On Jun 23, 2017, at 4:55 PM, John Bent <John.Bent(a)seagategov.com> wrote:
On Jun 19, 2017, at 8:04 PM, Andreas Dilger <adilger(a)dilger.ca>
wrote:
>
> On Jun 19, 2017, at 2:46 AM, John Bent <John.Bent(a)seagategov.com> wrote:
>> Just FYI, the above contains a small correction to the proposal I sent that
George and I figured out while we were running. The original idea was modeled after Jim
Gray’s sort benchmark which was to see how much could be sorted in 1 minute. Originally
it was called Terasort and was how quick to sort a terabyte but after a few years a
terabyte was so small that the test was effectively testing only how quickly you could
launch a job.
>>
>> So we said how much you can do in 5 minutes. But when we ran the test we
realized it was better to use the IOR self-reported numbers for bandwidth because they
exclude the MPI start-up and finalize times which is a more accurate reflection of IO.
But then we realized that people could run for only a small amount of time and satisfy the
5 minute limit and get a high bandwidth reported from IOR because they did a small amount
of IO which fit in server-side caches which don’t necessarily respect a sync command.
>>
>> So the new proposal is to do at least a minimum of 5 minutes. To try to get
consistent numbers, I think we should also set a maximum time. So I propose that we’ll
use the self-reported IOR/mdtest results iff the self-reported IO time is between 5 and
5.5 minutes.
>
> In this case, it might be useful to have the test loops with some kind of linear
approximation/binary searching until it gets the IOR/mdtest results within 300-330s. The
user can avoid this if they supply the size parameter such that the first run completes
within the required time window. If multiple runs are needed, the test script should
report the final sizes used along with the other config parameters so that it can be used
for subsequent runs.
>
> Alternately, since we are already modifying IOR, we could add an option to limit the
test run to the specified time. Something like "run until N seconds have elapsed,
then continue until all threads have written the same amount of data" so that it
should be equivalent to a regular IOR run with the same specified size.
>
Hey Andreas,
Interesting that you suggest this. A Seagate colleague also asked for something similar.
Also, we used to run the LANL benchmark in this way; Brent Welch used to complain that it
was a bad workload since real apps don’t do this and it is important to account for
stragglers.
The Seagate colleague actually requested this for a different reason. He refers to this
mode as ‘stone-walled’ and the other mode as ‘non-stone-walled.’ He suggested that IO 500
should do both stone-walled and non-stone-walled because the difference between them is a
measurement of load imbalance. This is a good point. However, I think that IO500 is not
about helping diagnose why a storage system might be inefficient; it is just about
measuring expected performance. So doing stone-walled is interesting, and comparing it to
non-stone-walled is also interesting but neither is the purpose of IO500 as I’ve been
thinking of it.
To clarify, I'm not proposing to report the stone-wall IOR results, which is
"compute bandwidth based on data written and elapsed time as soon as the first client
finishes". I think IOR already has this today. Rather, I propose a slightly
different mode that would be equivalent to a proper non-stonewall IOR run that is
hand-tuned to finish after a specified time limit, without the burden of hand-tuning the
file size.
In particular, it would be:
- run until the specified time limit has elapsed, or maximum file size is reached
- all ranks do an MPI_Reduce(MPI_MAX) to find the peak data written any thread
- possibly round this up to some even number
- all ranks finish writing data to match the peak data found at the time limit
- the total data size is printed as part of the results
For a perfectly balanced system the test would complete at the time limit, since all nodes
would have written the same amount of data at that time. A real-world system would have
some delay after the time limit (unless stone wall was enabled) as the straggler ranks
finish writing their data to reach the largest file size. This shouldn't take too
long to complete after the time limit, but would still allow the ability to determine the
amount of IO imbalance between ranks by how much more than the specified time limit the
test run takes.
The main goal is just to allow IO-500 to be able to get a 300s-ish test run easily without
having to iterate in the script or manually, but the actual test run time is also useful
information at that point to figure the IO imbalance.
Cheers, Andreas