On Jun 23, 2017, at 5:57 PM, Andreas Dilger <adilger(a)dilger.ca>
wrote:
On Jun 23, 2017, at 4:55 PM, John Bent <John.Bent(a)seagategov.com> wrote:
> On Jun 19, 2017, at 8:04 PM, Andreas Dilger <adilger(a)dilger.ca> wrote:
>>
>> On Jun 19, 2017, at 2:46 AM, John Bent <John.Bent(a)seagategov.com> wrote:
>>> Just FYI, the above contains a small correction to the proposal I sent that
George and I figured out while we were running. The original idea was modeled after Jim
Gray’s sort benchmark which was to see how much could be sorted in 1 minute. Originally
it was called Terasort and was how quick to sort a terabyte but after a few years a
terabyte was so small that the test was effectively testing only how quickly you could
launch a job.
>>>
>>> So we said how much you can do in 5 minutes. But when we ran the test we
realized it was better to use the IOR self-reported numbers for bandwidth because they
exclude the MPI start-up and finalize times which is a more accurate reflection of IO.
But then we realized that people could run for only a small amount of time and satisfy the
5 minute limit and get a high bandwidth reported from IOR because they did a small amount
of IO which fit in server-side caches which don’t necessarily respect a sync command.
>>>
>>> So the new proposal is to do at least a minimum of 5 minutes. To try to get
consistent numbers, I think we should also set a maximum time. So I propose that we’ll
use the self-reported IOR/mdtest results iff the self-reported IO time is between 5 and
5.5 minutes.
>>
>> In this case, it might be useful to have the test loops with some kind of linear
approximation/binary searching until it gets the IOR/mdtest results within 300-330s. The
user can avoid this if they supply the size parameter such that the first run completes
within the required time window. If multiple runs are needed, the test script should
report the final sizes used along with the other config parameters so that it can be used
for subsequent runs.
>>
>> Alternately, since we are already modifying IOR, we could add an option to limit
the test run to the specified time. Something like "run until N seconds have
elapsed, then continue until all threads have written the same amount of data" so
that it should be equivalent to a regular IOR run with the same specified size.
>>
> Hey Andreas,
>
> Interesting that you suggest this. A Seagate colleague also asked for something
similar. Also, we used to run the LANL benchmark in this way; Brent Welch used to
complain that it was a bad workload since real apps don’t do this and it is important to
account for stragglers.
>
> The Seagate colleague actually requested this for a different reason. He refers to
this mode as ‘stone-walled’ and the other mode as ‘non-stone-walled.’ He suggested that
IO 500 should do both stone-walled and non-stone-walled because the difference between
them is a measurement of load imbalance. This is a good point. However, I think that
IO500 is not about helping diagnose why a storage system might be inefficient; it is just
about measuring expected performance. So doing stone-walled is interesting, and comparing
it to non-stone-walled is also interesting but neither is the purpose of IO500 as I’ve
been thinking of it.
To clarify, I'm not proposing to report the stone-wall IOR results, which is
"compute bandwidth based on data written and elapsed time as soon as the first client
finishes". I think IOR already has this today. Rather, I propose a slightly
different mode that would be equivalent to a proper non-stonewall IOR run that is
hand-tuned to finish after a specified time limit, without the burden of hand-tuning the
file size.
In particular, it would be:
- run until the specified time limit has elapsed, or maximum file size is reached
- all ranks do an MPI_Reduce(MPI_MAX) to find the peak data written any thread
- possibly round this up to some even number
- all ranks finish writing data to match the peak data found at the time limit
- the total data size is printed as part of the results
For a perfectly balanced system the test would complete at the time limit, since all
nodes would have written the same amount of data at that time. A real-world system would
have some delay after the time limit (unless stone wall was enabled) as the straggler
ranks finish writing their data to reach the largest file size. This shouldn't take
too long to complete after the time limit, but would still allow the ability to determine
the amount of IO imbalance between ranks by how much more than the specified time limit
the test run takes.
The main goal is just to allow IO-500 to be able to get a 300s-ish test run easily
without having to iterate in the script or manually, but the actual test run time is also
useful information at that point to figure the IO imbalance.
Oh, very nice indeed. This gives everything: the runtime determinism of stone-wall
without losing the straggler effect and while gaining the imbalance measurement. We
should definitely do this unless . . . does anyone fear that checking the timestamp after
every storage IO and sending a collective IO after elapsed time reaches 300 seconds can
reduce the benchmark fidelity?
Thanks,
John
Cheers, Andreas