Dear Andreas,
in smaller round we had discussed a "automatic" tool to parameterize
the IO500 before.
I'm also very much in favor of this.
I presume the easiest way to support it is to create a wrapper script
that explores the settings and ramp's up the operations.
That way it is not changing the benchmarks directly.
Julian
2017-06-24 2:03 GMT+02:00 John Bent <John.Bent(a)seagategov.com>:
> On Jun 23, 2017, at 5:57 PM, Andreas Dilger <adilger(a)dilger.ca> wrote:
>
> On Jun 23, 2017, at 4:55 PM, John Bent <John.Bent(a)seagategov.com> wrote:
>> On Jun 19, 2017, at 8:04 PM, Andreas Dilger <adilger(a)dilger.ca> wrote:
>>>
>>> On Jun 19, 2017, at 2:46 AM, John Bent <John.Bent(a)seagategov.com>
wrote:
>>>> Just FYI, the above contains a small correction to the proposal I sent
that George and I figured out while we were running. The original idea was modeled after
Jim Gray’s sort benchmark which was to see how much could be sorted in 1 minute.
Originally it was called Terasort and was how quick to sort a terabyte but after a few
years a terabyte was so small that the test was effectively testing only how quickly you
could launch a job.
>>>>
>>>> So we said how much you can do in 5 minutes. But when we ran the test we
realized it was better to use the IOR self-reported numbers for bandwidth because they
exclude the MPI start-up and finalize times which is a more accurate reflection of IO.
But then we realized that people could run for only a small amount of time and satisfy the
5 minute limit and get a high bandwidth reported from IOR because they did a small amount
of IO which fit in server-side caches which don’t necessarily respect a sync command.
>>>>
>>>> So the new proposal is to do at least a minimum of 5 minutes. To try to
get consistent numbers, I think we should also set a maximum time. So I propose that
we’ll use the self-reported IOR/mdtest results iff the self-reported IO time is between 5
and 5.5 minutes.
>>>
>>> In this case, it might be useful to have the test loops with some kind of
linear approximation/binary searching until it gets the IOR/mdtest results within
300-330s. The user can avoid this if they supply the size parameter such that the first
run completes within the required time window. If multiple runs are needed, the test
script should report the final sizes used along with the other config parameters so that
it can be used for subsequent runs.
>>>
>>> Alternately, since we are already modifying IOR, we could add an option to
limit the test run to the specified time. Something like "run until N seconds have
elapsed, then continue until all threads have written the same amount of data" so
that it should be equivalent to a regular IOR run with the same specified size.
>>>
>> Hey Andreas,
>>
>> Interesting that you suggest this. A Seagate colleague also asked for something
similar. Also, we used to run the LANL benchmark in this way; Brent Welch used to
complain that it was a bad workload since real apps don’t do this and it is important to
account for stragglers.
>>
>> The Seagate colleague actually requested this for a different reason. He refers
to this mode as ‘stone-walled’ and the other mode as ‘non-stone-walled.’ He suggested
that IO 500 should do both stone-walled and non-stone-walled because the difference
between them is a measurement of load imbalance. This is a good point. However, I think
that IO500 is not about helping diagnose why a storage system might be inefficient; it is
just about measuring expected performance. So doing stone-walled is interesting, and
comparing it to non-stone-walled is also interesting but neither is the purpose of IO500
as I’ve been thinking of it.
>
> To clarify, I'm not proposing to report the stone-wall IOR results, which is
"compute bandwidth based on data written and elapsed time as soon as the first client
finishes". I think IOR already has this today. Rather, I propose a slightly
different mode that would be equivalent to a proper non-stonewall IOR run that is
hand-tuned to finish after a specified time limit, without the burden of hand-tuning the
file size.
>
> In particular, it would be:
> - run until the specified time limit has elapsed, or maximum file size is reached
> - all ranks do an MPI_Reduce(MPI_MAX) to find the peak data written any thread
> - possibly round this up to some even number
> - all ranks finish writing data to match the peak data found at the time limit
> - the total data size is printed as part of the results
>
> For a perfectly balanced system the test would complete at the time limit, since all
nodes would have written the same amount of data at that time. A real-world system would
have some delay after the time limit (unless stone wall was enabled) as the straggler
ranks finish writing their data to reach the largest file size. This shouldn't take
too long to complete after the time limit, but would still allow the ability to determine
the amount of IO imbalance between ranks by how much more than the specified time limit
the test run takes.
>
> The main goal is just to allow IO-500 to be able to get a 300s-ish test run easily
without having to iterate in the script or manually, but the actual test run time is also
useful information at that point to figure the IO imbalance.
>
Oh, very nice indeed. This gives everything: the runtime determinism of stone-wall
without losing the straggler effect and while gaining the imbalance measurement. We
should definitely do this unless . . . does anyone fear that checking the timestamp after
every storage IO and sending a collective IO after elapsed time reaches 300 seconds can
reduce the benchmark fidelity?
Thanks,
John
> Cheers, Andreas
>
>
>
>
>
--
http://wr.informatik.uni-hamburg.de/people/julian_kunkel