On Jun 26, 2017, at 12:13 PM, Andreas Dilger
<adilger@dilger.ca<mailto:adilger@dilger.ca>> wrote:
On Jun 23, 2017, at 6:03 PM, John Bent
<John.Bent@seagategov.com<mailto:John.Bent@seagategov.com>> wrote:
On Jun 23, 2017, at 5:57 PM, Andreas Dilger
<adilger@dilger.ca<mailto:adilger@dilger.ca>> wrote:
On Jun 23, 2017, at 4:55 PM, John Bent
<John.Bent@seagategov.com<mailto:John.Bent@seagategov.com>> wrote:
On Jun 19, 2017, at 8:04 PM, Andreas Dilger
<adilger@dilger.ca<mailto:adilger@dilger.ca>> wrote:
On Jun 19, 2017, at 2:46 AM, John Bent
<John.Bent@seagategov.com<mailto:John.Bent@seagategov.com>> wrote:
Just FYI, the above contains a small correction to the proposal I sent that George and I
figured out while we were running. The original idea was modeled after Jim Gray’s sort
benchmark which was to see how much could be sorted in 1 minute. Originally it was called
Terasort and was how quick to sort a terabyte but after a few years a terabyte was so
small that the test was effectively testing only how quickly you could launch a job.
So we said how much you can do in 5 minutes. But when we ran the test we realized it was
better to use the IOR self-reported numbers for bandwidth because they exclude the MPI
start-up and finalize times which is a more accurate reflection of IO. But then we
realized that people could run for only a small amount of time and satisfy the 5 minute
limit and get a high bandwidth reported from IOR because they did a small amount of IO
which fit in server-side caches which don’t necessarily respect a sync command.
So the new proposal is to do at least a minimum of 5 minutes. To try to get consistent
numbers, I think we should also set a maximum time. So I propose that we’ll use the
self-reported IOR/mdtest results iff the self-reported IO time is between 5 and 5.5
minutes.
In this case, it might be useful to have the test loops with some kind of linear
approximation/binary searching until it gets the IOR/mdtest results within 300-330s. The
user can avoid this if they supply the size parameter such that the first run completes
within the required time window. If multiple runs are needed, the test script should
report the final sizes used along with the other config parameters so that it can be used
for subsequent runs.
Alternately, since we are already modifying IOR, we could add an option to limit the test
run to the specified time. Something like "run until N seconds have elapsed, then
continue until all threads have written the same amount of data" so that it should be
equivalent to a regular IOR run with the same specified size.
Hey Andreas,
Interesting that you suggest this. A Seagate colleague also asked for something similar.
Also, we used to run the LANL benchmark in this way; Brent Welch used to complain that it
was a bad workload since real apps don’t do this and it is important to account for
stragglers.
The Seagate colleague actually requested this for a different reason. He refers to this
mode as ‘stone-walled’ and the other mode as ‘non-stone-walled.’ He suggested that IO 500
should do both stone-walled and non-stone-walled because the difference between them is a
measurement of load imbalance. This is a good point. However, I think that IO500 is not
about helping diagnose why a storage system might be inefficient; it is just about
measuring expected performance. So doing stone-walled is interesting, and comparing it to
non-stone-walled is also interesting but neither is the purpose of IO500 as I’ve been
thinking of it.
To clarify, I'm not proposing to report the stone-wall IOR results, which is
"compute bandwidth based on data written and elapsed time as soon as the first client
finishes". I think IOR already has this today. Rather, I propose a slightly
different mode that would be equivalent to a proper non-stonewall IOR run that is
hand-tuned to finish after a specified time limit, without the burden of hand-tuning the
file size.
In particular, it would be:
- run until the specified time limit has elapsed, or maximum file size is reached
- all ranks do an MPI_Reduce(MPI_MAX) to find the peak data written any thread
- possibly round this up to some even number
- all ranks finish writing data to match the peak data found at the time limit
- the total data size is printed as part of the results
For a perfectly balanced system the test would complete at the time limit, since all nodes
would have written the same amount of data at that time. A real-world system would have
some delay after the time limit (unless stone wall was enabled) as the straggler ranks
finish writing their data to reach the largest file size. This shouldn't take too
long to complete after the time limit, but would still allow the ability to determine the
amount of IO imbalance between ranks by how much more than the specified time limit the
test run takes.
The main goal is just to allow IO-500 to be able to get a 300s-ish test run easily without
having to iterate in the script or manually, but the actual test run time is also useful
information at that point to figure the IO imbalance.
Oh, very nice indeed. This gives everything: the runtime determinism of stone-wall
without losing the straggler effect and while gaining the imbalance measurement. We
should definitely do this unless . . . does anyone fear that checking the timestamp after
every storage IO and sending a collective IO after elapsed time reaches 300 seconds can
reduce the benchmark fidelity?
The overhead of checking the elapsed time is fairly small, since the threads are IO bound,
otherwise they aren't driving the storage hard enough... It looks like IOR is already
checking the timestamp, but this may be after every iteration in a loop, rather than after
every IO? Need to look more closely.
If checking the timestamp after every IO is really a concern, this could also be done with
an asynchronous alarm() with a signal handler that just sets a global variable when it
fires. The overhead of checking this global variable after every IO would be minimal.
The only possible concern of implementing it this way is whether the signal is handled
properly by the MPI threads.
Thanks Andreas, what about the overhead of sending the collective IO after the timer goes
off? I guess I think it’s pretty minimal since we’re doing 300 seconds of IO.
Thanks,
John
Cheers, Andreas