On Jun 26, 2017, at 12:13 PM, Andreas Dilger <adilger@dilger.ca> wrote:

On Jun 23, 2017, at 6:03 PM, John Bent <John.Bent@seagategov.com> wrote:


On Jun 23, 2017, at 5:57 PM, Andreas Dilger <adilger@dilger.ca> wrote:

On Jun 23, 2017, at 4:55 PM, John Bent <John.Bent@seagategov.com> wrote:
On Jun 19, 2017, at 8:04 PM, Andreas Dilger <adilger@dilger.ca> wrote:

On Jun 19, 2017, at 2:46 AM, John Bent <John.Bent@seagategov.com> wrote:
Just FYI, the above contains a small correction to the proposal I sent that George and I figured out while we were running.  The original idea was modeled after Jim Gray’s sort benchmark which was to see how much could be sorted in 1 minute.  Originally it was called Terasort and was how quick to sort a terabyte but after a few years a terabyte was so small that the test was effectively testing only how quickly you could launch a job.

So we said how much you can do in 5 minutes.  But when we ran the test we realized it was better to use the IOR self-reported numbers for bandwidth because they exclude the MPI start-up and finalize times which is a more accurate reflection of IO.  But then we realized that people could run for only a small amount of time and satisfy the 5 minute limit and get a high bandwidth reported from IOR because they did a small amount of IO which fit in server-side caches which don’t necessarily respect a sync command.

So the new proposal is to do at least a minimum of 5 minutes.  To try to get consistent numbers, I think we should also set a maximum time.  So I propose that we’ll use the self-reported IOR/mdtest results iff the self-reported IO time is between 5 and 5.5 minutes.

In this case, it might be useful to have the test loops with some kind of linear approximation/binary searching until it gets the IOR/mdtest results within 300-330s.  The user can avoid this if they supply the size parameter such that the first run completes within the required time window.  If multiple runs are needed, the test script should report the final sizes used along with the other config parameters so that it can be used for subsequent runs.

Alternately, since we are already modifying IOR, we could add an option to limit the test run to the specified time.  Something like "run until N seconds have elapsed, then continue until all threads have written the same amount of data" so that it should be equivalent to a regular IOR run with the same specified size.

Hey Andreas,

Interesting that you suggest this.  A Seagate colleague also asked for something similar.  Also, we used to run the LANL benchmark in this way; Brent Welch used to complain that it was a bad workload since real apps don’t do this and it is important to account for stragglers.

The Seagate colleague actually requested this for a different reason.  He refers to this mode as ‘stone-walled’ and the other mode as ‘non-stone-walled.’  He suggested that IO 500 should do both stone-walled and non-stone-walled because the difference between them is a measurement of load imbalance.  This is a good point.  However, I think that IO500 is not about helping diagnose why a storage system might be inefficient; it is just about measuring expected performance.  So doing stone-walled is interesting, and comparing it to non-stone-walled is also interesting but neither is the purpose of IO500 as I’ve been thinking of it.

To clarify, I'm not proposing to report the stone-wall IOR results, which is "compute bandwidth based on data written and elapsed time as soon as the first client finishes".  I think IOR already has this today.  Rather, I propose a slightly different mode that would be equivalent to a proper non-stonewall IOR run that is hand-tuned to finish after a specified time limit, without the burden of hand-tuning the file size.

In particular, it would be:
- run until the specified time limit has elapsed, or maximum file size is reached
- all ranks do an MPI_Reduce(MPI_MAX) to find the peak data written any thread
- possibly round this up to some even number
- all ranks finish writing data to match the peak data found at the time limit
- the total data size is printed as part of the results

For a perfectly balanced system the test would complete at the time limit, since all nodes would have written the same amount of data at that time.  A real-world system would have some delay after the time limit (unless stone wall was enabled) as the straggler ranks finish writing their data to reach the largest file size.  This shouldn't take too long to complete after the time limit, but would still allow the ability to determine the amount of IO imbalance between ranks by how much more than the specified time limit the test run takes.

The main goal is just to allow IO-500 to be able to get a 300s-ish test run easily without having to iterate in the script or manually, but the actual test run time is also useful information at that point to figure the IO imbalance.

Oh, very nice indeed.  This gives everything: the runtime determinism of stone-wall without losing the straggler effect and while gaining the imbalance measurement.  We should definitely do this unless . . . does anyone fear that checking the timestamp after every storage IO and sending a collective IO after elapsed time reaches 300 seconds can reduce the benchmark fidelity?

The overhead of checking the elapsed time is fairly small, since the threads are IO bound, otherwise they aren't driving the storage hard enough...  It looks like IOR is already checking the timestamp, but this may be after every iteration in a loop, rather than after every IO?  Need to look more closely.

If checking the timestamp after every IO is really a concern, this could also be done with an asynchronous alarm() with a signal handler that just sets a global variable when it fires.  The overhead of checking this global variable after every IO would be minimal.  The only possible concern of implementing it this way is whether the signal is handled properly by the MPI threads.

Thanks Andreas, what about the overhead of sending the collective IO after the timer goes off?  I guess I think it’s pretty minimal since we’re doing 300 seconds of IO.

Thanks,

John


Cheers, Andreas