On Jun 19, 2017, at 8:04 PM, Andreas Dilger <adilger@dilger.ca> wrote:

On Jun 19, 2017, at 2:46 AM, John Bent <John.Bent@seagategov.com> wrote:


On Jun 17, 2017, at 10:54 PM, Georgios Markomanolis <georgios.markomanolis@kaust.edu.sa> wrote:

Hello everybody,

Today I had the chance to work with John for some time and we decided to make a first version of the script. I attach it, we have tested it on our Cray Burst Buffer at KAUST and it works without issues, I have to tune it a bit for better performance. We tried to create some variables to be easy to modify. The script calculates everything automatic. We need to clarify what we want as final result, now it follows the formula of the previous email with the geometric means.

Some topics for discussion, why do we need to run mdtest and IOR for 5 minutes minimum? For example, in our system with more than 1TB/s IOR, I need to create 300TB+ in 5 minutes on BB. Is it not reasonable to create a file a bit more than the half memory of the reserved resources? Do you think that there is a chance of caching by using more than half memory? For mdtests, creating millions of files can hurt the filesystem and especially Lustre. We had some real cases that Lustre could not handle.

Just FYI, the above contains a small correction to the proposal I sent that George and I figured out while we were running.  The original idea was modeled after Jim Gray’s sort benchmark which was to see how much could be sorted in 1 minute.  Originally it was called Terasort and was how quick to sort a terabyte but after a few years a terabyte was so small that the test was effectively testing only how quickly you could launch a job.

So we said how much you can do in 5 minutes.  But when we ran the test we realized it was better to use the IOR self-reported numbers for bandwidth because they exclude the MPI start-up and finalize times which is a more accurate reflection of IO.  But then we realized that people could run for only a small amount of time and satisfy the 5 minute limit and get a high bandwidth reported from IOR because they did a small amount of IO which fit in server-side caches which don’t necessarily respect a sync command.

So the new proposal is to do at least a minimum of 5 minutes.  To try to get consistent numbers, I think we should also set a maximum time.  So I propose that we’ll use the self-reported IOR/mdtest results iff the self-reported IO time is between 5 and 5.5 minutes.

In this case, it might be useful to have the test loops with some kind of linear approximation/binary searching until it gets the IOR/mdtest results within 300-330s.  The user can avoid this if they supply the size parameter such that the first run completes within the required time window.  If multiple runs are needed, the test script should report the final sizes used along with the other config parameters so that it can be used for subsequent runs.

Alternately, since we are already modifying IOR, we could add an option to limit the test run to the specified time.  Something like "run until N seconds have elapsed, then continue until all threads have written the same amount of data" so that it should be equivalent to a regular IOR run with the same specified size.

Hey Andreas,

Interesting that you suggest this.  A Seagate colleague also asked for something similar.  Also, we used to run the LANL benchmark in this way; Brent Welch used to complain that it was a bad workload since real apps don’t do this and it is imports to account for stragglers.

The Seagate colleague actually requested this for a different reason.  He refers to this mode as ‘stone-walled’ and the other mode as ‘non-stone-walled.’  He suggested that IO 500 should do both stone-walled and non-stone-walled because the difference between them is a measurement of load imbalance.  This is a good point.  However, I think that IO500 is not about helping diagnose why a storage system might be inefficient; it is just about measuring expected performance.  So doing stone-walled is interesting, and comparing it to non-stone-walled is also interesting but neither is the purpose of IO500 as I’ve been thinking of it.

Thanks,

John

Cheers, Andreas

Thanks,

John

I think we should start a repository and update there, we can have optimum parameters per site also.

The BW of IOR are located in ior_$JOBID with details, for example:

Bandwidth 1 is  774570.92 MB/s and duration is  17.30 seconds
Bandwidth 2 is  123680.01 MB/s and duration is   7.78 seconds
Bandwidth 3 is 1250231.51 MB/s and duration is  10.72 seconds
Bandwidth 4 is   83650.59 MB/s and duration is  11.51 seconds

and the mdtests in the file mdt_$JOBID.

The script is prepared with SLURM commands and in this example I have used 2048 compute nodes, so feel free to adapt.

In case that we use Lustre, we need to run the experiment on another striped folder. Even for BB, I need to increase the MPI I/O aggregators in the case of shared file, but at least we have a first version for now.

See you at BOF.

Best regards,
George

________________________________________
George Markomanolis, PhD
Computational Scientist
KAUST Supercomputing Laboratory (KSL)
King Abdullah University of Science & Technology
Al Khawarizmi Bldg. (1) Room 0123
Thuwal
Kingdom of Saudi Arabia
Mob:   +966 56 325 9012
Office: +966 12 808 0393

From: IO-500 <io-500-bounces@vi4io.org> on behalf of Julian Kunkel <juliankunkel@googlemail.com>
Date: Saturday, 17 June 2017 at 9:42 AM
To: "io-500@vi4io.org" <io-500@vi4io.org>
Subject: [IO-500] Detailed benchmark proposal

Somehow the mail from John did not get through, so here it is (if there is an email issue please mail me). Thanks also for all those, we had discussions with, besides the Dagstuhl meeting...

Von: John Bent <John.Bent@seagategov.com>
Gesendet: 16. Juni 2017 22:30:23 MESZ
An: "io-500@vi4io.org" <io-500@vi4io.org>
Betreff: Detailed benchmark proposal

All,


Sorry for the long silence on the mailing list. However, we have made some substantial progress recently as we prepare for our ISC BOF next week.  For those of you at ISC, please join us from 11 to 12 on Tuesday in Substanz 1&2.


The progress that we have made recently happened because a bunch of us were attending a German workshop last month at Dagstuhl and had multiple discussions about the benchmark.


Here’s the highlights from what was discussed and the progress that we made at Dagstuhl:


• General agreement that the IOR-hard, IOR-easy, mdtest-hard, mdtest-easy approach is appropriate.
• We should add a ‘find’ command as this is a popular and important workload.
• The multiple bandwidth measurements should be combined via geometric mean into one bandwidth.
• The multiple IOPs measurements should also be combined via geometric mean into one IOPs.
• The bandwidth and the IOPs should be multiplied to create one final score.
• The ranking uses that final score but the webpage can be sorted using other metrics.
• The webpage should allow filtering as well so, for example, people can look at only the HDD results.
• We should separate the write/create phases from the read/stat phases to help ensure that caching is avoided
• Nathan Hjelm volunteered to combine the mdtest and IOR benchmarks into one git repo and has now done so.  This removes the #ifdef mess from mdtest and now they both share the nice modular IOR backend


So the top-level summary of the benchmark in pseudo-code has become:


# write/create phase
bw1 = ior_easy -write [user supplies their own parameters maximizing data writes that can be done in 5 minutes]
md1 = md_test_easy -create [user supplies their own parameters maximizing file creates that can be done in 5 minutes]
bw2 = ior_hard -write [we supply parameters: unaligned strided into single shared file]
md2 = md_test_hard -create [we supply parameters: creates of 3900 byte files into single shared directory]

# read/stat phase
bw3 = ior_easy -read [cross-node read of everything that was written in bw1]
md3 = md_test_easy -stat [cross-node stat of everything that was created in md1]
bw4 = ior_hard -read
md4 = md_test_hard -stat

# find phase
md5 = [we supply parameters to find a subset of the files that were created in the tests]

# score phase
bw = geo_mean( bw1 bw2 bw3 bw4)
md = geo_mean( md1 md2 md3 md4 bd5)
total = bw * md

Now we are moving on to precisely define what the parameters should look like for the hard tests and to create a standard so that people can start running it on their systems.  By doing so, we will define the formal process so we can actually make this an official benchmark.  Please see the attached file in which we’ve started precisely defining these parameters.  Let’s start iterating please on this file to get these parameters correct.


Thanks,


John

--
Dr. Julian Kunkel
Abteilung Forschung
Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45a • D-20146 Hamburg • Germany

Phone:  +49 40 460094-161
Fax: +49 40 460094-270
E-mail: kunkel@dkrz.de
URL: http://www.dkrz.de

Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784


This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.
<io_500.sh>_______________________________________________
IO-500 mailing list
IO-500@vi4io.org
https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500

_______________________________________________
IO-500 mailing list
IO-500@vi4io.org
https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500


Cheers, Andreas