On Jun 17, 2017, at 10:54 PM, Georgios Markomanolis <georgios.markomanolis@kaust.edu.sa> wrote:

Hello everybody,
 
Today I had the chance to work with John for some time and we decided to make a first version of the script. I attach it, we have tested it on our Cray Burst Buffer at KAUST and it works without issues, I have to tune it a bit for better performance. We tried to create some variables to be easy to modify. The script calculates everything automatic. We need to clarify what we want as final result, now it follows the formula of the previous email with the geometric means.
 
Some topics for discussion, why do we need to run mdtest and IOR for 5 minutes minimum? For example, in our system with more than 1TB/s IOR, I need to create 300TB+ in 5 minutes on BB. Is it not reasonable to create a file a bit more than the half memory of the reserved resources? Do you think that there is a chance of caching by using more than half memory? For mdtests, creating millions of files can hurt the filesystem and especially Lustre. We had some real cases that Lustre could not handle. 
 
Just FYI, the above contains a small correction to the proposal I sent that George and I figured out while we were running.  The original idea was modeled after Jim Gray’s sort benchmark which was to see how much could be sorted in 1 minute.  Originally it was called Terasort and was how quick to sort a terabyte but after a few years a terabyte was so small that the test was effectively testing only how quickly you could launch a job.

So we said how much you can do in 5 minutes.  But when we ran the test we realized it was better to use the IOR self-reported numbers for bandwidth because they exclude the MPI start-up and finalize times which is a more accurate reflection of IO.  But then we realized that people could run for only a small amount of time and satisfy the 5 minute limit and get a high bandwidth reported from IOR because they did a small amount of IO which fit in server-side caches which don’t necessarily respect a sync command.

So the new proposal is to do at least a minimum of 5 minutes.  To try to get consistent numbers, I think we should also set a maximum time.  So I propose that we’ll use the self-reported IOR/mdtest results iff the self-reported IO time is between 5 and 5.5 minutes.

Thanks,

John

I think we should start a repository and update there, we can have optimum parameters per site also.
 
The BW of IOR are located in ior_$JOBID with details, for example:
 
Bandwidth 1 is  774570.92 MB/s and duration is  17.30 seconds
Bandwidth 2 is  123680.01 MB/s and duration is   7.78 seconds
Bandwidth 3 is 1250231.51 MB/s and duration is  10.72 seconds
Bandwidth 4 is   83650.59 MB/s and duration is  11.51 seconds
 
and the mdtests in the file mdt_$JOBID.
 
The script is prepared with SLURM commands and in this example I have used 2048 compute nodes, so feel free to adapt.
 
In case that we use Lustre, we need to run the experiment on another striped folder. Even for BB, I need to increase the MPI I/O aggregators in the case of shared file, but at least we have a first version for now.
 
See you at BOF.
 
Best regards,
George
 
________________________________________
George Markomanolis, PhD
Computational Scientist
KAUST Supercomputing Laboratory (KSL)
King Abdullah University of Science & Technology
Al Khawarizmi Bldg. (1) Room 0123
Thuwal
Kingdom of Saudi Arabia
Mob:   +966 56 325 9012
Office: +966 12 808 0393
 
From: IO-500 <io-500-bounces@vi4io.org> on behalf of Julian Kunkel <juliankunkel@googlemail.com>
Date: Saturday, 17 June 2017 at 9:42 AM
To: "io-500@vi4io.org" <io-500@vi4io.org>
Subject: [IO-500] Detailed benchmark proposal
 
Somehow the mail from John did not get through, so here it is (if there is an email issue please mail me). Thanks also for all those, we had discussions with, besides the Dagstuhl meeting...
 

Von: John Bent <John.Bent@seagategov.com>
Gesendet: 16. Juni 2017 22:30:23 MESZ
An: "io-500@vi4io.org" <io-500@vi4io.org>
Betreff: Detailed benchmark proposal

All, 


Sorry for the long silence on the mailing list. However, we have made some substantial progress recently as we prepare for our ISC BOF next week.  For those of you at ISC, please join us from 11 to 12 on Tuesday in Substanz 1&2.


The progress that we have made recently happened because a bunch of us were attending a German workshop last month at Dagstuhl and had multiple discussions about the benchmark.


Here’s the highlights from what was discussed and the progress that we made at Dagstuhl:


  1. General agreement that the IOR-hard, IOR-easy, mdtest-hard, mdtest-easy approach is appropriate.  
  2. We should add a ‘find’ command as this is a popular and important workload.  
  3. The multiple bandwidth measurements should be combined via geometric mean into one bandwidth.
  4. The multiple IOPs measurements should also be combined via geometric mean into one IOPs.
  5. The bandwidth and the IOPs should be multiplied to create one final score.
  6. The ranking uses that final score but the webpage can be sorted using other metrics.
  7. The webpage should allow filtering as well so, for example, people can look at only the HDD results.
  8. We should separate the write/create phases from the read/stat phases to help ensure that caching is avoided
  9. Nathan Hjelm volunteered to combine the mdtest and IOR benchmarks into one git repo and has now done so.  This removes the #ifdef mess from mdtest and now they both share the nice modular IOR backend 


So the top-level summary of the benchmark in pseudo-code has become:


# write/create phase
bw1 = ior_easy -write [user supplies their own parameters maximizing data writes that can be done in 5 minutes]
md1 = md_test_easy -create [user supplies their own parameters maximizing file creates that can be done in 5 minutes]
bw2 = ior_hard -write [we supply parameters: unaligned strided into single shared file]
md2 = md_test_hard -create [we supply parameters: creates of 3900 byte files into single shared directory]
 
# read/stat phase
bw3 = ior_easy -read [cross-node read of everything that was written in bw1]
md3 = md_test_easy -stat [cross-node stat of everything that was created in md1]
bw4 = ior_hard -read
md4 = md_test_hard -stat
 
# find phase
md5 = [we supply parameters to find a subset of the files that were created in the tests]
 
# score phase
bw = geo_mean( bw1 bw2 bw3 bw4)
md = geo_mean( md1 md2 md3 md4 bd5)
total = bw * md
 
Now we are moving on to precisely define what the parameters should look like for the hard tests and to create a standard so that people can start running it on their systems.  By doing so, we will define the formal process so we can actually make this an official benchmark.  Please see the attached file in which we’ve started precisely defining these parameters.  Let’s start iterating please on this file to get these parameters correct.


Thanks,


John

-- 
Dr. Julian Kunkel
Abteilung Forschung
Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45a • D-20146 Hamburg • Germany

Phone:  +49 40 460094-161
Fax: +49 40 460094-270
E-mail: kunkel@dkrz.de
URL: http://www.dkrz.de

Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784
 


This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.
<io_500.sh>_______________________________________________
IO-500 mailing list
IO-500@vi4io.org
https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500