On Jun 19, 2017, at 2:46 AM, John Bent <John.Bent(a)seagategov.com> wrote:
>
> On Jun 17, 2017, at 10:54 PM, Georgios Markomanolis
<georgios.markomanolis(a)kaust.edu.sa> wrote:
>
> Hello everybody,
>
> Today I had the chance to work with John for some time and we decided to make a first
version of the script. I attach it, we have tested it on our Cray Burst Buffer at KAUST
and it works without issues, I have to tune it a bit for better performance. We tried to
create some variables to be easy to modify. The script calculates everything automatic. We
need to clarify what we want as final result, now it follows the formula of the previous
email with the geometric means.
>
> Some topics for discussion, why do we need to run mdtest and IOR for 5 minutes
minimum? For example, in our system with more than 1TB/s IOR, I need to create 300TB+ in 5
minutes on BB. Is it not reasonable to create a file a bit more than the half memory of
the reserved resources? Do you think that there is a chance of caching by using more than
half memory? For mdtests, creating millions of files can hurt the filesystem and
especially Lustre. We had some real cases that Lustre could not handle.
>
Just FYI, the above contains a small correction to the proposal I sent that George and I
figured out while we were running. The original idea was modeled after Jim Gray’s sort
benchmark which was to see how much could be sorted in 1 minute. Originally it was called
Terasort and was how quick to sort a terabyte but after a few years a terabyte was so
small that the test was effectively testing only how quickly you could launch a job.
So we said how much you can do in 5 minutes. But when we ran the test we realized it was
better to use the IOR self-reported numbers for bandwidth because they exclude the MPI
start-up and finalize times which is a more accurate reflection of IO. But then we
realized that people could run for only a small amount of time and satisfy the 5 minute
limit and get a high bandwidth reported from IOR because they did a small amount of IO
which fit in server-side caches which don’t necessarily respect a sync command.
So the new proposal is to do at least a minimum of 5 minutes. To try to get consistent
numbers, I think we should also set a maximum time. So I propose that we’ll use the
self-reported IOR/mdtest results iff the self-reported IO time is between 5 and 5.5
minutes.
In this case, it might be useful to have the test loops with some kind of linear
approximation/binary searching until it gets the IOR/mdtest results within 300-330s. The
user can avoid this if they supply the size parameter such that the first run completes
within the required time window. If multiple runs are needed, the test script should
report the final sizes used along with the other config parameters so that it can be used
for subsequent runs.
Alternately, since we are already modifying IOR, we could add an option to limit the test
run to the specified time. Something like "run until N seconds have elapsed, then
continue until all threads have written the same amount of data" so that it should be
equivalent to a regular IOR run with the same specified size.
Cheers, Andreas
Thanks,
John
> I think we should start a repository and update there, we can have optimum parameters
per site also.
>
> The BW of IOR are located in ior_$JOBID with details, for example:
>
> Bandwidth 1 is 774570.92 MB/s and duration is 17.30 seconds
> Bandwidth 2 is 123680.01 MB/s and duration is 7.78 seconds
> Bandwidth 3 is 1250231.51 MB/s and duration is 10.72 seconds
> Bandwidth 4 is 83650.59 MB/s and duration is 11.51 seconds
>
> and the mdtests in the file mdt_$JOBID.
>
> The script is prepared with SLURM commands and in this example I have used 2048
compute nodes, so feel free to adapt.
>
> In case that we use Lustre, we need to run the experiment on another striped folder.
Even for BB, I need to increase the MPI I/O aggregators in the case of shared file, but at
least we have a first version for now.
>
> See you at BOF.
>
> Best regards,
> George
>
> ________________________________________
> George Markomanolis, PhD
> Computational Scientist
> KAUST Supercomputing Laboratory (KSL)
> King Abdullah University of Science & Technology
> Al Khawarizmi Bldg. (1) Room 0123
> Thuwal
> Kingdom of Saudi Arabia
> Mob: +966 56 325 9012
> Office: +966 12 808 0393
>
> From: IO-500 <io-500-bounces(a)vi4io.org> on behalf of Julian Kunkel
<juliankunkel(a)googlemail.com>
> Date: Saturday, 17 June 2017 at 9:42 AM
> To: "io-500(a)vi4io.org" <io-500(a)vi4io.org>
> Subject: [IO-500] Detailed benchmark proposal
>
> Somehow the mail from John did not get through, so here it is (if there is an email
issue please mail me). Thanks also for all those, we had discussions with, besides the
Dagstuhl meeting...
>
> Von: John Bent <John.Bent(a)seagategov.com>
> Gesendet: 16. Juni 2017 22:30:23 MESZ
> An: "io-500(a)vi4io.org" <io-500(a)vi4io.org>
> Betreff: Detailed benchmark proposal
>
> All,
>
>
> Sorry for the long silence on the mailing list. However, we have made some
substantial progress recently as we prepare for our ISC BOF next week. For those of you
at ISC, please join us from 11 to 12 on Tuesday in Substanz 1&2.
>
>
> The progress that we have made recently happened because a bunch of us were attending
a German workshop last month at Dagstuhl and had multiple discussions about the
benchmark.
>
>
> Here’s the highlights from what was discussed and the progress that we made at
Dagstuhl:
>
>
> • General agreement that the IOR-hard, IOR-easy, mdtest-hard, mdtest-easy approach
is appropriate.
> • We should add a ‘find’ command as this is a popular and important workload.
> • The multiple bandwidth measurements should be combined via geometric mean into one
bandwidth.
> • The multiple IOPs measurements should also be combined via geometric mean into one
IOPs.
> • The bandwidth and the IOPs should be multiplied to create one final score.
> • The ranking uses that final score but the webpage can be sorted using other
metrics.
> • The webpage should allow filtering as well so, for example, people can look at
only the HDD results.
> • We should separate the write/create phases from the read/stat phases to help
ensure that caching is avoided
> • Nathan Hjelm volunteered to combine the mdtest and IOR benchmarks into one git
repo and has now done so. This removes the #ifdef mess from mdtest and now they both
share the nice modular IOR backend
>
>
> So the top-level summary of the benchmark in pseudo-code has become:
>
>
> # write/create phase
> bw1 = ior_easy -write [user supplies their own parameters maximizing data writes that
can be done in 5 minutes]
> md1 = md_test_easy -create [user supplies their own parameters maximizing file
creates that can be done in 5 minutes]
> bw2 = ior_hard -write [we supply parameters: unaligned strided into single shared
file]
> md2 = md_test_hard -create [we supply parameters: creates of 3900 byte files into
single shared directory]
>
> # read/stat phase
> bw3 = ior_easy -read [cross-node read of everything that was written in bw1]
> md3 = md_test_easy -stat [cross-node stat of everything that was created in md1]
> bw4 = ior_hard -read
> md4 = md_test_hard -stat
>
> # find phase
> md5 = [we supply parameters to find a subset of the files that were created in the
tests]
>
> # score phase
> bw = geo_mean( bw1 bw2 bw3 bw4)
> md = geo_mean( md1 md2 md3 md4 bd5)
> total = bw * md
>
> Now we are moving on to precisely define what the parameters should look like for the
hard tests and to create a standard so that people can start running it on their systems.
By doing so, we will define the formal process so we can actually make this an official
benchmark. Please see the attached file in which we’ve started precisely defining these
parameters. Let’s start iterating please on this file to get these parameters correct.
>
>
> Thanks,
>
>
> John
>
> --
> Dr. Julian Kunkel
> Abteilung Forschung
> Deutsches Klimarechenzentrum GmbH (DKRZ)
> Bundesstraße 45a • D-20146 Hamburg • Germany
>
> Phone: +49 40 460094-161
> Fax: +49 40 460094-270
> E-mail: kunkel(a)dkrz.de
> URL:
http://www.dkrz.de
>
> Geschäftsführer: Prof. Dr. Thomas Ludwig
> Sitz der Gesellschaft: Hamburg
> Amtsgericht Hamburg HRB 39784
>
>
> This message and its contents including attachments are intended solely for the
original recipient. If you are not the intended recipient or have received this message in
error, please notify me immediately and delete this message from your computer system. Any
unauthorized use or distribution is prohibited. Please consider the environment before
printing this email.
> <io_500.sh>_______________________________________________
> IO-500 mailing list
> IO-500(a)vi4io.org
>
https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500
_______________________________________________
IO-500 mailing list
IO-500(a)vi4io.org
https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500
Cheers, Andreas