ISC BOF report

List overview All Threads
Download

newer

older

Results from DKRZ available

github

John Bent

23 Jun 2017 23 Jun '17

10:57 p.m.

All, We had a great session at ISC (about 30 people I think) and made great progress in the weeks leading up to it as well. Thanks to Satoshi we even got the two attached slides added into the official slides being released from the Top 500 session! We had 6 people sign up at the BOF saying that they’ll run the benchmark when it is finalized. I know I always say that we have ‘almost’ finalized the benchmark. But we really are getting much closer; it helped so much that Nathan combined the benchmarks and George worked on the script. I think we only have two open questions right now: 1. do 47K random IO in the IOR-hard or do 47K simple strided? My original thinking was strided but someone pointed out that the idea is to create the bounding box and random is harder than strided. Also random might be increasingly prevalent these days with more analytics and machine learning and graph analytics, etc. So I propose that we do random unless there are objections here. 2. Should we do some sort of mixed IO workload in addition to running the 4 tests serially? I like the idea but am not sure how exactly to do it. Do we need to merely mix IOR-hard and IOR-easy or md-hard and md-easy or both or mix all 4 at once? Do we just launch multiple command lines in the background and hope that the mpirun launch times are fast enough that they overlap? Do we need to modify IOR/mdtest to split the ranks in half and do different workloads with the two halves? Thoughts? We also made some procedural decisions. The initial steering committee will be Jay Lofstead, Julian Kunkle, and myself. That steering committee membership will last until IO500 is up and running and stable at which point the community can nominate new members. All decisions will be discussed first on the mailing list and we will try for as much consensus as possible. The VI4IO organization will host the IO500. Thanks very much, John

Show replies by date

Andreas Dilger

24 Jun 24 Jun

12:02 a.m.

On Jun 23, 2017, at 4:57 PM, John Bent <John.Bent(a)seagategov.com> wrote:

...

May as well go random at that point.

...

2. Should we do some sort of mixed IO workload in addition to running the 4 tests serially? I like the idea but am not sure how exactly to do it. Do we need to merely mix IOR-hard and IOR-easy or md-hard and md-easy or both or mix all 4 at once? Do we just launch multiple command lines in the background and hope that the mpirun launch times are fast enough that they overlap? Do we need to modify IOR/mdtest to split the ranks in half and do different workloads with the two halves? Thoughts?

This would be a bit of a dog's breakfast, and very hard to specify the test parameters. Do all of the loads need to be running for the whole duration? How does this work if some jobs finish early? What if, for example, there was a workload scheduler in the storage that (automatically?) segregated the IO of each workload and they actually ran serially and didn't contend at all? Would that be considered an improvement, since this could help real-world jobs as well? Maybe something for V2?

...

We also made some procedural decisions. The initial steering committee will be Jay Lofstead, Julian Kunkle, and myself. That steering committee membership will last until IO500 is up and running and stable at which point the community can nominate new members. All decisions will be discussed first on the mailing list and we will try for as much consensus as possible. The VI4IO organization will host the IO500. Thanks very much, John <io500_two_slides.pdf>_______________________________________________ IO-500 mailing list IO-500(a)vi4io.org https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500

Cheers, Andreas

John Bent

12:05 a.m.

...

On Jun 23, 2017, at 6:02 PM, Andreas Dilger <adilger(a)dilger.ca> wrote: On Jun 23, 2017, at 4:57 PM, John Bent <John.Bent(a)seagategov.com> wrote:

May as well go random at that point.

Agree about the scheduler. But what about a modified IOR that split the ranks in two and half did easy and the other half did hard? Thanks, John

...

Maybe something for V2?

Cheers, Andreas

Julian Kunkel

7:04 a.m.

Dear John, would move find and (any) concurrent benchmark or other application specific mode to the extended benchmark. Makes it easier to rollout the initial version. Cheers, Julian 2017-06-24 2:05 GMT+02:00 John Bent <John.Bent(a)seagategov.com>om>:

...

On Jun 23, 2017, at 6:02 PM, Andreas Dilger <adilger(a)dilger.ca> wrote: On Jun 23, 2017, at 4:57 PM, John Bent <John.Bent(a)seagategov.com> wrote:

May as well go random at that point.

Agree about the scheduler. But what about a modified IOR that split the ranks in two and half did easy and the other half did hard? Thanks, John

Maybe something for V2?

Cheers, Andreas

_______________________________________________ IO-500 mailing list IO-500(a)vi4io.org https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500

-- http://wr.informatik.uni-hamburg.de/people/julian_kunkel

Georgios Markomanolis

7:16 a.m.

Dear Julian, My apologies, I sent the answer to another thread (I have many ones in my outlook) and the communication becomes confusing. The mix workload is tested initially as I said to my previous email. I know that it is more advanced and I understand but I would like to have it in the initial release as this is a real case and we can be criticized that the benchmark does not correspond to real cases. A good beginning is halfway to success. I would like to be able to see something like 3 numbers from IOR (even more but just saying as example), worst case, best case and an average. For example, if I have 2 numbers 100GB/s and 1.5TB/s this says nothing about the real expectation. I can think more details but for sure these could be in the extended version, not in the basic one. The basic benchmark should be simple to be executed and I believe we can do it with mix workload. I will be a bit slow in my answers now as I am getting ready for holidays. Enjoy I/O ( Best regards, George ________________________________________ George Markomanolis, PhD Computational Scientist KAUST Supercomputing Laboratory (KSL) King Abdullah University of Science & Technology Al Khawarizmi Bldg. (1) Room 0123 Thuwal Kingdom of Saudi Arabia Mob: +966 56 325 9012 Office: +966 12 808 0393 <tel:%2B966%2012%20808%200683> On 24/06/2017, 10:04 AM, "IO-500 on behalf of Julian Kunkel" <io-500-bounces(a)vi4io.org on behalf of juliankunkel(a)googlemail.com> wrote: Dear John, would move find and (any) concurrent benchmark or other application specific mode to the extended benchmark. Makes it easier to rollout the initial version. Cheers, Julian 2017-06-24 2:05 GMT+02:00 John Bent <John.Bent(a)seagategov.com>om>:

...

On Jun 23, 2017, at 6:02 PM, Andreas Dilger <adilger(a)dilger.ca> wrote: On Jun 23, 2017, at 4:57 PM, John Bent <John.Bent(a)seagategov.com> wrote:

May as well go random at that point.

Agree about the scheduler. But what about a modified IOR that split the ranks in two and half did easy and the other half did hard? Thanks, John

Maybe something for V2?

Cheers, Andreas

_______________________________________________ IO-500 mailing list IO-500(a)vi4io.org https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500

-- http://wr.informatik.uni-hamburg.de/people/julian_kunkel _______________________________________________ IO-500 mailing list IO-500(a)vi4io.org https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500 ________________________________ This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.

John Bent

2:11 p.m.

Mixed requires a bit of work but I think is worth it. Find requires no effort so I see no reason to drop it. Perhaps one reason not to include it is that it might take an intractable amount of time. George, have you tried the find part yet? Just how bad is it? However, this is exactly the point. It's a hard workload and it will make most systems look bad. If it is optional, then they won't do it because they don't want to look bad. We had agreement on find, the community asked for it, we gave it to them. It makes our steering committee look bad if we change now. I suggest we leave find and figure out how to add mixed. If we still have our own discussion about this, we should move it to the full mailing list I think. Thx John

...

On Jun 24, 2017, at 1:05 AM, Julian Kunkel <juliankunkel(a)googlemail.com> wrote: Dear John, would move find and (any) concurrent benchmark or other application specific mode to the extended benchmark. Makes it easier to rollout the initial version. Cheers, Julian 2017-06-24 2:05 GMT+02:00 John Bent <John.Bent(a)seagategov.com>om>:

On Jun 23, 2017, at 6:02 PM, Andreas Dilger <adilger(a)dilger.ca> wrote: On Jun 23, 2017, at 4:57 PM, John Bent <John.Bent(a)seagategov.com> wrote:

May as well go random at that point.

Agree about the scheduler. But what about a modified IOR that split the ranks in two and half did easy and the other half did hard? Thanks, John

Maybe something for V2?

Cheers, Andreas

_______________________________________________ IO-500 mailing list IO-500(a)vi4io.org https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500

-- http://wr.informatik.uni-hamburg.de/people/julian_kunkel

Georgios Markomanolis

2:44 p.m.

I volunteer to support the mix workload, I need it and it is really god for the first impression of the benchmark when it is released. I think is useful for everybody (my opinion), except if you have dedicate machine for one job. I have tried the command find and it is really slow. It takes 3-4 minutes when I have 1-1.5 million files (I don't remember exact time, I wait my flight for my holidays). I agree to include find, there is no reason to remove it. The trap here is, less the MPI processes, less the files you create, the duration of the command find is shorter but also the result is lower as it searched across smaller number of files. I don't know if we want to define a rule that minimum X number of files or something like that. Best regards, George Sent from my iPhone

...

On Jun 24, 2017, at 5:14 PM, John Bent <John.Bent(a)seagategov.com> wrote: Mixed requires a bit of work but I think is worth it. Find requires no effort so I see no reason to drop it. Perhaps one reason not to include it is that it might take an intractable amount of time. George, have you tried the find part yet? Just how bad is it? However, this is exactly the point. It's a hard workload and it will make most systems look bad. If it is optional, then they won't do it because they don't want to look bad. We had agreement on find, the community asked for it, we gave it to them. It makes our steering committee look bad if we change now. I suggest we leave find and figure out how to add mixed. If we still have our own discussion about this, we should move it to the full mailing list I think. Thx John

On Jun 23, 2017, at 6:02 PM, Andreas Dilger <adilger(a)dilger.ca> wrote: On Jun 23, 2017, at 4:57 PM, John Bent <John.Bent(a)seagategov.com> wrote: > > All, > > We had a great session at ISC (about 30 people I think) and made great progress in the weeks leading up to it as well. Thanks to Satoshi we even got the two attached slides added into the official slides being released from the Top 500 session! We had 6 people sign up at the BOF saying that they’ll run the benchmark when it is finalized. > > I know I always say that we have ‘almost’ finalized the benchmark. But we really are getting much closer; it helped so much that Nathan combined the benchmarks and George worked on the script. > > I think we only have two open questions right now: > > 1. do 47K random IO in the IOR-hard or do 47K simple strided? My original thinking was strided but someone pointed out that the idea is to create the bounding box and random is harder than strided. Also random might be increasingly prevalent these days with more analytics and machine learning and graph analytics, etc. So I propose that we do random unless there are objections here. May as well go random at that point. > 2. Should we do some sort of mixed IO workload in addition to running the 4 tests serially? I like the idea but am not sure how exactly to do it. Do we need to merely mix IOR-hard and IOR-easy or md-hard and md-easy or both or mix all 4 at once? Do we just launch multiple command lines in the background and hope that the mpirun launch times are fast enough that they overlap? Do we need to modify IOR/mdtest to split the ranks in half and do different workloads with the two halves? Thoughts? This would be a bit of a dog's breakfast, and very hard to specify the test parameters. Do all of the loads need to be running for the whole duration? How does this work if some jobs finish early? What if, for example, there was a workload scheduler in the storage that (automatically?) segregated the IO of each workload and they actually ran serially and didn't contend at all? Would that be considered an improvement, since this could help real-world jobs as well?

Agree about the scheduler. But what about a modified IOR that split the ranks in two and half did easy and the other half did hard? Thanks, John

Maybe something for V2? > We also made some procedural decisions. The initial steering committee will be Jay Lofstead, Julian Kunkle, and myself. That steering committee membership will last until IO500 is up and running and stable at which point the community can nominate new members. All decisions will be discussed first on the mailing list and we will try for as much consensus as possible. The VI4IO organization will host the IO500. > > Thanks very much, > > John > <io500_two_slides.pdf>_______________________________________________ > IO-500 mailing list > IO-500(a)vi4io.org > https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500 Cheers, Andreas

_______________________________________________ IO-500 mailing list IO-500(a)vi4io.org https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500

-- http://wr.informatik.uni-hamburg.de/people/julian_kunkel

_______________________________________________ IO-500 mailing list IO-500(a)vi4io.org https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500

________________________________ This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.

Andreas Dilger

29 Jun 29 Jun

11:35 p.m.

On Jun 24, 2017, at 8:44 AM, Georgios Markomanolis <georgios.markomanolis(a)kaust.edu.sa> wrote:

...

I have tried the command find and it is really slow. It takes 3-4 minutes when I have 1-1.5 million files (I don't remember exact time, I wait my flight for my holidays). I agree to include find, there is no reason to remove it. The trap here is, less the MPI processes, less the files you create, the duration of the command find is shorter but also the result is lower as it searched across smaller number of files. I don't know if we want to define a rule that minimum X number of files or something like that.

The time that find takes to run depends on the previously-run mdtest runs, which have a maximum score when a large number of files are created. We should make sure that the find result is not at a significantly advantage if it completes faster because the file creation is very slow and creates few files. That means the result needs to be in "files per second" or similar and not just the time it takes for find to finish. I don't think running find in a loop to get a larger runtime is an improvement, since this would benefit from slower create rates (fewer files) by being able to re-scan the same files from cache. Cheers, Andreas

...

> On Jun 23, 2017, at 6:02 PM, Andreas Dilger <adilger(a)dilger.ca> wrote: > > On Jun 23, 2017, at 4:57 PM, John Bent <John.Bent(a)seagategov.com> wrote: >> >> All, >> >> We had a great session at ISC (about 30 people I think) and made great progress in the weeks leading up to it as well. Thanks to Satoshi we even got the two attached slides added into the official slides being released from the Top 500 session! We had 6 people sign up at the BOF saying that they’ll run the benchmark when it is finalized. >> >> I know I always say that we have ‘almost’ finalized the benchmark. But we really are getting much closer; it helped so much that Nathan combined the benchmarks and George worked on the script. >> >> I think we only have two open questions right now: >> >> 1. do 47K random IO in the IOR-hard or do 47K simple strided? My original thinking was strided but someone pointed out that the idea is to create the bounding box and random is harder than strided. Also random might be increasingly prevalent these days with more analytics and machine learning and graph analytics, etc. So I propose that we do random unless there are objections here. > > May as well go random at that point. > >> 2. Should we do some sort of mixed IO workload in addition to running the 4 tests serially? I like the idea but am not sure how exactly to do it. Do we need to merely mix IOR-hard and IOR-easy or md-hard and md-easy or both or mix all 4 at once? Do we just launch multiple command lines in the background and hope that the mpirun launch times are fast enough that they overlap? Do we need to modify IOR/mdtest to split the ranks in half and do different workloads with the two halves? Thoughts? > > This would be a bit of a dog's breakfast, and very hard to specify the test parameters. Do all of the loads need to be running for the whole duration? How does this work if some jobs finish early? What if, for example, there was a workload scheduler in the storage that (automatically?) segregated the IO of each workload and they actually ran serially and didn't contend at all? Would that be considered an improvement, since this could help real-world jobs as well? > Agree about the scheduler. But what about a modified IOR that split the ranks in two and half did easy and the other half did hard? Thanks, John > Maybe something for V2? > >> We also made some procedural decisions. The initial steering committee will be Jay Lofstead, Julian Kunkle, and myself. That steering committee membership will last until IO500 is up and running and stable at which point the community can nominate new members. All decisions will be discussed first on the mailing list and we will try for as much consensus as possible. The VI4IO organization will host the IO500. >> >> Thanks very much, >> >> John >> <io500_two_slides.pdf>_______________________________________________ >> IO-500 mailing list >> IO-500(a)vi4io.org >> https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500 > > > Cheers, Andreas > > > > > _______________________________________________ IO-500 mailing list IO-500(a)vi4io.org https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500

-- http://wr.informatik.uni-hamburg.de/people/julian_kunkel

_______________________________________________ IO-500 mailing list IO-500(a)vi4io.org https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500

Cheers, Andreas

John Bent

11:47 p.m.

...

On Jun 29, 2017, at 1:35 PM, Andreas Dilger <adilger(a)dilger.ca> wrote: That means the result needs to be in "files per second"

Absolutely. It will be just a fifth iops number to combine with the other 4 IOPs numbers (mdtest create easy/hard and mdtest stat easy/hard) using geo mean. Figure out the number of files created by the four produce phases: n. Then divide that by wall-clock for the find command: w. Then the 'find' IOPs is n/w. My concern is that it will be so slow that people will give up and not run it. John

Andreas Dilger

11:52 p.m.

On Jun 29, 2017, at 5:47 PM, John Bent <John.Bent(a)seagategov.com> wrote:

...

On Jun 29, 2017, at 1:35 PM, Andreas Dilger <adilger(a)dilger.ca> wrote: That means the result needs to be in "files per second"

I don't see why that would be true? For Lustre at least, readdir() and stat() are about 2x as fast as creating files. Cheers, Andreas

John Bent

11:58 p.m.

...

On Jun 29, 2017, at 1:52 PM, Andreas Dilger <adilger(a)dilger.ca> wrote:

On Jun 29, 2017, at 5:47 PM, John Bent <John.Bent(a)seagategov.com> wrote:

On Jun 29, 2017, at 1:35 PM, Andreas Dilger <adilger(a)dilger.ca> wrote: That means the result needs to be in "files per second"

I don't see why that would be true? For Lustre at least, readdir() and stat() are about 2x as fast as creating files.

Even if you create across 10,000 nodes and then readdir from just 1? Thx John

...

Cheers, Andreas

Andreas Dilger

30 Jun 30 Jun

12:11 a.m.

On Jun 29, 2017, at 5:58 PM, John Bent <John.Bent(a)seagategov.com> wrote:

...

On Jun 29, 2017, at 1:52 PM, Andreas Dilger <adilger(a)dilger.ca> wrote:

On Jun 29, 2017, at 5:47 PM, John Bent <John.Bent(a)seagategov.com> wrote:

On Jun 29, 2017, at 1:35 PM, Andreas Dilger <adilger(a)dilger.ca> wrote: That means the result needs to be in "files per second"

I don't see why that would be true? For Lustre at least, readdir() and stat() are about 2x as fast as creating files.

Even if you create across 10,000 nodes and then readdir from just 1?

Hmm, good point. I was thinking about parallel stat, but it isn't orders of magnitude slower. George mentioned the find command took 3-4 minutes, which isn't slower than the 5-minute create phase... Cheers, Andreas

John Bent

12:27 a.m.

...

On Jun 29, 2017, at 2:11 PM, Andreas Dilger <adilger(a)dilger.ca> wrote: On Jun 29, 2017, at 5:58 PM, John Bent <John.Bent(a)seagategov.com> wrote:

On Jun 29, 2017, at 1:52 PM, Andreas Dilger <adilger(a)dilger.ca> wrote:

> On Jun 29, 2017, at 5:47 PM, John Bent <John.Bent(a)seagategov.com> wrote: > > On Jun 29, 2017, at 1:35 PM, Andreas Dilger <adilger(a)dilger.ca> wrote: > > That means the result needs to be in "files per second" Absolutely. It will be just a fifth iops number to combine with the other 4 IOPs numbers (mdtest create easy/hard and mdtest stat easy/hard) using geo mean. Figure out the number of files created by the four produce phases: n. Then divide that by wall-clock for the find command: w. Then the 'find' IOPs is n/w. My concern is that it will be so slow that people will give up and not run it.

I don't see why that would be true? For Lustre at least, readdir() and stat() are about 2x as fast as creating files.

Even if you create across 10,000 nodes and then readdir from just 1?

Hmm, good point. I was thinking about parallel stat, but it isn't orders of magnitude slower. George mentioned the find command took 3-4 minutes, which isn't slower than the 5-minute create phase...

I don't think George has done a 5-min create phase yet. He's just getting the scripts working still. Thx John

...

Cheers, Andreas

Georgios Markomanolis

3:22 a.m.

Hi, A short update, I just tried to run larger mdtest (I still calibrate). For mdt easy, the create takes 4 minutes and for the hard takes 4.5 minutes (I am getting close). I do create 2 million files. This moment still has not finished the command find and it takes 20 minutes. Total execution of the benchmark without having calibrated IOR to 5 minutes, is almost 46 minutes and counting till the command "find" finishes. All the IOR except the hard write, took less than 2 minutes and the hard write took 7.7 minutes. We miss the mix workload also. I use 1000 compute nodes with 2 MPI processes per node. I have time limit 1 hour, so I am afraid that it could be killed due to time limit. Best regards, George __________________________________________________ George Markomanolis, PhD Computational Scientist KAUST Supercomputing Laboratory (KSL) King Abdullah University of Science & Technology Al Khawarizmi Bldg. (1) Room 0123 Thuwal Kingdom of Saudi Arabia Mob: +966 56 325 9012 Office: +966 12 808 0393 ________________________________________ From: John Bent <John.Bent(a)seagategov.com> Sent: Friday, June 30, 2017 3:27 AM To: Andreas Dilger Cc: Georgios Markomanolis; Julian Kunkel; io-500(a)vi4io.org Subject: Re: [IO-500] ISC BOF report

...

On Jun 29, 2017, at 2:11 PM, Andreas Dilger <adilger(a)dilger.ca> wrote: On Jun 29, 2017, at 5:58 PM, John Bent <John.Bent(a)seagategov.com> wrote:

On Jun 29, 2017, at 1:52 PM, Andreas Dilger <adilger(a)dilger.ca> wrote:

I don't see why that would be true? For Lustre at least, readdir() and stat() are about 2x as fast as creating files.

Even if you create across 10,000 nodes and then readdir from just 1?

Hmm, good point. I was thinking about parallel stat, but it isn't orders of magnitude slower. George mentioned the find command took 3-4 minutes, which isn't slower than the 5-minute create phase...

I don't think George has done a 5-min create phase yet. He's just getting the scripts working still. Thx John

...

Cheers, Andreas

2502

days inactive

2509

days old

io-500@lists.vi4io.org

Manage subscription

13 comments

4 participants

tags (0)

participants (4)

Andreas Dilger
Georgios Markomanolis
John Bent
Julian Kunkel