Hi Glenn,

Thanks for the comments. This is part of what we wrestled with for a long time. Cache flushing was part of it. 80% of node memory was another thought, but few have that high a requirement (LANL/SNL and maybe LLNL have an app that needs that for checkpoint/restart).

What we settled on was 90% forward progress for an app for an hour. That gave 6 minutes for C/R. 5 minutes seemed like a simpler number for people to work on. Seeing how much you could do in that time seemed like the right metric. I don't think any of us expected that we'd run into file system limitations.

Our goal was to try to limit the ability to game things too much and encouraged people to try it since it was clear that careful tuning wouldn't buy that much. Finding something that was general like, keep writing for 5 minutes rather than saying 80% of node memory so people start limiting the number of nodes or carefully selecting the number of nodes and their mapping to the storage targets. With a time limit, it strictly becomes how much can you do because of the fixed time rather than how carefully you can configure it.

We were stumped trying to figure out a different metric that would cause the same effect as saying do as much as you can within the time limit. Suggestions are welcome. Ideas how to keep the spirit of the 5 minute do-all-you-can test for file systems that cannot sustain for 5 minutes are also welcome.

The newly announced 10 node challenge is intended to address more per node performance rather than ultimate scale up performance. If that doesn't look like it'll achieve the right results in your opinion, please let us know. We can adjust things still.

Best,

Jay

From: IO-500 <io-500-bounces@vi4io.org> on behalf of Glenn Lockwood <glock@lbl.gov>
Date: Monday, July 30, 2018 at 2:45 PM
To: "io-500@vi4io.org" <io-500@vi4io.org>
Subject: Re: [IO-500] [EXTERNAL] Re: mdtest-hard issue

I think this speaks to a more fundamental problem with the 300-second walltime requirement. I maintain that this requirement be removed and replaced with something that speaks more directly to the intent of that requirement (which was to eliminate caching effects, right?).

In an ideal world, one would be able to flush the entire contents of system memory to a storage system in zero seconds, and the capacity of that storage system would be sized to be as small as possible so as to minimize its overall cost. As we move towards that goal, we move further away from being able to meet the 300-second requirement; either one has to

1. slow the file system down so that it takes a full 300 seconds (which becomes less representative of what users actually _want_ to do with a parallel storage system), or

2. size the performance tier so large that it can absorb a full 300 seconds' worth of continuous dumping (which may be much larger than the system memory)

I think the IO-500 benchmarks should be modeled off of real I/O tasks; for example, read/write 80% of the aggregate node memory for whatever degree of parallelism you choose to run to get the hero numbers, then work backwards from there. Throw in some language about "walltime must include time to persistence" and use the IOR -C option (and the analogous mdtest one).

Nobody intentionally performs I/O for a minimum period of time, so nobody architects storage systems to handle that workload. This is only going to become a more glaring issue as the ratio of bandwidth/capacity of performance tiers goes up and storage architects tune their desired $/bandwidth and $/capacity.

Glenn

On Mon, Jul 30, 2018 at 11:50 AM Lofstead, Gerald F II <gflofst@sandia.gov> wrote:

Should we pursue making this a bug report for the file system instead?

On 7/30/18, 12:44 PM, "IO-500 on behalf of John Bent" <io-500-bounces@vi4io.org on behalf of johnbent@gmail.com> wrote:

Very nice Julian.

However, if we are to consider allowing this as a valid submissions then I propose we add synchronization between directory switching to ensure fully N-1 behavior. Yes, it does somewhat unfairly penalize Osamu here. However, not synchronizing somewhat unfairly advantages him as it allows his fast processes non-contended access to subsequent directories. If a submission cannot abide by official rules, then it seems better to unfairly penalize instead of unfairly advantaging it.

Anyone else have thoughts here?

Thanks

John

> On Jul 26, 2018, at 5:57 AM, Julian Kunkel <juliankunkel@googlemail.com> wrote:
>
> Dear Osama,
> thanks for pointing this out.
> I was not aware of this problem.
>
> I now made some modifications to mdtest to compensate for this limitation.
> Now it can limit the number of files per directories creating one
> extra top-level directory if needed.
> E.g. with a limit of 50 files and a total of 200 files, one would see
> in the top-level testing directory:
> #test-dir.0-0
> #test-dir.0-1
> #test-dir.0-2
> #test-dir.0-3
>
> The subdirectory tree of each directory is the complete tree as before
> under #test-dir.0-0.
> There is no synchronization when switching between these trees.
> I expect that this emulates very well the behavior of a single large
> directory but well there is no perfect solution.
>
>
> The stuff is already integrated into the testing branch of io-500-dev.
> To use it:
>
> In io-500-dev
> $ git checkout testing
> $ ./utilities/prepare.sh
>
> in io500.sh
> Line 85:
> io500_mdtest_hard_other_options=""
> Change that to:
> io500_mdtest_hard_other_options="-I 8000000"
> Which effectively will create a directory for each batch of 8M files.
> Note that the number of items (-n) must be a multiple of (-I)
>
> Note that that feature cannot be used with stonewalling at the moment!
> At some point, the mdtest code needs a redesign to allow such changes.
>
> Hence, set in L30:
> io500_stonewall_timer=0
>
> If you want to give it a try, I would welcome it...
>
> Best,
> Julian
>
> 2018-07-26 3:05 GMT+01:00 Osamu Tatebe <tatebe@cs.tsukuba.ac.jp>:
>> Hi,
>>
>> I would like to remind an issue regarding mdtest-hard benchmark. It
>> creates files in a single directory for at least five minutes. When
>> the metadata performance is more improved, the greater number of files
>> will be created. This will hit the limit of the number of files in a
>> single directory.
>>
>> Actually, the limit of Lustre file system is about 8M files although
>> it depends the length of the pathname. When the metadata performance
>> is better than 27K IOPS, it is not possible to execute the mdtest-hard
>> create benchmark for five minutes since the number of files is larger
>> than 8M.
>>
>> This is a reason we cannot submit the Lustre result. I expect this
>> will be quite common when the metadata performance is more improved.
>>
>> Regards,
>> Osamu
>>
>> ---
>> Osamu Tatebe, Ph.D.
>> Center for Computational Sciences, University of Tsukuba
>> 1-1-1 Tennodai, Tsukuba, Ibaraki 3058577 Japan
>> _______________________________________________
>> IO-500 mailing list
>> IO-500@vi4io.org
>> https://www.vi4io.org/mailman/listinfo/io-500
>
>
>
> --
> Dr. Julian Kunkel
> Lecturer, Department of Computer Science
> +44 (0) 118 378 8218
> http://www.cs.reading.ac.uk/
> https://hps.vi4io.org/
> _______________________________________________
> IO-500 mailing list
> IO-500@vi4io.org
> https://www.vi4io.org/mailman/listinfo/io-500
_______________________________________________
IO-500 mailing list
IO-500@vi4io.org
https://www.vi4io.org/mailman/listinfo/io-500

_______________________________________________
IO-500 mailing list
IO-500@vi4io.org
https://www.vi4io.org/mailman/listinfo/io-500