I think this speaks to a more fundamental problem with the 300-second walltime requirement.  I maintain that this requirement be removed and replaced with something that speaks more directly to the intent of that requirement (which was to eliminate caching effects, right?).

In an ideal world, one would be able to flush the entire contents of system memory to a storage system in zero seconds, and the capacity of that storage system would be sized to be as small as possible so as to minimize its overall cost.  As we move towards that goal, we move further away from being able to meet the 300-second requirement; either one has to 

1. slow the file system down so that it takes a full 300 seconds (which becomes less representative of what users actually _want_ to do with a parallel storage system), or

2. size the performance tier so large that it can absorb a full 300 seconds' worth of continuous dumping (which may be much larger than the system memory)

I think the IO-500 benchmarks should be modeled off of real I/O tasks; for example, read/write 80% of the aggregate node memory for whatever degree of parallelism you choose to run to get the hero numbers, then work backwards from there.  Throw in some language about "walltime must include time to persistence" and use the IOR -C option (and the analogous mdtest one).

Nobody intentionally performs I/O for a minimum period of time, so nobody architects storage systems to handle that workload.  This is only going to become a more glaring issue as the ratio of bandwidth/capacity of performance tiers goes up and storage architects tune their desired $/bandwidth and $/capacity.

Glenn


On Mon, Jul 30, 2018 at 11:50 AM Lofstead, Gerald F II <gflofst@sandia.gov> wrote:
Should we pursue making this a bug report for the file system instead?

On 7/30/18, 12:44 PM, "IO-500 on behalf of John Bent" <io-500-bounces@vi4io.org on behalf of johnbent@gmail.com> wrote:

    Very nice Julian.

    However, if we are to consider allowing this as a valid submissions then I propose we add synchronization between directory switching to ensure fully N-1 behavior. Yes, it does somewhat unfairly penalize Osamu here. However, not synchronizing somewhat unfairly advantages him as it allows his fast processes non-contended access to subsequent directories.  If a submission cannot abide by official rules, then it seems better to unfairly penalize instead of unfairly advantaging it. 

    Anyone else have thoughts here?

    Thanks

    John

    > On Jul 26, 2018, at 5:57 AM, Julian Kunkel <juliankunkel@googlemail.com> wrote:
    >
    > Dear Osama,
    > thanks for pointing this out.
    > I was not aware of this problem.
    >
    > I now made some modifications to mdtest to compensate for this limitation.
    > Now it can limit the number of files per directories creating one
    > extra top-level directory if needed.
    > E.g. with a limit of 50 files and a total of 200 files, one would see
    > in the top-level testing directory:
    > #test-dir.0-0
    > #test-dir.0-1
    > #test-dir.0-2
    > #test-dir.0-3
    >
    > The subdirectory tree of each directory is the complete tree as before
    > under #test-dir.0-0.
    > There is no synchronization when switching between these trees.
    > I expect that this emulates very well the behavior of a single large
    > directory but well there is no perfect solution.
    >
    >
    > The stuff is already integrated into the testing branch of io-500-dev.
    > To use it:
    >
    > In io-500-dev
    > $ git checkout testing
    > $ ./utilities/prepare.sh
    >
    > in io500.sh
    > Line 85:
    > io500_mdtest_hard_other_options=""
    > Change that to:
    > io500_mdtest_hard_other_options="-I 8000000"
    > Which effectively will create a directory for each batch of 8M files.
    > Note that the number of items (-n) must be a multiple of (-I)
    >
    > Note that that feature cannot be used with stonewalling at the moment!
    > At some point, the mdtest code needs a redesign to allow such changes.
    >
    > Hence, set in L30:
    > io500_stonewall_timer=0
    >
    > If you want to give it a try, I would welcome it...
    >
    > Best,
    > Julian
    >
    > 2018-07-26 3:05 GMT+01:00 Osamu Tatebe <tatebe@cs.tsukuba.ac.jp>:
    >> Hi,
    >>
    >> I would like to remind an issue regarding mdtest-hard benchmark.  It
    >> creates files in a single directory for at least five minutes.  When
    >> the metadata performance is more improved, the greater number of files
    >> will be created.  This will hit the limit of the number of files in a
    >> single directory.
    >>
    >> Actually, the limit of Lustre file system is about 8M files although
    >> it depends the length of the pathname.  When the metadata performance
    >> is better than 27K IOPS, it is not possible to execute the mdtest-hard
    >> create benchmark for five minutes since the number of files is larger
    >> than 8M.
    >>
    >> This is a reason we cannot submit the Lustre result.  I expect this
    >> will be quite common when the metadata performance is more improved.
    >>
    >> Regards,
    >> Osamu
    >>
    >> ---
    >> Osamu Tatebe, Ph.D.
    >> Center for Computational Sciences, University of Tsukuba
    >> 1-1-1 Tennodai, Tsukuba, Ibaraki 3058577 Japan
    >> _______________________________________________
    >> IO-500 mailing list
    >> IO-500@vi4io.org
    >> https://www.vi4io.org/mailman/listinfo/io-500
    >
    >
    >
    > --
    > Dr. Julian Kunkel
    > Lecturer, Department of Computer Science
    > +44 (0) 118 378 8218
    > http://www.cs.reading.ac.uk/
    > https://hps.vi4io.org/
    > _______________________________________________
    > IO-500 mailing list
    > IO-500@vi4io.org
    > https://www.vi4io.org/mailman/listinfo/io-500
    _______________________________________________
    IO-500 mailing list
    IO-500@vi4io.org
    https://www.vi4io.org/mailman/listinfo/io-500


_______________________________________________
IO-500 mailing list
IO-500@vi4io.org
https://www.vi4io.org/mailman/listinfo/io-500