I think this speaks to a more fundamental problem with the 300-second
walltime requirement. I maintain that this requirement be removed and
replaced with something that speaks more directly to the intent of that
requirement (which was to eliminate caching effects, right?).
In an ideal world, one would be able to flush the entire contents of system
memory to a storage system in zero seconds, and the capacity of that
storage system would be sized to be as small as possible so as to minimize
its overall cost. As we move towards that goal, we move further away from
being able to meet the 300-second requirement; either one has to
1. slow the file system down so that it takes a full 300 seconds (which
becomes less representative of what users actually _want_ to do with a
parallel storage system), or
2. size the performance tier so large that it can absorb a full 300
seconds' worth of continuous dumping (which may be much larger than the
system memory)
I think the IO-500 benchmarks should be modeled off of real I/O tasks; for
example, read/write 80% of the aggregate node memory for whatever degree of
parallelism you choose to run to get the hero numbers, then work backwards
from there. Throw in some language about "walltime must include time to
persistence" and use the IOR -C option (and the analogous mdtest one).
Nobody intentionally performs I/O for a minimum period of time, so nobody
architects storage systems to handle that workload. This is only going to
become a more glaring issue as the ratio of bandwidth/capacity of
performance tiers goes up and storage architects tune their desired
$/bandwidth and $/capacity.
Glenn
On Mon, Jul 30, 2018 at 11:50 AM Lofstead, Gerald F II <gflofst(a)sandia.gov>
wrote:
Should we pursue making this a bug report for the file system
instead?
On 7/30/18, 12:44 PM, "IO-500 on behalf of John Bent" <
io-500-bounces(a)vi4io.org on behalf of johnbent(a)gmail.com> wrote:
Very nice Julian.
However, if we are to consider allowing this as a valid submissions
then I propose we add synchronization between directory switching to ensure
fully N-1 behavior. Yes, it does somewhat unfairly penalize Osamu here.
However, not synchronizing somewhat unfairly advantages him as it allows
his fast processes non-contended access to subsequent directories. If a
submission cannot abide by official rules, then it seems better to unfairly
penalize instead of unfairly advantaging it.
Anyone else have thoughts here?
Thanks
John
> On Jul 26, 2018, at 5:57 AM, Julian Kunkel <
juliankunkel(a)googlemail.com> wrote:
>
> Dear Osama,
> thanks for pointing this out.
> I was not aware of this problem.
>
> I now made some modifications to mdtest to compensate for this
limitation.
> Now it can limit the number of files per directories creating one
> extra top-level directory if needed.
> E.g. with a limit of 50 files and a total of 200 files, one would see
> in the top-level testing directory:
> #test-dir.0-0
> #test-dir.0-1
> #test-dir.0-2
> #test-dir.0-3
>
> The subdirectory tree of each directory is the complete tree as
before
> under #test-dir.0-0.
> There is no synchronization when switching between these trees.
> I expect that this emulates very well the behavior of a single large
> directory but well there is no perfect solution.
>
>
> The stuff is already integrated into the testing branch of
io-500-dev.
> To use it:
>
> In io-500-dev
> $ git checkout testing
> $ ./utilities/prepare.sh
>
> in io500.sh
> Line 85:
> io500_mdtest_hard_other_options=""
> Change that to:
> io500_mdtest_hard_other_options="-I 8000000"
> Which effectively will create a directory for each batch of 8M files.
> Note that the number of items (-n) must be a multiple of (-I)
>
> Note that that feature cannot be used with stonewalling at the
moment!
> At some point, the mdtest code needs a redesign to allow such
changes.
>
> Hence, set in L30:
> io500_stonewall_timer=0
>
> If you want to give it a try, I would welcome it...
>
> Best,
> Julian
>
> 2018-07-26 3:05 GMT+01:00 Osamu Tatebe <tatebe(a)cs.tsukuba.ac.jp>:
>> Hi,
>>
>> I would like to remind an issue regarding mdtest-hard benchmark. It
>> creates files in a single directory for at least five minutes. When
>> the metadata performance is more improved, the greater number of
files
>> will be created. This will hit the limit of the number of files in
a
>> single directory.
>>
>> Actually, the limit of Lustre file system is about 8M files although
>> it depends the length of the pathname. When the metadata
performance
>> is better than 27K IOPS, it is not possible to execute the
mdtest-hard
>> create benchmark for five minutes since the number of files is
larger
>> than 8M.
>>
>> This is a reason we cannot submit the Lustre result. I expect this
>> will be quite common when the metadata performance is more improved.
>>
>> Regards,
>> Osamu
>>
>> ---
>> Osamu Tatebe, Ph.D.
>> Center for Computational Sciences, University of Tsukuba
>> 1-1-1 Tennodai, Tsukuba, Ibaraki 3058577 Japan
>> _______________________________________________
>> IO-500 mailing list
>> IO-500(a)vi4io.org
>>
https://www.vi4io.org/mailman/listinfo/io-500
>
>
>
> --
> Dr. Julian Kunkel
> Lecturer, Department of Computer Science
> +44 (0) 118 378 8218
>
http://www.cs.reading.ac.uk/
>
https://hps.vi4io.org/
> _______________________________________________
> IO-500 mailing list
> IO-500(a)vi4io.org
>
https://www.vi4io.org/mailman/listinfo/io-500
_______________________________________________
IO-500 mailing list
IO-500(a)vi4io.org
https://www.vi4io.org/mailman/listinfo/io-500
_______________________________________________
IO-500 mailing list
IO-500(a)vi4io.org
https://www.vi4io.org/mailman/listinfo/io-500