Hi Glenn,
Thanks for the comments. This is part of what we wrestled with for a long time. Cache
flushing was part of it. 80% of node memory was another thought, but few have that high a
requirement (LANL/SNL and maybe LLNL have an app that needs that for
checkpoint/restart).
What we settled on was 90% forward progress for an app for an hour. That gave 6 minutes
for C/R. 5 minutes seemed like a simpler number for people to work on. Seeing how much you
could do in that time seemed like the right metric. I don't think any of us expected
that we'd run into file system limitations.
Our goal was to try to limit the ability to game things too much and encouraged people to
try it since it was clear that careful tuning wouldn't buy that much. Finding
something that was general like, keep writing for 5 minutes rather than saying 80% of node
memory so people start limiting the number of nodes or carefully selecting the number of
nodes and their mapping to the storage targets. With a time limit, it strictly becomes how
much can you do because of the fixed time rather than how carefully you can configure
it.
We were stumped trying to figure out a different metric that would cause the same effect
as saying do as much as you can within the time limit. Suggestions are welcome. Ideas how
to keep the spirit of the 5 minute do-all-you-can test for file systems that cannot
sustain for 5 minutes are also welcome.
The newly announced 10 node challenge is intended to address more per node performance
rather than ultimate scale up performance. If that doesn't look like it'll achieve
the right results in your opinion, please let us know. We can adjust things still.
Best,
Jay
From: IO-500 <io-500-bounces(a)vi4io.org> on behalf of Glenn Lockwood
<glock(a)lbl.gov>
Date: Monday, July 30, 2018 at 2:45 PM
To: "io-500(a)vi4io.org" <io-500(a)vi4io.org>
Subject: Re: [IO-500] [EXTERNAL] Re: mdtest-hard issue
I think this speaks to a more fundamental problem with the 300-second walltime
requirement. I maintain that this requirement be removed and replaced with something that
speaks more directly to the intent of that requirement (which was to eliminate caching
effects, right?).
In an ideal world, one would be able to flush the entire contents of system memory to a
storage system in zero seconds, and the capacity of that storage system would be sized to
be as small as possible so as to minimize its overall cost. As we move towards that goal,
we move further away from being able to meet the 300-second requirement; either one has
to
1. slow the file system down so that it takes a full 300 seconds (which becomes less
representative of what users actually _want_ to do with a parallel storage system), or
2. size the performance tier so large that it can absorb a full 300 seconds' worth of
continuous dumping (which may be much larger than the system memory)
I think the IO-500 benchmarks should be modeled off of real I/O tasks; for example,
read/write 80% of the aggregate node memory for whatever degree of parallelism you choose
to run to get the hero numbers, then work backwards from there. Throw in some language
about "walltime must include time to persistence" and use the IOR -C option (and
the analogous mdtest one).
Nobody intentionally performs I/O for a minimum period of time, so nobody architects
storage systems to handle that workload. This is only going to become a more glaring
issue as the ratio of bandwidth/capacity of performance tiers goes up and storage
architects tune their desired $/bandwidth and $/capacity.
Glenn
On Mon, Jul 30, 2018 at 11:50 AM Lofstead, Gerald F II
<gflofst@sandia.gov<mailto:gflofst@sandia.gov>> wrote:
Should we pursue making this a bug report for the file system instead?
On 7/30/18, 12:44 PM, "IO-500 on behalf of John Bent"
<io-500-bounces@vi4io.org<mailto:io-500-bounces@vi4io.org> on behalf of
johnbent@gmail.com<mailto:johnbent@gmail.com>> wrote:
Very nice Julian.
However, if we are to consider allowing this as a valid submissions then I propose we
add synchronization between directory switching to ensure fully N-1 behavior. Yes, it does
somewhat unfairly penalize Osamu here. However, not synchronizing somewhat unfairly
advantages him as it allows his fast processes non-contended access to subsequent
directories. If a submission cannot abide by official rules, then it seems better to
unfairly penalize instead of unfairly advantaging it.
Anyone else have thoughts here?
Thanks
John
On Jul 26, 2018, at 5:57 AM, Julian Kunkel
<juliankunkel@googlemail.com<mailto:juliankunkel@googlemail.com>> wrote:
Dear Osama,
thanks for pointing this out.
I was not aware of this problem.
I now made some modifications to mdtest to compensate for this limitation.
Now it can limit the number of files per directories creating one
extra top-level directory if needed.
E.g. with a limit of 50 files and a total of 200 files, one would see
in the top-level testing directory:
#test-dir.0-0
#test-dir.0-1
#test-dir.0-2
#test-dir.0-3
The subdirectory tree of each directory is the complete tree as before
under #test-dir.0-0.
There is no synchronization when switching between these trees.
I expect that this emulates very well the behavior of a single large
directory but well there is no perfect solution.
The stuff is already integrated into the testing branch of io-500-dev.
To use it:
In io-500-dev
$ git checkout testing
$ ./utilities/prepare.sh
in io500.sh
Line 85:
io500_mdtest_hard_other_options=""
Change that to:
io500_mdtest_hard_other_options="-I 8000000"
Which effectively will create a directory for each batch of 8M files.
Note that the number of items (-n) must be a multiple of (-I)
Note that that feature cannot be used with stonewalling at the moment!
At some point, the mdtest code needs a redesign to allow such changes.
Hence, set in L30:
io500_stonewall_timer=0
If you want to give it a try, I would welcome it...
Best,
Julian
2018-07-26 3:05 GMT+01:00 Osamu Tatebe
<tatebe@cs.tsukuba.ac.jp<mailto:tatebe@cs.tsukuba.ac.jp>>:
> Hi,
>
> I would like to remind an issue regarding mdtest-hard benchmark. It
> creates files in a single directory for at least five minutes. When
> the metadata performance is more improved, the greater number of files
> will be created. This will hit the limit of the number of files in a
> single directory.
>
> Actually, the limit of Lustre file system is about 8M files although
> it depends the length of the pathname. When the metadata performance
> is better than 27K IOPS, it is not possible to execute the mdtest-hard
> create benchmark for five minutes since the number of files is larger
> than 8M.
>
> This is a reason we cannot submit the Lustre result. I expect this
> will be quite common when the metadata performance is more improved.
>
> Regards,
> Osamu
>
> ---
> Osamu Tatebe, Ph.D.
> Center for Computational Sciences, University of Tsukuba
> 1-1-1 Tennodai, Tsukuba, Ibaraki 3058577 Japan
> _______________________________________________
> IO-500 mailing list
> IO-500@vi4io.org<mailto:IO-500@vi4io.org>
>
https://www.vi4io.org/mailman/listinfo/io-500
--
Dr. Julian Kunkel
Lecturer, Department of Computer Science
+44 (0) 118 378 8218
http://www.cs.reading.ac.uk/
https://hps.vi4io.org/
_______________________________________________
IO-500 mailing list
IO-500@vi4io.org<mailto:IO-500@vi4io.org>
https://www.vi4io.org/mailman/listinfo/io-500
_______________________________________________
IO-500 mailing list
IO-500@vi4io.org<mailto:IO-500@vi4io.org>
https://www.vi4io.org/mailman/listinfo/io-500
_______________________________________________
IO-500 mailing list
IO-500@vi4io.org<mailto:IO-500@vi4io.org>
https://www.vi4io.org/mailman/listinfo/io-500