I think the 5-minute IO requirement (for IOR at least) is quite representative of real
systems. While I agree that everyone _wants_
to have a system with 1-second checkpoints, the practical outcome for
most large systems is that the budget/power/space considerations mean
that the maximum acceptable IO time is 10% of the runtime, which is
5-6 minutes per hour. That means the funding for storage will not be
more than what provides this minimum requirement (even though some
people would want it to be more), since otherwise more money could be
spent on compute. At least this is true for all of the systems I've
been involved in.
The other reason to have a time-based result is that it avoids the
"write only to RAM" issue, without making the test run excessively
long. If the requirement is 50% of RAM for the clients, then this
can easily get a great result by writing into the other 50% of RAM
on all of the clients, or by running with a small number of clients
that dump directly into the RAM on the servers, but that isn't a real
HPC workload result either.
The goal is to report the performance of the storage system itself.
Also, the benchmark should be able to complete in a reasonable time,
so 2x client RAM or other cache avoidance metrics may cause the test
to run for many hours if a large number of clients are involved.
I agree that caching layers are definitely desirable to have, so I
don't think there should be a "must write to disk" requirement, but
rather "report the results for each layer separately". That way we
can see that the flash layer gets X GB/s, Y IOPS, and the disk layer
gets 10% of those numbers, or whatever.
Initially I wasn't sure whether the 300s runtime for the metadata
tests was reasonable or not, but as HPC moves away from the typically
well-structured workloads (e.g. file per process, but only once per
hour) into genomics and AI, the "create many millions of files" is
becoming a very common workload.
The limitation of 8M files/directory that Osama reported for Lustre
(we report 10M in our manual, but it depends on filename length and
other factors) is (or at least was) a real issue for ldiskfs/ext4
MDTs. In newer versions of Lustre and with the appropriate harware
configuration, it is possible to exceed the 8/10M files per directory
by using DNE to stripe a single directory across multiple MDTs, with
ZFS MDTs, or (with an upcoming release of e2fsprogs and the right
settings) exceed 10M entries per directory with ldiskfs/ext4 itself
(up to about 5B entries/directory in theory).
On Jul 30, 2018, at 2:35 PM, Glenn Lockwood <glock(a)lbl.gov> wrote:
I think this speaks to a more fundamental problem with the 300-second walltime
requirement. I maintain that this requirement be removed and replaced with something that
speaks more directly to the intent of that requirement (which was to eliminate caching
In an ideal world, one would be able to flush the entire contents of system memory to a
storage system in zero seconds, and the capacity of that storage system would be sized to
be as small as possible so as to minimize its overall cost. As we move towards that goal,
we move further away from being able to meet the 300-second requirement; either one has
1. slow the file system down so that it takes a full 300 seconds (which becomes less
representative of what users actually _want_ to do with a parallel storage system), or
2. size the performance tier so large that it can absorb a full 300 seconds' worth of
continuous dumping (which may be much larger than the system memory)
I think the IO-500 benchmarks should be modeled off of real I/O tasks; for example,
read/write 80% of the aggregate node memory for whatever degree of parallelism you choose
to run to get the hero numbers, then work backwards from there. Throw in some language
about "walltime must include time to persistence" and use the IOR -C option (and
the analogous mdtest one).
Nobody intentionally performs I/O for a minimum period of time, so nobody architects
storage systems to handle that workload. This is only going to become a more glaring
issue as the ratio of bandwidth/capacity of performance tiers goes up and storage
architects tune their desired $/bandwidth and $/capacity.
On Mon, Jul 30, 2018 at 11:50 AM Lofstead, Gerald F II <gflofst(a)sandia.gov> wrote:
Should we pursue making this a bug report for the file system instead?
On 7/30/18, 12:44 PM, "IO-500 on behalf of John Bent"
<io-500-bounces(a)vi4io.org on behalf of johnbent(a)gmail.com> wrote:
Very nice Julian.
However, if we are to consider allowing this as a valid submissions then I propose we
add synchronization between directory switching to ensure fully N-1 behavior. Yes, it does
somewhat unfairly penalize Osamu here. However, not synchronizing somewhat unfairly
advantages him as it allows his fast processes non-contended access to subsequent
directories. If a submission cannot abide by official rules, then it seems better to
unfairly penalize instead of unfairly advantaging it.
Anyone else have thoughts here?
> On Jul 26, 2018, at 5:57 AM, Julian Kunkel <juliankunkel(a)googlemail.com>
> Dear Osama,
> thanks for pointing this out.
> I was not aware of this problem.
> I now made some modifications to mdtest to compensate for this limitation.
> Now it can limit the number of files per directories creating one
> extra top-level directory if needed.
> E.g. with a limit of 50 files and a total of 200 files, one would see
> in the top-level testing directory:
> The subdirectory tree of each directory is the complete tree as before
> under #test-dir.0-0.
> There is no synchronization when switching between these trees.
> I expect that this emulates very well the behavior of a single large
> directory but well there is no perfect solution.
> The stuff is already integrated into the testing branch of io-500-dev.
> To use it:
> In io-500-dev
> $ git checkout testing
> $ ./utilities/prepare.sh
> in io500.sh
> Line 85:
> Change that to:
> io500_mdtest_hard_other_options="-I 8000000"
> Which effectively will create a directory for each batch of 8M files.
> Note that the number of items (-n) must be a multiple of (-I)
> Note that that feature cannot be used with stonewalling at the moment!
> At some point, the mdtest code needs a redesign to allow such changes.
> Hence, set in L30:
> If you want to give it a try, I would welcome it...
> 2018-07-26 3:05 GMT+01:00 Osamu Tatebe <tatebe(a)cs.tsukuba.ac.jp>:
>> I would like to remind an issue regarding mdtest-hard benchmark. It
>> creates files in a single directory for at least five minutes. When
>> the metadata performance is more improved, the greater number of files
>> will be created. This will hit the limit of the number of files in a
>> single directory.
>> Actually, the limit of Lustre file system is about 8M files although
>> it depends the length of the pathname. When the metadata performance
>> is better than 27K IOPS, it is not possible to execute the mdtest-hard
>> create benchmark for five minutes since the number of files is larger
>> than 8M.
>> This is a reason we cannot submit the Lustre result. I expect this
>> will be quite common when the metadata performance is more improved.
>> Osamu Tatebe, Ph.D.
>> Center for Computational Sciences, University of Tsukuba
>> 1-1-1 Tennodai, Tsukuba, Ibaraki 3058577 Japan
>> IO-500 mailing list
> Dr. Julian Kunkel
> Lecturer, Department of Computer Science
> +44 (0) 118 378 8218
> IO-500 mailing list
IO-500 mailing list
IO-500 mailing list
IO-500 mailing list