Re: [IO-500] Detailed benchmark proposal

Monday, 19 June 2017

On Jun 19, 2017, at 7:10 AM, Andreas Dilger
<adilger@dilger.ca<mailto:adilger@dilger.ca>> wrote:

On Jun 18, 2017, at 1:01 PM, Michael Kluge
<michael.kluge@tu-dresden.de<mailto:michael.kluge@tu-dresden.de>> wrote:

IOR has an option to allocate a certain amount of the hosts memory. I suggest that we set
this to 90-95 percent and the total amount of data written as twice the size of the main
memory? Otherwise, the 10+ PB main memory of SUMMIT would make the list useless ;)

If I read everything correctly the current run rules define an execution time of 5 minutes
and just count the numbers of bytes/iops/files touched during this time. I agree that most
of the time our users do I/O in bursts. Is the benchmark basically only about “who can
write the most data with one file per process in 5 mins”? Why 5 minutes and not “how long
does it take to dump 80% of the main memory to some redundant permanent storage” (with
fsync())?

The reason that a fixed time of 5 minutes was chosen for the write time is that this is
typically the maximum time to dump a full-system checkpoint every hour to get 90% compute
efficiency.  If the limit is based on a particular data size then as you say the writes
might fit entirely into RAM and the storage isn't exercised at all.  Conversely, for
checkpoints the size is rarely larger than the total RAM size.

Michael,  I like your 90% memory idea.  Like the 5 minute rule, it keeps the benchmark
from degrading over time.  For example, if we said “how fast to do 1 PB,” then in 10
years, the test will be too small.

However, my concern about 90% of memory is that machines that have very small amounts of
memory will be able to fit all of their memory into server-side caches which might not
respect a sync command.

Do we want to define some rules about how safe the data has to be? Should it be OK if this
data ends up in a single burst buffer and there is no copy somewhere? I would recommend
that the results are only valid if data survives one failure of one of the storage devices
used.  For example: For the mdtest-workload I could imagine a file system that has
directory locking turned off and is using an SSD/NVRAM backend and thus would just behave
like the “IOR hard” workload.

The goal of the benchmark is that there is a separate measurement for each layer of the
storage system.  That would give one result for a flash/burst buffer tier, a separate
result for the disk-based storage.

Another point is that I am more a fan of application driven benchmarks. The numbers above
do not tell me anything about my applications, so why should I actually run the benchmark?
Just to to be “on the list”?

Applications are going to be different, even among HPC sites.  The goal of the benchmark
is to try and determine the outer limits of the performance space of the storage.  How
does it handle large aligned reads/writes, how does it handle unaligned small IOPS?  How
fast are the primitive metadata operations and small files?  With this raw information, it
should (in theory) be possible to derive the IO behaviour of actual applications.

Application driven benchmarks (something like SPEC CPU, but SPEC I/O), that scale with the
machine (and with the machines main memory) could actually become a standard that also the
industry could use to advertise their systems.

This benchmark is also intended to scale with the system size and performance, rather than
being a fixed size.  At the same time, they are intended to be run within a reasonable
time period rather than taking many hours to run.

In addition, if we as a site have I/O patterns that are close to one of the benchmarks, we
could put some weight on this benchmark and adjust our tenders and the industry partner
would know how to design the storage system with respect to our special requirements. Just
because they know the I/O pattern because it is a standard and they know how to deal with
it.

Since the benchmark itself is essentially capturing orthogonal performance parameters of
the storage system, it should be possible to weight the results of the different test
phases to get something approaching your desired performance characteristics.

One more thing that the current approach does not deal with at all is the fact that in
very near future applications will access permanent storage using interfaces that IOR does
not cover and store data by using mov() instructions in the CPU. Thus, if the list is
established using some combination of IOR+MDTEST+POSIX, I think it has no chance to
reflect the really fast I/O subsystems that are coming like http://pmem.io/

The benchmark is not necessarily tied to POSIX.  With IOR it is possible to run different
backend IO interfaces (HDF5, MPI-IO, S3 in newer versions), and I believe that mdtest has
been updated to do the same.  It should be possible to interface IOR/mdtest with
persistent memory storage systems like DAOS as well.

Yes, mdtest has been ported to replace all the ugly #ifdef with the same IOR backend. 
Nathan Hjelm did this work and will be pushing this back to the main LANL IOR github at
which point all the old mdtest repo’s will be deprecated.  We are proposing that our IO500
will use this LANL github.

In terms of POSIX or not, this is the reason behind the easy/hard benchmarks.  Currently
we are dictating POSIX for mdtest hard and IOR hard and allowing the submitter to use
whatever backend they want (including adding a new one) for mdtest easy and IOR easy. 
Should we find ourselves in a POSIX-free future, we’ll change the POSIX requirement for
the hard.

Thanks,

John

Cheers, Andreas

Von: IO-500 [mailto:io-500-bounces@vi4io.org] Im Auftrag von John Bent
Gesendet: Freitag, 16. Juni 2017 22:30
An: io-500@vi4io.org<mailto:io-500@vi4io.org>
Betreff: [IO-500] Detailed benchmark proposal

All,

Sorry for the long silence on the mailing list. However, we have made some substantial
progress recently as we prepare for our ISC BOF next week.  For those of you at ISC,
please join us from 11 to 12 on Tuesday in Substanz 1&2.

The progress that we have made recently happened because a bunch of us were attending a
German workshop last month at Dagstuhl and had multiple discussions about the benchmark.

Here’s the highlights from what was discussed and the progress that we made at Dagstuhl:

• General agreement that the IOR-hard, IOR-easy, mdtest-hard, mdtest-easy approach is
appropriate.
• We should add a ‘find’ command as this is a popular and important workload.
• The multiple bandwidth measurements should be combined via geometric mean into one
bandwidth.
• The multiple IOPs measurements should also be combined via geometric mean into one
IOPs.
• The bandwidth and the IOPs should be multiplied to create one final score.
• The ranking uses that final score but the webpage can be sorted using other metrics.
• The webpage should allow filtering as well so, for example, people can look at only the
HDD results.
• We should separate the write/create phases from the read/stat phases to help ensure that
caching is avoided
• Nathan Hjelm volunteered to combine the mdtest and IOR benchmarks into one git repo and
has now done so.  This removes the #ifdef mess from mdtest and now they both share the
nice modular IOR backend

So the top-level summary of the benchmark in pseudo-code has become:

# write/create phase
bw1 = ior_easy -write [user supplies their own parameters maximizing data writes that can
be done in 5 minutes]
md1 = md_test_easy -create [user supplies their own parameters maximizing file creates
that can be done in 5 minutes]
bw2 = ior_hard -write [we supply parameters: unaligned strided into single shared file]
md2 = md_test_hard -create [we supply parameters: creates of 3900 byte files into single
shared directory]

# read/stat phase
bw3 = ior_easy -read [cross-node read of everything that was written in bw1]
md3 = md_test_easy -stat [cross-node stat of everything that was created in md1]
bw4 = ior_hard -read
md4 = md_test_hard -stat

# find phase
md5 = [we supply parameters to find a subset of the files that were created in the
tests]

# score phase
bw = geo_mean( bw1 bw2 bw3 bw4)
md = geo_mean( md1 md2 md3 md4 bd5)
total = bw * md

Now we are moving on to precisely define what the parameters should look like for the hard
tests and to create a standard so that people can start running it on their systems.  By
doing so, we will define the formal process so we can actually make this an official
benchmark.  Please see the attached file in which we’ve started precisely defining these
parameters.  Let’s start iterating please on this file to get these parameters correct.

Thanks,

John
_______________________________________________
IO-500 mailing list
IO-500@vi4io.org<mailto:IO-500@vi4io.org>
https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500

Cheers, Andreas

_______________________________________________
IO-500 mailing list
IO-500@vi4io.org<mailto:IO-500@vi4io.org>
https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500

2024

2023

2022

2021

2020

2019

2018

2017

2016

Re: [IO-500] Detailed benchmark proposal