Hello everybody and Michael,
I have sent two emails in the list that probably will arrive in later(??), if you have not
got them already, so I am sorry if you receive them later.
Michael, this is exactly what I was discussing also with Julien today. It is ok to have
some generic benchmarks but as one person who was judging the linpack that it was not
related to an application, thus we should have some benchmarks related to applications.
My proposal was let’s discuss which applications consume the largest percentage of
core-hours on the supercomputers across various sites and let’s try to mimic their I/O
through IOR or whatever else, but first through IOR as it is a ready solution and well
known. If not possible, we could check other solutions.
For example, we already did test today the md-real and I was mentioning that I would like
also to create less files but larger ones, because this could mimic cases that users are
doing data assimilation, they run multiple models with various parameters and they don’t
save large data as they execute a model for example 100 times the same moment.
About the duration of the benchmark, one personal concern is not to be complicated like
linpack neither to take long time and make it difficult for the people to find resources
etc. I assume that 3 hours is more than enough. But I agree, that it depends. However, if
I create a lot of data, I know that I will disturb other users on Lustre, I do not want to
do that for 3 hours.
About safe data, how would you check that? About pmem and future, the benchmark (suite)
could be updated, this moment not many people have access on this, right? Of course, this
does not mean that we should not get ready for this, but I am saying let’s get ready first
for the basic ones and we’ll get involve, just my opinion.
Everybody wants to have a benchmark suite that will help to next procurement, right?
I hope to have nice discussions at BOF with all the people who will be there.
George Markomanolis, PhD
KAUST Supercomputing Laboratory (KSL)
King Abdullah University of Science & Technology
Al Khawarizmi Bldg. (1) Room 0123
Kingdom of Saudi Arabia
Mob: +966 56 325 9012
Office: +966 12 808 0393<tel:%2B966%2012%20808%200683>
From: IO-500 <io-500-bounces(a)vi4io.org> on behalf of Michael Kluge
Date: Sunday, 18 June 2017 at 9:01 PM
To: "io-500(a)vi4io.org" <io-500(a)vi4io.org>
Subject: Re: [IO-500] Detailed benchmark proposal
IOR has an option to allocate a certain amount of the hosts memory. I suggest that we set
this to 90-95 percent and the total amount of data written as twice the size of the main
memory? Otherwise, the 10+ PB main memory of SUMMIT would make the list useless ;)
If I read everything correctly the current run rules define an execution time of 5 minutes
and just count the numbers of bytes/iops/files touched during this time. I agree that most
of the time our users do I/O in bursts. Is the benchmark basically only about “who can
write the most data with one file per process in 5 mins”? Why 5 minutes and not “how long
does it take to dump 80% of the main memory to some redundant permanent storage” (with
Do we want to define some rules about how safe the data has to be? Should it be OK if this
data ends up in a single burst buffer and there is no copy somewhere? I would recommend
that the results are only valid if data survives one failure of one of the storage devices
For example: For the mdtest-workload I could imagine a file system that has directory
locking turned off and is using an SSD/NVRAM backend and thus would just behave like the
“IOR hard” workload.
Another point is that I am more a fan of application driven benchmarks. The numbers above
do not tell me anything about my applications, so why should I actually run the benchmark?
Just to to be “on the list”?
Application driven benchmarks (something like SPEC CPU, but SPEC I/O), that scale with the
machine (and with the machines main memory) could actually become a standard that also the
industry could use to advertise their systems.
In addition, if we as a site have I/O patterns that are close to one of the benchmarks, we
could put some weight on this benchmark and adjust our tenders and the industry partner
would know how to design the storage system with respect to our special requirements. Just
because they know the I/O pattern because it is a standard and they know how to deal with
One more thing that the current approach does not deal with at all is the fact that in
very near future applications will access permanent storage using interfaces that IOR does
not cover and store data by using mov() instructions in the CPU. Thus, if the list is
established using some combination of IOR+MDTEST+POSIX, I think it has no chance to
reflect the really fast I/O subsystems that are coming like http://pmem.io/
Sorry for the lengthy statement …
Dr.-Ing. Michael Kluge
Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
Falkenbrunnen, Room 240
Phone: (+49) 351 463-34217
Fax: (+49) 351 463-37773
Von: IO-500 [mailto:firstname.lastname@example.org] Im Auftrag von John Bent
Gesendet: Freitag, 16. Juni 2017 22:30
Betreff: [IO-500] Detailed benchmark proposal
Sorry for the long silence on the mailing list. However, we have made some substantial
progress recently as we prepare for our ISC BOF next week. For those of you at ISC,
please join us from 11 to 12 on Tuesday<x-apple-data-detectors://0> in Substanz
The progress that we have made recently happened because a bunch of us were attending a
German workshop last month at Dagstuhl and had multiple discussions about the benchmark.
Here’s the highlights from what was discussed and the progress that we made at Dagstuhl:
1. General agreement that the IOR-hard, IOR-easy, mdtest-hard, mdtest-easy approach is
2. We should add a ‘find’ command as this is a popular and important workload.
3. The multiple bandwidth measurements should be combined via geometric mean into one
4. The multiple IOPs measurements should also be combined via geometric mean into one
5. The bandwidth and the IOPs should be multiplied to create one final score.
6. The ranking uses that final score but the webpage can be sorted using other
7. The webpage should allow filtering as well so, for example, people can look at only
the HDD results.
8. We should separate the write/create phases from the read/stat phases to help ensure
that caching is avoided
9. Nathan Hjelm volunteered to combine the mdtest and IOR benchmarks into one git repo
and has now done so. This removes the #ifdef mess from mdtest and now they both share the
nice modular IOR backend
So the top-level summary of the benchmark in pseudo-code has become:
# write/create phase
bw1 = ior_easy -write [user supplies their own parameters maximizing data writes that can
be done in 5 minutes]
md1 = md_test_easy -create [user supplies their own parameters maximizing file creates
that can be done in 5 minutes]
bw2 = ior_hard -write [we supply parameters: unaligned strided into single shared file]
md2 = md_test_hard -create [we supply parameters: creates of 3900 byte files into single
# read/stat phase
bw3 = ior_easy -read [cross-node read of everything that was written in bw1]
md3 = md_test_easy -stat [cross-node stat of everything that was created in md1]
bw4 = ior_hard -read
md4 = md_test_hard -stat
# find phase
md5 = [we supply parameters to find a subset of the files that were created in the
# score phase
bw = geo_mean( bw1 bw2 bw3 bw4)
md = geo_mean( md1 md2 md3 md4 bd5)
total = bw * md
Now we are moving on to precisely define what the parameters should look like for the hard
tests and to create a standard so that people can start running it on their systems. By
doing so, we will define the formal process so we can actually make this an official
benchmark. Please see the attached file in which we’ve started precisely defining these
parameters. Let’s start iterating please on this file to get these parameters correct.
This message and its contents including attachments are intended solely for the original
recipient. If you are not the intended recipient or have received this message in error,
please notify me immediately and delete this message from your computer system. Any
unauthorized use or distribution is prohibited. Please consider the environment before
printing this email.