Comments below ...
On 4/6/17, 10:18 AM, "IO-500 on behalf of Andreas Dilger"
<io-500-bounces(a)vi4io.org on behalf of adilger(a)dilger.ca> wrote:
On Apr 5, 2017, at 10:18 PM, John Bent
<John.Bent(a)seagategov.com> wrote:
>
> All,
>
> I met with Ilene Carpenter about a week ago and she had a bunch of
>interesting thoughts about IO-500. Her¹s are top-level than my thoughts
>are the sub-bullets.
>
> € Maybe it should be called HPC IO-500 to distinguish it from hyper
>scalars since the focus is on HPC and the machines on the Top 500 list.
> € On the other hand, Top 500 is not called HPC Top 500.
> € Benchmarks get what benchmarks get. What about performance when
>another job is running?
> € The idea was data-easy, data-hard, metadata-easy, metadata-hard and
>then figure out how to combine these four into one number. Perhaps we
>add a fifth test which is running all four simultaneously.
I'm not against running all of them together, but the aggregate shouldn't
be worse than data-hard + metadata-hard in the end, or by definition
those are not the "hard" numbers we were looking for?
> € Storage performance degrades with age. You run Linpack on day 1 and
>get a number. You run Linpack on day N and get the same number. If
>your purpose is to bound user expectation, then a number from day 1 may
>no longer be the correct bound on day N when the storage is fragmented.
Agreed. While the "-easy" numbers may degrade over time, at least the
"-hard" numbers shouldn't get worse with age. Also, there is value to
knowing how the "-easy" numbers decline over time. The amount that the
"-easy" performance declines is a function of the underlying filesystem
and storage technology (e.g. SSD vs. HDD) so unlike Top 500 there would
be value in re-running the benchmarks every year and amending the storage
system's entry (though I think the ranking should be based on the peak
number).
> € Another challenge with IO 500 is that Linpack is so easy. You run
>it and you get two indisputable answers: the result which can be
>verified to be correct and the time that it took to get the result. IO
>benchmarks are much harder. Are people allowed to set up RAID0 RAM
>disks and do the benchmark into them?
Sure, if that is a storage option that they actually provide for users.
I know of a few sites that use(d) RAM-based Lustre filesystems for shared
storage since their application didn't have checkpoint-restart so if any
node crashed the test run was lost anyway.
> € IO 500 is great because it enables people to look at historical
>trends.
> € For example, when do various systems start showing up? If I¹m
>procuring a new storage system and I see that Ceph, BeeGFS, or OrangeFS
>are high up in the IO 500 then I¹m much likely consider them instead of
>just Lustre and GPFS.
Definitely one of the things I've wanted to see in the past was which
filesystem each system on the Top-500 was using. We've had to generate
these results manually in the past.
> € NREL just doubled their flops on a system but didn¹t touch storage.
>It¹d be nice if the IO 500 could somehow capture this. Both before and
>after will have the same storage performance but the after system is
>worse because it is imbalanced.
While the number of clients running the tests should be part of the
results, I don't think a later change in the client system should affect
previous results (see my earlier comment about submitting updated numbers
periodically). There could be the ability to link the IO-500 storage
results to the Top-500 system results, but there isn't necessarily a 1:1
relationship between them.
In particular, many sites have Lustre or GPFS site-wide filesystems that
are accessed by multiple (and changing) compute clusters, so while a
particular benchmark result may depend on the number of clients actually
running the test, the result itself shouldn't be a function of the total
number of clients accessing the storage.
The discussion we had was about system balance, how top 500 has caused a
tendency to value flops above having a balanced system and the possibility
of somehow assigning a balance score - storage capacity and performance
relative to flop rate. This would only apply to systems that have
dedicated HPC-oriented filesystems, I don¹t think we¹d want to try to
assess balance for a site-wide filesystem.
Thinking about how to deal with site-wide filesystems that makes me
wonder, what is the *identifier* of entries in the proposed io-500 list?
Is it system name (ie. Peregrine, Cori, Summit, etc.) ?
Is it a system name plus a file system name (as in Peregrine:/scratch,
Jim:/scratch1, Jim:/scratch2, Bob:/ptmp)?
Maybe it is a system and the aggregation of multiple parallel file systems
(as in Jim:/scratch1+Jim:/scratch2)?
Maybe it is a site name plus a file system name (as in
NERSC:/site-wide-GPFS, NREL:Peregrine:/scratch)?
We¹ve focused a fair amount of energy on defining benchmarks but I don¹t
recall any discussion on this.