Re: [IO-500] Musings from Ilene Carpenter

Thursday, 6 April 2017

On 4/6/17, 11:36 AM, "Andreas Dilger" <adilger(a)dilger.ca&gt; wrote:

...
On Apr 6, 2017, at 10:38 AM, Carpenter, Ilene
<Ilene.Carpenter(a)nrel.gov&gt;
wrote:
> On 4/6/17, 10:18 AM, "IO-500 on behalf of Andreas Dilger"
> <io-500-bounces(a)vi4io.org on behalf of adilger(a)dilger.ca&gt; wrote:
> 
>> On Apr 5, 2017, at 10:18 PM, John Bent <John.Bent(a)seagategov.com&gt;
>>wrote:
>>> 
>>> All,
>>> 
>>> I met with Ilene Carpenter about a week ago and she had a bunch of
>>> interesting thoughts about IO-500.  Her¹s are top-level than my
>>>thoughts
>>> are the sub-bullets.
>>> 
>>> 	€ Maybe it should be called HPC IO-500 to distinguish it from hyper
>>> scalars since the focus is on HPC and the machines on the Top 500
>>>list.
>>> 		€ On the other hand, Top 500 is not called HPC Top 500.
>>> 	€ Benchmarks get what benchmarks get.  What about performance when
>>> another job is running?
>>> 		€ The idea was data-easy, data-hard, metadata-easy, metadata-hard
>>>and
>>> then figure out how to combine these four into one number.  Perhaps we
>>> add a fifth test which is running all four simultaneously.
>> 
>> I'm not against running all of them together, but the aggregate
>>shouldn't
>> be worse than data-hard + metadata-hard in the end, or by definition
>> those are not the "hard" numbers we were looking for?
>> 
>>> 	€ Storage performance degrades with age.  You run Linpack on day 1
>>>and
>>> get a number.  You run Linpack on day N and get the same number.  If
>>> your purpose is to bound user expectation, then a number from day 1
>>>may
>>> no longer be the correct bound on day N when the storage is
>>>fragmented.
>> 
>> Agreed.  While the "-easy" numbers may degrade over time, at least
the
>> "-hard" numbers shouldn't get worse with age.  Also, there is value
to
>> knowing how the "-easy" numbers decline over time.  The amount that
the
>> "-easy" performance declines is a function of the underlying
filesystem
>> and storage technology (e.g. SSD vs. HDD) so unlike Top 500 there would
>> be value in re-running the benchmarks every year and amending the
>>storage
>> system's entry (though I think the ranking should be based on the peak
>> number).
>> 
>>> 	€ Another challenge with IO 500 is that Linpack is so easy.  You run
>>> it and you get two indisputable answers: the result which can be
>>> verified to be correct and the time that it took to get the result.
>>>IO
>>> benchmarks are much harder.  Are people allowed to set up RAID0 RAM
>>> disks and do the benchmark into them?
>> 
>> Sure, if that is a storage option that they actually provide for users.
>> I know of a few sites that use(d) RAM-based Lustre filesystems for
>>shared
>> storage since their application didn't have checkpoint-restart so if
>>any
>> node crashed the test run was lost anyway.
>> 
>>> 	€ IO 500 is great because it enables people to look at historical
>>> trends.
>>> 		€ For example, when do various systems start showing up?  If I¹m
>>> procuring a new storage system and I see that Ceph, BeeGFS, or
>>>OrangeFS
>>> are high up in the IO 500 then I¹m much likely consider them instead
>>>of
>>> just Lustre and GPFS.
>> 
>> Definitely one of the things I've wanted to see in the past was which
>> filesystem each system on the Top-500 was using.  We've had to generate
>> these results manually in the past.
>> 
>>> 	€ NREL just doubled their flops on a system but didn¹t touch storage.
>>> It¹d be nice if the IO 500 could somehow capture this.  Both before
>>>and
>>> after will have the same storage performance but the after system is
>>> worse because it is imbalanced.
>> 
>> While the number of clients running the tests should be part of the
>> results, I don't think a later change in the client system should
>>affect
>> previous results (see my earlier comment about submitting updated
>>numbers
>> periodically). There could be the ability to link the IO-500 storage
>> results to the Top-500 system results, but there isn't necessarily a
>>1:1
>> relationship between them.
>> 
>> In particular, many sites have Lustre or GPFS site-wide filesystems
>>that
>> are accessed by multiple (and changing) compute clusters, so while a
>> particular benchmark result may depend on the number of clients
>>actually
>> running the test, the result itself shouldn't be a function of the
>>total
>> number of clients accessing the storage.
> 
> The discussion we had was about system balance, how top 500 has caused a
> tendency to value flops above having a balanced system and the
>possibility
> of somehow assigning a balance score - storage capacity and performance
> relative to flop rate. This would only apply to systems that have
> dedicated HPC-oriented filesystems, I don¹t think we¹d want to try to
> assess balance for a site-wide filesystem.

I think if there is a link from the IO-500 results to the Top-500 results
then this is something that readers can work out themselves, but I don't
think the posted IO-500 results should depend on the compute nodes
directly.  Otherwise, this implies that the IO-500 result would change if
the Linpack number is updated (e.g. better compiler, add GPUs) but that
doesn't necessarily make sense.

> Thinking about how to deal with site-wide filesystems that makes me
> wonder, what is the *identifier* of entries in the proposed io-500 list?
> Is it system name (ie. Peregrine, Cori, Summit, etc.) ?
> Is it a system name plus a file system name (as in Peregrine:/scratch,
> Jim:/scratch1, Jim:/scratch2, Bob:/ptmp)?
> Maybe it is a system and the aggregation of multiple parallel file
>systems
> (as in Jim:/scratch1+Jim:/scratch2)?
> Maybe it is a site name plus a file system name (as in
> NERSC:/site-wide-GPFS, NREL:Peregrine:/scratch)?
> 
> We¹ve focused a fair amount of energy on defining benchmarks but I don¹t
> recall any discussion on this.

Definitely the filesystem should be identified separately from (or in
addition to) the cluster that it is attached to.  Some sites have names
for site-wide filesystems themselves, since it is attached to multiple
compute clusters.

So that would be CENTER:/filesystem-name for site-wide filesystems and
either CENTER:SYSTEM:/filesystem-name or SYSTEM:/filesystem-name for
system-specific filesystems, which may have pretty generic names like
“scratch”. 

2024

2023

2022

2021

2020

2019

2018

2017

2016

Re: [IO-500] Musings from Ilene Carpenter