Re: [IO-500] detailed benchmark proposal

Tuesday, 20 June 2017

On Jun 19, 2017, at 5:04 PM, Harms, Kevin <harms(a)alcf.anl.gov&gt; wrote:
...

  Just my 0.02$ here, but I think the usefulness of the IO-500 would be to get the details
of the how the storage system is tuned and what tuning options were used when running the
job. Reporting of say all the Lustre/GPFS tuning parameters for a given storage system
would be nice to know. For example, an IOR test done using the exact same compute system
and storage system would give different results if it was GPFS was configured with a 256
KiB block size as opposed to a 8 MiB block size.

  I think people submitting "optimized" results with the result setup documented
and all of the storage/FS configuration documented is more useful than trying to converge
on a set of run rules.

  For example, 5 minutes of runtime sounds good but what happens if someone deploys
"burst buffer type" storage that can write 100% of memory in less than a minute?

I guess that would be possible, but at least with the systems I've been involved in
the burst buffer storage should be able to store at least 3-4 full checkpoints, and the
storage system performance is typically scaled to achieve the 5-minute/90% target for cost
reasons as much as anything.

I don't _think_ that writing for 5 minutes to a storage system is an onerous
requirement.  Storage devices would need to have a minimum "300s full bandwidth write
capacity" (FWBC) so they don't run out of space during the test, and looking
around I don't see much available that doesn't meet this performance vs. capacity
requirement.

Definitely every HDD and SATA SSD has capacity > bandwidth * 300 due to performance
limitations of the SATA interface.  With SATA 3.0 at 600MB/s, the 300s FBWC is 180GB and
no HDD that small is even sold today (and wouldn't be able to accept data at that rate
anyway).  Only the very cheapest consumer-grade SSDs are below 256GB capacity and are
unlikely they have an actual write bandwidth that hits the SATA 3.0 limit, and above 256GB
capacity they are limited by SATA bandwidth.

High-end PCIe3 SSD/NVRAM devices _can_ hit very high write bandwidths. The PCIe3 bandwidth
itself would put a limit on the maximum device size needed to run the benchmark.  I think
SSD devices today generally already meet these targets.  It's easy for me to find the
Intel PCI SSD specs, but I expect other vendor devices to be similar.  Note that
performance is often a function of the device size, due to the bandwidth limit of
individual NAND chips.  I've listed the interesting performance/capacity points for
various models below:

Card		Capacity	Max write b/w	300s FBWC
PCIe3 x1			  984MB/s	296GB
PCIe3 x4			 3940MB/s	1182MB
PCIe3 x8			 7880MB/s	2364MB
PCIe3 x16			15754MB/s	4726MB
Intel DC P3500	 400GB		 1000MB/s	300GB
Intel DC P3500	1200GB		 1200MB/s	360GB
Intel DC P3500	2000GB		 1800MB/s	540GB
Intel DC P3700	 400GB		 1080MB/s	324GB
Intel DC P3700	 800GB		 1900MB/s	570GB
Intel DC P3700	2000GB		 1900MB/s	570GB
Intel DC P3608	1600GB		 2000MB/s	600GB
Intel DC P3608	4000GB		 3000MB/s	900GB
Intel DC P4800X	 375GB		 2200MB/s	660GB*

The only device that doesn't meet the 300s FBWC is the new Optane P4800X device, which
hits about 2200MB/s with 1MB transfers or even 4KB transfers with queue depth 8[*].  It is
worthwhile to note in the same article that the P4800X line is also scheduled to have a
750GB model, so it would exceed the 300s FBWC limit as well.

In summary, I don't think there are really going to be (m)any storage systems that can
deliver the required throughput to store 100% of RAM in under 5 minutes, simply due to the
economics of performance vs. capacity.

Cheers, Andreas

[*]
http://www.anandtech.com/show/11209/intel-optane-ssd-dc-p4800x-review-a-d...

...
 kevin
 ________________________________________
 From: IO-500 <io-500-bounces(a)vi4io.org&gt; on behalf of John Bent
<John.Bent(a)seagategov.com&gt;
 Sent: Friday, June 16, 2017 3:56:09 AM
 To: io-500(a)vi4io.org
 Subject: [IO-500] detailed benchmark proposal

 All,

 Sorry for the long silence on the mailing list.  However, we have made some substantial
progress recently as we prepare for our ISC BOF next week.  For those of you at ISC,
please join us from 11 to 12 on Tuesday in Substanz 1&2.

 The progress that we have made recently happened because a bunch of us were attending a
German workshop last month at Dagstuhl and had multiple discussions about the benchmark.

 Here’s the highlights from what was discussed and the progress that we made at Dagstuhl:

  1.  General agreement that the IOR-hard, IOR-easy, mdtest-hard, mdtest-easy approach is
appropriate.
  2.  We should add a ‘find’ command as this is a popular and important workload.
  3.  The multiple bandwidth measurements should be combined via geometric mean into one
bandwidth.
  4.  The multiple IOPs measurements should also be combined via geometric mean into one
IOPs.
  5.  The bandwidth and the IOPs should be multiplied to create one final score.
  6.  The ranking uses that final score but the webpage can be sorted using other
metrics.
  7.  The webpage should allow filtering as well so, for example, people can look at only
the HDD results.
  8.  We should separate the write/create phases from the read/stat phases to help ensure
that caching is avoided
  9.  Nathan Hjelm volunteered to combine the mdtest and IOR benchmarks into one git repo
and has now done so.  This removes the #ifdef mess from mdtest and now they both share the
nice modular IOR backend

 So the top-level summary of the benchmark in pseudo-code has become:

 # write/create phase
 bw1 = ior_easy -write [user supplies their own parameters maximizing data writes that can
be done in 5 minutes]
 md1 = md_test_easy -create [user supplies their own parameters maximizing file creates
that can be done in 5 minutes]
 bw2 = ior_hard -write [we supply parameters: unaligned strided into single shared file]
 md2 = md_test_hard -create [we supply parameters: creates of 3900 byte files into single
shared directory]

 # read/stat phase
 bw3 = ior_easy -read [cross-node read of everything that was written in bw1]
 md3 = md_test_easy -stat [cross-node stat of everything that was created in md1]
 bw4 = ior_hard -read
 md4 = md_test_hard -stat

 # find phase
 md5 = [we supply parameters to find a subset of the files that were created in the
tests]

 # score phase
 bw = geo_mean( bw1 bw2 bw3 bw4)
 md = geo_mean( md1 md2 md3 md4 bd5)
 total = bw * md

 Now we are moving on to precisely define what the parameters should look like for the
hard tests and to create a standard so that people can start running it on their systems. 
By doing so, we will define the formal process so we can actually make this an official
benchmark.  Please see the attached file in which we’ve started precisely defining these
parameters.  Let’s start iterating please on this file to get these parameters correct.

 Thanks,

 John
 _______________________________________________
 IO-500 mailing list
 IO-500(a)vi4io.org
 https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500 

Cheers, Andreas

2024

2023

2022

2021

2020

2019

2018

2017

2016

Re: [IO-500] detailed benchmark proposal