Frankly, I'm much more ok with scratch focused solutions dominating the list than I am with vendor built specialty systems (potentially with custom software) on top of the list. If they must, they should be on a separate list. 

I agree on the durability side of things and the sharing of as much data about the architectures and configurations as possible. 

--Ken

On Mar 11, 2020, at 2:26 PM, Dean Hildebrand via IO-500 <io-500@vi4io.org> wrote:

Just to +1 the idea of full transparency.  I'm ok with scratch space focused solutions dominate the top of the list, but their type of durability and availability should be stated.  

I would go beyond what can be manually scraped in Linux (e.g., you can't see the version of EC used on a storage controller) and mandate that submissions include their data durability architecture in at least some level of detail.  I don't think it would be onerous to include some level of information such as  "8+3 RS with dual redundant servers" or with Google Cloud for example, "OSS consists of a VM using Standard Persistent Disk".

Even if it wasn't mandated, we should encourage the audience to view any solution that doesn't include durability/availability information as effectively being a scratch space solution.

On Wed, Mar 11, 2020 at 9:08 AM Mark Nelson via IO-500 <io-500@vi4io.org> wrote:
On 3/10/20 6:34 PM, Andreas Dilger wrote:

> On Mar 10, 2020, at 4:43 PM, Harms, Kevin via IO-500 <io-500@vi4io.org> wrote:
>> Mark,
>>
>>   currently there is no requirement for replication = 2, you can run with replication = 1.
> That is true, but it depends what you want to show to users.  Some systems might reasonably run without redundancy (e.g. short-term scratch space where NVMe MTTF is much longer than the file lifetime), but that is not necessarily desirable of most storage.  So I think that should be _possible_, but using an unrealistic configuration for vendor submissions may be setting users up for disappointment if that isn't how systems are normally configured in the field.


That was sort of my take.  It seems a little un-sportsman-like to set 
the system up in a way that no one would actually use.  If I use 1X 
replication (and run the tests a little longer for better dynamic 
sub-tree partitioning) I can push the score up to 62.5. It would 
probably scale higher/faster with faster sub-tree partitioning since the 
mdtest results aren't changing much.  In any event, it seems far more 
informative for users to report the ~56 score with 2x replication or 
even 3x replication which is what we recommend for more permanent 
filesystem use.  I guess the exception might be for very highly 
transient scratch space?


>
> As for somewhat apples-to-oranges comparisons (at least two fruits, not apples-to-asteroids), in some cases it is possible to get _some_ information about the underlying storage, but I don't think that is *easy* with the current results.  If you go to the IO-500 lists, and add all of the "information.*" fields at the bottom of the page, this will give you some details about the systems that the tests were run on.


That's what I figured but I thought I would ask. Apples-to-apples is 
tough but at least knowing if you are comparing even remotely similar 
setups would be nice without having to do a bunch of digging/guess work.


>
> That said, the information is recorded inconsistently in some places (e.g. in some cases it looks like the number of devices is per server, and in others it is the total number of devices), but at least it gives you some idea of what the other storage systems are using.

Yeah, that's exactly what I'm seeing.  I'm trying to look through some 
of the 10 node challenge results on this page:


https://www.vi4io.org/io500/list/19-11/10node?fields=information__system,information__institution,information__storage_vendor,information__filesystem_type,information__client_nodes,information__client_total_procs,io500__score,io500__bw,io500__md,information__data,information__list_id&equation=&sort_asc=false&sort_by=io500__score&radarmax=6&query=


Down at the bottom there is a "Download complete data as CSV" link but 
that appears to be for the regular IO500 list since the results are 
different (many results are using way more than 10 clients).  I see how 
people are taking liberty regarding the meaning of the different 
information fields.  Also I'm not entirely sure how I would fill them 
out for Ceph.  On our test cluster metadata and data are being stored on 
the same object stores and everything is running on the same 10 nodes 
and 80 devices.  IE would I put 0 mds servers and 0 mds devices or 10 
mds servers and 80 mds devices?  It's difficult to express that with the 
fields we have available (maybe it would be helpful to have explicit 
devices/node, total device count, and total node count fields?).  You 
almost need some kind of topology visualization.


>
> One of the goals for the next IO-500 list is to automate some of the information capture so that this is recorded more accurately and consistently.  For example, having a script for Lustre, Ceph, GPFS, BeeGFS, etc. to scrape information from the client and/or server about RAM, CPU, network, filesystem size, version, devices, OS versions, tunable parameters, etc. to include with the test results would be very useful, even if not all of it fits into the database schema at this point.


We've got some tools in ceph that gather information like this, though I 
don't recall everything it gathers.  It might be possible to tie into 
it, though I have a feeling that you might be able to cover many 
lustre/beegfs/ceph deployments with a single external tool.


>
> Cheers, Andreas
>
>
>> ________________________________________
>> From: IO-500 <io-500-bounces@vi4io.org> on behalf of Mark Nelson via IO-500 <io-500@vi4io.org>
>> Sent: Tuesday, March 10, 2020 4:30 PM
>> To: io-500@vi4io.org
>> Subject: [IO-500] How to judge scoring vs storage HW
>>
>> Hi Folks,
>>
>> I'm one of the Ceph developers but used to work in the HPC world in a
>> previous life.  Recently I saw that we were listed on the SC19 IO-500 10
>> node challenge list but had ranked pretty low.  I figured that it might
>> be fun to play around for a couple of days and see if I could get our
>> score up a bit.
>>
>> Let me first say that it's great having mdtest and ior packaged up like
>> this.  Already the hard test cases have identified a couple of
>> performance issues we should take care of with unaligned reads/writes
>> and cephfs dynamic subtree partitioning (which are also dragging our
>> score down).  Very useful!  I was so happy with the effort that I ended
>> up writing a new libcephfs aiori backend for ior/mdtest.  The PR just
>> merged but is here for anyone interested:
>>
>> https://github.com/hpc/ior/pull/217
>>
>> Our test cluster has 10 nodes with 8 NVMe drives each, and we are
>> co-locating the metadata servers and client processes on the same nodes
>> during testing.  So far with 2x replication we've managed to hit scores
>> in the 55-60 range which looks like it would have put us in 10th place
>> on the SC19 list (note that for that result we are pre-creating the
>> mdtest easy directories for static round-robin MDS pinning, though we
>> have a feature coming soon for ephemeral pinning via a single
>> parent-directory xattr).  Anyway, I have really no idea how that score
>> actually compares to the other systems listed.  I was wondering if
>> there's any way to easily compare what kind of hardware and software
>> configuration is being used for the storage clusters for each entry?
>>
>> IE in our case we're using 2x replication and 10 nodes total with pretty
>> beefy Xeon CPUs, 8xP4610 NVMe drives, and 4x25GbE.  Total storage
>> capacity before replication is ~640TB.
>>
>> Thanks,
>> Mark
>>
>> _______________________________________________
>> IO-500 mailing list
>> IO-500@vi4io.org
>> https://www.vi4io.org/mailman/listinfo/io-500
>> _______________________________________________
>> IO-500 mailing list
>> IO-500@vi4io.org
>> https://www.vi4io.org/mailman/listinfo/io-500
>
> Cheers, Andreas
>
>
>
>
>

_______________________________________________
IO-500 mailing list
IO-500@vi4io.org
https://www.vi4io.org/mailman/listinfo/io-500
_______________________________________________
IO-500 mailing list
IO-500@vi4io.org
https://urldefense.com/v3/__https://www.vi4io.org/mailman/listinfo/io-500__;!!Eh6p8Q!SG3gsFYYZ0UOnDW-do1B2dPoWDIjMFTwbWO8unu6hRaevfnhWS2QQGhd5Ixqycsl5QhA$