On Wed, Mar 11, 2020 at 9:08 AM Mark Nelson via IO-500 <io-500@vi4io.org> wrote:

On 3/10/20 6:34 PM, Andreas Dilger wrote:

> On Mar 10, 2020, at 4:43 PM, Harms, Kevin via IO-500 <io-500@vi4io.org> wrote:
>> Mark,
>>
>> currently there is no requirement for replication = 2, you can run with replication = 1.
> That is true, but it depends what you want to show to users. Some systems might reasonably run without redundancy (e.g. short-term scratch space where NVMe MTTF is much longer than the file lifetime), but that is not necessarily desirable of most storage. So I think that should be _possible_, but using an unrealistic configuration for vendor submissions may be setting users up for disappointment if that isn't how systems are normally configured in the field.

That was sort of my take. It seems a little un-sportsman-like to set
the system up in a way that no one would actually use. If I use 1X
replication (and run the tests a little longer for better dynamic
sub-tree partitioning) I can push the score up to 62.5. It would
probably scale higher/faster with faster sub-tree partitioning since the
mdtest results aren't changing much. In any event, it seems far more
informative for users to report the ~56 score with 2x replication or
even 3x replication which is what we recommend for more permanent
filesystem use. I guess the exception might be for very highly
transient scratch space?

>
> As for somewhat apples-to-oranges comparisons (at least two fruits, not apples-to-asteroids), in some cases it is possible to get _some_ information about the underlying storage, but I don't think that is *easy* with the current results. If you go to the IO-500 lists, and add all of the "information.*" fields at the bottom of the page, this will give you some details about the systems that the tests were run on.

That's what I figured but I thought I would ask. Apples-to-apples is
tough but at least knowing if you are comparing even remotely similar
setups would be nice without having to do a bunch of digging/guess work.

>
> That said, the information is recorded inconsistently in some places (e.g. in some cases it looks like the number of devices is per server, and in others it is the total number of devices), but at least it gives you some idea of what the other storage systems are using.

Yeah, that's exactly what I'm seeing. I'm trying to look through some
of the 10 node challenge results on this page:

https://www.vi4io.org/io500/list/19-11/10node?fields=information__system,information__institution,information__storage_vendor,information__filesystem_type,information__client_nodes,information__client_total_procs,io500__score,io500__bw,io500__md,information__data,information__list_id&equation=&sort_asc=false&sort_by=io500__score&radarmax=6&query=

Down at the bottom there is a "Download complete data as CSV" link but
that appears to be for the regular IO500 list since the results are
different (many results are using way more than 10 clients). I see how
people are taking liberty regarding the meaning of the different
information fields. Also I'm not entirely sure how I would fill them
out for Ceph. On our test cluster metadata and data are being stored on
the same object stores and everything is running on the same 10 nodes
and 80 devices. IE would I put 0 mds servers and 0 mds devices or 10
mds servers and 80 mds devices? It's difficult to express that with the
fields we have available (maybe it would be helpful to have explicit
devices/node, total device count, and total node count fields?). You
almost need some kind of topology visualization.

>
> One of the goals for the next IO-500 list is to automate some of the information capture so that this is recorded more accurately and consistently. For example, having a script for Lustre, Ceph, GPFS, BeeGFS, etc. to scrape information from the client and/or server about RAM, CPU, network, filesystem size, version, devices, OS versions, tunable parameters, etc. to include with the test results would be very useful, even if not all of it fits into the database schema at this point.

We've got some tools in ceph that gather information like this, though I
don't recall everything it gathers. It might be possible to tie into
it, though I have a feeling that you might be able to cover many
lustre/beegfs/ceph deployments with a single external tool.

>
> Cheers, Andreas
>
>
>> ________________________________________
>> From: IO-500 <io-500-bounces@vi4io.org> on behalf of Mark Nelson via IO-500 <io-500@vi4io.org>
>> Sent: Tuesday, March 10, 2020 4:30 PM
>> To: io-500@vi4io.org
>> Subject: [IO-500] How to judge scoring vs storage HW
>>
>> Hi Folks,
>>
>> I'm one of the Ceph developers but used to work in the HPC world in a
>> previous life. Recently I saw that we were listed on the SC19 IO-500 10
>> node challenge list but had ranked pretty low. I figured that it might
>> be fun to play around for a couple of days and see if I could get our
>> score up a bit.
>>
>> Let me first say that it's great having mdtest and ior packaged up like
>> this. Already the hard test cases have identified a couple of
>> performance issues we should take care of with unaligned reads/writes
>> and cephfs dynamic subtree partitioning (which are also dragging our
>> score down). Very useful! I was so happy with the effort that I ended
>> up writing a new libcephfs aiori backend for ior/mdtest. The PR just
>> merged but is here for anyone interested:
>>
>> https://github.com/hpc/ior/pull/217
>>
>> Our test cluster has 10 nodes with 8 NVMe drives each, and we are
>> co-locating the metadata servers and client processes on the same nodes
>> during testing. So far with 2x replication we've managed to hit scores
>> in the 55-60 range which looks like it would have put us in 10th place
>> on the SC19 list (note that for that result we are pre-creating the
>> mdtest easy directories for static round-robin MDS pinning, though we
>> have a feature coming soon for ephemeral pinning via a single
>> parent-directory xattr). Anyway, I have really no idea how that score
>> actually compares to the other systems listed. I was wondering if
>> there's any way to easily compare what kind of hardware and software
>> configuration is being used for the storage clusters for each entry?
>>
>> IE in our case we're using 2x replication and 10 nodes total with pretty
>> beefy Xeon CPUs, 8xP4610 NVMe drives, and 4x25GbE. Total storage
>> capacity before replication is ~640TB.
>>
>> Thanks,
>> Mark
>>
>> _______________________________________________
>> IO-500 mailing list
>> IO-500@vi4io.org
>> https://www.vi4io.org/mailman/listinfo/io-500
>> _______________________________________________
>> IO-500 mailing list
>> IO-500@vi4io.org
>> https://www.vi4io.org/mailman/listinfo/io-500
>
> Cheers, Andreas
>
>
>
>
>

_______________________________________________
IO-500 mailing list
IO-500@vi4io.org
https://www.vi4io.org/mailman/listinfo/io-500