Thanks to everyone for the great discussion.  From re-reading everyone's comments, here is what I gleamed:

1. I think I derailed the conversation with my 1PB suggestion.  It was simply a capacity that I felt couldn't be reasonably stored in a single node.  So if the concern is shared memory on a single node, then that was just 1 way to get around the issue without dictating architecture. But it sounds like it was a poor suggestion :)

2. It sounds like everyone has different ideas for why the 10 node challenge exists (e.g., let small systems submit, evaluate smaller apps on large storage systems, encourage participation).  While this seems ok (and possibly desirable to have so many reasons for a benchmark's existence), the actual results seem to be less useful than the hero runs in that what guidance does it provide to our users?  Some entries have huge backends, some have a few TBs and .few storage storage servers...so its really varied.  I guess its 'reader of list beware' (as always seems to be the case).

3. I like the thread limit suggestion, keeping it hard for many FSs and probably a little more realistic how real jobs run on physical h/w and utilize storage (as Steve stated).  But its not clear to me that the 10 node challenge is primarily focused on real world anyway given the myriad of reasons for its existence.

Finally I do still think it would be useful for the committee/community to come up with a more singular focus for the 10 node challenge rather than the grab bag that exists today.

Thanks to everyone again, beyond fixing FS perf bugs, the io500 also seems like a great way for the community to discuss real issues.

On Fri, Oct 4, 2019 at 3:18 PM Carlile, Ken <carlilek@janelia.hhmi.org> wrote:
+++

The 1 PB limit barring all flash systems Is exactly what I had in mind. While I do have a few all flash systems (or ones that can convincingly make the argument for the purposes of io500), arguably my fastest is 500T. I suspect I have a faster one that’s only 100T, but I’ll never have a chance to test that one...

--Ken

Sent from my <advertising redacted>

> On Oct 4, 2019, at 5:55 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>
> IMHO, the "10 physical nodes" requirement makes sense from the point of view
> previously stated, that running 10 virtual hosts on the same physical machine
> can dramatically skew the results since they can share the same memory and
> do virtually no actual network traffic, bypassing the "read your neighbour"
> requirement.
>
> One of the main motivators for IO-500 is to explore filesystem scaling in real
> clusters, and if we allow 10 virtual nodes, why not devolve into 10 containers
> in the same OS instance, or 10 mountpoints on a single node with a local server?
> I think that is bypassing the intent of the benchmark completely.
>
> As for why the 10-node challenge exists, IMHO there are two motivations for this:
> - see storage performance with a limited number of clients so that users/admins
>  can get a realistic sense of how IO performance will scale based on the number
>  of clients (i.e. assuming benchmark numbers are limited by protocol and network
>  performance), while the hero numbers are based on the number of servers (i.e.
>  assume benchmark numbers are limited by aggregate storage bandwidth and IOPS).
> - let's be realistic- since the list doesn't yet have 500 submissions, let alone
>  500 top systems, so this is a reasonable way to improve audience participation
>  that actually provides some useful metrics.
>
> In particular I like the idea of comparing the 10-node result with the N-node
> result to see just how much of the storage bandwidth can be driven by a small
> number of clients.  In theory, the 10-node case could saturate the network
> bandwidth (assuming storage bandwidth > 10x network bandwidth), but in practise
> this is not always the case, and the difference shows areas that could improve
> CPU or protocol or server efficiency.  I think the IO-500 has already driven
> real-world improvements (c.f. Cambridge) that have improved the lives of users.
> I think that 10-node results will continue to be useful to sumbit even after
> there are more than 500 larger results available on the list.
>
> As for the 1PB minimum, I think that would drive down participation, especially
> in the (IMHO important) flash storage arena, since that can be cost-prohibitive
> today.  I think the list will naturally fill out over time, with new and large
> systems coming online submitting results during their acceptance phase pushing
> the 10-node results from the top spots, and eventually from the IO-500 list
> entirely.  In the meantime, I don't see a need to refuse valid results while
> there is still a lot of room on the list that needs to be filled.
>
> Cheers, Andreas
>
>
>> On Oct 4, 2019, at 10:46 AM, Carlile, Ken via IO-500 <io-500@vi4io.org> wrote:
>>
>> I think the 1PB is a non-starter. Why exclude the small guys?
>>
>> What confuses me is the statement that it's ok to run multiple VMs as long as the iron count is 10. Or am I misreading that?
>>
>> 10 clients makes sense to me because certain places simply don't HAVE that many clients to throw at the benchmark, and it normalizes the speeds across a standard number of clients.
>>
>> --Ken
>>
>>>> On Oct 4, 2019, at 12:43 PM, Dean Hildebrand via IO-500 <io-500@vi4io.org> wrote:
>>>
>>> Julien, Thanks for the examples.
>>>
>>> I think what you may be getting at is that the 10 client challenge is really about, "Given a large storage system that submits a result to the standard io500, how well does it do with only 10 clients?".
>>>
>>> If this is the case, and we don't want to encourage the submission of small non-scalable storage systems, then maybe there are other ways to achieve it such as:
>>> - A submission to the 10 client challenge is only valid if a submission is also made to the standard io500 list.  Users can then look at both rankings to get an understanding of the system.
>>> - Each submission must have at least 1PB of storage capacity, which will increase by 10% each year.
>>>
>>> Just rough ideas, but maybe we need to clarify why an io500 list cares about 10 clients?
>>> Dean
>>>
>>>
>>> On 10/3/19 1:39 AM, Julian Kunkel wrote:
>>>> Hi,
>>>> IMHO: A simple way of seeing this matter for the 10 node challenge is
>>>> that it really should be about 10 nodes with interconnects to
>>>> normalize results to some extent. Such runs can be seen in a real
>>>> configuration.
>>>> However, deploying 10 VMs on a single host and seeing a performance
>>>> gain vs. running directly on the host seems to be artificial.
>>>>
>>>> Regarding cheating: theoretically one could run 10 VMs on one big
>>>> node, the host could slow down the creation rates to a limit such that
>>>> all data is available in a big cache (say NVDIMMs) from the
>>>> perspective of the host (and the VMs then).  Every read would then be
>>>> cached.
>>>>
>>>> Here is a rather artificial example (if you have more appropriate
>>>> numbers, use them):
>>>>
>>>> For IOR BW assume
>>>> * writes 5 GiB/s to NVDIMMs (throttled) => 1.5 * 2 TB space needed / doable.
>>>> * read 500 GiB/s.
>>>> => (5*5*500*500)^0.25 = 50 score
>>>> Not an issue so far.
>>>>
>>>> For MD, 10 Million IOOPS for create and 100 Million for any
>>>> read/delete and find would give
>>>> (10000*10000*100000*100000*100000*100000*100000*100000)^(1/8)
>>>> => 56234.13
>>>>
>>>> Total score: sqrt(56234*50) = 1676.812
>>>>
>>>> Yes, it is a synthetic example but there could be technology out there
>>>> that generates such numbers o people may create an IOR backend to
>>>> exploit such a setup.
>>>> You could also use two nodes and only 1/5th of data needs to be
>>>> transferred over the network (as the IO500 does rank-shifting), that
>>>> would also lead to a superficial number.
>>>>
>>>> Personally I would be interested in such gaming results, you can
>>>> always submit such numbers to the full list as synthetic "upper
>>>> bounds".
>>>>
>>>> Best,
>>>> Julian
>>>>
>>>> On Wed, Oct 2, 2019 at 10:02 PM Dean Hildebrand via IO-500
>>>> <io-500@vi4io.org> wrote:
>>>>> As a cloud provider, this rule isn't too onerous as there is always a way to get dedicated machines through sole tenant offerings and simply using large VMs (although it is a waste of $$ to use clients that have 60+ cores just to run a single benchmark process).
>>>>>
>>>>> I'm more curious about the thinking here, can someone from the committee provide some background?  This is one of those funny and rare cases where we are worried about someone with fewer resources having an advantage over someone with more resources.  If a system with a 1 or 2 clients can beat 10...isn't that one measure of success from an HPC point of view?
>>>>>
>>>>> Dean
>>>>>
>>>>> On 9/30/19 9:10 AM, John Bent via IO-500 wrote:
>>>>>
>>>>>> To IO500 Community,
>>>>>>
>>>>>>
>>>>>> The committee has received some queries about the rules concerning virtual machines for the 10 Node Challenge. As such, the committee has added the following rule:
>>>>>>
>>>>>>
>>>>>> 13. For the 10 Node Challenge, there must be exactly 10 physical nodes for client processes and at least one benchmark process must run on each
>>>>>>
>>>>>> Virtual machines can be used but the above rule must be followed. More than one virtual machine can be run on each physical node.
>>>>>>
>>>>>>
>>>>>> Although we recognize that this may disadvantage cloud architectures, we do want to stress that this rule only applies to the 10 Node Challenge. The committee did feel it was important to add this rule to ensure that the 10 Node Challenge sublist offers the maximum potential for fair comparisons by ensuring equivalent client hardware quantities. Submissions with any number/combination of virtual and physical machines can of course always be submitted to the full list.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>>
>>>>>> The IO500 Committee
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> IO-500 mailing list
>>>>> IO-500@vi4io.org
>>>>> https://urldefense.com/v3/__https://www.vi4io.org/mailman/listinfo/io-500__;!oCotSwSxbw8!XFVHy8UqPSjO_ddtthYK6-ay02nqNNGxJZAdzULz-NE-DBvFZyCgbE4em46bVd8n7W4a$
>>>>>
>>>>> _______________________________________________
>>>>> IO-500 mailing list
>>>>> IO-500@vi4io.org
>>>>> https://urldefense.com/v3/__https://www.vi4io.org/mailman/listinfo/io-500__;!oCotSwSxbw8!XFVHy8UqPSjO_ddtthYK6-ay02nqNNGxJZAdzULz-NE-DBvFZyCgbE4em46bVd8n7W4a$
>>>>
>>>>
>>>
>>> _______________________________________________
>>> IO-500 mailing list
>>> IO-500@vi4io.org
>>> https://urldefense.com/v3/__https://www.vi4io.org/mailman/listinfo/io-500__;!oCotSwSxbw8!XFVHy8UqPSjO_ddtthYK6-ay02nqNNGxJZAdzULz-NE-DBvFZyCgbE4em46bVd8n7W4a$
>>
>> _______________________________________________
>> IO-500 mailing list
>> IO-500@vi4io.org
>> https://www.vi4io.org/mailman/listinfo/io-500
>
>
> Cheers, Andreas
>
>
>
>
>