So the gist of it is that currently we let applications/users pin a
given directory to a given MDS via an xattr. That takes it out of the
automatic subtree partitioning scheme so the user has direct control
over where it goes. IE in the real world you might have an application
decide that each process is going to pin directories round-robin over
all MDSes, or use a hashing scheme, or perhaps if there is some locality
involved pin their directory to an MDS running in the same
node/rack/zone/etc. The idea here is that once the user/application
decides to set their own directory pin, we stay out of the way.
If I understand correctly, the new rule would require that we can't
pre-create any of the sub-directories, meaning that mdtest itself would
need to set the pin xattr per directory, unless we can some how go in
and set them after they are created by mdtest but before files are
written. Alternately, the rules would allow us to set an xattr on the
top level directories and do the pinning behind the scenes (something
like a "use_rr_pinning" xattr). We're planning on adding something like
that anyway as a convenience option, but the rule would still limit any
kind of user-defined pinning strategies using our existing pinning scheme.
I guess I just want to understand the rationale. Why allow the top
level directory to be tuned but not the lower levels? Why is the rule
being introduced now in the middle of a Call for Submissions?
Mark
On 5/28/20 1:25 PM, Julian Kunkel wrote:
Hi Mark,
just to be sure:
What you cannot do with the current ruleset is to pre-create a
directory for each individual *process* in mdtest-easy as the process
wants to create such a directory.
You can create the top-level directories, e.g., md-easy, md-hard and
configure them differently.
Setting a top-level xattr seems good to me, is there a serious problem
with this issue?
Best,
Julian
P.S. These statements merely reflect my own personal view; the only
mechanism for announcing official IO500 policies and decisions is the
committee(a)io500.org email address.
On Thu, May 28, 2020 at 7:17 PM Mark Nelson via IO-500 <io-500(a)vi4io.org> wrote:
> Thinking about this more, could I please ask what the rationale here
> is? Ultimately we'll do the pinning one way or another (maybe not for
> ISC20, we'll see). Right now we pin the easy mdtest subdirs
> individually in the script. We can accomplish the same thing with a
> top-level xattr and round-robin as subdirectories are created inside
> ceph itself, but that just trades user-control over the pinning scheme
> for the convenience of setting a top-level xattr. The whole thing is
> pretty arbitrary except that this isn't how we do it right now.
>
>
> I'd like to understand where you guys are coming from on this one. Are
> you worried about being able to game the benchmark if you can set subdir
> xattrs? Wouldn't a real performance-focused user potentially want to
> set subdir tunings (and not just in the ceph case) for a real-world
> use-case that the easy mdtest benchmark is supposed to represent?
>
>
> Mark
>
>
> On 5/28/20 12:31 PM, Mark Nelson wrote:
>> Sigh. That means I'll need to have our ephemeral pinning code do it
>> inside ceph rather than just pinning those directories in the script
>> as I've been doing previously. Not impossible, just more work to do
>> under the time crunch while also trying to debug the API issues with
>> the new C version of the benchmark. This is getting rather frustrating.
>>
>>
>> Mark
>>
>>
>> On 5/28/20 12:24 PM, John Bent wrote:
>>> Mark and all,
>>>
>>> The committee just added a rule clarifying precreation of directories
>>> to the rules page:
https://www.vi4io.org/io500/rules/submission. The
>>> newly added rule states:
>>>
>>> "Each of the four main phases (IOR easy and hard, and mdtest easy and
>>> hard) has a subdirectory which can be precreated and tuned (e.g.
>>> using tools such as lfs_setstripe or beegfs_ctl); however, additional
>>> subdirectories within these subdirectories cannot be precreated."
>>>
>>> Below my signature, I am including my standard disclaimer that my
>>> email is not necessarily an official IO500 position but note that the
>>> rules page itself is. :)
>>>
>>> Hope this is clear; please do reply with any questions or need for
>>> further clarification,
>>>
>>> Thanks,
>>>
>>> John(*)
>>> * These statements merely reflect my own personal view; the only
>>> mechanism for announcing official IO500 policies and decisions is the
>>> committee(a)io500.org <mailto:committee@io500.org> email address.
>>>
>>>
>>> On Wed, May 27, 2020 at 5:14 PM John Bent <johnbent(a)gmail.com
>>> <mailto:johnbent@gmail.com>> wrote:
>>>
>>> Hey Mark,
>>>
>>> Thanks for the interest. It will be great to get your
>>> contributions!
>>>
>>> 1. Must be exactly 300 seconds.
>>> 2. Does not include the directories. Other historical submissions
>>> have tuned the directories exactly as you describe.
>>> 3. Yes, 10+ metal nodes in AWS satisfies this requirement.
>>>
>>> Other committee members, and community members, please chime in if
>>> I got anything wrong! Mark, you might note the disclaimer below
>>> my signature which is just our committee's way of being careful.
>>> I'll make sure to discuss this email with the rest of the
>>> committee and will let you know if any of my answers need official
>>> clarification.
>>>
>>> Thanks,
>>>
>>> John(*)
>>>
>>> * These statements merely reflect my own personal view; the only
>>> mechanism for announcing official IO500 policies and decisions is
>>> the committee(a)io500.org <mailto:committee@io500.org> email
address.
>>>
>>>
>>> On Wed, May 27, 2020 at 4:44 PM Mark Nelson via IO-500
>>> <io-500(a)vi4io.org <mailto:io-500@vi4io.org>> wrote:
>>>
>>> Hi Folks,
>>>
>>>
>>> We are thinking about throwing together some cephfs io500
>>> results for
>>> ISC20 and I just wanted to make sure that we are doing the
>>> right thing
>>> in a couple of cases. Any help would be much appreciated
>>> since we've
>>> never submitted results before. We might have a couple of
>>> additional
>>> questions later on, but for now:
>>>
>>>
>>> 1) "All create/write phases must run for at least 300 seconds;
>>> the
>>> stonewall flag must be set to 300 which should ensure this."
>>>
>>> Is it acceptable to set the stonewall higher than 300, or is a
>>> setting
>>> of exactly 300 required?
>>>
>>>
>>> 2) "The file names for the mdtest output files may not be
>>> pre-created."
>>>
>>> Does this also include the directories? We have the ability
>>> to pin
>>> directories to specific MDSes that helps in the easy tests. We
>>> also have
>>> an experimental feature that more or less does this
>>> psuedo-randomly
>>> behind the scenes so long as a top level xattr is set, but it
>>> would be
>>> convenient if we could just pre-create the mdtest directories
>>> and set
>>> the xattr to pin them individually in the "directory
setup"
>>> phase of the
>>> test if allowed. Likewise, we have code that allows users to
>>> provide a
>>> hint if a specific directory is expected to have lots of files
>>> which can
>>> improve performance in the hard tests. I would like to
>>> pre-create the
>>> mdtest directory so that we can set the xattr informing ceph
>>> that we
>>> expect a lot of files to be written in that directory.
>>>
>>>
>>> 3) "Only submissions using at least 10 physical client nodes
are
>>> eligible to win IO500 awards and at least one benchmark
>>> process must run
>>> on each."
>>>
>>> We are planning on running on AWS. So long as we are using
>>> 10+ metal
>>> nodes does that meet the requirement to have "at least 10
>>> physical
>>> client nodes"?
>>>
>>>
>>> Thanks,
>>>
>>> Mark
>>>
>>> _______________________________________________
>>> IO-500 mailing list
>>> IO-500(a)vi4io.org <mailto:IO-500@vi4io.org>
>>>
https://www.vi4io.org/mailman/listinfo/io-500
>>>
> _______________________________________________
> IO-500 mailing list
> IO-500(a)vi4io.org
>
https://www.vi4io.org/mailman/listinfo/io-500
--
Dr. Julian Kunkel
Lecturer, Department of Computer Science
+44 (0) 118 378 8218
http://www.cs.reading.ac.uk/
https://hps.vi4io.org/
PGP Fingerprint: 1468 1A86 A908 D77E B40F 45D6 2B15 73A5 9D39 A28E