Re: [IO-500] [EXTERNAL] Re: Benchmark abstraction

Wednesday, 23 November 2016

Dear John,
this looks good. I appreciate a careful yet rapid approach and schedule.

We have asked for funding for a benchmarking I/O workshop in March
lately, hope that is funded and we can receive additional input.
The overall schedule is timely for our BoF at ISC in June, such that
we could present the survey results and current status of the effort
then.

I would extend 1. by ultimately defining the high-level access
patterns. I think my proposal for the definition of initial access
patterns was not too bad.

I created a Google Drive Folder and document for adding points:
https://drive.google.com/drive/folders/0Byyti4QVNd-pWjJCamlRU01ZLUU
I will work on this converting our discussion into the doc.

This document is to be understood as supplement to this list,
preserving the current state of the discussion and allowing people to
"vote" on the relevance on issues.

Regards,
Julian

2016-11-23 15:58 GMT+01:00 John Bent <John.Bent(a)seagategov.com&gt;:
...
 Upon further reflection, there is no need to be hasty. My concern is
simply
 interminable discussion. Bounded discussion OTOH sounds great. How about:

 1. Open discussion from now until Jan 31 to define goals, identify concerns,
 etc.
 2. In Feb, people can submit benchmark proposals.  Two months to discuss and
 refine these.
 3. If consensus not reached by April 1, vote.
 4. Survey the top 100 sites using the chosen benchmark proposal

 Thoughts?

 Thanks

 John

 On Nov 23, 2016, at 12:37 AM, John Bent <John.Bent(a)seagategov.com&gt; wrote:

 PLFS will do horribly in this test because it will completely fail the IOR
 hard and it will do horribly on both mdtest hard and easy.  If someone fixes
 it (like maybe the Mohror burst-fs stuff), then that seems like fair game.

 I think we’re at a bit of an impasse in terms of whether to use IOR and
 mdtest or try to create new and better benchmarks.  I fear that an attempt
 to create new and better benchmarks is doomed to failure.  We can never make
 everyone happy.  Dongarra went with Linpack.  It’s universally reviled yet
 Top 500 is massively successful.

 Now, here’s a question: do we want to be successful or do we want to advance
 science?  Because maybe they are different things.  Has Top 500 with an
 imperfect benchmark advanced science?

 In terms of our impasse, how about we spend six weeks trying to define a
 more well-liked benchmark?  If we fail, we go ahead with IOR and mdtest.
 How do we determine failure/success?  I guarantee we won’t find consensus.
 Maybe six weeks to discuss and then we submit candidates and vote?

 I also like my idea of surveying the Top 100.  I suspect if we propose IOR
 hard, IOR easy, mdtest hard, mdtest easy to them today, we will get 30 of
 them that say they will do it.  I suspect if we spend six months discussing
 and then survey them again with whatever we agree upon, we will get 30 of
 them that say they will do it.  And I think 30 is enough that we should do
 it because 30 is large enough that the rest will follow.  Heck, I think ORNL
 alone is enough that the rest will follow.

 But if our goal is not just to succeed, but actually to advance science then
 I’m willing to spend six weeks and see if we converge on something better
 than IOR and mdtest.  My prediction (which I think Andreas shares)?  We
 won’t.

 Sarp and others who haven’t yet spoken, can you please weigh in on whether:

 A. We should survey immediately with IOR and mdtest
 B. We should spend six weeks trying to find something better
 C. Some other path?

 By the way, I liked Dean’s suggestion that we change the benchmark every
 year in terms of how well it would advance science.  But it absolutely
 terrifies me as a practical matter.  This discussion every year?  Plus I
 think sites are less likely to participate if they have to learn how to run,
 and tune for, a different benchmark every year.

 Also, replies to Jay inline:

 On Nov 22, 2016, at 5:03 PM, Lofstead, Gerald F II <gflofst(a)sandia.gov&gt;
 wrote:

 Sarp: Can you share any of the materials from your previous effort?

 I have a few other comments to add in:

 1. A better metadata testing tool is a great idea. Let’s focus on forward
 looking tools rather than clinging to old tools. My concern is how well can
 we avoid “gaming” the new tool. mdtest is well understood and can probably
 be controlled for.

 If mdtest hard and mdtest easy as we’ve discussed can be gamed, then any
 other benchmark can as well.  But if someone “games” to do well with mdtest
 hard and mdtest easy, then I do think that sets bounds for user expectations
 of metadata performance.

 2. We have had a lot of discussion of moving to object storage because we
 don’t have a choice. The vendors are addressing the needs of their 95%
 customer. I don’t think IOR is a fair test of this. It ends up being a test
 of the mapping from the data structures to objects. For example, using
 something like PLFS will be a HUGE advantage for “fixing” the IO to be more
 object oriented no matter the IO API/middleware limitations. In essence, you
 could cheat trivially.

 As I discussed above, PLFS would fail horribly.

 3. By doing mdtest and ior separately, we are decoupling the two. Striping
 issues that hit the metadata server are part of the file creation AND IO
 performance issues. Do we want to combine these in a more direct test
 somehow?

 Maybe we should do IOR hard, IOR easy, mdtest hard, mdtest easy
 independently as four measurements.  Then run all four concurrently as a
 fifth measurement?

 4. How much of what we are testing is intended to be the hardware vs. the
 storage software layer (e.g., Lustre) vs. the middleware (MPI-IO + PLFS vs.
 ADIOS + BP) vs. IO API (HDF/NetCDF vs. ADIOS vs. POSIX vs. MPI-IO)? Testing
 at each of these levels makes a lot of sense and have different values to
 different audiences. I’d argue that all that matters is the top level test
 since we are trying to support applications. If they do N-1 files, unless
 the system ALWAYS uses PLFS, it should suffer the stack performance
 characteristics. Doing something “simple” at a lower layer does not
 represent what end users care about—what IO performance can I expect? I
 think IOR can do a lot of it, but it isn’t a complete solution.

 I agree that all that matters is the top-level test since we are trying to
 support apps.  I’d phrase it as ‘we are trying to help apps predict their
 performance’ as this was a comment made at the BoF by someone whose name I
 sadly do not know.

 5. How do we deal with burst buffers in their various incarnations? Do we
 make rules about relative sizes of BB and main memory to decide if other
 storage systems have to be considered? Is there a different metric such as
 accessible from some external location that determines what we want to
 benchmark? Is that fair since many systems are being bought with a BB to
 hide that latency in the general case believing that there are sufficient
 IOPS and back end bandwidth to drain without slowing applications.

 I like 5 minutes of sustained IO.  If the BB is large enough to get super
 high bandwidth during 5 minutes, then I’m willing to believe its a good
 storage system.  Sure, someone might build a BB sized for 5 minutes and then
 not even bother to have a second tier just because they want to win IO 500.
 And they’ve done dumb stuff like that for Top 500 too.  We can’t build a
 perfect benchmark.

 But . . . I’m willing to spend some time trying to build a good one.
 Although, to be honest, I remain of the opinion that IOR and mdtest are
 already a good one.  I’ve heard good arguments against them but nothing
 sufficient to persuade me that we can do any better.  Maybe I just haven’t
 understood the arguments well enough.

 There are tons more things to consider.

 Agreed.  So many that attempting to consider them all dooms us to inaction.

 Thanks,

 John

 Best,

 Jay

 From: IO-500 <io-500-bounces(a)vi4io.org&gt; on behalf of Julian Kunkel
 <juliankunkel(a)googlemail.com&gt;
 Date: Tuesday, November 22, 2016 at 12:49 AM
 To: John Bent <John.Bent(a)seagategov.com&gt;
 Cc: &quot;io-500(a)vi4io.org&quot; <io-500(a)vi4io.org&gt;
 Subject: [EXTERNAL] Re: [IO-500] Benchmark abstraction

 Dear All,
 I'm not *against* using IOR but at this stage, I  rather favour a clear
 separation between
 What and why certain metrics are useful to be measured and in a second step
 How they are measured.

 This also serves as validation that we do the right thing. I found this
 always useful when defining a test, and a benchmark is just a performance
 test for me. The intended purpose helps not only in communication but also
 prevents unintentional optimizations of systems.

 Again I agree that IOR could be the vehicle but I would hope the community
 firstly agrees on the metrics before there might be detailed discussions
 about the tool.

 Regards
 Julian

 Am 21.11.2016 10:37 nachm. schrieb "John Bent"
<John.Bent(a)seagategov.com&gt;:

 Thanks Sarp!  Some comments in-line.

 On Nov 21, 2016, at 2:22 PM, Oral, H. Sarp <oralhs(a)ornl.gov&gt; wrote:

 Well, I agree with John that trying to define a new and all around benchmark
 is highly difficult. We tried that (and looked a few other benchmarks at the
 time) and failed. No need to repeat the same mistakes, I think.

 And I also agree that the benchmarks need to be simple and easy to run and
 representative of realistic scenarios.

 Rather than limiting to two IOR instances, we can perhaps increase them
 slightly to cover more I/O workloads with IOR, if needed.

 By the way, we already have an IOR version that we integrated with ADIOS. We
 can share it with the community. And IOR already supports HDF5, and MPIIO.
 Between POSIX and these mid level libs, I think IOR covers a majority of the
 use cases. The trick is coming up with good, canned command line option sets
 for IOR covering various I/O workloads.

 There is really nothing else on measuring the mdtest today as far as I know.

 In terms of mdtest, when I speak of it, I have to admit that I’m speaking of
 a theoretical future mdtest which does not yet exist.  IOR is beautifully
 engineered with a fantastic plug-in feature as you mention.  The mdtest I’m
 envisioning is taking mdtest.c from it’s current github and moving it into
 the IOR github and rewriting it to replace the POSIX calls into calls to
 this IOR plug-in interface.  The plug-in interface is already almost a
 superset of what mdtest needs.  I think only ‘stat’ needs to be added.  That
 way, when people add new plug-ins to IOR, they will simultaneously add them
 to mdtest.  Also, for our benchmark, they’d simply pull and ‘make’ from a
 single repository.

 Any volunteers?  :)

 Here’s what I believe to be the most recently maintained repositories:
 https://github.com/MDTEST-LANL/mdtest
 https://github.com/IOR-LANL/ior

 I have to admit that I have not yet looked at md-real-io to do a comparison.
 (sorry Julian, it is on my TODO list…)

 So, we are on board.

 Fantastic.  Our survey is now 1% complete.  :)

 Thanks,

 John

 Thanks,

 Sarp

 --
 Sarp Oral, PhD

 National Center for Computational Sciences
 Oak Ridge National Laboratory
 oralhs(a)ornl.gov
 865-574-2173

 On 11/20/16, 12:33 PM, "IO-500 on behalf of Julian Kunkel"
 <io-500-bounces(a)vi4io.orgon behalf of juliankunkel(a)googlemail.com&gt; wrote:

    Dear John,
    I would definitely not go with mdtest. That one can be well optimized by
 read ahead / sync. Also it is POSIX only.
    Note that for overcoming the caching problem, I wrote the md-real-io
 benchmark that shares many things with mdtest.
    I would wait for the community feedback and not ignore that concepts such
 as ADIOS may not necessarily fit as IOR back ends and rather to with
 abstract definitions first.

    Regards
    Julian

    Am 20.11.2016 6:12 nachm. schrieb "John Bent"
<John.Bent(a)seagategov.com&gt;:

    To attempt defining the perfect IO benchmark is Quixotic.  Those who
 dislike IO500 will always dislike IO500 regardless of what the specific
 benchmark is.  Those who like the idea will accept an imperfect benchmark.

    Therefore, I suggest we move forward with the straw person proposal: IOR
 hard, IOR easy, mdtest hard, mdtest easy.

    * Average IOR hard and IOR easy.  Average mdtest hard and mdtest easy.
    * Their product determines the winner.
    * Don’t report the product since it’s a meaningless unit; report the
 averages.
    * e.g. The winner of IO500 is TaihuLight with a score of 250 GB/s and
 300K IOPs.

    Unless a proposal is strictly much better than IOR hard, IOR easy, mdtest
 hard, mdtest easy, I don’t think we should consider it.  The beauty of IOR
 hard, IOR easy, mdtest hard, mdtest easy is that they are well-understood,
 well-accepted benchmarks, that
     are trivial to download and compile, and whose results are immediately
 understandable.  Every RFP in the world uses them.  The one problem is they
 need a pithier name than “IOR hard, IOR easy, mdtest hard, mdtest easy”...

    My suggestion is to poll the top 100 of the
    top500.org <http://top500.org> and ask them this:

    "If we were to do an IO500, and our benchmark was IOR hard, IOR easy,
 mdtest hard, mdtest easy, would you participate?  If not, would you
 participate with a different benchmark?”

    If the bulk of the answers are “yes,” then we just figure out how to
 organize and administer this thing.
    If the bulk of the answers are “no,” then we give up and do something
 else.
    If the bulk of the answers are “no, yes,” then we need to find a new
 benchmark.

    Thanks,

    John

    On Nov 20, 2016, at 7:11 AM, Julian Kunkel <juliankunkel(a)googlemail.com&gt;
 wrote:

    Dear all,
    based on our discussion during the BoF at SC, we could focus on the
    access pattern(s) of interest first. Later we can define which
    benchmarks (such as IOR) could implement these patterns (e.g., how to
    call existing benchmarks).

    This strategy gives other I/O paradigms the option to create a
    benchmark with that pattern that fits their I/O paradigm/architecture.

    Here is a draft of one that is probably not too difficult to discuss:
    Goal: IOmax: Sustained performance for well-formed I/O

    Rationales:
    The benchmark shall determine the best sustained I/O behavior without
    in-memory caching and I/O variability. A set of real applications that
    are highly optimized should be able to show the described access
    behavior.

    Use case: A large data structure is distributed across N
    threads/processes; a time series of this data structured shall be
    stored/retrieved efficiently. (This could be a checkpoint.)

    Processing steps:
    S0) Each thread allocates and initializes a large consecutive memory
    region of size S with a random (but well defined) pattern
    S1) Repeat T times: Each process persists/reads its data to/from the
    storage. Ech iteration  is protected with a global barrier and the
    runtime is measured
    S2) Compute the throughput (as IOmax) by dividing the total accessed
    data volume (N*S) by the maximum observed runtime for any single
    iteration in step S1

    Rules:
    R1) The data of each thread and timestep must be stored individually
    and cannot be overwritten during a benchmark run
    R2) It must be ensured that the time includes all processes needed to
    persist all data in volatile memory (for writes) and that prior
    startup of reads no data is cached in any volatile memory
    R3) A valid result must verify that read returns the expected (random)
 data
    R4) N, T and S can be set arbitrarily. T must be >= 3. The benchmark
    shall be repeated several times

    Reported metrics:
    * IOmax
    * Working set size W: N*T*S

    Regards,
    Julian
    _______________________________________________
    IO-500 mailing list
    IO-500(a)vi4io.org
    https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500

    STRICTLY PERSONAL AND CONFIDENTIAL. This email may contain
    confidential and proprietary material for the sole use of
    the intended recipient. Any review or distribution by others
    is strictly prohibited. If you are not the intended recipient
    please contact the sender and delete all copies.

 STRICTLY PERSONAL AND CONFIDENTIAL. This email may contain
 confidential and proprietary material for the sole use of
 the intended recipient. Any review or distribution by others
 is strictly prohibited. If you are not the intended recipient
 please contact the sender and delete all copies.

 _______________________________________________
 IO-500 mailing list
 IO-500(a)vi4io.org
 https://www.vi4io.org/cgi-bin/mailman/listinfo/io-500

 STRICTLY PERSONAL AND CONFIDENTIAL. This email may contain
 confidential and proprietary material for the sole use of
 the intended recipient. Any review or distribution by others
 is strictly prohibited. If you are not the intended recipient
 please contact the sender and delete all copies. 

-- 
http://wr.informatik.uni-hamburg.de/people/julian_kunkel

2024

2023

2022

2021

2020

2019

2018

2017

2016

Re: [IO-500] [EXTERNAL] Re: Benchmark abstraction