Definitely like github for issues. Personally I prefer the mailing list over slack for discussions though. For me, it feels easier to search and easier to group things into one coherent thread. I might be showing my age with this preference. :)
On Mon, Jun 1, 2020 at 9:35 AM Julian Kunkel via IO-500 <
io-500@vi4io.org> wrote:
Thanks all for your reporting,
I personally agree with the approach: we should try to keep the
mailing list for most important discussions and post such error
messages either as issues in GitHub or discuss them on Slack to
minimize traffic.
Thanks & Best,
Julian
On Mon, Jun 1, 2020 at 4:05 PM Mark Nelson via IO-500 <io-500@vi4io.org> wrote:
>
> On 6/1/20 8:44 AM, Pinkesh Valdria via IO-500 wrote:
>
> > I made some progress over the weekend to troubleshoot why find phase
> > was not working, but I am not out of the woods. Appreciate if
> > someone can confirm if I am on the right path or acknowledge if the
> > below are known issues and there are workarounds for them.
> >
> > Thanks for your help. It’s a long email, but detailed to ensure,
> > there is not a lot of back-n-forth.
> >
> > Here are some differences I found and workaround I had to use.
> >
> > *_Issue1: _*
> >
> > Io500 (C version) expects a field labelled: external-extra-args in the
> > config.ini file, but the non-C version (io500.sh) logic is looking
> > for field labelled: “external-args” (without “-extra-“) , see below
> > line:
> >
> > https://github.com/VI4IO/io500-app/blob/master/io500.sh#L229
> >
> >
> > io500_find_cmd_args="$(get_ini_param find external-args)"
> >
> >
> > less config-full.ini
> >
> > [find]
> >
> > # Set to an external script to perform the find phase
> >
> > external-script =
> >
> > # Extra arguments for external scripts
> >
> > external-extra-args =
> >
> > # Startup arguments for external scripts
> >
> > external-mpi-args =
> >
> > # Set the number of processes for pfind
> >
> > nproc =
> >
> > # Pfind queue length
> >
> > pfind-queue-length = 10000
> >
> > # Pfind Steal from next
> >
> > pfind-steal-next = FALSE
> >
> > # Parallelize the readdir by using hashing. Your system must support this!
> >
> > pfind-parallelize-single-dir-access-using-hashing = FALSE
> >
> > *Temporary workaround*: I changed the io500.sh code to look for:
> > /io500_find_cmd_args="$(get_ini_param find external-extra-args)"/
> >
> > *_Issue2_*: I had to set*__*io500_find_mpi to True in the io500.sh
> > script to avoid getting this error: “/io500_find_mpi: unbound
> > variable/” , but I don’t know if there is a different way to set the
> > value for the C-version using config.ini file, can someone share how
> > to pass that value for the C-app version?
> >
> > function setup_find {
> >
> > io500_find_mpi="True"
> >
> > *_Issue3:_***How do I validate that the parameters I am setting in the
> > config.ini file are been used at runtime. I set the following, but
> > don’t see them below:
> >
> > [find]
> >
> > external-script = /mnt/beeond/io500-app/bin/pfind
> >
> > #nproc = 30
> >
> > pfind-queue-length = 2000
> >
> > pfind-steal-next = TRUE
> >
> > pfind-parallelize-single-dir-access-using-hashing = FALSE*__*
> >
> > io500 – C version hangs at the below command and I don’t see queue
> > length, steal-next, etc
> >
> > [find]
> >
> > t_start = 2020-06-01 08:28:30
> >
> > exe = /mnt/beeond/io500-app/bin/pfind ./out//2020.06.01-06.57.23-app
> > -newer ./out//2020.06.01-06.57.23-app/timestampfile -size 3901c -name
> > "*01*"
> >
> > nproc = 1
> >
> > *_Issue4_*: Manual workaround to make “find” phase work by setting
> > some parameters in io500.sh, but they are not passed to the C-version
> > and hence that fails.
> >
> > In io500.sh script:
> >
> > function setup_find {
> >
> > io500_find_mpi="True"
> >
> > io500_find_cmd_args="-N -q 2000 -s $io500_stonewall_timer -r
> > $io500_result_dir/pfind_results"
> >
> > __
> >
> > _Non-C version - success _
> >
> > [Starting] find
> >
> > [Exec] mpiexec --allow-run-as-root -mca btl self -x UCX_TLS=rc,self,sm
> > -x HCOLL_ENABLE_MCAST_ALL=0 -mca coll_hcoll_enable 0 -x
> > UCX_IB_TRAFFIC_CLASS=105 -x UCX_IB_GID_INDEX=3 -n 30 -npernode 10
> > --hostfile /mnt/beeond/hostsfile.cn /mnt/beeond/io500-app/bin/pfind
> > ./out//2020.06.01-06.57.23-scr -newer
> > ./out//2020.06.01-06.57.23-scr/timestampfile -size 3901c -name "*01*"
> > -N -q 2000 -s 300 -r ./results//2020.06.01-06.57.23-scr/pfind_results
> >
> > [Results] in ./results//2020.06.01-06.57.23-scr/find.txt.
> >
> > [FIND] MATCHED 28170/15966876 in 44.6593 seconds
> >
> > [RESULT] IOPS phase 3 find 357.520
> > kiops : time 44.66 seconds
> >
> > _io500 – C version hangs at the below command and I don’t see queue
> > length, steal-next, etc _
> >
> > [find]
> >
> > t_start = 2020-06-01 08:28:30
> >
> > exe = /mnt/beeond/io500-app/bin/pfind ./out//2020.06.01-06.57.23-app
> > -newer ./out//2020.06.01-06.57.23-app/timestampfile -size 3901c -name
> > "*01*"
> >
> > nproc = 1
> >
> > *_Issue5_*: As you can see in issue#4, I am passing some parameters
> > with bash variables, if I do the same in the config.ini file, they
> > will be passed as is without getting interpreted by the bash script.
> > How do I pass such variable for the C-version of io500 ?
> >
> > Already tried the below and they are not interpreted when processed by
> > io500 C-app version.
> >
> > In config.ini file:
> >
> > external-extra-args = -s \$io500_stonewall_timer -r
> > \$io500_result_dir/pfind_results
> >
> > or
> >
> > external-extra-args = -s $io500_stonewall_timer -r
> > $io500_result_dir/pfind_results
> >
> > [find]
> >
> > t_start = 2020-05-31 11:43:52
> >
> > exe = /mnt/beeond/io500-app/bin/pfind -s $io500_stonewall_timer -r
> > $io500_result_dir/pfind_results ./out//2020.05.31-10.52.56-app -newer
> > ./out//2020.05.31-10.52.56-app/timestampfile -size 3901c -name "*01*"
> >
> > nproc = 1
> >
> > [find]
> >
> > t_start = 2020-05-31 15:55:52
> >
> > exe = /mnt/beeond/io500-app/bin/pfind -s \$io500_stonewall_timer -r
> > \$io500_result_dir/pfind_results ./out//2020.05.31-15.55.38-app -newer
> > ./out//2020.05.31-15.55.38-app/timestampfile -size 3901c -name "*01*"
> >
> > nproc = 1
> >
> > This is a small test cluster
> >
> > less 2020.06.01-06.57.23-scr/result_summary.txt
> >
> > [RESULT] BW phase 1 ior_easy_write 6.062
> > GiB/s : time 362.17 seconds
> >
> > [RESULT] IOPS phase 1 mdtest_easy_write 7.300
> > kiops : time 2054.93 seconds
> >
> > [RESULT] BW phase 2 ior_hard_write 1.605
> > GiB/s : time 321.58 seconds
> >
> > [RESULT] IOPS phase 2 mdtest_hard_write 3.173
> > kiops : time 304.69 seconds
> >
> > [RESULT] IOPS phase 3 find 357.520
> > kiops : time 44.66 seconds
> >
> > [RESULT] BW phase 3 ior_easy_read 8.269
> > GiB/s : time 265.51 seconds
> >
> > [RESULT] IOPS phase 4 mdtest_easy_stat 144.149
> > kiops : time 104.06 seconds
> >
> > [RESULT] BW phase 4 ior_hard_read 3.847
> > GiB/s : time 134.18 seconds
> >
> > [RESULT] IOPS phase 5 mdtest_hard_stat 82.220
> > kiops : time 11.76 seconds
> >
> > [RESULT] IOPS phase 6 mdtest_easy_delete 54.334
> > kiops : time 276.07 seconds
> >
> > [RESULT] IOPS phase 7 mdtest_hard_read 22.822
> > kiops : time 42.37 seconds
> >
> > [RESULT] IOPS phase 8 mdtest_hard_delete 8.042
> > kiops : time 123.97 seconds
> >
> > [SCORE] Bandwidth 4.19424 GiB/s : IOPS 31.5378 kiops : TOTAL 11.5012
> >
> > C-version app – partial result – which hangs at find:
> >
> > [root@inst-5n58i-good-crow results]# less
> > 2020.06.01-06.57.23-app/result.txt | egrep "\[|score"
> >
> > [ior-easy-write]
> >
> > score = 6.136592
> >
> > [mdtest-easy-write]
> >
> > score = 41.913878
> >
> > [timestamp]
> >
> > [ior-hard-write]
> >
> > score = 1.538194
> >
> > [mdtest-hard-write]
> >
> > score = 3.012750
> >
> > [find]
> >
> > [root@inst-5n58i-good-crow results]#
> >
>
> FWIW the default version of find is still working the same as it did
> previously for us. This is not really related to what you are seeing,
> but what I'm still noticing is that we're getting an unreasonable find
> advantage when our easy and hard create results are highly skewed (which
> I'm still working on debugging since we are still sometimes seeing a lot
> of variation). This is from an 8 node development cluster:
>
>
> First run (C-app in this case):
>
> [RESULT] ior-easy-write 59.443067 GiB/s : time 467.383 seconds
> [RESULT] mdtest-easy-write 277.570531 kIOPS : time 369.282 seconds
> [RESULT] ior-hard-write 11.037549 GiB/s : time 333.391 seconds
> [RESULT] mdtest-hard-write 48.336249 kIOPS : time 328.619 seconds
> [RESULT] find 512.379261 kIOPS : time 230.722 seconds
> [RESULT] ior-easy-read 70.285275 GiB/s : time 397.338 seconds
> [RESULT] mdtest-easy-stat 627.847802 kIOPS : time 165.087 seconds
> [RESULT] ior-hard-read 13.879120 GiB/s : time 275.325 seconds
> [RESULT] mdtest-hard-stat 179.562335 kIOPS : time 1221.149 seconds
> [RESULT] mdtest-easy-delete 140.591170 kIOPS : time 917.852 seconds
> [RESULT] mdtest-hard-read 43.742740 kIOPS : time 362.040 seconds
> [RESULT] mdtest-hard-delete 11.804945 kIOPS : time 1341.281 seconds
> [SCORE] Bandwidth 28.284600 GB/s : IOPS 124.102186 kiops : TOTAL 59.246778
>
>
> Second run (Script):
>
> [RESULT] BW phase 1 ior_easy_write 55.296 GiB/s : time
> 477.21 seconds
> [RESULT] BW phase 2 ior_hard_write 10.749 GiB/s : time
> 331.08 seconds
> [RESULT] BW phase 3 ior_easy_read 67.308 GiB/s : time
> 391.89 seconds
> [RESULT] BW phase 4 ior_hard_read 14.842 GiB/s : time
> 239.77 seconds
> [RESULT] IOPS phase 1 mdtest_easy_write 278.851 kiops : time
> 367.23 seconds
> [RESULT] IOPS phase 2 mdtest_hard_write 8.455 kiops : time
> 323.59 seconds
> [RESULT] IOPS phase 3 find 1470.910 kiops : time
> 71.48 seconds
> [RESULT] IOPS phase 4 mdtest_easy_stat 967.127 kiops : time
> 105.89 seconds
> [RESULT] IOPS phase 5 mdtest_hard_stat 47.136 kiops : time
> 58.04 seconds
> [RESULT] IOPS phase 6 mdtest_easy_delete 224.795 kiops : time
> 455.53 seconds
> [RESULT] IOPS phase 7 mdtest_hard_read 46.749 kiops : time
> 58.52 seconds
> [RESULT] IOPS phase 8 mdtest_hard_delete 28.490 kiops : time
> 99.08 seconds
> [SCORE] Bandwidth 27.7589 GiB/s : IOPS 121.449 kiops : TOTAL 58.0628
>
>
> We don't really "deserve" a score of 58 on that second run because the
> only reason find was so fast was that we wrote out far less
> mdtest_hard_write data.
>
>
> Mark
>
> > *From: *Pinkesh Valdria <pinkesh.valdria@oracle.com>
> > *Date: *Saturday, May 30, 2020 at 3:01 AM
> > *To: *Andreas Dilger <adilger@dilger.ca>
> > *Cc: *<io-500@vi4io.org>
> > *Subject: *Re: [IO-500] Io500 runs twice - is that expected starting
> > 2020 ?
> >
> > Thanks Andreas,
> >
> > The C-app benchmark failed, I mean it never completed and there was
> > no score at the end, I waited for 8 hours before I did ctrl-c . It
> > only has 4 results vs 12 results in the first section. Are there
> > special logs to see for the C version of the benchmark.
> >
> > The below is a very small development cluster I am using until I
> > figure out how to run IO500 correctly.
> >
> > [Leaving] datafiles in ./out//2020.05.29-17.31.34-scr
> >
> > [Summary] Results files in ./results//2020.05.29-17.31.34-scr
> >
> > [Summary] Data files in ./out//2020.05.29-17.31.34-scr
> >
> > [RESULT] BW phase 1 ior_easy_write 6.188
> > GiB/s : time 357.44 seconds
> >
> > [RESULT] BW phase 2 ior_hard_write 1.132
> > GiB/s : time 367.41 seconds
> >
> > [RESULT] BW phase 3 ior_easy_read 8.090
> > GiB/s : time 273.37 seconds
> >
> > [RESULT] BW phase 4 ior_hard_read 3.726
> > GiB/s : time 111.69 seconds
> >
> > [RESULT] IOPS phase 1 mdtest_easy_write 4.263
> > kiops : time 3518.34 seconds
> >
> > [RESULT] IOPS phase 2 mdtest_hard_write 2.953
> > kiops : time 303.52 seconds
> >
> > [RESULT] IOPS phase 3 find 91.550
> > kiops : time 173.64 seconds
> >
> > [RESULT] IOPS phase 4 mdtest_easy_stat 137.243
> > kiops : time 109.30 seconds
> >
> > [RESULT] IOPS phase 5 mdtest_hard_stat 84.140
> > kiops : time 10.65 seconds
> >
> > [RESULT] IOPS phase 6 mdtest_easy_delete 55.311
> > kiops : time 271.19 seconds
> >
> > [RESULT] IOPS phase 7 mdtest_hard_read 21.778
> > kiops : time 41.16 seconds
> >
> > [RESULT] IOPS phase 8 mdtest_hard_delete 7.133
> > kiops : time 129.24 seconds
> >
> > [SCORE] Bandwidth 3.81212 GiB/s : IOPS 24.1149 kiops : TOTAL 9.58796
> >
> > The io500.sh was run
> >
> > Running the C version of the benchmark now
> >
> > IO500 version io500-isc20
> >
> > [RESULT] ior-easy-write 6.233210 GiB/s : time 359.216 seconds
> >
> > [RESULT] mdtest-easy-write 3.750549 kIOPS : time 3999.448 seconds
> >
> > [RESULT] ior-hard-write 1.415903 GiB/s : time 349.737 seconds
> >
> > [RESULT] mdtest-hard-write 3.006432 kIOPS : time 305.006 seconds
> >
> > ^C
> >
> > [root@inst-q7cdd-good-crow io500-app]#
> >
> > *From: *Andreas Dilger <adilger@dilger.ca>
> > *Date: *Saturday, May 30, 2020 at 2:33 AM
> > *To: *Pinkesh Valdria <pinkesh.valdria@oracle.com>
> > *Cc: *<io-500@vi4io.org>
> > *Subject: *Re: [IO-500] Io500 runs twice - is that expected starting
> > 2020 ?
> >
> > Hi Pinkesh,
> >
> > The dual runs of the IO500 benchmark for this list are intentional,
> >
> > and documented in the README-ISC20.txt file in the source tree.
> >
> > This is to allow comparison between the historical io500.sh script and the
> >
> > new C application that runs the same IOR, mdtest, and find commands.
> >
> > Please submit both results for ISC'20.
> >
> > We wanted to be sure that the transition to the new C-app didn't
> >
> > introduce any errors in the results. The need to run the benchmark
> >
> > twice will hopefully be gone for the SC'20 list.
> >
> > Cheers, Andreas(*)
> >
> > * speaking on my own behalf and not on behalf of the IO500 board
> >
> >
> >
> >
> > On May 29, 2020, at 16:02, Pinkesh Valdria via IO-500
> > <io-500@vi4io.org> wrote:
> >
> >
> >
> > Hello IO-500 experts,
> >
> > I am trying to configure io500 . When I run it, it runs twice,
> > first one is regular and 2^nd one is called “Running the C version
> > of the benchmark now”. Is it because I misconfigured it or is it
> > required to run both, starting 2020 ? My config*.ini file is
> > below.
> >
> > [root@inst-q7cdd-good-crow io500-app]# ./io500.sh config-test1.ini
> >
> > System: inst-q7cdd-good-crow
> >
> > …..
> >
> > Running the IO500 Benchmark now
> >
> > [Creating] directories
> >
> > …..
> >
> > [Summary] Results files in ./results//2020.05.29-17.31.34-scr
> >
> > [Summary] Data files in ./out//2020.05.29-17.31.34-scr
> >
> > [RESULT] BW phase 1 ior_easy_write 6.188 GiB/s :
> > time 357.44 seconds
> >
> > [RESULT] BW phase 2 ior_hard_write 1.132 GiB/s :
> > time 367.41 seconds
> >
> > [RESULT] BW phase 3 ior_easy_read 8.090 GiB/s
> > : time 273.37 seconds
> >
> > [RESULT] BW phase 4 ior_hard_read 3.726 GiB/s :
> > time 111.69 seconds
> >
> > [RESULT] IOPS phase 1 mdtest_easy_write 4.263 kiops
> > : time 3518.34 seconds
> >
> > [RESULT] IOPS phase 2 mdtest_hard_write 2.953 kiops
> > : time 303.52 seconds
> >
> > [RESULT] IOPS phase 3 find 91.550 kiops :
> > time 173.64 seconds
> >
> > [RESULT] IOPS phase 4 mdtest_easy_stat 137.243 kiops :
> > time 109.30 seconds
> >
> > [RESULT] IOPS phase 5 mdtest_hard_stat 84.140 kiops :
> > time 10.65 seconds
> >
> > [RESULT] IOPS phase 6 mdtest_easy_delete 55.311 kiops :
> > time 271.19 seconds
> >
> > [RESULT] IOPS phase 7 mdtest_hard_read
> > 21.778 kiops : time 41.16 seconds
> >
> > [RESULT] IOPS phase 8 mdtest_hard_delete 7.133
> > kiops : time 129.24 seconds
> >
> > [SCORE] Bandwidth 3.81212 GiB/s : IOPS 24.1149 kiops : TOTAL 9.58796
> >
> > The io500.sh was run
> >
> > Running the C version of the benchmark now
> >
> > IO500 version io500-isc20
> >
> > <currently running …when I posted this question ….
> >
> > ***************************************************
> >
> > config-test1.ini (END)
> >
> > ***************************************************
> >
> > [global]
> >
> > datadir = ./out/
> >
> > resultdir = ./results/
> >
> > timestamp-resultdir = TRUE
> >
> > # Chose parameters that are very small for all benchmarks
> >
> > [debug]
> >
> > stonewall-time = 300 # for testing
> >
> > [ior-easy]
> >
> > transferSize = 2m
> >
> > blockSize = 102400m
> >
> > [mdtest-easy]
> >
> > API = POSIX
> >
> > # Files per proc
> >
> > n = 500000
> >
> > [ior-hard]
> >
> > API = POSIX
> >
> > # Number of segments10000000
> >
> > segmentCount = 400000
> >
> > [mdtest-hard]
> >
> > API = POSIX
> >
> > # Files per proc 1000000
> >
> > n = 40000
> >
> > [find]
> >
> > external-script = /mnt/beeond/io500-app/bin/pfind
> >
> > pfind-parallelize-single-dir-access-using-hashing = FALSE
> >
> > ***************************************************
> >
> > _______________________________________________
> > IO-500 mailing list
> > IO-500@vi4io.org
> > https://www.vi4io.org/mailman/listinfo/io-500
> > <https://urldefense.com/v3/__https:/www.vi4io.org/mailman/listinfo/io-500__;!!GqivPVa7Brio!IIVK5wte_QzaDMDI727WWUVOeAHxxHJpVHKY-ULJ8aLVtR9jiMb9ino83TfwS3rj3slr$>
> >
> >
> > _______________________________________________
> > IO-500 mailing list
> > IO-500@vi4io.org
> > https://www.vi4io.org/mailman/listinfo/io-500
>
> _______________________________________________
> IO-500 mailing list
> IO-500@vi4io.org
> https://www.vi4io.org/mailman/listinfo/io-500
--
Dr. Julian Kunkel
Lecturer, Department of Computer Science
+44 (0) 118 378 8218
http://www.cs.reading.ac.uk/
https://hps.vi4io.org/
PGP Fingerprint: 1468 1A86 A908 D77E B40F 45D6 2B15 73A5 9D39 A28E
_______________________________________________
IO-500 mailing list
IO-500@vi4io.org
https://www.vi4io.org/mailman/listinfo/io-500