December 2016 - IO-500 - Lists.Vi4io.Org

by Lofstead, Gerald F II

Hi all, I just wanted to add a note before the holidays help everyone disappear for a week or two. I think we have settled on the following: 1. Anything that is capable of being treated as “storage” should have separate benchmarking. This will give us some idea how things like SCR will perform. It will also tell us worst case performance should data either be unavailable in the fast tier or the fast tier capacity is exceeded and data migration/retargeting is required. If data migration is required, some measure of the simultaneous drain/fill should also be benchmarked. 2. We need to settle on benchmarks for traditional HPC workloads, such as engineering codes and bulk synchronous simulations with distributed, but dependent data sets. We also need to determine what benchmarks we want to support for all phases of data movement/staging for other workloads, such as bio/genomics, chemistry, or data analytics workloads. Data distribution and reading performance are important. In the case of flash, erasing the data from a previous application needs to be included unless there is some guarantee that it won’t be an issue (doubtful). Irene had some ideas about cloud workloads that may be different from those described above. Hopefully she can educate the rest of us on what we should include. Having support for Swift or S3 APIs, if supported, is a simple, but likely inadequate first step. Sarp has tried this effort previously, but ran into serious issues. If he could share the specifics of what they ran into and/or what they developed before abandoning the project, that would be extremely valuable. Best, Jay

7 years, 4 months

1
0
0 / 0

Metadata benchmarking

by Julian Kunkel

Dear all, on our document https://docs.google.com/document/d/1zIl6XHjAJxjX3NMjNR49wkFoe42Lt9ti-Qsfi... I worked on the description of a Metadata benchmark. I also prototyped the behavior into the "md-real-io" benchmark and validated that the pattern can be usefully implemented with NoSQL solutions (mongodb) and the relational model besides being useful for POSIX interfaces. Plugins are implemented for these, but S3 and other interfaces will follow. Surely there are alternatives. I measured the pattern on local file system and Lustre and it will reveal the hardware performance (SSD vs. HDD and file system behavior) in my sense much better than mdtest. mdtest also has quite a lot of knobs to play, probably I do not know how to reveal better the hardware features and avoid bulk optimizations / caching. Some results are posted here: * https://www.vi4io.org/tools/benchmarks/mdtest * https://www.vi4io.org/tools/benchmarks/md-real-io Anyway I wanted to move on with the high-level discussion and have here some thoughts about the access patterns / use case. I'm open to other use cases that make sense in big data and HPC. Here is an excerpt of MDsmall (more in the document): MDsmall * Goal: Determine the performance for accessing small data objects independently. * The term data object refers to data that is independently created/assessed to all other data objects. Rationales: * The benchmark shall determine the sustained performance of creating, accessing and deleting of data objects but preventing caching. * It shall simulate the interactive usage, as usage of a system by users often leads to such small accesses ascertain data artifacts such as source code is small. For example, some users compile a program consisting of many source code pieces, on a file system, they may run “ls”, “stat”, “cat”, or editors. * Alternatively, it simulates a producer-consumer / stream processing system. Use case: * N processes independently work on data objects, they behave like a producer-consumer system / stream processing engine. Each process is a consumer reading/deleting a data object and a producer creating a new data object based on the previous data. A process consumes data from a few (fixed) others and produces it also for a few other processes. * The objects are considered to be distributed across multiple logical data sets (such as directories, buckets, databases, …), each data set is considered to be a queue for the producer/consumers. The exact mapping from logical to physical object does not matter, but all processes shall be able to access all objects at any time. Example: * The producer/consumer queues can be considered to lead to the following “communication” pattern (receive == read/delete, send == write). * Assume D=1: a process receives data from the left process and sends data to the right. (Virtually, the process has actually processed the data and produced some product, but we don’t care about the processing) Regards, Julian -- http://wr.informatik.uni-hamburg.de/people/julian_kunkel

7 years, 4 months

1
0
0 / 0

Workshop on 23rd and 24th

by Julian Kunkel

Dear all, as mentioned before, to foster the benchmarking effort we are organizing a workshop about the "understanding of I/O performance behavior" on March 23rd and 24th in 2017 that takes place at DKRZ, Hamburg and invite each of you to participate in this workshop and (if you are interested) give a talk related to this topic. Of course you can participate as regular attendee without giving a talk, too. Please fill the form to indicate your interest in the workshop and optionally a preliminary title: https://goo.gl/forms/nj1b7NR2Vb4gw7ek2 == Abstract == Understanding I/O performance behavior is crucial to optimize I/O-intense applications but also the infrastructure of data centers. However, with the dawn of new technologies such as NVRAM, burst-buffers, active storage/function shipping, and network attached memory, the complexity of storage infrastructure increases significantly and the boundary between memory and storage blurs. During the procurement of new systems, data centers have to ensure that the application's needs are met. Therefore, they need to define the proper requirements for storage and provide I/O benchmarks that represent application workloads to quantify and verify I/O performance. The main goal of the workshop is discussion of tools to identify (in-)efficient usage of I/O resources on modern storage subsystems from the perspective of users and data centers. The workshop covers: a discussion of design alternatives of storage architectures and their implications on user workflows; telemetry and monitoring information necessary to understand actual rather than reported I/O activity enabling efficient performance optimization of system and applications; the development of representative benchmarks resembling the applications' needs. The discussion of alternative storage architectures lays the foundation for the requirements of the monitoring and benchmarking efforts. Speakers involved in storage and file system research will present experience in alternative storage architectures, application workflows, monitoring tools to identify bottlenecks in I/O, and (benchmarking) tools to quantify I/O performance. Scientists involved in various application domains can give an introduction to their workflows and I/O requirements. By bringing together application developers/users and I/O experts, we support the development of tools to identify and quantify I/O inefficiencies that support users and data centers. Workshop homepage: http://wr.informatik.uni-hamburg.de/events/2017/uiop If you have any questions, contact me or Jay. Regards, Julian

7 years, 4 months

1
0
0 / 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

IO-500 December 2016