Metadata benchmarking

Wednesday, 7 December 2016

Dear all,
on our document
https://docs.google.com/document/d/1zIl6XHjAJxjX3NMjNR49wkFoe42Lt9ti-Qsfi...
I worked on the description of a Metadata benchmark.
I also prototyped the behavior into the "md-real-io" benchmark and
validated that the pattern can be usefully implemented with NoSQL
solutions (mongodb) and the relational model besides being useful for
POSIX interfaces.
Plugins are implemented for these, but S3 and other interfaces will follow.

Surely there are alternatives.
I measured the pattern on local file system and Lustre and it will
reveal the hardware performance (SSD vs. HDD and file system behavior)
in my sense much better than mdtest.
mdtest also has quite a lot of knobs to play, probably I do not know
how to reveal better the hardware features and avoid bulk
optimizations / caching.

Some results are posted here:
* https://www.vi4io.org/tools/benchmarks/mdtest
* https://www.vi4io.org/tools/benchmarks/md-real-io

Anyway I wanted to move on with the high-level discussion and have
here some thoughts about the access patterns / use case. I'm open to
other use cases that make sense in big data and HPC.
Here is an excerpt of MDsmall (more in the document):

MDsmall
* Goal: Determine the performance for accessing small data objects
independently.
* The term data object refers to data that is independently
created/assessed to all other data objects.

Rationales:
* The benchmark shall determine the sustained performance of creating,
accessing and deleting of data objects but preventing caching.
* It shall simulate the interactive usage, as usage of a system by
users often leads to such small accesses ascertain data artifacts such
as source code is small. For example, some users compile a program
consisting of many source code pieces, on a file system, they may run
“ls”, “stat”, “cat”, or editors.
* Alternatively, it simulates a producer-consumer / stream processing system.

Use case:
* N processes independently work on data objects, they behave like a
producer-consumer system / stream processing engine. Each process is a
consumer reading/deleting a data object and a producer creating a new
data object based on the previous data. A process consumes data from a
few (fixed) others and produces it also for a few other processes.
* The objects are considered to be distributed across multiple logical
data sets (such as directories, buckets, databases, …), each data set
is considered to be a queue for the producer/consumers. The exact
mapping from logical to physical object does not matter, but all
processes shall be able to access all objects at any time.

Example:
* The producer/consumer queues can be considered to lead to the
following “communication” pattern (receive == read/delete, send ==
write).
* Assume D=1: a process receives data from the left process and sends
data to the right. (Virtually, the process has actually processed the
data and produced some product, but we don’t care about the
processing)

Regards,
Julian

-- 
http://wr.informatik.uni-hamburg.de/people/julian_kunkel

2024

2023

2022

2021

2020

2019

2018

2017

2016