Dear All,
With the explosive growth of colossal data from various academic and industrial sectors,
many High-Performance Computing (HPC) and data analytics systems have been developed to
meet the needs of data collection, processing and analysis. Accordingly, many research
groups around the world have explored unconventional and cutting-edge ideas for the
management of storage and I/O.
For the I/O research community to get a global picture on the current state-of-the-art and
vibrant progress, invited by Journal of Computer Science and Technology (JCST,
http://jcst.ict.ac.cn), Prof. Xian-He Sun of Illinois Institute of Technology and Prof.
Weikuan Yu of Florida State University organized the Special Section on Selected I/O
Technologies for High-Performance Computing and Data Analytics, which consists of the
following eight high-quality papers from China, Europe, Japan, and the United States.
Due to COVID-19, we make this special issue free. We hope that a great number of readers
and users find this special section interesting and useful for their respective needs and
endeavors. Thanks a lot for the authors' contributions and all the reviewers'
valuable time and efforts.
Thank you.
Journal of Computer Science and Technology
05 January 2020, Volume 35 Issue 1
Special Section on Selected I/O Technologies for High-Performance Computing and Data
Analytics
Preface
Xian-He Sun, Weikuan Yu
Journal of Computer Science and Technology, 2020, 35 (1): 1-3. DOI:
10.1007/s11390-020-0001-9
PDF
Ad Hoc File Systems for High-Performance Computing
André Brinkmann, Kathryn Mohror, Weikuan Yu, Philip Carns, Toni Cortes, Scott A. Klasky,
Alberto Miranda, Franz-Josef Pfreundt, Robert B. Ross, Marc-André Vef
Journal of Computer Science and Technology, 2020, 35 (1): 4-26. DOI:
10.1007/s11390-020-9801-1
PDF Highlights Chinese Summary
Abstract Storage backends of parallel compute clusters are still based mostly on magnetic
disks, while newer and faster storage technologies such as flash-based SSDs or
non-volatile random access memory (NVRAM) are deployed within compute nodes. Including
these new storage technologies into scientific workflows is unfortunately today a mostly
manual task, and most scientists therefore do not take advantage of the faster storage
media. One approach to systematically include nodelocal SSDs or NVRAMs into scientific
workflows is to deploy ad hoc file systems over a set of compute nodes, which serve as
temporary storage systems for single applications or longer-running campaigns. This paper
presents results from the Dagstuhl Seminar 17202 "Challenges and Opportunities of
User-Level File Systems for HPC" and discusses application scenarios as well as
design strategies for ad hoc file systems using node-local storage media. The discussion
includes open research questions, such as how to couple ad hoc file systems with the batch
scheduling environment and how to schedule stage-in and stage-out processes of data
between the storage backend and the ad hoc file systems. Also presented are strategies to
build ad hoc file systems by using reusable components for networking and how to improve
storage device compatibility. Various interfaces and semantics are presented, for example
those used by the three ad hoc file systems BeeOND, GekkoFS, and BurstFS. Their
presentation covers a range from file systems running in production to cutting-edge
research focusing on reaching the performance limits of the underlying devices.
Design and Implementation of the Tianhe-2 Data Storage and Management System
Yu-Tong Lu, Peng Cheng, Zhi-Guang Chen
Journal of Computer Science and Technology, 2020, 35 (1): 27-46. DOI:
10.1007/s11390-020-9799-4
PDF Highlights Chinese Summary
Abstract With the convergence of high-performance computing (HPC), big data and artificial
intelligence (AI), the HPC community is pushing for "triple use" systems to
expedite scientific discoveries. However, supporting these converged applications on HPC
systems presents formidable challenges in terms of storage and data management due to the
explosive growth of scientific data and the fundamental differences in I/O characteristics
among HPC, big data and AI workloads. In this paper, we discuss the driving force behind
the converging trend, highlight three data management challenges, and summarize our
efforts in addressing these data management challenges on a typical HPC system at the
parallel file system, data management middleware, and user application levels. As HPC
systems are approaching the border of exascale computing, this paper sheds light on how to
enable application-driven data management as a preliminary step toward the deep
convergence of exascale computing ecosystems, big data, and AI.
Lessons Learned from Optimizing the Sunway Storage System for Higher Application I/O
Performance
Qi Chen, Kang Chen, Zuo-Ning Chen, Wei Xue, Xu Ji, Bin Yang
Journal of Computer Science and Technology, 2020, 35 (1): 47-60. DOI:
10.1007/s11390-020-9798-5
PDF Highlights Chinese Summary
Abstract It is hard for applications to make full utilization of the peak bandwidth of the
storage system in highperformance computers because of I/O interferences, storage resource
misallocations and complex long I/O paths. We performed several studies to bridge this gap
in the Sunway storage system, which serves the supercomputer Sunway TaihuLight. To locate
these issues and connections between them, an end-to-end performance monitoring and
diagnosis tool was developed to understand I/O behaviors of applications and the system.
With the help of the tool, we were about to find out the root causes of such performance
barriers at the I/O forwarding layer and the parallel file system layer. An
application-aware I/O forwarding allocation framework was used to address the I/O
interferences and resource misallocations at the I/O forwarding layer. A performance-aware
data placement mechanism was proposed to mitigate the impact of I/O interferences and
performance variations of storage devices in the PFS. Together, applications obtained much
better I/O performance. During the process, we also proposed a lightweight storage stack
to shorten the I/O path of applications with N-N I/O pattern. This paper summarizes these
studies and presents the lessons learned from the process.
Gfarm/BB—Gfarm File System for Node-Local Burst Buffer
Osamu Tatebe, Shukuko Moriwake, Yoshihiro Oyama
Journal of Computer Science and Technology, 2020, 35 (1): 61-71. DOI:
10.1007/s11390-020-9803-z
PDF Highlights Chinese Summary
Abstract Burst buffer has become a major component to meet the I/O performance requirement
of HPC bursty traffic. This paper proposes Gfarm/BB that is a file system for a burst
buffer efficiently exploiting node-local storage systems. Although node-local storages
improve storage performance, they are only available during the job allocation. Gfarm/BB
should have better access and metadata performance while it should be constructed
on-demand before the job execution. To improve the read and write performance, it exploits
the file descriptor passing and remote direct memory access (RDMA). It improves the
metadata performance by omitting the persistency and the redundancy since it is a temporal
file system. Using RDMA, writes and reads bandwidth are improved by 1.7x and 2.2x compared
with IP over InfiniBand (IPoIB), respectively. It achieves 14 700 operations per second in
the directory creation performance, which is 13.4x faster than the fully persistent and
redundant case. The construction of Gfarm/BB takes 0.31 seconds using 2 nodes. IOR
benchmark and ARGOT-IO application I/O benchmark show the scalable performance improvement
by exploiting the locality of node-local storages. Compared with BeeOND, Gfarm/BB shows
2.6x and 2.4x better performance in IOR write and read benchmarks, respectively, and it
shows 2.5x better performance in ARGOT-IO.
GekkoFS—A Temporary Burst Buffer File System for HPC Applications
Marc-André Vef, Nafiseh Moti, Tim Süß, Markus Tacke, Tommaso Tocci, Ramon Nou, Alberto
Miranda, Toni Cortes, André Brinkmann
Journal of Computer Science and Technology, 2020, 35 (1): 72-91. DOI:
10.1007/s11390-020-9797-6
PDF Highlights Chinese Summary
Abstract Many scientific fields increasingly use high-performance computing (HPC) to
process and analyze massive amounts of experimental data while storage systems in
today's HPC environments have to cope with new access patterns. These patterns include
many metadata operations, small I/O requests, or randomized file I/O, while
general-purpose parallel file systems have been optimized for sequential shared access to
large files. Burst buffer file systems create a separate file system that applications can
use to store temporary data. They aggregate node-local storage available within the
compute nodes or use dedicated SSD clusters and offer a peak bandwidth higher than that of
the backend parallel file system without interfering with it. However, burst buffer file
systems typically offer many features that a scientific application, running in isolation
for a limited amount of time, does not require. We present GekkoFS, a temporary,
highly-scalable file system which has been specifically optimized for the aforementioned
use cases. GekkoFS provides relaxed POSIX semantics which only offers features which are
actually required by most (not all) applications. GekkoFS is, therefore, able to provide
scalable I/O performance and reaches millions of metadata operations already for a small
number of nodes, significantly outperforming the capabilities of common parallel file
systems.
I/O Acceleration via Multi-Tiered Data Buffering and Prefetching
Anthony Kougkas, Hariharan Devarajan, Xian-He Sun
Journal of Computer Science and Technology, 2020, 35 (1): 92-120. DOI:
10.1007/s11390-020-9781-1
PDF Highlights Chinese Summary
Abstract Modern High-Performance Computing (HPC) systems are adding extra layers to the
memory and storage hierarchy, named deep memory and storage hierarchy (DMSH), to increase
I/O performance. New hardware technologies, such as NVMe and SSD, have been introduced in
burst buffer installations to reduce the pressure for external storage and boost the
burstiness of modern I/O systems. The DMSH has demonstrated its strength and potential in
practice. However, each layer of DMSH is an independent heterogeneous system and data
movement among more layers is significantly more complex even without considering
heterogeneity. How to efficiently utilize the DMSH is a subject of research facing the HPC
community. Further, accessing data with a high-throughput and low-latency is more
imperative than ever. Data prefetching is a well-known technique for hiding read latency
by requesting data before it is needed to move it from a high-latency medium (e.g., disk)
to a low-latency one (e.g., main memory). However, existing solutions do not consider the
new deep memory and storage hierarchy and also suffer from under-utilization of
prefetching resources and unnecessary evictions. Additionally, existing approaches
implement a client-pull model where understanding the application's I/O behavior
drives prefetching decisions. Moving towards exascale, where machines run multiple
applications concurrently by accessing files in a workflow, a more data-centric approach
resolves challenges such as cache pollution and redundancy. In this paper, we present the
design and implementation of Hermes:a new, heterogeneous-aware, multi-tiered, dynamic, and
distributed I/O buffering system. Hermes enables, manages, supervises, and, in some sense,
extends I/O buffering to fully integrate into the DMSH. We introduce three novel data
placement policies to efficiently utilize all layers and we present three novel techniques
to perform memory, metadata, and communication management in hierarchical buffering
systems. Additionally, we demonstrate the benefits of a truly hierarchical data prefetcher
that adopts a server-push approach to data prefetching. Our evaluation shows that, in
addition to automatic data movement through the hierarchy, Hermes can significantly
accelerate I/O and outperforms by more than 2x state-of-the-art buffering platforms.
Lastly, results show 10%-35% performance gains over existing prefetchers and over 50% when
compared to systems with no prefetching.
Mochi: Composing Data Services for High-Performance Computing Environments
Robert B. Ross, George Amvrosiadis, Philip Carns, Charles D. Cranor, Matthieu Dorier,
Kevin Harms, Greg Ganger, Garth Gibson, Samuel K. Gutierrez, Robert Latham, Bob Robey,
Dana Robinson, Bradley Settlemyer, Galen Shipman, Shane Snyder, Jerome Soumagne, Qing
Zheng
Journal of Computer Science and Technology, 2020, 35 (1): 121-144. DOI:
10.1007/s11390-020-9802-0
PDF Highlights Chinese Summary
Abstract Technology enhancements and the growing breadth of application workflows running
on high-performance computing (HPC) platforms drive the development of new data services
that provide high performance on these new platforms, provide capable and productive
interfaces and abstractions for a variety of applications, and are readily adapted when
new technologies are deployed. The Mochi framework enables composition of specialized
distributed data services from a collection of connectable modules and subservices. Rather
than forcing all applications to use a one-size-fits-all data staging and I/O software
configuration, Mochi allows each application to use a data service specialized to its
needs and access patterns. This paper introduces the Mochi framework and methodology. The
Mochi core components and microservices are described. Examples of the application of the
Mochi methodology to the development of four specialized services are detailed. Finally, a
performance evaluation of a Mochi core component, a Mochi microservice, and a composed
service providing an object model is performed. The paper concludes by positioning Mochi
relative to related work in the HPC space and indicating directions for future work.
ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems
Suren Byna, M. Scot Breitenfeld, Bin Dong, Quincey Koziol, Elena Pourmal, Dana Robinson,
Jerome Soumagne, Houjun Tang, Venkatram Vishwanath, Richard Warren
Journal of Computer Science and Technology, 2020, 35 (1): 145-160. DOI:
10.1007/s11390-020-9822-9
PDF Highlights Chinese Summary
Abstract Scientific applications at exascale generate and analyze massive amounts of data.
A critical requirement of these applications is the capability to access and manage this
data efficiently on exascale systems. Parallel I/O, the key technology enables moving data
between compute nodes and storage, faces monumental challenges from new applications,
memory, and storage architectures considered in the designs of exascale systems. As the
storage hierarchy is expanding to include node-local persistent memory, burst buffers,
etc., as well as disk-based storage, data movement among these layers must be efficient.
Parallel I/O libraries of the future should be capable of handling file sizes of many
terabytes and beyond. In this paper, we describe new capabilities we have developed in
Hierarchical Data Format version 5 (HDF5), the most popular parallel I/O library for
scientific applications. HDF5 is one of the most used libraries at the leadership
computing facilities for performing parallel I/O on existing HPC systems. The
state-of-the-art features we describe include:Virtual Object Layer (VOL), Data Elevator,
asynchronous I/O, full-featured single-writer and multiple-reader (Full SWMR), and
parallel querying. In this paper, we introduce these features, their implementations, and
the performance and feature benefits to applications and other libraries.
Best Regards,
Editorial Office
Journal of Computer Science and Technology
P.O.Box 2704, Beijing 100190
P.R.China
Tel:(8610)62610746; 62600340
Online Submission:
https://mc03.manuscriptcentral.com/jcst
E-mail:jcst@ict.ac.cn
http://jcst.ict.ac.cn
_______________________________________________
Storage-research-list mailing list
Storage-research-list(a)ece.cmu.edu
https://lists.andrew.cmu.edu/mailman/listinfo/storage-research-list