The Parallel Virtual File System for High-Performance Computing Clusters
Monica Kashyap; Jenwei Hsieh, Ph.D.; Christopher Stanton; and Rizwan Ali (November 2002)
In the process of continuously improving the performance of Linux® commodity cluster systems, the development of a high-speed parallel file system has become a necessary component for parallel computing. The Parallel Virtual File System (PVFS) was created to provide such a parallel file system for high-performance computing (HPC) clusters and large, I/O-intensive parallel applications. This article describes performance evaluations conducted by Dell on a cluster of DellTM PowerEdgeTM 2650 servers using PVFS.
The Parallel Virtual File System (PVFS) Project at Clemson University was conceived to create an open-source parallel file system for clusters of PCs running the Linux® operating system. PVFS has been widely used as a high-performance, large parallel file system for temporary storage and as an infrastructure for parallel I/O research.
As a parallel file system, PVFS stores data on the existing local file systems of multiple cluster nodes; many clients can access this data simultaneously. A parallel file system has several advantages:
- Provides a global name space
- Distributes data across multiple disks
- Allows the use of different user interfaces
- Can include additional I/O interfaces to support larger files
Within a high-performance computing (HPC) cluster, PVFS enables high-performance I/O that is comparable to that achieved through other proprietary file systems, such as the Intel® Parallel File System on the Accelerated Strategic Computing Initiative (ASCI) Red supercomputer. In October 2000, Argonne National Laboratory achieved an I/O throughput of 1.05 GB/sec on a PVFS cluster. This I/O throughput shows that the performance of a PC cluster is comparable to that of a proprietary system; in April 1997, 1.0 GB/sec throughput had been reached using the commercial file system of the proprietary ASCI Red system. This article describes tests conducted by Dell to evaluate the performance of PVFS on a cluster of DellTM PowerEdgeTM 2650 servers.
Understanding the structure of PVFS
To provide high-speed access to the file system across the cluster, PVFS stripes file data across multiple disks of certain cluster nodes known as I/O nodes. Striping allows the storage capacity across the cluster to be quite large, depending on the number of nodes in the cluster. The resulting storage capacity provides users with a large global scratch space for the entire cluster.
Like many network file systems or parallel file systems, PVFS is implemented using a client-server architecture. It employs a group of collaborative user-space server processes (daemons) to provide a cluster-wide, consistent name space and to stripe data across multiple cluster nodes. Messages between PVFS clients and servers are exchanged over TCP/IP for reliable communications. All PVFS file system data is stored on the I/O nodes' local file systems, which can encompass one partition on a disk drive, the entire disk drive, or a logical volume of many disk drives using any supported local Linux file system, such as ext2, ext3, and ReiserFS.
PVFS uses three types of nodes: the management node, the I/O node, and the compute node (see Figure 1 ). One cluster node can assume one, two, or all three of these functions simultaneously. The current PVFS implementation has one management node, one or more I/O nodes, and multiple compute nodes.
Figure 1. Logical view of PVFS
The management node
A management node runs the metadata server (the mgr daemon), which handles all file metadata. Metadata contains file descriptions, including the name of the file, its location in the directory hierarchy, its owner, and how it is distributed across the I/O nodes.
The I/O node
An I/O node stores file data for the file system. It runs the I/O server (the iod daemon), which handles the storage and retrieval of data from disk.
The compute node
A compute node runs applications. It uses libpvfs, a client-side library of I/O calls that provides low-level access to the PVFS servers. This library handles the data transfer between user buffers and PVFS servers, and it hides these operations from users.
When opening, closing, creating, or removing a file, an application on the compute node communicates directly with the metadata server through libpvfs. Once the management node locates a file, it returns the file's location to the application. The application can then use libpvfs to directly contact the appropriate I/O nodes for read-write operations without communicating with the metadata server (see Figure 2 ).
Figure 2. Metadata and data flow in PVFS
The compute nodes also have Linux kernel support that allows them to mount file systems from a PVFS server in the same manner as they would under Network File System (NFS). This functionality lets existing programs access all PVFS files without modification.
PVFS offers three interfaces through which it can be accessed:
- PVFS native application programming interface (API): This UNIX® -like interface for PVFS file access is available through libpvfs. The API enables specifications of how files will be striped across the I/O nodes.
- Linux kernel interface: This interface allows PVFS to be accessed and manipulated from the user space in the same manner as local Linux file systems. Through this interface, existing applications or common utilities can manipulate data stored on PVFS.
- ROMIO interface: ROMIO is a portable implementation of MPI-IO (Message Passing Interface-I/O), a parallel I/O interface in MPI-2. This interface allows programmers to access PVFS files through MPI.
Configuring the PVFS test environment
To evaluate the performance of PVFS, Dell tested the scalability of the file system on a cluster of 40 rack-optimized Dell PowerEdge 2650 servers, 24 of which were used for compute nodes and 16 for I/O nodes. One of the I/O nodes also ran the metadata server. Each PowerEdge 2650 contained two Intel XeonTM processors at 2.4 GHz with 512 KB of level 2 (L2) cache, 2 GB of main memory, and two integrated Gigabit Ethernet1 interfaces.
Each compute node had one 36 GB Ultra3 (U160) SCSI system disk, and each I/O node had five 33.6 GB Ultra3 (U160) SCSI disks. The nodes were interconnected with a MyricomTM MyrinetTM -2000 switch.
The software environment included PVFS version 1.5.4 built on the Red Hat® Linux 7.3 operating system, which was used on all the cluster nodes. The cluster used version 1.5.1 of the Myricom GM message-passing system.
Dell used an MPI-IO test program called perf, which is part of the ROMIO source code. It performs concurrent read and write operations to the same file. To each MPI process, perf assigns a fixed-size data array of either 4 MB, 16 MB, or 64 MB. The array is written to a fixed disjoint region of the shared file using MPI_File_write() and is read from the same region of the file using MPI_File_read(). All MPI processes use MPI_Barrier() to synchronize before each I/O operation.
The perf program measures four types of concurrent I/O operations:
- Write operations without file synchronization
- Write operations with file synchronization (using MPI_File_sync())
- Read operations without file synchronization
- Read operations after file synchronization
These tests provide an upper bound on the MPI-IO performance expected from a given set of I/O nodes and the file system. Using as many as 24 compute nodes, Dell ran perf against the PVFS file system, which incorporated up to 16 I/O nodes.
Evaluating PVFS performance
Dell performed initial tests using the first disk on the I/O nodes for the operating system and the remaining four disks for PVFS. Software RAID-0 (striping with no fault tolerance) was used to create a single 32 GB logical unit number (LUN) that comprised an 8 GB partition from each of the four 33.6 GB hard drives. Figures 3 through 6 show the results obtained for the four types of perf tests. In each figure, the y-axis represents the I/O throughput in megabytes per second and the x-axis shows the number of compute nodes accessing the I/O nodes. In these tests, the compute nodes used a 64 MB access size.
When write operations occurred without file synchronization, PVFS achieved good scalability using 12 or 16 I/O nodes, which could handle the write requests from 24 compute nodes while achieving more than 1.7 GB/sec and 2.0 GB/sec throughput, respectively, without peaking (see Figure 3 ). However, when the system used 4 I/O nodes, 8 compute nodes were enough to saturate the I/O network after achieving 0.6 GB/sec throughput. Similarly, for 8 I/O nodes, 16 compute nodes saturated the network after reaching a throughput of 1.2 GB/sec.
Figure 3. Bandwidth from write operations without file synchronization
For write operations with file synchronizations, the measurement includes the time required for a call to the MPI_File_sync() routine after the write. This routine forces file updates to propagate to the storage device before the routine returns. The aggregate bandwidth was much lower than that achieved without synchronization; Figure 4 shows that I/O throughput peaked at roughly 675 MB/sec with 16 I/O nodes.
Figure 4. Bandwidth from write operations with file synchronization
Figure 5 shows the results obtained for read operations without file synchronization. The system displayed good scalability when using 16 I/O nodes: it achieved close to 1.8 GB/sec throughput without reaching a peak. This type of operation shows no significant performance difference compared to read operations after file synchronization (see Figure 6 ).
Figure 5. Bandwidth from read operations without file synchronization
Figure 6. Bandwidth from read operations after file synchronization
Dell conducted an additional study using 16 I/O nodes to understand the performance differences, if any, between RAID-0 configurations with one, two, three, or four hard drives on each I/O node. The results show minimal differences between the four drive configurations when the I/O nodes wrote to the file without file synchronization (see Figure 7 ). On the other hand, the number of hard drives did affect write access with file synchronization-the more hard drives, the better the performance.
Figure 7. Bandwidth from write operations, without (left) and with (right) file synchronization, using varying numbers of hard drives and 16 I/O nodes
To observe the effect of access size on performance, Dell measured the aggregate bandwidths achieved when each compute node used access sizes of 4 MB, 16 MB, or 64 MB. The results show that, as the number of compute nodes increases, peak performance increases with larger access size (see Figure 8 ). The difference is negligible with read accesses but is significant with write accesses.
Figure 8. Bandwidth from read (left) and write (right) operations, after file synchronization, using different access sizes and 16 I/O nodes
Enhancing the performance of PVFS
At this time, PVFS effectively provides a global scratch space for HPC clusters, but it is still being tested and tuned for better performance. The aggregate read and write bandwidths obtained from the performance tests show that a Dell cluster running PVFS efficiently uses the Myrinet network.
Currently, PVFS is limited because it uses TCP for all communication, which inhibits some performance. The project is redesigning PVFS to use Virtual Interface Architecture (VIA), GM, and other interfaces, in addition to TCP. The fault tolerance of PVFS also needs improvement. For instance, if one node fails in the cluster, the entire file system becomes unreachable. Because the metadata contains all the information about the cluster, the metadata server becomes a single point of failure. However, PVFS has the potential to enhance future experiments of high-performance parallel I/O and parallel file systems.
Monica Kashyap (firstname.lastname@example.org) is a systems engineer in the Scalable Systems Group at Dell. She has a B.S. in Applied Science and Computer Engineering from the University of North Carolina at Chapel Hill.
Jenwei Hsieh, Ph.D. (email@example.com) is a member of the Scalable Systems Group at Dell. Jenwei is responsible for developing high-performance clusters. He has published more than 30 technical papers in the area of multimedia computing and communications, high-speed networking, serial storage interfaces, and distributed network computing. Jenwei has a Ph.D. in Computer Science from the University of Minnesota and a B.E. from Tamkang University in Taiwan.
Christopher Stanton (firstname.lastname@example.org) is a senior systems engineer in the Scalable Systems Group at Dell. His HPC cluster-related interests include cluster installation, management, and performance benchmarking. Christopher graduated from the University of Texas at Austin with a B.S. and special honors in Computer Science.
Rizwan Ali (email@example.com) is a systems engineer working in the Scalable Systems Group at Dell. His current research interests are performance benchmarking and high-speed interconnects. Rizwan has a B.S. in Electrical Engineering from the University of Minnesota.
For more information