12.2 Using PVFS

To make effective use of PVFS, you need to understand how PVFS distributes files across the cluster. PVFS uses a simple striping scheme with three striping parameters.

base: The cluster node where the file starts, given as an index where the first I/O server is 0. Typically, this defaults to 0.
pcount: The number of I/O servers among which the file is partitioned. Typically, this defaults to the total number of I/O servers.
ssize: The size of each strip, i.e., contiguous blocks of data. Typically, this defaults to 64 KB.

Figure 12-2 should help clarify how files are distributed. In the figure, the file is broken into eight pieces and distributed among four I/O servers. base is the index of the first I/O server. pcount is the number of servers used, i.e., four in this case. ssize is the size of each of the eight blocks. Of course, the idea is to select a block size that will optimize parallel access to the file.

Figure 12-2. Overlap within files

You can examine the distribution of a file using the pvstat utility. For example,

[root@fanny pvfs]# pvstat data

data: base = 0, pcount = 5, ssize = 65536

[root@fanny pvfs]# ls -l data

-rw-r--r--    1 root     root     10485760 Jun 21 12:49 data

A little arithmetic shows this file is broken into 160 pieces with 32 blocks on each I/O server.

If you copy a file to a PVFS filesystem using cp, it will be partitioned automatically for you using what should be reasonable defaults. For more control, you can use the u2p utility. With u2p, the command-line option -s sets the stripe size; -b specifies the base; and -n specifies the number of nodes. Here is an example:

[root@fanny /]# u2p -s16384 data /mnt/data

1 node(s); ssize = 8192; buffer = 0; nanMBps (0 bytes total)

[root@fanny /]# pvstat /mnt/data

/mnt/data: base = 0, pcount = 1, ssize = 8192

Typically, u2p is used to convert an existing file for use with a parallel program.

While Unix system call read and write will work with the PVFS without any changes, large numbers of small accesses will not perform well. The buffered routines from the standard I/O library (e.g., fread and fwrite) should work better provided an adequate buffer is used.

To make optimal use of PVFS, you will need to write your programs to use PVFS explicitly. This can be done using the native PVFS access provided through the libpvfs.a library. Details can be found in Using the Parallel Virtual File System, part of the documentation available at the PVFS web site. Programming examples are included with the source in the examples subdirectory. Clearly, you should understand your application's data requirements before you begin programming.

Alternatively, PVFS can be used with the ROMIO interface from http://www.mcs.anl.gov. The ROMIO is included with both MPICH and LAM/MPI. (If you compile ROMIO, you need to specify PVFS support. Typically, you use the compile flags -lib=/usr/local/lib/libpvfs.a and -file_system=pvfs+nfs+ufs.) ROMIO provides two important optimizations, data sieving and two-phase I/O. Additional information is available at the ROMIO web site.

Table of Contents