12.1 PVFS

PVFS is a freely available, software-based solution jointly developed by Argonne National Laboratory and Clemson University. PVFS is designed to distribute data among the disks throughout the cluster and will work with both serial and parallel programs. In programming, it works with traditional Unix file I/O semantics, with the MPI-2 ROMIO semantics, or with the native PVFS semantics. It provides a consistent namespace and transparent access using existing utilities along with a mechanism for programming application-specific access. Although PVFS is developed using X-86-based Linux platforms, it runs on some other platforms. It is available for both OSCAR and Rocks. PVFS2, a second generation PVFS, is in the works.

On the downside, PVFS does not provide redundancy, does not support symbolic or hard links, and it does not provide a fsck-like utility.

Figure 12-1 shows the overall architecture for a cluster using PVFS. Machines in a cluster using PVFS fall into three possibly overlapping categories based on functionality. Each PVFS has one metadata server. This is a filesystem management node that maintains or tracks information about the filesystem such as file ownership, access privileges, and locations, i.e., the filesystem's metadata.

Figure 12-1. Internal cluster architecture

Because PVFS distributes files across the cluster nodes, the actual files are located on the disks on I/O servers. I/O servers store the data using the existing hardware and filesystem on that node. By spreading or striping a file across multiple nodes, applications have multiple paths to data. A compute node may access a portion of the file on one machine while another node accesses a different portion of the file located on a different I/O server. This eliminates the bottleneck inherent in a single file server approach such as NFS.

The remaining nodes are the client nodes. These are the actual compute nodes within the clusters, i.e., where the parallel jobs execute. With PVFS, client nodes and I/O servers can overlap. For a small cluster, it may make sense for all nodes to be both client and I/O nodes. Similarly, the metadata server can also be an I/O server or client node, or both. Once you start writing data to these machines, it is difficult to change the configuration of your system. So give some thought to what you need.

12.1.1 Installing PVFS on the Head Node

Installing and configuring PVFS is more complicated that most of the other software described in this book for a couple of reasons. First, you will need to decide how to partition your cluster. That is, you must decide which machine will be the metadata server, which machines will be clients, and which machines will be I/O servers. For each type of machine, there is different software to install and a different configuration. If a machine is going to be both a client and an I/O server, it must be configured for each role. Second, in order to limit the overhead of accessing the filesystem through the kernel, a kernel module is used. This may entail further tasks such as making sure the appropriate kernel header files are available or patching the code to account for differences among Linux kernels.

This chapter describes a simple configuration where fanny is the metadata server, a client, and an I/O server, and all the remaining nodes are both clients and I/O servers. As such, it should provide a fairly complete idea about how PVFS is set up. If you are configuring your cluster differently, you won't need to do as much. For example, if some of your nodes are only I/O nodes, you can skip the client configuration steps on those machines.

In this example, the files are downloaded, compiled, and installed on fanny since fanny plays all three roles. Once the software is installed on fanny, the appropriate pieces are pushed to the remaining machines in the cluster.

The first step, then, is to download the appropriate software. To download PVFS, first go to the PVFS home page (http://www.parl.clemson.edu/pvfs/) and follow the link to files. This site has links to several download sites. (You'll want to download the documentation from this site before moving on to the software download sites.) There are two tar archives to download: the sources for PVFS and for the kernel module.

You should also look around for any patches you might need. For example, at the time this was written, because of customizations to the kernel, the current version of PVFS would not compile correctly under Red Hat 9.0. Fortunately, a patch from http://www.mcs.anl.gov/~robl/pvfs/redhat-ntpl-fix.patch.gz was available.^[1] Other patches may also be available.

^[1] Despite the URL, this was an uncompressed text file at the time this was written.

Once you have the files, copy the files to an appropriate directory and unpack them.

[root@fanny src]# gunzip pvfs-1.6.2.tgz

[root@fanny src]# gunzip pvfs-kernel-1.6.2-linux-2.4.tgz

[root@fanny src]# tar -xvf pvfs-1.6.2.tar

...

[root@fanny src]# tar -xvf pvfs-kernel-1.6.2-linux-2.4.tar

...

It is simpler if you install these under the same directory. In this example, the directory /usr/local/src is used. In the documentation that comes with PVFS, a link was created to the first directory.

[root@fanny src]# ln -s pvfs-1.6.0 pvfs

This will save a little typing but isn't essential.

Be sure to look at the README and INSTALL files that come with the sources.

Next, apply any patches you may need. As noted, with this version the kernel module sources need to be patched.

[root@fanny src]# mv redhat-ntpl-fix.patch pvfs-kernel-1.6.2-linux-2.4/

[root@fanny src]# cd pvfs-kernel-1.6.2-linux-2.4

[root@fanny pvfs-kernel-1.6.2-linux-2.4]# patch -p1 -b <

\> redhat-ntpl-fix.patch

patching file config.h.in

patching file configure

patching file configure.in

patching file kpvfsd.c

patching file kpvfsdev.c

patching file pvfsdev.c

patching file pvfsdev.c

Apply any other patches that might be needed.

The next steps are compiling PVFS and the PVFS kernel module. Here are the steps for compiling PVFS:

[root@fanny pvfs-kernel-1.6.2-linux-2.4]# cd /usr/local/src/pvfs

[root@fanny pvfs]# ./configure

...

[root@fanny pvfs]# make

...

[root@fanny pvfs]# make install

...

There is nothing new here.

Next, repeat the process with the kernel module.

[root@fanny src]# cd /usr/local/src/pvfs-kernel-1.6.2-linux-2.4

[root@fanny pvfs-kernel-1.6.2-linux-2.4]# ./configure

...

[root@fanny pvfs-kernel-1.6.2-linux-2.4]# make

...

[root@fanny pvfs-kernel-1.6.2-linux-2.4]# make install

install -c -d /usr/local/sbin

install -c mount.pvfs /usr/local/sbin

install -c pvfsd /usr/local/sbin

NOTE: pvfs.o must be installed by hand!

NOTE: install mount.pvfs by hand to /sbin if you want 'mount -t pvfs' to work

This should go very quickly.

As you see from the output, the installation for the kernel requires some additional manual steps. Specifically, you need to decide where you want to put the kernel module. The following works for Red Hat 9.0.

[root@fanny pvfs-kernel-1.6.2-linux-2.4]# mkdir \

> /lib/modules/2.4.20-6/kernel/fs/pvfs

[root@fanny pvfs-kernel-1.6.2-linux-2.4]# cp pvfs.o \

> /lib/modules/2.4.20-6/kernel/fs/pvfs/pvfs.o

If you are doing something different, you may need to poke around a bit to find the right location.

12.1.2 Configuring the Metadata Server

If you have been following along, at this point you should have all the software installed on the head node, i.e., the node that will function as the metadata server for the filesystem. The next step is to finish configuring the metadata server. Once this is done, the I/O server and client software can be installed and configured.

Configuring the meta-server is straightforward. First, create a directory to store filesystem data.

[root@fanny pvfs-kernel-1.6.2-linux-2.4]# mkdir /pvfs-meta

Keep in mind, this directory is used to store information about the PVFS filesystem. The actual data is not stored in this directory. Once PVFS is running, you can ignore this directory.

Next, create the two metadata configuration files and place them in this directory. Fortunately, PVFS provides a script to simplify the process.

[root@fanny pvfs-kernel-1.6.2-linux-2.4]# cd /pvfs-meta

[root@fanny pvfs-meta]# /usr/local/bin/mkmgrconf

This script will make the .iodtab and .pvfsdir files

in the metadata directory of a PVFS file system.

   

Enter the root directory (metadata directory):

/pvfs-meta/

Enter the user id of directory: 

root

Enter the group id of directory: 

root

Enter the mode of the root directory: 

777

Enter the hostname that will run the manager: 

fanny

Searching for host...success

Enter the port number on the host for manager: 

(Port number 3000 is the default)

3000

Enter the I/O nodes: (can use form node1, node2, ... or

nodename{#-#,#,#})

fanny george hector ida james

Searching for hosts...success

I/O nodes: fanny george hector ida james

Enter the port number for the iods: 

(Port number 7000 is the default)

7000

Done!

Running this script creates the two configuration files .pvfsdir and .iodtab. The file .pvfsdir contains permission information for the metadata directory. Here is the file the mkmgrconf script creates when run as shown.

84230

0

0

0040777

3000

fanny

/pvfs-meta/

/

The first entry is the inode number of the configuration file. The remaining entries correspond to the questions answered earlier.

The file .iodtab is a list of the I/O servers and their port numbers. For this example, it should look like this:

fanny:7000

george:7000

hector:7000

ida:7000

james:7000

Systems can be listed by name or by IP number. If the default port (7000) is used, it can be omitted from the file.

The .iodtab file is an ordered list of I/O servers. Once PVFS is running, you should not change the .iodtab file. Otherwise, you will almost certainly render existing PVFS files inaccessible.

12.1.3 I/O Server Setup

To set up the I/O servers, you need to create a data directory on the appropriate machines, create a configuration file, and then push the configuration file, along with the other I/O server software, to the appropriate machines. In this example, all the nodes in the cluster including the head node are I/O servers.

The first step is to create a directory with the appropriate ownership and permissions on all the I/O servers. We start with the head node.

[root@fanny /]# mkdir /pvfs-data

[root@fanny /]# chmod 700 /pvfs-data

[root@fanny /]# chown nobody.nobody /pvfs-data

Keep in mind that these directories are where the actual pieces of a data file will be stored. However, you will not access this data in these directories directly. That is done through the filesystem at the appropriate mount point. These PVFS data directories, like the meta-server's metadata directory, can be ignored once PVFS is running.

Next, create the configuration file /etc/iod.conf using your favorite text editor. (This is optional, but recommended.) iod.conf describes the iod environment. Every line, apart from comments, consists of a key and a corresponding value. Here is a simple example:

# iod.conf-iod configuration file

datadir /pvfs-data

user nobody

group nobody

logdir /tmp

rootdir /

debug 0

As you can see, this specifies a directory for the data, the user and group under which the I/O daemon iod will run, the log and root directories, and a debug level. You can also specify other parameters such as the port and buffer information. In general, the defaults are reasonable, but you may want to revisit this file when fine-tuning your system.

While this takes care of the head node, the process must be repeated for each of the remaining I/O servers. First, create the directory and configuration file for each of the remaining I/O servers. Here is an example using the C3 utilities. (C3 is described in Chapter 10.)

[root@fanny /]# cexec mkdir /pvfs-data

...

[root@fanny /]# cexec chmod 700 /pvfs-data

...

[root@fanny /]# cexec chown nobody.nobody /pvfs-data

...

[root@fanny /]# cpush /etc/iod.conf

...

Since the configuration file is the same, it's probably quicker to copy it to each machine, as shown here, rather than re-create it.

Finally, since the iod daemon was created only on the head node, you'll need to copy it to each of the remaining I/O servers.

[root@fanny root]# cpush /usr/local/sbin/iod

...

While this example uses C3's cpush, you can use whatever you are comfortable with.

If you aren't configuring every machine in your cluster to be an I/O server, you'll need to adapt these steps as appropriate for your cluster. This is easy to do with C3's range feature.

12.1.4 Client Setup

Client setup is a little more involved. For each client, you'll need to create a PVFS device file, copy over the kernel module, create a mount point and a PVFS mount table, and copy over the appropriate executable along with any other utilities you might need on the client machine. In this example, all nodes including the head are configured as clients. But because we have already installed software on the head node, some of the steps aren't necessary for that particular machine.

First, a special character file needs to be created on each of the clients using the mknod command.

[root@fanny /]# cexec mknod /dev/pvfsd c 60 0

...

/dev/pvfsd is used to communicate between the pvfsd daemon and the kernel module pvfs.o. It allows programs to access PVFS files, once mounted, using traditional Unix filesystem semantics.

We will need to distribute both the kernel module and the daemon to each node.

[root@fanny /]# cpush /usr/local/sbin/pvfsd

...

[root@fanny /]# cexec mkdir /lib/modules/2.4.20-6/kernel/fs/pvfs/

...

[root@fanny /]# cpush /lib/modules/2.4.20-6/kernel/fs/pvfs/pvfs.o

...

The kernel module registers the filesystem with the kernel while the daemon performs network transfers.

Next, we need to create a mount point.

[root@fanny root]# mkdir /mnt/pvfs

[root@fanny /]# cexec mkdir /mnt/pvfs

...

This example uses /mnt/pvfs, but /pvfs is another frequently used alternative. The mount directory is where the files appear to be located. This is the directory you'll use to access or reference files.

The mount.pvfs executable is used to mount a filesystem using PVFS and should be copied to each client node.

[root@fanny /]# cpush /usr/local/sbin/mount.pvfs /sbin/

...

mount.pvfs can be invoked by the mount command on some systems, or it can be called directly.

Finally, create /etc/pvfstab, a mount table for the PVFS system. This needs to contain only a single line of information as shown here:

fanny:/pvfs-meta  /mnt/pvfs  pvfs  port=3000  0  0

If you are familiar with /etc/fstab, this should look very familiar. The first field is the path to the metadata information. The next field is the mount point. The third field is the filesystem type, which is followed by the port number. The last two fields, traditionally used to determine when a filesystem is dumped or checked, aren't currently used by PVFS. These fields should be zeros. You'll probably need to change the first two fields to match your cluster, but everything else should work as shown here.

Once you have created the mount table, push it to the remaining nodes.

[root@fanny /]# cpush /etc/pvfstab

...

[root@fanny /]# cexec chmod 644 /etc/pvfstab

...

Make sure the file is readable as shown.

While it isn't strictly necessary, there are some other files that you may want to push to your client nodes. The installation of PVFS puts a number of utilities in /usr/local/bin. You'll need to push these to the clients before you'll be able to use them effectively. The most useful include mgr-ping, iod-ping, pvstat, and u2p.

[root@fanny root]# cpush /usr/local/bin/mgr-ping

...

[root@fanny root]# cpush /usr/local/bin/iod-ping

...

[root@fanny root]# cpush /usr/local/bin/pvstat

...

[root@fanny pvfs]# cpush /usr/local/bin/u2p

...

As you gain experience with PVFS, you may want to push other utilities across the cluster.

If you want to do program development using PVFS, you will need access to the PVFS header files and libraries and the pvfstab file. By default, header and library files are installed in /usr/local/include and /usr/local/lib, respectively. If you do program development only on your head node, you are in good shape. But if you do program development on any of your cluster nodes, you'll need to push these files to those nodes. (You might also want to push the manpages as well, which are installed in /usr/local/man.)

12.1.5 Running PVFS

Finally, now that you have everything installed, you can start PVFS. You need to start the appropriate daemons on the appropriate machines and load the kernel module. To load the kernel module, use the insmod command.

[root@fanny root]# insmod /lib/modules/2.4.20-6/kernel/fs/pvfs/pvfs.o

[root@fanny root]# cexec insmod /lib/modules/2.4.20-6/kernel/fs/pvfs/pvfs.o

...

Next, run the mgr daemon on the metadata server. This is the management daemon.

[root@fanny root]# /usr/local/sbin/mgr

On each I/O server, start the iod daemon.

[root@fanny root]# /usr/local/sbin/iod

[root@fanny root]# cexec /usr/local/sbin/iod

...

Next, start the pvfsd daemon on each client node.

[root@fanny root]# /usr/local/sbin/pvfsd

[root@fanny root]# cexec /usr/local/sbin/pvfsd

...

Finally, mount the filesystem on each client.

[root@fanny root]# /usr/local/sbin/mount.pvfs fanny:/pvfs-meta /mnt/pvfs

[root@fanny /]# cexec /sbin/mount.pvfs fanny:/pvfs-meta /mnt/pvfs

...

PVFS should be up and running.^[2]

^[2] Although not described here, you'll probably want to make the necessary changes to your startup file so that this is all done automatically. PVFS provides scripts enablemgr and enableiod for use with Red Hat machines.

To shut PVFS down, use the umount command to unmount the filesystem, e.g., umount /mnt/pvfs, stop the PVFS processes with kill or killall, and unload the pvfs.o module with the rmmod command.

12.1.5.1 Troubleshooting

There are several things you can do to quickly check whether everything is running. Perhaps the simplest is to copy a file to the mounted directory and verify that it is accessible on other nodes. If you have problems, there are a couple of other things you might want to try to narrow things down.

First, use ps to ensure the daemons are running on the appropriate machines. For example,

[root@fanny root]# ps -aux | grep pvfsd

root     15679  0.0  0.1  1700  184 ?        S    

Jun21   0:00 /usr/local/sbin/pvfsd

Of course, mgr should be running only on the metadata server and iod should be running on all the I/O servers (but nowhere else).

Each process will create a log file, by default in the /tmp directory. Look to see if these are present.

[root@fanny root]# ls -l /tmp

total 48

-rwxr-xr-x    1 root     root          354 Jun 21 11:13 iolog.OxLkSR

-rwxr-xr-x    1 root     root            0 Jun 21 11:12 mgrlog.z3tg11

-rwxr-xr-x    1 root     root          119 Jun 21 11:21 pvfsdlog.msBrCV

...

The garbage at the end of the filenames is generated to produce a unique filename.

The mounted PVFS will be included in the listing given with the mount command.

[root@fanny root]# mount

...

fanny:/pvfs-meta on /mnt/pvfs type pvfs (rw)

...

This should work on each node.

In addition to the fairly obvious tests just listed, PVFS provides a couple of utilities you can turn to. The utilities iod-ping and mgr-ping can be used to check whether the I/O and metadata servers are running and responding on a particular machine.

Here is an example of using iod-ping:

[root@fanny root]# /usr/local/bin/iod-ping

localhost:7000 is responding.

[root@fanny root]# cexec /usr/local/bin/iod-ping

************************* local *************************

--------- george.wofford.int---------

localhost:7000 is responding.

--------- hector.wofford.int---------

localhost:7000 is responding.

--------- ida.wofford.int---------

localhost:7000 is responding.

--------- james.wofford.int---------

localhost:7000 is responding.

The iod daemon seems to be OK on all the clients. If you run mgr-ping, only the metadata server should respond.

Table of Contents