Previous Section Table of Contents Next Section

2.3 Architecture and Cluster Software

Once you have established the mission for your cluster, you can focus on its architecture and select the software. Most high-performance clusters use an architecture similar to that shown in Figure 1-5. The software described in this book is generally compatible with that basic architecture. If this does not match the mission of your cluster, you still may be able to use many of the packages described in this book, but you may need to make a few adaptations.

Putting together a cluster involves the selection of a variety of software. The possibilities are described briefly here. Each is discussed in greater detail in subsequent chapters in this book.

2.3.1 System Software

One of the first selections you will probably want to make is the operating system, but this is actually the final software decision you should make. When selecting an operating system, the fundamental question is compatibility. If you have a compelling reason to use a particular piece of software and it will run only under a single operating system, the choice has been made for you. For example, openMosix uses extensions to the Linux kernel, so if you want openMosix, you must use Linux. Provided the basic issue of compatibility has been met, the primary reasons to select a particular operating system are familiarity and support. Stick with what you know and what's supported.

All the software described in this book is compatible with Linux. Most, but not all, of the software will also work nicely with other Unix systems. In this book, we'll be assuming the use of Linux. If you'd rather use BSD or Solaris, you'll probably be OK with most of the software, but be sure to check its compatibility before you make a commitment. Some of the software, such as MPICH, even works with Windows.

There is a natural human tendency to want to go with the latest available version of an operating system, and there are some obvious advantages to using the latest release. However, compatibility should drive this decision as well. Don't expect clustering software to be immediately compatible with the latest operating system release. Compatibility may require that you use an older release. (For more on Linux, see Chapter 4.)

In addition to the operating system itself, you may need additional utilities or extensions to the basic services provided by the operating system. For example, to create a cluster you'll need to install the operating system and software on a large number of machines. While you could do this manually with a small cluster, it's an error-prone and tedious task. Fortunately, you can automate the process with cloning software. Cloning is described in detail in Chapter 8.

High-performance systems frequently require extensive I/O. To optimize performance, parallel file systems may be used. Chapter 12 looks at the Parallel Virtual File System (PVFS), an open source high-performance file system.

2.3.2 Programming Software

There are two basic decisions you'll need to make with respect to programming software-the programming languages you want to support and the libraries you want to use. If you have a small user base, you may be able to standardize on a single language and a single library. If you can pull this off, go for it; life will be much simpler. However, if you need to support a number of different users and applications, you may be forced to support a wider variety of programming software.

The parallel programming libraries provide a mechanism that allows you to easily coordinate computing and exchange data among programs running on the cluster. Without this software, you'll be forced to rely on operating system primitives to program your cluster. While it is certainly possible to use sockets to build parallel programs, it is a lot more work and more error prone. The most common libraries are the Message Passing Interface (MPI) and Parallel Virtual Machine (PVM) libraries.

The choice of program languages depends on the parallel libraries you want to use. Typically, the libraries provide bindings for only a small number of programming languages. There is no point in installing Ada if you can't link it to the parallel library you want to use. Traditionally, parallel programming libraries support C and FORTRAN, and C++ is growing in popularity. Libraries and languages are discussed in greater detail in Chapter 9.

2.3.3 Control and Management

In addition to the programming software, you'll need to keep your cluster running. This includes scheduling and management software.

Cluster management includes both routine system administration tasks and monitoring the health of your cluster. With a cluster, even a simple task can become cumbersome if it has to be replicated over a large number of systems. Just checking which systems are available can be a considerable time sink if done on a regular basis. Fortunately, there are several packages that can be used to simplify these tasks. Cluster Command and Control (C3) provides a command-line interface that extends across a cluster, allowing easy replication of tasks on each machine in a cluster or on a subset of the cluster. Ganglia provides web-based monitoring in a single interface. Both C3 and Ganglia can be used with federated clusters as well as simple clusters. C3 and Ganglia are described in Chapter 10.

Scheduling software determines when your users' jobs will be executed. Typically, scheduling software can allocate resources, establish priorities, and do basic accounting. For Linux clusters there are two likely choices-Condor and Portable Batch System (PBS). If you have needs for an advanced scheduler, you might also consider Maui. PBS is available as a commercial product, PBSPro, and as open source software, OpenPBS. OpenPBS is described in Chapter 11.

    Previous Section Table of Contents Next Section