3.1 Design Decisions

While you may have some idea of what you want, it is still worthwhile to review the implications of your choices. There are several closely related, overlapping key issues to consider when acquiring PCs for the nodes in your cluster:

Will you have identical systems or a mixture of hardware?
Will you scrounge for existing computers, buy assembled computers, or buy the parts and assemble your own computers?
Will you have full systems with monitors, keyboards, and mice, minimal systems, or something in between?
Will you have dedicated computers, or will you share your computers with other users?
Do you have a broad or shallow user base?

This is this most important thing I'll say in this chapter-if at all possible, use identical systems for your nodes. Life will be much simpler. You'll need to develop and test only one configuration and then you can clone the remaining machines. When programming your cluster, you won't have to consider different hardware capabilities as you attempt to balance the workload among machines. Also, maintenance and repair will be easier since you will have less to become familiar with and will need to keep fewer parts on hand. You can certainly use heterogeneous hardware, but it will be more work.

In constructing a cluster, you can scrounge for existing computers, buy assembled computers, or buy the parts and assemble your own. Scrounging is the cheapest way to go, but this approach is often the most time consuming. Usually, using scrounged systems means you'll end up with a wide variety of hardware, which creates both hardware and software problems. With older scrounged systems, you are also more likely to have even more hardware problems. If this is your only option, try to standardize hardware as much as possible. Look around for folks doing bulk upgrades when acquiring computers. If you can find someone replacing a number of computers at one time, there is a good chance the computers being replaced will have been a similar bulk purchase and will be very similar or identical. These could come from a computer laboratory at a college or university or from an IT department doing a periodic upgrade.

Buying new, preassembled computers may be the simplest approach if money isn't the primary concern. This is often the best approach for mission-critical applications or when time is a critical factor. Buying new is also the safest way to go if you are uncomfortable assembling computers. Most system integrators will allow considerable latitude over what to include with your systems, particularly if you are buying in bulk. If you are using a system integrator, try to have the integrator provide a list of MAC addresses and label each machine.

Building your own system is cheaper, provides higher performance and reliability, and allows for customization. Assembling your own computers may seem daunting, but it isn't that difficult. You'll need time, personnel, space, and a few tools. It's a good idea to build a single system and test it for hardware and software compatibility before you commit to a large bulk order. Even if you do buy preassembled computers, you will still need to do some testing and maintenance. Unfortunately, even new computers are occasionally DOA.^[1] So the extra time may be less than you'd think. And by building your own, you'll probably be able to afford more computers.

^[1] Dead on arrival: nonfunctional when first installed.

If you are constructing a dedicated cluster, you will not need full systems. The more you can leave out of each computer, the more computers you will be able to afford, and the less you will need to maintain on individual computers. For example, with dedicated clusters you can probably do without monitors, keyboards, and mice for each individual compute node. Minimal machines have the smallest footprint, allowing larger clusters when space is limited and have smaller power and air conditioning requirements. With a minimal configuration, wiring is usually significantly easier, particularly if you use rack-mounted equipment. (However, heat dissipation can be a serious problem with rack-mounted systems.) Minimal machines also have the advantage of being less likely to be reallocated by middle management.

The size of your user base will also affect your cluster design. With a broad user base, you'll need to prepare for a wider range of potential uses-more applications software and more systems tools. This implies more secondary storage and, perhaps, more memory. There is also the increased likelihood that your users will need direct access to individual nodes.

Shared machines, i.e., computers that have other uses in addition to their role as a cluster node, may be a way of constructing a part-time cluster that would not be possible otherwise. If your cluster is shared, then you will need complete, fully functioning machines. While this book won't focus on such clusters, it is certainly possible to have a setup that is a computer lab on work days and a cluster on the weekend, or office machines by day and cluster nodes at night.

3.1.1 Node Hardware

Obviously, your computers need adequate hardware for all intended uses. If your cluster includes workstations that are also used for other purposes, you'll need to consider those other uses as well. This probably means acquiring a fairly standard workstation. For a dedicated cluster, you determine your needs and there may be a lot you won't need-audio cards and speakers, video capture cards, etc. Beyond these obvious expendables, there are other additional parts you might want to consider omitting such as disk drives, keyboards, mice, and displays. However, you should be aware of some of the potential problems you'll face with a truly minimalist approach. This subsection is a quick review of the design decisions you'll need to make.

3.1.1.1 CPUs and motherboards

While you can certainly purchase CPUs and motherboards from different sources, you need to select each with the other in mind. These two items are the heart of your system. For optimal performance, you'll need total compatibility between these. If you are buying your systems piece by piece, consider buying an Intel- or ADM-compatible motherboard with an installed CPU. However, you should be aware that some motherboards with permanently affixed CPUs are poor performers, so choose with care.

You should also buy your equipment from a known, trusted source with a reputable warranty. For example, in recent years a number of boards have been released with low-grade electrolytic capacitors. While these capacitors work fine initially, the board life is disappointingly brief. People who bought these boards from fly-by-night companies were out of luck.

In determining the performance of a node, the most important factors are processor clock rate, cache size, bus speed, memory capacity, disk access speed, and network latency. The first four are determined by your selection of CPU and motherboard. And if you are using integrated EIDE interfaces and network adapters, all six are at least influenced by your choice of CPU and motherboard.

Clock speed can be misleading. It is best used to compare processors within the same family since comparing processors from different families is an unreliable way to measure performance. For example, an AMD Athlon 64 may outperform an Intel Pentium 4 when running at the same clock rate. Processor speed is also very application dependent. If your data set fits within the large cache in a Prescott-core Pentium 4 but won't fit in the smaller cache in an Athlon, you may see much better performance with the Pentium.

Selecting a processor is a balancing act. Your choice will be constrained by cost, performance, and compatibility. Remember, the rationale behind a commodity off-the-shelf (COTS) cluster is buying machines that have the most favorable price to performance ratio, not pricey individual machines. Typically you'll get the best ratio by purchasing a CPU that is a generation behind the current cutting edge. This means comparing the numbers. When comparing CPUs, you should look at the increase in performance versus the increase in the total cost of a node. When the cost starts rising significantly faster than the performance, it's time to back off. When a 20 percent increase in performance raises your cost by 40 percent, you've gone too far.

Since Linux works with most major chip families, stay mainstream and you shouldn't have any software compatibility problems. Nonetheless, it is a good idea to test a system before committing to a bulk purchase. Since a primary rationale for building your own cluster is the economic advantage, you'll probably want to stay away from the less common chips. While clusters built with UltraSPARC systems may be wonderful performers, few people would describe these as commodity systems. So unless you just happen to have a number of these systems that you aren't otherwise using, you'll probably want to avoid them.^[2]

^[2] Radajewski and Eadline's Beowulf HOWTO refers to "Computer Shopper"-certified equipment. That is, if equipment isn't advertised in Computer Shopper, it isn't commodity equipment.

With standalone workstations, the overall benefit of multiple processors (i.e., SMP systems) is debatable since a second processor can remain idle much of the time. A much stronger argument can be made for the use of multiple processor systems in clusters where heavy utilization is assured. They add additional CPUs without requiring additional motherboards, disk drives, power supplies, cases, etc.

When comparing motherboards, look to see what is integrated into the board. There are some significant differences. Serial, parallel, and USB ports along with EIDE disk adapters are fairly standard. You may also find motherboards with integrated FireWire ports, a network interface, or even a video interface. While you may be able to save money with built-in network or display interfaces (provided they actually meet your needs), make sure they can be disabled should you want to install your own adapter in the future. If you are really certain that some fully integrated motherboard meets your needs, eliminating the need for daughter cards may allow you to go with a small case. On the other hand, expandability is a valuable hedge against the future. In particular, having free memory slots or adapter slots can be crucial at times.

Finally, make sure the BIOS Setup options are compatible with your intended configuration. If you are building a minimal system without a keyboard or display, make sure the BIOS will allow you to boot without them attached. That's not true for some BIOSs.

3.1.1.2 Memory and disks

Subject to your budget, the more cache and RAM in your system, the better. Typically, the faster the processor, the more RAM you will need. A very crude rule of thumb is one byte of RAM for every floating-point operation per second. So a processor capable of 100 MFLOPs would need around 100 MB of RAM. But don't take this rule too literally.

Ultimately, what you will need depends on your applications. Paging creates a severe performance penalty and should be avoided whenever possible. If you are paging frequently, then you should consider adding more memory. It comes down to matching the memory size to the cluster application. While you may be able to get some idea of what you will need by profiling your application, if you are creating a new cluster for as yet unwritten applications, you will have little choice but to guess what you'll need as you build the cluster and then evaluate its performance after the fact. Having free memory slots can be essential under these circumstances.

Which disks to include, if any, is perhaps the most controversial decision you will make in designing your cluster. Opinions vary widely. The cases both for and against diskless systems have been grossly overstated. This decision is one of balancing various tradeoffs. Different contexts tip the balance in different directions. Keep in mind, diskless systems were once much more popular than they are now. They disappeared for a reason. Despite a lot of hype a few years ago about thin clients, the reemergence of these diskless systems was a spectacular flop. Clusters are, however, a notable exception. Diskless clusters are a widely used, viable approach that may be the best solution in some circumstances.

There are a number of obvious advantages to diskless systems. There is a lower cost per machine, which means you may be able to buy a bigger cluster with better performance. With rapidly declining disk prices, this is becoming less of an issue. A small footprint translates into lowered power and HVAC needs. And once the initial configuration has stabilized, software maintenance is simpler.

But the real advantage of diskless systems, at least with large clusters, is reduced maintenance. With diskless systems, you eliminate all moving parts aside from fans. For example, the average life (often known as mean time between failures, mean time before failure, or mean time to failure) of one manufacturer's disks is reported to be 300,000 hours or 34 years of continuous operation. If you have a cluster of 100 machines, you'll replace about three of these drives a year. This is a nuisance, but doable. If you have a cluster with 12,000 nodes, then you are looking at a failure, on average, every 25 hours-roughly once a day.

There is also a downside to consider. Diskless systems are much harder for inexperienced administrators to configure, particularly with heterogeneous hardware. The network is often the weak link in a cluster. In diskless systems the network will see more traffic from the network file system, compounding the problem. Paging across a network can be devastating to performance, so it is critical that you have adequate local memory. But while local disks can reduce network traffic, they don't eliminate it. There will still be a need for network-accessible file systems.

Simply put, disk-based systems are more versatile and more forgiving. If you are building a dedicated cluster with new equipment and have experience with diskless systems, you should definitely consider diskless systems. If you are new to clusters, a disk-based cluster is a safer approach. (Since this book's focus is getting started with clusters, it does not describe setting up diskless clusters.)

If you are buying hard disks, there are three issues: interface type (EIDE vs. SCSI), disk latency (a function of rotational speed), and disk capacity. From a price-performance perspective, EIDE is probably a better choice than SCSI since virtually all motherboards include a built-in EIDE interface. And unless you are willing to pay a premium, you won't have much choice with respect to disk latency. Almost all current drives rotate at 7,200 RPM. While a few 10,000 RPM drives are available, their performance, unlike their price, is typically not all that much higher. With respect to disk capacity, you'll need enough space for the operating system, local paging, and the data sets you will be manipulating. Unless you have extremely large data sets, when recycling older computers a 10 GB disk should be adequate for most uses. Often smaller disks can be used. For new systems, you'll be hard pressed to find anything smaller that 20 GB, which should satisfy most uses. Of course, other non-cluster needs may dictate larger disks.

You'll probably want to include either a floppy drive or CD-ROM drive in each system. Since CD-ROM drives can be bought for under $15 and floppy drives for under $5, you won't save much by leaving these out. For disk-based systems, CD-ROMs or floppies can be used to initiate and customize network installs. For example, when installing the software on compute nodes, you'll typically use a boot floppy for OSCAR systems and a CD-ROM on Rocks systems. For diskless systems, CD-ROMs or floppies can be used to boot systems over the network without special BOOT ROMs on your network adapters. The only compelling reason to not include a CD-ROM or floppy is a lack of space in a truly minimal system.

When buying any disks, don't forget the cables.

3.1.1.3 Monitors, keyboards, and mice

Many minimal systems elect not to include monitors, keyboards, or mice but rely on the network to provide local connectivity as needed. While this approach is viable only with a dedicated cluster, its advantages include lower cost, less equipment to maintain, and a smaller equipment footprint. There are also several problems you may encounter with these headless systems. Depending on the system BIOS, you may not be able to boot a system without a display card or keyboard attached. When such systems boot, they probe for an attached keyboard and monitor and halt if none are found. Often, there will be a CMOS option that will allow you to override the test, but this isn't always the case.

Another problem comes when you need to configure or test equipment. A lack of monitor and keyboard can complicate such tasks, particularly if you have network problems. One possible solution is the use of a crash cart-a cart with keyboard, mouse, and display that can be wheeled to individual machines and connected temporarily. Provided the network is up and the system is booting properly, X Windows or VNC provide a software solution.

Yet another alternative, particularly for small clusters, is the use of a keyboard-video-mouse (KVM) switch. With these switches, you can attach a single keyboard, mouse, and monitor to a number of different machines. The switch allows you to determine which computer is currently connected. You'll be able to access only one of the machines at a time, but you can easily cycle among the machines at the touch of a button. It is not too difficult to jump between machines and perform several tasks at once. However, it is fairly easy to get confused about which system you are logged on to. If you use a KVM switch, it is a good idea to configure the individual systems so that each displays its name, either as part of the prompt for command-line systems or as part of the background image for GUI-based systems.

There are a number of different switches available. Avocet even sells a KVM switch that operates over IP and can be used with remote clusters. Some KVM switches can be very pricey so be sure to shop around. Don't forget to include the cost of cables when pricing KVM switches. Frequently, these are not included with the switch and are usually overpriced. You'll need a set for every machine you want to leave connected, but not necessarily every machine.

The interaction between the system and the switch may provide a surprise or two. As previously noted, some systems don't allow booting without a keyboard, i.e., there is no CMOS override for booting without a keyboard. A KVM switch may be able to fool these systems. Such systems may detect a keyboard when connected to a KVM switch even when the switch is set to a different system. On the other hand, if you are installing Linux on a computer and it probes for a monitor, unless the switch is set to that system, the monitor won't be found.

Keep in mind, both the crash cart and the KVM switch approaches assume that individual machines have display adapters.

For this reason, you should seriously consider including a video card even when you are going with a headless systems. Very inexpensive cards or integrated adapters can be used since you won't need anything fancy. Typically, embedded video will only add a few dollars to the price of a motherboard.

One other possibility is to use serial consoles. Basically, the idea is to replace the attached monitor and keyboard with a serial connection to a remote system. With a fair amount of work, most Linux systems can be reconfigured to work in this manner. If you are using rack-mount machines, many of them support serial console redirection out of the box. With this approach, the systems use a connection to a serial port to eliminate the need for a KVM switch. Additional hardware is available that will allow you to multiplex serial connections from a number of machines. If this approach is of interest, consult the Remote Serial Console HOWTO at http://www.tldp.org/HOWTO/Remote-Serial-Console-HOWTO/.

3.1.1.4 Adapters, power supplies, and cases

As just noted, you should include a video adapter. The network adapter is also a key component. You must buy an adapter that is compatible with the cluster network. If you are planning to boot a diskless system over the network, you'll need an adapter that supports it. This translates into an adapter with an appropriate network BOOT ROM, i.e., one with pre-execution environment (PXE) support. Many adapters come with a built-in (but empty) BOOT ROM socket so that the ROM can be added. You can purchase BOOT ROMs for these cards or burn your own. However, it may be cheaper to buy a new card with an installed BOOT ROM than to add the BOOT ROMs. And unless you are already set up to burn ROMs, you'll need to be using several machines before it becomes cost effective to buy an EPROM burner.

To round things out, you'll need something to put everything in and a way to supply power, i.e., a case and power supply. With the case, you'll have to balance keeping the footprint small and having room to expand your system. If you buy too small a power supply, it won't meet your needs or allow you to expand your system. If you buy too large a power supply, you waste money and space. If you add up the power requirements for your individual components and add in another 50 percent as a fudge factor, you should be safe.

One last word about node selection-while we have considered components individually, you should also think about the system collectively before you make a final decision. If collectively the individual systems generate more heat that you can manage, you may need to reconsider how you configure individual machines. For example, Google is said to use less-powerful machines in its clusters in order to balance computation needs with total operational costs, a judgment that includes the impact of cooling needs.

3.1.2 Cluster Head and Servers

Thus far, we have been looking at the compute nodes within the cluster. Depending on your configuration, you will need a head node and possibly additional servers. Ideally, the head node and most servers should be complete systems since it will add little to your overall cost and can simplify customizing and maintaining these systems. Typically, there is no need for these systems to use the same hardware that your compute nodes use. Go for enhancements that will improve performance that you might not be able to afford on every node. These machines are the place for large, fast disks and lots of fast memory. A faster processor is also in order.

On smaller clusters, you can usually use one machine as both the head and as the network file server. This will be a dual-homed machine (two network interfaces) that serves as an access point for the cluster. As such, it will be configured to limit and control access as well as provide it. When the services required by the network file systems put too great a strain on the head node, the network file system can be moved to a separate server to improve performance.

If you are setting up systems as I/O servers for a parallel file system, it is likely that you'll want larger and faster drives on these systems. Since you may have a number of I/O servers in a larger cluster, you may need to look more closely at cost and performance trade-offs.

3.1.3 Cluster Network

By definition, a cluster is a networked collection of computers. For commodity clusters, networking is often the weak link. The two key factors to consider when designing your network are bandwidth and latency. Your application or application mix will determine just how important these two factors are. If you need to move large blocks of data, bandwidth will be critical. For real-time applications or applications that have lots of interaction among nodes, minimizing latency is critical. If you have a mix of applications, both can be critical.

It should come as no surprise that a number of approaches and products have been developed. High-end Ethernet is probably the most common choice for clusters. But for some low-latency applications, including many real-time applications, you may need to consider specialized low-latency hardware. There are a number of choices. The most common alternative to Ethernet is Myrinet from Myricom, Inc. Myrinet is a proprietary solution providing high-speed bidirectional connectivity (currently about 2 Gbps in each direction) and low latencies (currently under 4 microseconds). Myrinet uses a source-routing strategy and allows arbitrary length packets.

Other competitive technologies that are emerging or are available include cLAN from Emulex, QsNet from Quadrics, and Infiniband from the Infiniband consortium. These are high-performance solutions and this technology is rapidly changing.

The problem with these alternative technologies is their extremely high cost. Adapters can cost more than the combined cost of all the other hardware in a node. And once you add in the per node cost of the switch, you can easily triple the cost of a node. Clearly, these approaches are for the high-end systems.

Fortunately, most clusters will not need this extreme level of performance. Continuing gains in speed and rapidly declining costs make Ethernet the network of choice for most clusters. Now that Gigabit Ethernet is well established and 10 Gigabit Ethernet has entered the marketplace, the highly expensive proprietary products are no longer essential for most needs.

For Gigabit Ethernet, you will be better served with an embedded adapter rather than an add-on PCI board since Gigabit can swamp the PCI bus. Embedded adapters use workarounds that take the traffic off the PCI bus. Conversely, with 100BaseT, you may prefer a separate adapter rather than an embedded one since an embedded adapter may steal clock cycles from your applications.

Unless you are just playing around, you'll probably want, at minimum, switched Fast Ethernet. If your goal is just to experiment with clusters, almost any level of networking can be used. For example, clusters have been created using FireWire ports. For two (or even three) machines, you can create a cluster using crossover cables.

Very high-performance clusters may have two parallel networks. One is used for messages passing among the nodes, while the second is used for the network file system. In the past, elaborate technology, architectures, and topologies have been developed to optimize communications. For example, channel bonding uses multiple interfaces to multiplex channels for higher bandwidth. Hypercube topologies have been used to minimize communication path length. These approaches are beyond the scope of this book. Fortunately, declining networking prices and faster networking equipment have lessened the need for these approaches.

Table of Contents