  Linux Cluster HOWTO
  Ram Samudrala (me@ram.org)
  v1.31, November 7, 2004

  How to set up high-performance Linux computing clusters.
  ______________________________________________________________________

  Table of Contents


  1. Introduction

  2. Hardware

     2.1 Node hardware
     2.2 Server hardware
     2.3 Desktop hardware
     2.4 Firewall/gateway hardware
     2.5 Miscellaneous/accessory hardware
     2.6 Putting-it-all-together hardware
     2.7 Costs

  3. Software

     3.1 Operating system: Linux, of course!
     3.2 Networking software
     3.3 Parallel processing software
     3.4 Costs

  4. Set up, configuration, and maintenance

     4.1 Disk configuration
     4.2 Package configuration
     4.3 Operating system installation and maintenance
        4.3.1 Personal cloning strategy
        4.3.2 Cloning and maintenance packages
           4.3.2.1 FAI
           4.3.2.2 SystemImager
        4.3.3 DHCP vs. hard-coded IP addresses
     4.4 Known hardware issues
     4.5 Known software issues

  5. Performing tasks on the cluster

     5.1 Rough benchmarks
     5.2 Uptimes

  6. Acknowledgements

  7. Bibliography



  ______________________________________________________________________

  1.  Introduction

  This document describes how I set up my Linux computing clusters for
  high-performance computing which I need for my research.

  Use the information below at your own risk.  I disclaim all
  responsibility for anything you may do after reading this HOWTO. The
  latest version of this HOWTO will always be available at
  http://www.ram.org/computing/linux/linux_cluster.html.


  Unlike other documentation that talks about setting up clusters in a
  general way, this is a specific description of how our lab is setup
  and includes not only details the compute aspects, but also the
  desktop, laptop, and public server aspects.  This is done mainly for
  local use, but I put it up on the web since I received several e-mail
  messages based on my newsgroup query requesting the same information.
  Even today, as I plan another 64-node cluster, I find that there is a
  dearth of information about exactly how to assemble components to form
  a node that works reliably under Linux that includes information not
  only about the compute nodes, but about hardware that needs to work
  well with the nodes for productive research to happen.  The main use
  of this HOWTO as it stands is that it's a report on what kind of
  hardware works well with Linux and what kind of hardware doesn't.

  2.  Hardware

  This section covers the hardware choices I've made. Unless noted in
  the ``known hardware issues'' section, assume that everything works
  really well.

  Hardware installation is also fairly straight-forward unless otherwise
  noted, with most of the details covered by the manuals. For each
  section, the hardware is listed in the order of purchase (most recent
  is listed first).

  2.1.  Node hardware

  32 machines have the following setup each:


    2 XEON 2.66GHZ 533FSB CPUs

    Supermicro 6013A-T 1u case and motherboard

    2 512MB PC2100 DDR REG ECC RAM

    1 80GB SEA 7200 SATA HD

    1 250GB SEA 7200 SATA HD

  32 machines have the following setup each:


    2 XEON 2.4GHZ 533FSB CPUs

    Supermicro X5DPR-1G2 motherboard

    2 512MB PC2100 DDR REG ECC RAM

    1 40GB SEA 7200 HD

    1 120GB SEA 7200 HD

    Supermicro Slim 24X CDROM

    CSE-812 400 C/B 1U case

  32 machines have the following setup each:


    2 AMD Palamino MP XP 2000+ 1.67 GHz CPUs

    Asus A7M266-D w/LAN Dual DDR motherboard

    2 Kingston 512mb PC2100 DDR-266MHz REG ECC RAM

    1 41 GB Maxtor 7200rpm ATA100 HD

    1 120 GB Maxtor 5400rpm ATA100 HD

    Asus CD-A520 52x CDROM

    1.44mb floppy drive

    ATI Expert 2000 Rage 128 32mb

    IN-WIN P4 300ATX Mid Tower case

    Enermax P4-430ATX power supply

  32 machines have the following setup each:


    2 AMD Palamino MP XP 1800+ 1.53 GHz CPUs

    Tyan S2460 Dual Socket-A/MP motherboard

    Kingston 512mb PC2100 DDR-266MHz REG ECC RAM

    1 20 GB Maxtor UDMA/100 7200rpm HD

    1 120 GB Maxtor 5400rpm ATA100 HD

    Asus CD-A520 52x CDROM

    1.44mb floppy drive

    ATI Expert 98 8mb AGP video card

    IN-WIN P4 300ATX Mid Tower case

    Intel PCI PRO-100 10/100Mbps network card

    Enermax P4-430ATX power supply

  32 machines have the following setup each:


    2 Pentium III 1 GHz Intel CPUs

    Supermicro 370 DLE Dual PIII-FCPGA motherboard

    2 256 MB 168-pin PC133 Registered ECC Micron RAM

    1 20 GB Maxtor ATA/66 5400 RPM HD

    1 40 GB Maxtor UDMA/100 7200 RPM HD

    Asus CD-S500 50x CDROM

    1.4 MB floppy drive

    ATI Expert 98 8 MB PCI video card

    IN-WIN P4 300ATX Mid Tower case

  2.2.  Server hardware

  Two servers for external use (dissemination of information) with the
  following setups:


    2 AMD Opteron 240 1.4 GHz CPUs

    RIOWORKS HDAMB DUAL OPTERON motherboard

    4 KINGSTON 512MB PC3200 REG ECC RAM

    80GB MAX 7200 UDMA 133 HD

    6 200GB WD 7200 8MB HD

    ASUS 52X CD-A520 CDROM

    1.44mb floppy drive

    Antec 4U22ATX550EPS 4u case


    2 AMD Palamino MP XP 2000+ 1.67 GHz CPUs

    Asus A7M266-D w/LAN Dual DDR

    4 Kingston 512mb PC2100 DDR-266MHz REG ECC RAM

    Asus CD-A520 52x CDROM

    1 41 GB Maxtor 7200rpm ATA100 HD

    6 120 GB Maxtor 5400rpm ATA100 HD

    1.44mb floppy drive

    ATI Expert 2000 Rage 128 32mb

    IN-WIN P4 300ATX mid tower case

    Enermax P4-430ATX power supply

  2.3.  Desktop hardware

  1 desktop with the following setup:


    4 AMD 842 Opteron 1.6 GHz CPUs

    Tyan S4880UG2NR motherboard

    80GB MAX 7200 HD

    2 250GB WD 7200 HD

    10 GB DDR PC3200 REG ECC RAM

    Chenbro SR107 BLACK 550W 4u case


  2 desktops with the following setup:


    2 AMD Opteron 240 1.4 GHz CPUs

    K8T MASTER2-FAR K8T800 ATX motherboard

    2 KINGSTON 512MB PC2700 REG. ECC RAM

    550W Antec Xeon power supply

    ANTEC SX630II 300W mid-tower case

    1.44mb floppy drive

    PRO 660 TV/DVI FX5200T 128MB video card

    1 80GB SEA 7200 harddisk

    2 200GB WD 7200 8MB harddisk

    CREATIVE SB 128 5.1 PCI soundcard


  1 desktop with the following setup:


    2 AMD XP 2600 MP CPUs

    MSI K7D Master-L DUAL MS-6501 motherboard

    4 1024MB PC2100 DDR REG ECC RAM

    1 40GB SEA 7200 Maxtor harddisk

    2 120GB SEA 7200 Maxtor hardidks

    PIONEER DVR-AO5 IDE DVD-RW

    1.44mb floppy drive

    ATI Expert 2000 Rage 128 32mb video card

    IN-WIN P4 300ATX Mid Tower case

    Intel PCI PRO-100 10/100Mbps network card

    450W ENERMAX P4-430ATX power supply

    CREATIVE SB 128 5.1 PCI soundcard


  2 desktops with the following setup:


    2 AMD XP 2600 MP CPUs

    MSI K7D Master-L DUAL MS-6501 motherboard

    2 512MB PC2100 DDR REG ECC RAM

    1 40GB SEA 7200 Maxtor harddisk

    2 120GB SEA 7200 Maxtor hardidks

    MSI 52X24X52X CR52-A2 CD-RW

    1.44mb floppy drive

    ATI Expert 2000 Rage 128 32mb video card

    IN-WIN P4 300ATX Mid Tower case

    Intel PCI PRO-100 10/100Mbps network card

    450W ENERMAX P4-430ATX power supply

    CREATIVE SB 128 5.1 PCI soundcard


  1 desktop with the following setup:


    2 AMD Palamino MP XP 2000+ 1.67 GHz CPUs

    Asus A7M266-D w/LAN Dual DDR

    2 Kingston 512mb PC2100 DDR-266MHz REG ECC RAM

    Ricoh 32x12x10 CDRW/DVD Combo EIDE

    1.44mb floppy drive

    1 41 GB Maxtor 7200rpm ATA100 HD

    1 120 GB Maxtor 5400rpm ATA100 HD

    ATI Expert 2000 Rage 128 32mb video card

    IN-WIN P4 300ATX Mid Tower case

    Intel PCI PRO-100 10/100Mbps network card

    Enermax P4-430ATX power supply


  1 desktop with the following setup:


    2 Intel Xeon 1.7 GHz 256K 400FS

    Supermicro P4DCE Dual Xeon motherboard

    4 256mb RAMBUS 184-Pin 800 MHz memory

    2 120 GB Maxtor ATA/100 5400 RPM HD

    1 60 GB Maxtor ATA/100 7200 RPM HD

    52X Asus CD-A520 INT IDE CDROM

    1.4 MB floppy drive

    Leadtex 64 MB GF2 MX400 AGP

    Creative SB LIVE Value PCI 5.1

    Microsoft Natural Keyboard

    Microsoft Intellimouse Explorer

    Supermicro SC760 full-tower case with 400W PS


  2 desktops with the following setup:


    2 AMD K7 1.2g/266 MP Socket A CPU

    Tyan S2462NG Dual Socket A motherboard

    4 256mb PC2100 REG ECC DDR-266Mhz

    3 40 GB Maxtor UDMA/100 7200 RPM HD

    50X Asus CD-A520 INT IDE CDROM

    1.4 MB floppy drive

    Chaintech Geforce2 MX200 32mg AGP

    Creative SB LIVE Value PCI

    Full-tower case with 300W PS


  2 desktops with the following setup:


    2 Pentium III 1 GHz Intel CPUs

    Supermicro 370 DLE Dual PIII-FCPGA motherboard

    4 256 MB 168-pin PC133 Registered ECC Micron RAM

    3 40 GB Maxtor UDMA/100 7200 RPM HD

    Asus CD-S500 50x CDROM

    1.4 MB floppy drive

    Jaton Nvidia TNT2 32mb PCI

    Creative SB LIVE Value PCI

    Full-tower case with 300W PS


  2 desktops with the following setup:


    2 Pentium III 1 GHz Intel CPUs

    Supermicro 370 DLE Dual PIII-FCPGA motherboard

    4 256 MB 168-pin PC133 Registered ECC Micron RAM

    3 40 GB Maxtor UDMA/100 7200 RPM HD

    Mitsumi 8x/4x/32x CDRW

    1.4 MB floppy drive

    Jaton Nvidia TNT2 32mb PCI

    Creative SB LIVE Value PCI

    Full-tower case with 300W PS


  1 desktop with the following setup:


    2 Pentium III 1 GHz Intel CPUs

    Supermicro 370 DE6 Dual PIII-FCPGA motherboard

    4 256 MB 168-pin PC133 Registered ECC Micron RAM

    3 40 GB Maxtor UDMA/100 7200 RPM HD

    Ricoh 32x12x10 CDRW/DVD Combo EIDE

    Asus CD-A520 52x CDROM

    1.4 MB floppy drive

    Asus V7700 64mb GeForce2-GTS AGP video card

    Creative SB Live Platinum 5.1 sound card

    Full-tower case with 300W PS


  3 desktops with the following setup:


    2 Pentium III 1 GHz Intel CPUs

    Supermicro 370 DE6 Dual PIII-FCPGA motherboard

    4 256 MB 168-pin PC133 Registered ECC Micron RAM

    3 40 GB Maxtor UDMA/100 7200 RPM hard disk

    Ricoh 32x12x10 CDRW/DVD Combo EIDE

    1.4 MB floppy drive

    Asus V7700 64mb GeForce2-GTS AGP video card

    Creative SB Live Platinum 5.1 sound card

    Full-tower case with 300W PS

  2.4.  Firewall/gateway hardware

  1 firewall with the following setup:


    AMD Palamino XP 1700+ 1.47GHz CPU

    MSI KT3 Ultra2 KT333 MS-6380E motherboard

    512 MB PC2100 DDR-266MHz DIMM RAM

    40GB Seagate 7200rpm ATA/100 hard disk

    Asus 52X CD-A520 INT IDE cdrom

    1.44 MB floppy drive

    ATI Expert 2000 Rage 128 32mb video card

    3 Intel Pro/1000T Gigabit Server ethernet cards

    4U Black Rackmount Steel case

  1 gateway with the following setup. The gateway  is a mirror of the
  firewall in case the firewall breaks.


    AMD Palamino XP 1800+ 1.57GHz CPU


    MSI KT3 Ultra2 KT333 MS-6380E motherboard

    512 MB PC2100 DDR-266MHz DIMM RAM

    40GB Seagate 7200rpm ATA/100 hard disk

    Asus 52X CD-A520 INT IDE cdrom

    1.44 MB floppy drive

    ATI Expert 2000 Rage 128 32mb video card

    3 Intel Pro/1000T Gigabit Server ethernet cards

    4U Black Rackmount Steel case

  2.5.  Miscellaneous/accessory hardware

  Backup:


    2 Sony 20/40 GB DSS4 SE LVD DAT drives

  Monitors:


    3 17" Viewsonic VG700 LCB monitor

    2 17" Viewsonic VE700 LCD monitor

    1 20.1" Viewsonic VP201M LCD monitor

    1 22" Viewsonic P220F 0.25-0.27m monitor

    4 21" Sony CPD-G500 .24mm monitor

    2 18" Viewsonic VP181 LCD monitor

    1 17" Viewsonic VE170 LCD monitor

    2 21" Sun monitors

  Printers:


    HP colour laserject 4600dn

  Keyboards/mice:


    Microsoft Internet/Natural Keyboard

    Microsoft Intellimouse Explorer

  2.6.  Putting-it-all-together hardware

  We used to use KVM switches with a cheap monitor to connect up and
  "look" at all the machines:


    15" .28dp XLN CTL monitor

    3 Belkin Omniview 16-Port Pro switches

    Belkin Omniview 2-Port switch

    2 APC AR203 netshelter rack units

  While this is a nice solution, I think it's kind of needless. What we
  need is a small hand held monitor that can plug into the back of the
  PC (operated with a stylus, like the Palm). I don't plan to use more
  monitor switches/KVM cables.

  Networking is important:


    1 Netgear FS750ATNA 48 port/1 git network switch

    2 Netgear FS750NA 48 port/1 git network switch

    1 Netgear FSM750S 48 port/2 git network switch

    1 Netgear FS517TS 16 port/1 git network switch

    1 Netgear FS524 24 port network switch

    1 Cisco Catalyst 3448 XL Enterprise Edition 48 port network switch

    1 Netgear ME102NA Wireless Access Point

    1 Netgear MA401NA Wireless PCMCIA network card

  2.7.  Costs

  Our vendor is Hard Drives Northwest (http://www.hdnw.com). For each
  compute node in our cluster (containing two processors), we paid about
  $1500-$2000, including taxes. Generally, our goal is to keep the cost
  of each processor to below $1000 (including housing it).

  3.  Software

  3.1.  Operating system: Linux, of course!

  The following kernels and distributions are what are being used:


    Kernel 2.2.16-22, distribution KRUD 7.0

    Kernel 2.4.9-7, distribution KRUD 7.2

    Kernel 2.4.18-10, distribution KRUD 7.3

    Kernel 2.4.20-13.9, distribution KRUD 9.0

    Kernel 2.4.22-1.2188, distribution KRUD 2004-05

    Kernel 2.6.5-1.358, distribution Fedora Core 2

  These distributions work very well for us since updates are sent to us
  on CD and there's no reliance on an external network connection for
  updates. They also seem "cleaner" than the regular Red Hat
  distributions, and the setup is extremely stable.

  3.2.  Networking software

  We use Shorewall 1.3.14a ((http://www.shorewall.net) for the firewall.

  3.3.  Parallel processing software

  We use our own software for parallelising applications but have
  experimented with PVM <http://www.csm.ornl.gov/pvm/pvm_home.html> and
  MPI <http://www-unix.mcs.anl.gov/mpi/mpich/>. In my view, the overhead
  for these pre-packaged programs is too high.  I recommend writing
  application-specific code for the tasks you perform (that's one
  person's view).

  3.4.  Costs

  Linux and most software that run on Linux are freely copiable.

  4.  Set up, configuration, and maintenance

  4.1.  Disk configuration

  This section describes disk partitioning strategies.



       farm/cluster machines:

       hda1 - swap   (2 * RAM)
       hda2 - /      (remaining disk space)
       hdb1 - /maxa  (total disk)

       desktops (without windows):

       hda1 - swap   (2 * RAM)
       hda2 - /      (4-8 GB)
       hda3 - /spare (remaining disk space)
       hdb1 - /maxa  (total disk)
       hdd1 - /maxb  (total disk)

       desktops (with windows):

       hda1 - /win   (total disk)
       hdb1 - swap   (2 * RAM)
       hdb2 - /      (4 GB)
       hdb3 - /spare (remaining disk space)
       hdd1 - /maxa  (total disk)

       laptops (single disk):

       hda1 - /win   (half the total disk size)
       hda2 - swap   (2 * RAM)
       hda3 - /      (remaining disk space)



  4.2.  Package configuration

  Install a minimal set of packages for the farm. Users are allowed to
  configure desktops as they wish.

  4.3.  Operating system installation and maintenance

  4.3.1.  Personal cloning strategy

  I believe in having a completely distributed system. This means each
  machine contains a copy of the operating system.  Installing the OS on
  each machine manually is cumbersome. To optimise this process, what I
  do is first set up and install one machine exactly the way I want to.
  I then create a tar and gzipped file of the entire system and place it
  on a bootable CD-ROM which I then clone on each machine in my cluster.

  The commands I use to create the tar file are as follows:


       tar -czvlps --same-owner --atime-preserve -f /maxa/slash.tgz /



  I use a script called go that takes a machine number as its argument
  and untars the slash.tgz file on the CD-ROM and replaces the hostname
  and IP address in the appropriate locations. A version of the go
  script and the input files for it can be accessed at:
  http://www.ram.org/computing/linux/linux/cluster/. This script will
  have to be edited based on your cluster design.

  To make this work, I use Martin Purschke's Custom Rescue Disk
  (http://www.phenix.bnl.gov/~purschke/RescueCD/) to create a bootable
  CD image containing the .tgz file representing the cloned system, as
  well as the go script and other associated files.  This is burned onto
  a CD-ROM.

  There are several documents that describe how to create your own
  custom bootable CD, including the Linux Bootdisk HOWTO
  (http://www.linuxdoc.org/HOWTO/Bootdisk-HOWTO/), which also contains
  links to other pre-made boot/root disks.

  Thus you have a system where all you have to do is insert a CDROM,
  turn on the machine, have a cup of coffee (or a can of coke) and come
  back to see a full clone. You then repeat this process for as many
  machines as you have. This procedure has worked extremely well for me
  and if you have someone else actually doing the work (of inserting and
  removing CD-ROMs) then it's ideal. In my system, I specify the IP
  address by specifying the number of the machine, but this could be
  completely automated through the use of DHCP.

  Rob Fantini (rob@fantinibakery.com) has contributed modifications of
  the scripts above that he used for cloning a Mandrake 8.2 system
  accessible at
  http://www.ram.org/computing/linux/cluster/fantini_contribution.tgz.

  4.3.2.  Cloning and maintenance packages

  4.3.2.1.  FAI

  FAI (http://www.informatik.uni-koeln.de/fai/) is an automated system
  to install a Debian GNU/Linux operating system on a PC cluster. You
  can take one or more virgin PCs, turn on the power and after a few
  minutes Linux is installed, configured and running on the whole
  cluster, without any interaction necessary.


  4.3.2.2.  SystemImager

  SystemImager (http://systemimager.org) is software that automates
  Linux installs, software distribution, and production deployment.

  4.3.3.  DHCP vs. hard-coded IP addresses

  If you have DHCP set up, then you don't need to reset the IP address
  and that part of it can be removed from the go script.

  DHCP has the advantage that you don't muck around with IP addresses at
  all provided the DHCP server is configured appropriately. It has the
  disadvantage that it relies on a centralised server (and like I said,
  I tend to distribute things as much as possible). Also, linking
  hardware ethernet addresses to IP addresses can make it inconvenient
  if you wish to replace machines or change hostnames routinely.


  4.4.  Known hardware issues

  The hardware in general has worked really well for us. Specific issues
  are listed below:

  The AMD dual 1.2 GHz machines run really hot. Two of them in a room
  increase the temperature significantly. Thus while they might be okay
  as desktops, the cooling and power consumption when using them as part
  of a large cluster is a consideration. The AMD Palmino configuration
  described previously seems to work really well, but I definitely
  recommend getting two fans in the case--this solved all our
  instability problems.

  4.5.  Known software issues

  Some tar executables apparently don't create a tar file the nice way
  they're supposed to (especially in terms of referencing and de-
  referencing symbolic links). The solution to this I've found is to use
  a tar executable that does, like the one from RedHat 7.0.

  5.  Performing tasks on the cluster

  This section is still being developed as the usage on my cluster
  evolves, but so far we tend to write our own sets of message passing
  routines to communicate between processes on different machines.

  Many applications, particularly in the computational genomics areas,
  are massively and trivially parallelisable, meaning that perfect
  distribution can be achieved by spreading tasks equally across the
  machines (for example, when analysing a whole genome using a technique
  that operates on a single gene/protein, each processor can work on one
  gene/protein at a time independent of all the other processors).

  So far we have not found the need to use a professional queueing
  system, but obviously that is highly dependent on the type of
  applications you wish to run.

  5.1.  Rough benchmarks

  For the single most important program we run (our ab initio protein
  folding simulation program), using the Pentium 3 1 GHz processor
  machine as a frame of reference, on average:



       Xeon    1.7 GHz processor is about 22% slower
       Athlon  1.2 GHz processor is about 36% faster
       Athlon  1.5 GHz processor is about 50% faster
       Athlon  1.7 GHz processor is about 63% faster
       Xeon    2.4 GHz processor is about 45% faster
       Xeon    2.7 GHz processor is about 80% faster
       Opteron 1.4 GHz processor is about 70% faster
       Opteron 1.6 GHz processor is about 88% faster



  Yes, the Athlon 1.5 GHz is faster than the Xeon 1.7 GHz since the Xeon
  executes only six instructions per clock (IPC) whereas the Athlon
  executes nine IPC (you do the math!). This is however an highly non-
  rigourous comparison since the executables were each compiled on the
  machines (so the quality of the math libraries for example will have
  an impact) and the supporting hardware is different.



  5.2.  Uptimes

  These machines are incredibly stable both in terms of hardware and
  software once they have been debugged (usually some in a new batch of
  machines have hardware problems), running constantly under very heavy
  loads.  One example is given below. Reboots have generally occurred
  when a circuit breaker is tripped.



         2:29pm  up 495 days,  1:04,  2 users,  load average: 4.85, 7.15, 7.72



  6.  Acknowledgements

  The following people have been helpful in getting this HOWTO done:


    Michal Guerquin (Michal Guerquin)

    Michael Levitt (Michael Levitt)

  7.  Bibliography

  The following documents may prove useful to you---they are links to
  sources that make use of high-performance computing clusters:


    Ram Samudrala's research page (which describes the kind of research
     done with these clusters)
     <http://www.ram.org/research/research.html>

    RAMP web page <http://www.ram.org/computing/ramp/ramp.html>

    RAMBIN web page <http://www.ram.org/computing/rambin/rambin.html>



