Setting Up a New Cluster
This document is an attempt to provide you basic idea to start a new cluster. The document is based on RHEL (Red Hat Enterprise Linux) distribution and Dell servers (R430: server and compute nodes, R730 GPU nodes (P100), and R730 (SMP Nodes). Accordingly, there are three partitions batch, gpu, and smp created using SLURM. Since there are different generations of compute nodes and GPU nodes, we have also grouped them using “feature” option in SLURM. Check HPC Resource View Portal that gives you hybrid cluster structure. The internal high-bandwidth, low latency private network (“interconnect”) is currently provided by 100/25/10 Gbps Mellanox/Arista Ethernet switches.
General Reference – OpenHPC: https://openhpc.community/
CWRU HPC Overview:
Compute Nodes: CPU, GPU, Hadoop, DTN (Globus, Aspera, GDC)
Login Nodes: SSH/SFTP, Visualization, Hadoop Edge
Head Nodes: SLURM, XCat, LDAP, MySQL, DHCP, DNS, Ansible, Cloudera/Hadoop
Storage: Panasas, ZFS, Qumulo
Network: HPC, ILO/IPMI
Network & System Management & Monitoring: SolarWinds
Important Notes
- We prefer servers – rack-mountable, C13-C14 power connections, 25GbE card and 1GbE connection, and IPMI-enabled (management)
- Reserve the block of IP addresses for interfaces
-
The instructions can be slightly changed over time but the idea will remain the same
- Install OS in a management Node using kickstart file, and PXE boot all other servers using xCAT installed in the management node.
- Maintain the servers information in HPC Inventory to look up in the future
Pre-requisites
-
Set up the servers and switches in the racks.
-
Keep all head nodes in one rack with KVM Console (see Appendix A: Remote Management section for details); all head nodes (see Appendix B: Head Nodes) connecting to a KVM switch, and a long roving cable for connecting one compute node at a time.
-
Get the static IP for all the servers – head nodes as well as compute nodes.
Head Nodes & Compute Nodes – BIOS and DRAC configuration
Here is the procedure for setting up the servers.
-
Rack the node and connect
-
1GE yellow cable to LOM1 (usually em0)
-
the 10GbE twinax cable OR 100/25GbE breakout cables that comes with the order to the 10/25GbE slot (usually p2p1)
-
The outer facing interface (1G/10G) that connects to the internet is also required (for head nodes only; not for compute nodes)
-
-
HPC group will install the nodes physically, as well as set up the BIOS/DRAC.
-
Enter the BIOS setup:
-
Change the BIOS startup order to: PXE, Hard Drive.
-
Disable the Logical Processor (the hyperthreading)
-
Note down the ethernet MACs, noting down the em0 (the provisioning mac) in the HPC Inventory page <- if forgotten, this info is available in the idrac: Hardware -> Network devices -> Embedded NIC1
-
In the DRAC menu:
-
choose Non dedicated with LOM1 (we want the shared interface to reduce the number of cables)
-
enter the appropriate DRAC IP/255.255.240.0 (They should be pre-assigned)
-
Enable IPMI over LAN (necessary; else rpower shows ERROR timeout though DRAC access from portal is okay)
-
change the user password to: root and DRAC pw.
-
-
Save and Exit (Node will reboot)
-
Test whether you can access the node via DRAC (see Appendix C: DRAC Access)
Operating System (OS) Installation
We will Install RHEL OS in a management head node first as it is the provisioning server and provide kickstart files for installing OS in all the compute nodes via network using PXE Boot through xCAT. You can either install OS via media devices or network (if you have extra server designated for this in your IT department). more …
[PDF]