Setting Up a New Cluster

Setting Up a New Cluster

 

This document is an attempt to provide you basic idea to start a new cluster. The document is based on RHEL (Red Hat Enterprise Linux) distribution and Dell servers (R430: server and compute nodes, R730 GPU nodes (P100), and R730 (SMP Nodes). Accordingly, there are three partitions batch, gpu, and smp created using SLURM. Since there are different generations of compute nodes and GPU nodes, we have also grouped them using “feature” option in SLURM. Check HPC Resource View Portal that gives you hybrid cluster structure. The internal high-bandwidth, low latency private network (“interconnect”) is currently provided by 100/25/10 Gbps Mellanox/Arista Ethernet switches.

General Reference – OpenHPC: https://openhpc.community/

CWRU HPC Overview:

Compute Nodes: CPU, GPU, Hadoop, DTN (Globus, Aspera, GDC)

Login Nodes: SSH/SFTP, Visualization, Hadoop Edge

Head Nodes: SLURM, XCat, LDAP, MySQL, DHCP, DNS, Ansible, Cloudera/Hadoop

Storage: Panasas, ZFS, Qumulo

Network: HPC, ILO/IPMI

Network & System Management & Monitoring: SolarWinds

Important Notes

  • We prefer servers – rack-mountable, C13-C14 power connections, 25GbE card and 1GbE connection, and IPMI-enabled (management)
  • Reserve the block of IP addresses for interfaces
  • The instructions can be slightly changed over time but the idea will remain the same

  • Install OS in a management Node using kickstart file, and PXE boot all other servers using xCAT installed in the management node.
  • Maintain the servers information in HPC Inventory to look up in the future

Pre-requisites

  • Set up the servers and switches in the racks.

  • Keep all head nodes in one rack with KVM Console (see Appendix A: Remote Management section for details); all head nodes (see Appendix B: Head Nodes) connecting to a KVM switch, and a long roving cable for connecting one compute node at a time.

  • Get the static IP for all the servers – head nodes as well as compute nodes.

 

Head Nodes & Compute Nodes – BIOS and DRAC configuration

Here is the procedure for setting up the servers.

  1. Rack the node and connect

    1. 1GE yellow cable to LOM1 (usually em0)

    2. the 10GbE twinax cable OR 100/25GbE breakout cables that comes with the order to the 10/25GbE slot (usually p2p1)

    3. The outer facing interface (1G/10G) that connects to the internet is also required (for head nodes only; not for compute nodes)

  2. HPC group will install the nodes physically, as well as set up the BIOS/DRAC.

  3. Enter the BIOS setup:

    1. Change the BIOS startup order to: PXE, Hard Drive.

    2. Disable the Logical Processor (the hyperthreading)

    3. Note down the ethernet MACs, noting down the em0 (the provisioning mac) in the HPC Inventory page <- if forgotten, this info is available in the idrac: Hardware -> Network devices -> Embedded NIC1

    4. In the DRAC menu:

      • choose Non dedicated with LOM1 (we want the shared interface to reduce the number of cables)

      • enter the appropriate DRAC IP/255.255.240.0 (They should be pre-assigned)

      • Enable IPMI over LAN (necessary; else rpower shows ERROR timeout though DRAC access from portal is okay)

      • change the user password to: root and DRAC pw.

    5. Save and Exit (Node will reboot)

Test whether you can access the node via DRAC (see Appendix C: DRAC Access)

Operating System (OS) Installation

We will Install RHEL OS in a management head node first as it is the provisioning server and provide kickstart files for installing OS in all the compute nodes via network using PXE Boot through xCAT. You can either install OS via media devices or network (if you have extra server designated for this in your IT department). more …

[PDF]

Open Science Framework (OSF)

OSF

The Open Science Framework (OSF) [1] [2] is a tool that promotes open, centralized workflows by enabling capture of different aspects and products of the research lifecycle, including developing a research idea, designing a study, storing and analyzing collected data, and writing and publishing reports or papers. It is developed and maintained by the Center for Open Science (COS), a nonprofit organization founded in 2013 that conducts research into scientific practice, builds and supports scientific research communities, and develops research tools and infrastructure to enable managing and archiving research. As an organization, the COS encourages openness, integrity, and reproducibility in research across scientific disciplines. The OSF supports a variety of tools and services to assist in the research process. This review focuses primarily on the core functionality of the OSF, with brief descriptions of some of the other existing tools and services. more …
[PDF]

Singularity HPC Container Solution

Singularity

Singularity [1] enables users to have full control of their environment. This means that a non-privileged user can “swap out” the operating system on the host for one they control. So if the host system is running RHEL7 but your application runs in Ubuntu, you can create an Ubuntu image, install your applications into that image, copy the image to another host, and run your application on that host in it’s native Ubuntu environment!

Digital Signal Processing (DSP)

Digital Signal Processing

The signals that we are trying to use may contain noises ( unwanted signals) and hence they need to be processed such as filtering the noises using the filters. Also, the information they contain can be displayed, analyzed, or converted to another type of signal. Digital Signal Processing (DSP) incorporates all these methods of processing the signals so as to retrieve the desired information from the signal.

Signal and Systems

Signals

Signal such as sound, heartbeat, heat, earthquake, current, and more are information that get transmitted from one place to another. Systems process those signals to modify or transform them. Signal is fed into the system as time dependent input stream (x(t)) which results in output y(t) which is the response of the system. The mathematical modeling of signals and systems help in the design and development of electronic devices.

more … [PDF]

Research computing resources and services for Data Science

What is Data Science?

A definition from Wikipedia [1]:

“Data science, also known as data-driven science, is an interdisciplinary field about scientific processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD)”.

The indirect path between human and data through computer scientists, unlike the direct one in statistics domain, has given us the modern and emerging domain of Data Science. Data science aims to provide natural human-data interfaces where people can interact naturally with information using the concept of Open data (e.g. Drupal/DKAN),  Open Knowledge (e.g. The Open Knowledge Network), Open system (e.g. The Open Group), and Open source Software (e.g. Deep Learning package – Tensorflow, PyCasp) and Platform (e.g. CDH – Cloudera hadoop).

Data Science is an interdisciplinary field because it adopts techniques and theories from broad spectrum of fields in mathematics, statistics, operations research, information science, and computer science, including signal processing, probability models, machine learning, statistical learning, data mining, database, data engineering, pattern recognition and learning, visualization, predictive analytics, uncertainty modeling, data warehousing, data compression, computer programming, artificial intelligence, and high performance computing. Data Science also applies to wider domains of sciences, also including finances, social sciences and humanities.

more … [PDF]