1. 程式人生 > >How I built a GPU Deep Learning Cluster with ray 

How I built a GPU Deep Learning Cluster with ray 

It all began at the end of May. Eric texted me that he was working on a personal research project and that he needed my help to realize his goal. He wanted me to help him setup a distributed computing cluster, which he then could use to run some experiments on. Of course, I agreed and then I spent the rest of May to get myself acquainted with

Kubernetes, Google’s open-source container orchestration platform, as my friend was planning to use this platform to run his experiments.

The First Challenge: No GPUs

In June, the project finally kicked off with me getting four rooms, each containing about 17 PCs, that should be connected so that they would form a computing cluster.

At this point, the project would have been easy: just install Ubuntu or any other Linux distro on each machine, install the appropriate drivers and set up Kubernetes & Docker in such a way that the GPUs of the machines could be used for my partner’s experiments. However, unfortunately, the school set one big restriction that made the project more complicated: we were not allowed to modify the Windows 10 installation on the machines. Therefore, installing Linux on the physical machine was not a realistic option.

So, I first looked for a way to use Windows as the OS running the computing cluster. However, this was not possible as GPU Passthrough for Docker on Windows did not work at the time (and still does not work now -_-), so it was back to the drawing board.

Then, as another solution, I wanted to try to use Hyper-V on each machine to install a Linux VM on each PC and then use PCIe-Passthrough to make the GPU available to the Linux VM, but sadly Hyper-V for Windows 10 Pro does not support that.

So finally I wanted to try a different approach altogether. As I was not allowed to modify the local Windows installation on each PC, would it be possible to have a Linux installation on a remote storage server and then have each PC boot that remote Linux installation? As it turned out, this was the final solution for which I settled; however, in the beginning I had no chance to actually go for this approach as it required some network tweaking on the school’s side, so I had to go for the worst setup overall: a no-GPU Cluster using Linux VMs running on Windows.

Finally: GPUs possible

When I talked to Eric, my project partner, about the current situation, he was not very happy. I told him about my possible solutions and that we had to go for the no-GPU solution if the school was not willing to do some required network tweaking for the remote-boot solution.

So my partner went ahead (without me knowing) and got permission from the school to get the required network changes through. Oh boy, you could not believe how surprised I was when I got that phone call from him. He also got me a contact from my school’s IT support department, who later supported me during the setup process of the cluster.

Therefore, there was nothing standing in my way - or so I thought.

The Second Challenge: How to boot Linux from remote storage?

Now, as I had no more (bureaucratic) roadblocks in front of me, it was time to actually figure out how I would get Linux to boot from remote storage.

PXE: An Introduction

PXE, or the Preboot Execution Environment, is a standardized client-server environment that was standardized by Intel in its Wired for Management specification in the mid-1990, afterwards receiving its 2.0 update in 1998, followed by the 2.1 update in 1999. The specification describes how the standard industry protocols TCP/IP, TFTP and DHCP can be used to allow a PC to boot a software image (typically an Operating System) over the network.

Basically, any PXE-enabled device contains code either on its NIC (Network Interface Card) or as part of its (UEFI-)firmware that enables it to obtain an NBP, a Network Bootstrap Program. This program is retrieved from a TFPT server on the local network of a PXE client device, to which it is directed by a properly set-up DHCP server. The NBP is then used to actually bootstrap the client device. It does this by downloading additionally needed files from remote servers on the network into the client device’s main memory and then executes them to actually bootstrap the OS.

First Try: ContainerOS via PXE

So, as this was my first try with PXE, I had to do some research with Google and soon I found a number of articles that showed me how I could use PXE to boot the setups of Ubuntu, Fedora, CentOS and other Linux distros. However, I could not find any articles that showed me how I could use PXE to boot a full Linux installation. So out of curiosity I tried to see if I could boot CoreOS via PXE.

This might seem random but at the time I was working on this project, I was also interested in CoreOS which I knew to be intended to be used as a lightweight hosting OS for containers. So, this would be a perfect candidate for an OS that should support easy booting via the network, or so I thought.

And it turned out that CoreOS is actually quite easy to boot via PXE. There is even an open-source tool for that:

However, this attempt quickly came to a close when I found out that setting up GPU pass-through for containers on CoreOS is not very easy as installing the Nvidia GPU drivers is a pain. (Articles that show how it can be done are here and here).

Second Try: Ubuntu 14.04 via PXE and NFS

So, I took another round of googling to (finally) find an article that showed me how I could setup a PXE boot of a full Ubuntu 14.04 installation:

I tried to follow the tutorial but for some reason I just could not get Ubuntu 14.04 to boot in my test setup at home. As I was closing in on the final date of deployment (the end of June), I just went straight to the next (and temporary) final solution.

Third Try: CentOS 7 via PXE and NFS

Finally, after some additional googling, I found an automated script that could turn a blank CentOS 7 installation into a full-fledged PXE server. I quickly tested, if that script worked in my small home setup. And it actually did.

After that I went straight to school and implemented this “solution” in one of the four rooms to test how the solution performed in the real world. Sadly, this solution performed so poorly that I had to throw it away, again.

I then consulted with another friend, Sebastian De Ro, about this and he just recommended that I should retry it with Ubuntu.

Final Try: Ubuntu 14.04 again

I followed my friend’s advice and tried the Ubuntu tutorial again. However, this time, for some reason, I got it to work with my friend on the phone. Therefore, when I was in school again, I quickly replaced my CentOS setup with the Ubuntu setup. And now the test deployment was a full success as all PCs in the room where I carried out the testing could boot with Ubuntu 14.04.

Next Step: Setting Up Kubernetes

As the PCs were now finally booting over the network, it was time to setup Kubernetes. For this, I took advantage of kubeadm, a great tool that allowed me to easily set up a Kubernetes cluster within 1–2 hours.

Kubernetes: An introduction

Kubernetes is an open-source container orchestration platform developed by Google. It is an open-source system that allows you to manage containerized applications declaratively. For a quick introduction into Kubernetes Concepts, see this article:

Now, after I used kubeadm to setup my cluster, the architecture looked as follows:

  • The master server (the head of the Kubernetes cluster and aptly named “kube-master”) was located in the data center of the school, where also the PXE server for remote booting was located, and was an Ubuntu 16.04 VM.
  • The slave servers (the machines that actually did the computing work in the cluster and were named “kube-slaves”) where the “discless” PCs, which booted a remote Ubuntu 14.04 installation using PXE and would be responsible for doing the computations required for Eric’s experiments.

However, the final touch was still missing: The Ubuntu 14.04 machines still lacked the required GPU & CUDA drivers needed for GPU acceleration.

So, I went ahead and tried to install the GPU & CUDA drivers onto the Ubuntu 14.04 installation.

The Third Challenge: CUDA Toolkit only for Ubuntu 16.04 & above

Sadly, this was the moment where I faced the next challenge: the CUDA Toolkit 9.2, which, according to my friend, was essential, sadly did not exist for Ubuntu 14.04. Therefore I could not use my Ubuntu 14.04 remote installation for the “discless” PCs. So I had to upgrade my Ubuntu.

Net-booting Ubuntu 16.04: A Royal Pain

I turned back to good old Google and tried to find a newer tutorial than the one that I had already used. However, there was no newer version, so I just went ahead and followed the old article as much as I could.

And I actually got quite far with that, as I managed to actually get the machines to initiate the booting sequence, but then got stuck in the final boot phase. It took me quite some time to figure out that two major changes in Ubuntu 16.04 were causing the boot failure:

  1. The way how the root mount point device in the /etc/fstab file is specified, had changed:In the article, the root mount point device was specified like this: /dev/nfs but now, the root mount point device is specified like this: <NFS Server IP>:<Path to Mount>
  2. The folder structure has also changed:The /var/run and the /var/lock folders are now /run and /run/lock

These two changes caused errors in the boot process because the /etc/fstab file contained wrong entries. So, after fixing this, Ubuntu 16.04 easily booted via PXE. However, finding these errors was a pain.

Finally Solved: Nvidia GPU & CUDA Drivers installed

Now, that the net-boot setup for Ubuntu 16.04 was finally ready, I again replaced the net-boot deployment in the school’s data center with the Ubuntu 16.04 deployment. Finally, I PXE-booted one of the machines in my test deployment room and now went ahead and installed the Nvidia GPU & CUDA drivers together with the CUDA Toolkit 9.2. So I also mastered this challenge.

And finally the first version of the GPU cluster was ready. Now, we went ahead and tried to actually deploy the application, which would do the calculations required for my friend’s experiments, onto the cluster.

The Forth Challenge: Deploying the Application

Next, my friend and I got together to deploy the code that would carry out the experiments on the compute cluster.

However, we had trouble getting things to run.

At first, the docker images would not be loaded by the slaves, but this could be circumvented with a private repository, which I setup for this project.

Yet, the second problem was even worse: the application would not run on the cluster. It seemed to be a network-related error, so I again spent some time trying to figure out, what was going wrong. Nevertheless, even after trying to debug the thing for days, I could not get the application to run. Therefore, I got together with Eric and asked him why we had chosen this setup in the first place.

He explained that we were using Docker because someone had recommended using Kubernetes. However, after some discussing we came to the conclusion that we did not need Kubernetes, but could use another tool for the distributed computing: ray.

Good Bye Kubernetes, Hello ray

We decided to kill the Kubernetes approach and go with ray, a “flexible, high-performance distributed execution framework.”

ray: An Introduction

Ray is a Python framework that allows parallel execution of Python code by the use of “remote functions” like this:

For more info about ray and what it is capable of, see this:

I quickly adapted the remote Ubuntu installation by removing Kubernetes and installing ray on each machine. This went quite well.

Finally, I ran some last tests on the test deployment and then went ahead and deployed the full cluster. All 68 machines were interconnected and finally ready to do work.

The first real version of the cluster went live in the last week of June. Eric then also started to adapt his experimentation code to suit the new framework. Meanwhile, he also asked me if I could run some last tests on the cluster, just to see if the cluster could handle the network load the experimentation code would place upon it. Unfortunately, the results of the test were quite the deal breaker.

The Fifth Challenge: Network too slow

By now it was the end of June: the cluster was finally ready to do work but the final network load tests revealed some very serious bottlenecks. The school network was not fast enough to enable the cluster to do the experiment calculations in a feasible amount of time.

At first, I thought there was a problem with my setup, but network tests with iperf3 and some talking to the school’s IT support department made my worst fears come true: the school’s network architecture did not have enough performance for the cluster, so I had to come up with a (partly crazy) solution:

Build my own “Data Center” using the PCs the school had lent us for the project.

The Final Solution: Building A “Data Center”

So, what did this mean? Well, it meant that all the 68 PCs that the school had given us for the project had to be brought to one place, where they would then be put into one big network using one of the school’s more powerful network switches (a Cisco Catalyst 6500).

And this place was one of the school’s networking labs which also housed the Cisco Catalysts 6500 switch that we then used to interconnect all the 68 PCs into one big network.

After getting this solution permitted by the school’s IT support department, Eric and I got together with one of the school’s IT support interns and together brought all PCs from their original locations into the network lab where the “data center” would take form.

Then I spent a few days connecting all PCs to a central power supply as well as connecting all PCs to the central switch, so that all machines would turn on and form a network that could handle the loads that my friend’s experimentation code would require.

Afterwards, I spent some time setting up a central server that would serve the computing cluster as:

  • a storage server that contains the remote Ubuntu 16.04 installation all the worker machines would boot
  • a PXE-server that would enable the worker machines to boot the remote Ubuntu 16.04 installation
  • a DHCP-server so that all the machines would be able to communicate over the network
  • as the internet gateway
  • and as the head of the ray computing cluster

Finally, I configured the Cisco switch so that it would be ready for use and then I prepared another PC so that we could use that machine as the controller of the cluster.

Once all of this was done, the cluster was finally ready for use.

And so, because of a few issues (the room, in which the cluster had been built, had to be cleaned and the school was closed on weekends), the setup of the final “data center”-solution took longer than planned, but we manged to finally deploy my friend’s experimentation code on July 14th.

Me (on the left) and Eric Steinberger (on the right) next to the finished cluster

Here, is what the cluster looked like:

Epilogue

So after my friend’s experimentation code got deployed on the cluster, we let the code run till August 29th. This was the date when everything had to be reverted. Again, with the help of the interns of the school’s IT support department, we put all the PCs back to their original locations and reverted all changes on them. All the data, that had been generated by my friend’s experimentation code, was copied from the central server onto an external hard drive and the central server was reset together with the Cisco switch. Finally, all the cabling was packed away and the project concluded successfully.