1. 程式人生 > >網上搜的NVIDIA顯示卡驅動ubuntu安裝的知識。。。。。挺詳盡的或許有用

網上搜的NVIDIA顯示卡驅動ubuntu安裝的知識。。。。。挺詳盡的或許有用

Installing Nvidia CUDA 8.0 on Ubuntu 16.04 for Linux GPU Computing (New Troubleshooting Guide)
釋出日期: 釋出日期: 2017 年 4 月 1 日
Victor Oliveira Antonino

If you want to train deep neural networks, you should probably be familiar with packages like Caffe, Keras, TensorFlow, Theano, and Torch. These libraries use GPU computation power that you will probably want to use to further speed up training, which can be very long on CPU. No news so far, specially if you are an experienced machine learning engineer. However, the experience of installing CUDA on Ubuntu may be very frustrating.

These are the most frequent causes:

You were greeted by a black screen after installing Nvidia Driver
You got stuck in “login loop” after installing Nvidia Driver
When you tried to run the base installer (cuda_<version>_linux.run) you received this lovely message (specially on EC2 instances) : "The driver installation is unable to locate the kernel source. Please make sure that the kernel source packages are installed and set up correctly. If you know that the kernel source packages are installed and set up correctly, you may pass the location of the kernel source with the '--kernel-source-path' flag."
Even though there are tons of tutorials over the web, I have lost a considerable amount of time and I have spent days installing CUDA on Ubuntu over different computers, whether laptops or desktops. You might be familiar with most of the steps presented here, so don't mind jumping a few steps until you find something useful.


Kill your current X server session by pressing CTRL+ALT+F1 and login using your credentials.
sudo service lightdm stop
Why?
X is an application that manages one or more graphic displays. Makes total sense to disable it since its main component is responsible for resizing and moving of windows, decorative elements, title bars, minimize, close buttons, etc. [Ref]


1. Update your system


sudo apt-get update
sudo apt-get upgrade -y
sudo apt-get dist-upgrade -y
Why?
Keeping your system up to date is essential, right? Ubuntu images are not updated constantly and you are probably using a snapshot from a point in time. [Ref]


2. Install build-essential package


sudo apt-get install build-essential
Why?
If some library needs a C/C++ compiler, you need to install build-essential. [Ref]


3. Blacklist the "nouveau" driver


echo -e "blacklist nouveau\nblacklist lbm-nouveau\noptions nouveau modeset=0\nalias nouveau off\nalias lbm-nouveau off\n" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
sudo update-initramfs -u
 Reboot the computer and repeat step 1.


Why?
Nouveau is a free and open-source driver developed by reverse engineering Nvidia's proprietary Linux drivers. We can't use it for multiple reasons: inferior performance compared to Nvidia's proprietary graphics device drivers, no CUDA support, and we need to configure the xserver accordingly to avoid black screen/login loop issues, in other words, let's disable conflicting modules.


4. Install linux kernel modules


When asked about grub changes select choose package maintainers version.


apt-get install linux-image-extra-virtual
Why?
This is tricky. Especially if you are using an EC2 instance. This link gives you a good explanation why this is needed. However, I will quote the important piece:


"Nvidia's driver depends on the drm module, but that's not included in the default 'virtual' ubuntu that's on the cloud (as it usually has no graphics). It's available in the linux-image-extra-virtual package (and linux-image-generic supposedly), but just installing those directly will install the drm module for the NEWEST available kernel, not the one we're currently running. Hence, we need to specify the version manually. This command will probably need to be re-run every time you upgrade the kernel and reboot."


5. Install linux source and headers


apt-get install linux-source
apt-get source linux-image-$(uname -r)
apt-get install linux-headers-$(uname -r)
Why?
This is also needed to avoid the "unable to locate the kernel source" message!


CUDA toolkit documentation may not be very appealing to some, but I will also quote another important piece that explicitly says:


"The CUDA Driver requires that the kernel headers and development packages for the running version of the kernel be installed at the time of the driver installation, as well whenever the driver is rebuilt. For example, if your system is running kernel version 3.17.4-301, the 3.17.4-301 kernel headers and development packages must also be installed."


5. Install CUDA 8.0 


Run the following commands:


wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda_8.0.61_375.26_linux-run 
sudo sh cuda_8.0.61_375.26_linux.run --override --no-opengl-lib
Your log may be similar to this:


Do you accept the previously read EULA? (accept/decline/quit): accept
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 375.26? ((y)es/(n)o/(q)uit): y
Install the CUDA 8.0 Toolkit? ((y)es/(n)o/(q)uit): y
Enter Toolkit Location [ default is /usr/local/cuda-8.0 ]:
Do you want to install a symbolic link at /usr/local/cuda? ((y)es/(n)o/(q)uit): y
Install the CUDA 8.0 Samples? ((y)es/(n)o/(q)uit): y
Enter CUDA Samples Location [ default is /home/user ]: /usr/local/cuda-8.0
Why?
The "--override" is needed so you don't get the error, "Toolkit: Installation Failed. Using unsupported Compiler."


The "--no-opengl-lib" prevents the driver installation from installing NVIDIA's GL libraries. Useful for systems where the display is driven by a non-NVIDIA GPU. In such systems, NVIDIA's GL libraries could prevent X from loading properly. This flag is very important to avoid getting stuck in “login loop” or black screen!


Wait.. something is still not quite right! I am still receiving a message saying 'the driver installation is unable to locate the kernel source'. Even though I am using the flag --kernel-source-path=<path> !!!!


So.. let's check the following log file:


sudo vi /var/log/nvidia-installer.log
It says:


"ERROR: The kernel module failed to load, because it was not signed by a key that is trusted by the kernel. Please try installing the driver again, and sign the kernel module when prompted to do so.


ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release."




Usually the error "Unable to load the kernel module 'nvidia.ko'" is associated with dkms and installing linux kernel modules on step 4 might be enough. [See here]


However, my experience installing CUDA on a desktop computer showed me something different. Especially because of what the first paragraph says!


And there you have it:


Many linux distributions require modules to be cryptographically signed by a key trusted by the kernel when these modules are loaded into kernels running on UEFI systems with Secure Boot enabled. For those who did not get the last piece, the Unified Extensible Firmware Interface (UEFI) is a specification that defines a software interface between an operating system and platform firmware. UEFI replaces the Basic Input/Output System (BIOS) firmware interface originally present in all IBM PC-compatible personal computers.


Here, you can find details about how to generate signing keys in nvidia-installer.


Easy alternative? Disable UEFI Secure Boot (if possible), or use a kernel that doesn't require signed modules.


How to disable Secure Boot on Ubuntu, then!?!?


Since Ubuntu kernel build 4.4.0-21.37 this can be fixed by running:


sudo apt install mokutil
sudo mokutil --disable-validation
Since questions may arise, see third party kernel modules on UEFI with enabled Secure Boot and the consequences of disabling it.


I hope after this you were able to see the "beautiful" nvidia-smi message on your terminal, similar to the one above.