CUDA從入門到精通（大神之作）膜拜

阿新 • • 發佈：2019-01-21

在老闆的要求下，本博主從2012年上高效能運算課程開始接觸CUDA程式設計，隨後將該技術應用到了實際專案中，使處理程式加速超過1K，可見基於圖形顯示器的平行計算對於追求速度的應用來說無疑是一個理想的選擇。還有不到一年畢業，怕是畢業後這些技術也就隨畢業而去，準備這個暑假開闢一個CUDA專欄，從入門到精通，步步為營，順便分享設計的一些經驗教訓，希望能給學習CUDA的童鞋提供一定指導。個人能力所及，錯誤難免，歡迎討論。

PS：申請專欄好像需要先發原創帖超過15篇。。。算了，先寫夠再申請吧，到時候一併轉過去。

NVIDIA於2006年推出CUDA（Compute Unified Devices Architecture），可以利用其推出的GPU進行通用計算，將平行計算從大型叢集擴充套件到了普通顯示卡，使得使用者只需要一臺帶有Geforce顯示卡的筆記本就能跑較大規模的並行處理程式。

使用顯示卡的好處是，和大型叢集相比功耗非常低，成本也不高，但效能很突出。以我的筆記本為例，Geforce 610M，用DeviceQuery程式測試，可得到如下硬體引數：

計算能力達48X0.95 = 45.6 GFLOPS。而筆記本的CPU引數如下：

CPU計算能力為（4核）：2.5G*4 = 10GFLOPS，可見，顯示卡計算效能是4核i5 CPU的4~5倍，因此我們可以充分利用這一資源來對一些耗時的應用進行加速。

好了，工欲善其事必先利其器，為了使用CUDA對GPU進行程式設計，我們需要準備以下必備工具：

1. 硬體平臺，就是顯示卡，如果你用的不是NVIDIA的顯示卡，那麼只能說抱歉，其他都不支援CUDA。

2. 作業系統，我用過windows XP，Windows 7都沒問題，本部落格用Windows7。

3. C編譯器，建議VS2008，和本部落格一致。

4. CUDA編譯器NVCC，可以免費免註冊免license從官網下載CUDA ToolkitCUDA下載，最新版本為5.0，本部落格用的就是該版本。

5. 其他工具（如Visual Assist，輔助程式碼高亮）

準備完畢，開始安裝軟體。VS2008安裝比較費時間，建議安裝完整版（NVIDIA官網說Express版也可以），過程不必詳述。CUDA Toolkit 5.0裡面包含了NVCC編譯器、設計文件、設計例程、CUDA執行時庫、CUDA標頭檔案等必備的原材料。

安裝完畢，我們在桌面上發現這個圖示：

不錯，就是它，雙擊執行，可以看到一大堆例程。我們找到Simple OpenGL這個執行看看效果：

點右邊黃線標記處的Run即可看到美妙的三維正弦曲面，滑鼠左鍵拖動可以轉換角度，右鍵拖動可以縮放。如果這個執行成功，說明你的環境基本搭建成功。

出現問題的可能：

1. 你使用遠端桌面連線登入到另一臺伺服器，該伺服器上有顯示卡支援CUDA，但你遠端終端不能執行CUDA程式。這是因為遠端登入使用的是你本地顯示卡資源，在遠端登入時看不到伺服器端的顯示卡，所以會報錯：沒有支援CUDA的顯示卡！解決方法：1. 遠端伺服器裝兩塊顯示卡，一塊只用於顯示，另一塊用於計算；2.不要用圖形介面登入，而是用命令列介面如telnet登入。

2.有兩個以上顯示卡都支援CUDA的情況，如何區分是在哪個顯示卡上執行？這個需要你在程式裡控制，選擇符合一定條件的顯示卡，如較高的時鐘頻率、較大的視訊記憶體、較高的計算版本等。詳細操作見後面的部落格。

好了，先說這麼多，下一節我們介紹如何在VS2008中給GPU程式設計。

書接上回，我們既然直接執行例程成功了，接下來就是了解如何實現例程中的每個環節。當然，我們先從簡單的做起，一般程式語言都會找個helloworld例子，而我們的顯示卡是不會說話的，只能做一些簡單的加減乘除運算。所以，CUDA程式的helloworld，我想應該最合適不過的就是向量加了。

開啟VS2008，選擇File->New->Project，彈出下面對話方塊，設定如下：

之後點OK，直接進入工程介面。

工程中，我們看到只有一個.cu檔案，內容如下：

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
c[i] = a[i] + b[i];
}
int main()
{
const int arraySize = 5;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
const int b[arraySize] = { 10, 20, 30, 40, 50 };
int c[arraySize] = { 0 };
// Add vectors in parallel.
cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addWithCuda failed!");
return 1;
}
printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
c[0], c[1], c[2], c[3], c[4]);
// cudaThreadExit must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaThreadExit();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaThreadExit failed!");
return 1;
}
return 0;
}
// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)
{
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
goto Error;
}
// Allocate GPU buffers for three vectors (two input, one output) .
cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
// Launch a kernel on the GPU with one thread for each element.
addKernel<<<1, size>>>(dev_c, dev_a, dev_b);
// cudaThreadSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaThreadSynchronize();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
goto Error;
}
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
Error:
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
return cudaStatus;
}

可以看出，CUDA程式和C程式並無區別，只是多了一些以"cuda"開頭的一些庫函式和一個特殊宣告的函式：

__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
c[i] = a[i] + b[i];
}

這個函式就是在GPU上執行的函式，稱之為核函式，英文名Kernel Function，注意要和作業系統核心函式區分開來。

我們直接按F7編譯，可以得到如下輸出：

1>------ Build started: Project: cuda_helloworld, Configuration: Debug Win32 ------
1>Compiling with CUDA Build Rule...
1>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\bin\nvcc.exe" -G -gencode=arch=compute_10,code=\"sm_10,compute_10\" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --machine 32 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" -Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MT " -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\include" -maxrregcount=0 --compile -o "Debug/kernel.cu.obj" kernel.cu
1>tmpxft_000000ec_00000000-8_kernel.compute_10.cudafe1.gpu
1>tmpxft_000000ec_00000000-14_kernel.compute_10.cudafe2.gpu
1>tmpxft_000000ec_00000000-5_kernel.compute_20.cudafe1.gpu
1>tmpxft_000000ec_00000000-17_kernel.compute_20.cudafe2.gpu
1>kernel.cu
1>kernel.cu
1>tmpxft_000000ec_00000000-8_kernel.compute_10.cudafe1.cpp
1>tmpxft_000000ec_00000000-24_kernel.compute_10.ii
1>Linking...
1>Embedding manifest...
1>Performing Post-Build Event...
1>copy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\bin\cudart*.dll" "C:\Users\DongXiaoman\Documents\Visual Studio 2008\Projects\cuda_helloworld\Debug"
1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\bin\cudart32_50_35.dll
1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\bin\cudart64_50_35.dll
1>已複製 2 個檔案。
1>Build log was saved at "file://c:\Users\DongXiaoman\Documents\Visual Studio 2008\Projects\cuda_helloworld\cuda_helloworld\Debug\BuildLog.htm"
1>cuda_helloworld - 0 error(s), 105 warning(s)
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

可見，編譯.cu檔案需要利用nvcc工具。該工具的詳細使用見後面部落格。

直接執行，可以得到結果圖如下：

如果顯示正確，那麼我們的第一個程式宣告成功！

剛入門CUDA，跑過幾個官方提供的例程，看了看人家的程式碼，覺得並不難，但自己動手寫程式碼時，總是不知道要先幹什麼，後幹什麼，也不知道從哪個知識點學起。這時就需要有一本能提供指導的書籍或者教程，一步步跟著做下去，直到真正掌握。

一般講述CUDA的書，我認為不錯的有下面這幾本：

初學者可以先看美國人寫的這本《GPU高效能程式設計CUDA實戰》，可操作性很強，但不要期望能全看懂（Ps：裡面有些概念其實我現在還是不怎麼懂），但不影響你進一步學習。如果想更全面地學習CUDA，《GPGPU程式設計技術》比較客觀詳細地介紹了通用GPU程式設計的策略，看過這本書，可以對顯示卡有更深入的瞭解，揭開GPU的神祕面紗。後面《OpenGL程式設計指南》完全是為了體驗圖形互動帶來的樂趣，可以有選擇地看；《GPU高效能運算之CUDA》這本是師兄給的，適合快速查詢（感覺是將官方程式設計手冊翻譯了一遍）一些關鍵技術和概念。

有了這些指導材料還不夠，我們在做專案的時候，遇到的問題在這些書上肯定找不到，所以還需要有下面這些利器：

這裡面有很多工具的使用手冊，如CUDA_GDB，Nsight，CUDA_Profiler等，方便除錯程式；還有一些有用的庫，如CUFFT是專門用來做快速傅立葉變換的，CUBLAS是專用於線性代數（矩陣、向量計算）的，CUSPASE是專用於稀疏矩陣表示和計算的庫。這些庫的使用可以降低我們設計演算法的難度，提高開發效率。另外還有些入門教程也是值得一讀的，你會對NVCC編譯器有更近距離的接觸。

好了，前言就這麼多，本博主計劃按如下順序來講述CUDA：

1.瞭解裝置

2.執行緒並行

3.塊並行

4.流並行

5.執行緒通訊

6.執行緒通訊例項：規約

7.儲存模型

8.常數記憶體

9.紋理記憶體

10.主機頁鎖定記憶體

11.圖形互操作

12.優化準則

13.CUDA與MATLAB介面

14.CUDA與MFC介面

前面三節已經對CUDA做了一個簡單的介紹，這一節開始真正進入程式設計環節。

首先，初學者應該對自己使用的裝置有較為紮實的理解和掌握，這樣對後面學習並行程式優化很有幫助，瞭解硬體詳細引數可以通過上節介紹的幾本書和官方資料獲得，但如果仍然覺得不夠直觀，那麼我們可以自己動手獲得這些內容。

以第二節例程為模板，我們稍加改動的部分程式碼如下：

// Add vectors in parallel.
cudaError_t cudaStatus;
int num = 0;
cudaDeviceProp prop;
cudaStatus = cudaGetDeviceCount(&num);
for(int i = 0;i<num;i++)
{
cudaGetDeviceProperties(&prop,i);
}
cudaStatus = addWithCuda(c, a, b, arraySize);

這個改動的目的是讓我們的程式自動通過呼叫cuda API函式獲得裝置數目和屬性，所謂“知己知彼，百戰不殆”。

cudaError_t 是cuda錯誤型別，取值為整數。

cudaDeviceProp為裝置屬性結構體，其定義可以從cuda Toolkit安裝目錄中找到，我的路徑為：C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include\driver_types.h，找到定義為：

/**
* CUDA device properties
*/
struct __device_builtin__ cudaDeviceProp
{
char name[256]; /**< ASCII string identifying device */
size_t totalGlobalMem; /**< Global memory available on device in bytes */
size_t sharedMemPerBlock; /**< Shared memory available per block in bytes */
int regsPerBlock; /**< 32-bit registers available per block */
int warpSize; /**< Warp size in threads */
size_t memPitch; /**< Maximum pitch in bytes allowed by memory copies */
int maxThreadsPerBlock; /**< Maximum number of threads per block */
int maxThreadsDim[3]; /**< Maximum size of each dimension of a block */
int maxGridSize[3]; /**< Maximum size of each dimension of a grid */
int clockRate; /**< Clock frequency in kilohertz */
size_t totalConstMem; /**< Constant memory available on device in bytes */
int major; /**< Major compute capability */
int minor; /**< Minor compute capability */
size_t textureAlignment; /**< Alignment requirement for textures */
size_t texturePitchAlignment; /**< Pitch alignment requirement for texture references bound to pitched memory */
int deviceOverlap; /**< Device can concurrently copy memory and execute a kernel. Deprecated. Use instead asyncEngineCount. */
int multiProcessorCount; /**< Number of multiprocessors on device */
int kernelExecTimeoutEnabled; /**< Specified whether there is a run time limit on kernels */
int integrated; /**< Device is integrated as opposed to discrete */
int canMapHostMemory; /**< Device can map host memory with cudaHostAlloc/cudaHostGetDevicePointer */
int computeMode; /**< Compute mode (See ::cudaComputeMode) */
int maxTexture1D; /**< Maximum 1D texture size */
int maxTexture1DMipmap; /**< Maximum 1D mipmapped texture size */
int maxTexture1DLinear; /**< Maximum size for 1D textures bound to linear memory */
int maxTexture2D[2]; /**< Maximum 2D texture dimensions */
int maxTexture2DMipmap[2]; /**< Maximum 2D mipmapped texture dimensions */
int maxTexture2DLinear[3]; /**< Maximum dimensions (width, height, pitch) for 2D textures bound to pitched memory */
int maxTexture2DGather[2]; /**< Maximum 2D texture dimensions if texture gather operations have to be performed */
int maxTexture3D[3]; /**< Maximum 3D texture dimensions */
int maxTextureCubemap; /**< Maximum Cubemap texture dimensions */
int maxTexture1DLayered[2]; /**< Maximum 1D layered texture dimensions */
int maxTexture2DLayered[3]; /**< Maximum 2D layered texture dimensions */
int maxTextureCubemapLayered[2];/**< Maximum Cubemap layered texture dimensions */
int maxSurface1D; /**< Maximum 1D surface size */
int maxSurface2D[2]; /**< Maximum 2D surface dimensions */
int maxSurface3D[3]; /**< Maximum 3D surface dimensions */
int maxSurface1DLayered[2]; /**< Maximum 1D layered surface dimensions */
int maxSurface2DLayered[3]; /**< Maximum 2D layered surface dimensions */
int maxSurfaceCubemap; /**< Maximum Cubemap surface dimensions */
int maxSurfaceCubemapLayered[2];/**< Maximum Cubemap layered surface dimensions */
size_t surfaceAlignment; /**< Alignment requirements for surfaces */
int concurrentKernels; /**< Device can possibly execute multiple kernels concurrently */
int ECCEnabled; /**< Device has ECC support enabled */
int pciBusID; /**< PCI bus ID of the device */
int pciDeviceID; /**< PCI device ID of the device */
int pciDomainID; /**< PCI domain ID of the device */
int tccDriver; /**< 1 if device is a Tesla device using TCC driver, 0 otherwise */
int asyncEngineCount; /**< Number of asynchronous engines */
int unifiedAddressing; /**< Device shares a unified address space with the host */
int memoryClockRate; /**< Peak memory clock frequency in kilohertz */
int memoryBusWidth; /**< Global memory bus width in bits */
int l2CacheSize; /**< Size of L2 cache in bytes */
int maxThreadsPerMultiProcessor;/**< Maximum resident threads per multiprocessor */
};

後面的註釋已經說明了其欄位代表意義，可能有些術語對於初學者理解起來還是有一定困難，沒關係，我們現在只需要關注以下幾個指標：

name：就是裝置名稱；

totalGlobalMem：就是視訊記憶體大小；

major,minor：CUDA裝置版本號，有1.1, 1.2, 1.3, 2.0, 2.1等多個版本；

clockRate：GPU時鐘頻率；

multiProcessorCount：GPU大核數，一個大核（專業點稱為流多處理器，SM，Stream-Multiprocessor）包含多個小核（流處理器，SP，Stream-Processor）

編譯，執行，我們在VS2008工程的cudaGetDeviceProperties()函式處放一個斷點，單步執行這一函式，然後用Watch視窗，切換到Auto頁，展開+，在我的筆記本上得到如下結果：

可以看到，裝置名為GeForce 610M，視訊記憶體1GB，裝置版本2.1（比較高端了，哈哈），時鐘頻率為950MHz（注意950000單位為kHz），大核數為1。在一些高效能GPU上（如Tesla，Kepler系列），大核數可能達到幾十甚至上百，可以做更大規模的並行處理。

PS：今天看SDK程式碼時發現在help_cuda.h中有個函式實現從CUDA裝置版本查詢相應大核中小核的數目，覺得很有用，以後程式設計序可以借鑑，摘抄如下：

// Beginning of GPU Architecture definitions
inline int _ConvertSMVer2Cores(int major, int minor)
{
// Defines for GPU Architecture types (using the SM version to determine the # of cores per SM
typedef struct
{
int SM; // 0xMm (hexidecimal notation), M = SM Major version, and m = SM minor version
int Cores;
} sSMtoCores;
sSMtoCores nGpuArchCoresPerSM[] =
{
{ 0x10, 8 }, // Tesla Generation (SM 1.0) G80 class
{ 0x11, 8 }, // Tesla Generation (SM 1.1) G8x class
{ 0x12, 8 }, // Tesla Generation (SM 1.2) G9x class
{ 0x13, 8 }, // Tesla Generation (SM 1.3) GT200 class
{ 0x20, 32 }, // Fermi Generation (SM 2.0) GF100 class
{ 0x21, 48 }, // Fermi Generation (SM 2.1) GF10x class
{ 0x30, 192}, // Kepler Generation (SM 3.0) GK10x class
{ 0x35, 192}, // Kepler Generation (SM 3.5) GK11x class
{ -1, -1 }
};
int index = 0;
while (nGpuArchCoresPerSM[index].SM != -1)
{
if (nGpuArchCoresPerSM[index].SM == ((major << 4) + minor))
{
return nGpuArchCoresPerSM[index].Cores;
}
index++;
}
// If we don't find the values, we default use the previous one to run properly
printf("MapSMtoCores for SM %d.%d is undefined. Default to use %d Cores/SM\n", major, minor, nGpuArchCoresPerSM[7].Cores);
return nGpuArchCoresPerSM[7].Cores;
}
// end of GPU Architecture definitions

可見，裝置版本2.1的一個大核有48個小核，而版本3.0以上的一個大核有192個小核！

前文說到過，當我們用的電腦上有多個顯示卡支援CUDA時，怎麼來區分在哪個上執行呢？這裡我們看一下addWithCuda這個函式是怎麼做的。

cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
goto Error;
}

使用了cudaSetDevice(0)這個操作，0表示能搜尋到的第一個裝置號，如果有多個裝置，則編號為0,1,2...。

再看我們本節新增的程式碼，有個函式cudaGetDeviceCount(&num)，這個函式用來獲取裝置總數，這樣我們選擇執行CUDA程式的裝置號取值就是0,1,...num-1，於是可以一個個列舉裝置，利用cudaGetDeviceProperties(&prop)獲得其屬性,然後利用一定排序、篩選演算法，找到最符合我們應用的那個裝置號opt，然後呼叫cudaSetDevice(opt)即可選擇該裝置。選擇標準可以從處理能力、版本控制、名稱等各個角度出發。後面講述流併發過程時，還要用到這些API。

多執行緒我們應該都不陌生，在作業系統中，程序是資源分配的基本單元，而執行緒是CPU時間排程的基本單元（這裡假設只有1個CPU）。

將執行緒的概念引申到CUDA程式設計中，我們可以認為執行緒就是執行CUDA程式的最小單元，前面我們建立的工程程式碼中，有個核函式概念不知各位童鞋還記得沒有，在GPU上每個執行緒都會執行一次該核函式。

但GPU上的執行緒排程方式與CPU有很大不同。CPU上會有優先順序分配，從高到低，同樣優先順序的可以採用時間片輪轉法實現執行緒排程。GPU上執行緒沒有優先順序概念，所有執行緒機會均等，執行緒狀態只有等待資源和執行兩種狀態，如果資源未就緒，那麼就等待；一旦就緒，立即執行。當GPU資源很充裕時，所有執行緒都是併發執行的，這樣加速效果很接近理論加速比；而GPU資源少於匯流排程個數時，有一部分執行緒就會等待前面執行的執行緒釋放資源，從而變為序列化執行。

程式碼還是用上一節的吧，改動很少，再貼一遍：

#include "cuda_runtime.h" //CUDA執行時API
#include "device_launch_parameters.h"
#include <stdio.h>
cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
c[i] = a[i] + b[i];
}
int main()
{
const int arraySize = 5;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
const int b[arraySize] = { 10, 20, 30, 40, 50 };
int c[arraySize] = { 0 };
// Add vectors in parallel.
cudaError_t cudaStatus;
int num = 0;
cudaDeviceProp prop;
cudaStatus = cudaGetDeviceCount(&num);
for(int i = 0;i<num;i++)
{
cudaGetDeviceProperties(&prop,i);
}
cudaStatus = addWithCuda(c, a, b, arraySize);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "addWithCuda failed!");
return 1;
}
printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",c[0],c[1],c[2],c[3],c[4]);
// cudaThreadExit must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaThreadExit();
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaThreadExit failed!");
return 1;
}
return 0;
}
// 重點理解這個函式
cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)
{
int *dev_a = 0; //GPU裝置端資料指標
int *dev_b = 0;
int *dev_c = 0;
cudaError_t cudaStatus; //狀態指示
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0); //選擇執行平臺
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
goto Error;
}
// 分配GPU裝置端記憶體
cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
// 拷貝資料到GPU
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
// 執行核函式
 addKernel<<<1, size>>>(dev_c, dev_a, dev_b);
 // cudaThreadSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaThreadSynchronize(); //同步執行緒
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
goto Error;
}
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost); //拷貝結果回主機
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
Error:
cudaFree(dev_c); //釋放GPU裝置端記憶體
cudaFree(dev_a);
cudaFree(dev_b);
return cudaStatus;
}

紅色部分即啟動核函式的呼叫過程，這裡看到呼叫方式和C不太一樣。<<<>>>表示執行時配置符號，裡面1表示只分配一個執行緒組（又稱執行緒塊、Block），size表示每個執行緒組有size個執行緒（Thread）。本程式中size根據前面傳遞引數個數應該為5，所以執行的時候，核函式在5個GPU執行緒單元上分別運行了一次，總共運行了5次。這5個執行緒是如何知道自己“身份”的？是靠threadIdx這個內建變數，它是個dim3型別變數，接受<<<>>>中第二個引數，它包含x,y,z 3維座標，而我們傳入的引數只有一維，所以只有x值是有效的。通過核函式中int i = threadIdx.x;這一句，每個執行緒可以獲得自身的id號，從而找到自己的任務去執行。

同一版本的程式碼用了這麼多次，有點過意不去，於是這次我要做較大的改動，大家要擦亮眼睛，拭目以待。

塊並行相當於作業系統中多程序的情況，上節說到，CUDA有執行緒組（執行緒塊）的概念，將一組執行緒組織到一起，共同分配一部分資源，然後內部排程執行。執行緒塊與執行緒塊之間，毫無瓜葛。這有利於做更粗粒度的並行。我們將上一節的程式碼改為塊並行版本如下：

下節我們介紹塊並行。

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);
__global__ void addKernel(int *c, const int *a, const int *b)
{
 int i = blockIdx.x;
 c[i] = a[i] + b[i];
}
int main()
{
const int arraySize = 5;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
const int b[arraySize] = { 10, 20, 30, 40, 50 };
int c[arraySize] = { 0 };
// Add vectors in parallel.
cudaError_t cudaStatus;
int num = 0;
cudaDeviceProp prop;
cudaStatus = cudaGetDeviceCount(&num);
for(int i = 0;i<num;i++)
{
cudaGetDeviceProperties(&prop,i);
}
cudaStatus = addWithCuda(c, a, b, arraySize);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "addWithCuda failed!");
return 1;
}
printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",c[0],c[1],c[2],c[3],c[4]);
// cudaThreadExit must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaThreadExit();
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaThreadExit failed!");
return 1;
}
return 0;
}
// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)
{
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
goto Error;
}
// Allocate GPU buffers for three vectors (two input, one output) .
cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
// Launch a kernel on the GPU with one thread for each element.
 addKernel<<<size,1 >>>(dev_c, dev_a, dev_b);
 // cudaThreadSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaThreadSynchronize();
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
goto Error;
}
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
Error:
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
return cudaStatus;
}

和上一節相比，只有這兩行有改變，<<<>>>裡第一個引數改成了size，第二個改成了1，表示我們分配size個執行緒塊，每個執行緒塊僅包含1個執行緒，總共還是有5個執行緒。這5個執行緒相互獨立，執行核函式得到相應的結果，與上一節不同的是，每個執行緒獲取id的方式變為int i = blockIdx.x；這是執行緒塊ID。

於是有童鞋提問了，執行緒並行和塊並行的區別在哪裡？

執行緒並行是細粒度並行，排程效率高；塊並行是粗粒度並行，每次排程都要重新分配資源，有時資源只有一份，那麼所有執行緒塊都只能排成一隊，序列執行。

那是不是我們所有時候都應該用執行緒並行，儘可能不用塊並行？

當然不是，我們的任務有時可以採用分治法，將一個大問題分解為幾個小規模問題，將這些小規模問題分別用一個執行緒塊實現，執行緒塊內可以採用細粒度的執行緒並行，而塊之間為粗粒度並行，這樣可以充分利用硬體資源，降低執行緒並行的計算複雜度。適當分解，降低規模，在一些矩陣乘法、向量內積計算應用中可以得到充分的展示。

實際應用中，常常是二者的結合。執行緒塊、執行緒組織圖如下所示。

多個執行緒塊組織成了一個Grid，稱為執行緒格（經歷了從一位執行緒，二維執行緒塊到三維執行緒格的過程，立體感很強啊）。

好了，下一節我們介紹流並行，是更高層次的並行。

前面我們沒有講程式的結構，我想有些童鞋可能迫不及待想知道CUDA程式到底是怎麼一個執行過程。好的，這一節在介紹流之前，先把CUDA程式結構簡要說一下。

CUDA程式檔案字尾為.cu，有些編譯器可能不認識這個字尾的檔案，我們可以在VS2008的Tools->Options->Text Editor->File Extension裡新增cu字尾到VC++中，如下圖：

一個.cu檔案內既包含CPU程式（稱為主機程式），也包含GPU程式（稱為裝置程式）。如何區分主機程式和裝置程式？根據宣告，凡是掛有“__global__”或者“__device__”字首的函式，都是在GPU上執行的裝置程式，不同的是__global__裝置程式可被主機程式呼叫，而__device__裝置程式則只能被裝置程式呼叫。

沒有掛任何字首的函式，都是主機程式。主機程式顯示宣告可以用__host__字首。裝置程式需要由NVCC進行編譯，而主機程式只需要由主機編譯器（如VS2008中的cl.exe，Linux上的GCC）。主機程式主要完成裝置環境初始化，資料傳輸等必備過程，裝置程式只負責計算。

主機程式中，有一些“cuda”打頭的函式，這些都是CUDA Runtime API，即執行時函式，主要負責完成裝置的初始化、記憶體分配、記憶體拷貝等任務。我們前面第三節用到的函式cudaGetDeviceCount()，cudaGetDeviceProperties()，cudaSetDevice()都是執行時API。這些函式的具體引數宣告我們不必一一記下來，拿出第三節的官方利器就可以輕鬆查詢，讓我們開啟這個檔案：

開啟後，在pdf搜尋欄中輸入一個執行時函式，例如cudaMemcpy，查到的結果如下：

可以看到，該API函式的引數形式為，第一個表示目的地，第二個表示來源地，第三個引數表示位元組數，第四個表示型別。如果對型別不瞭解，直接點選超連結，得到詳細解釋如下：

可見，該API可以實現從主機到主機、主機到裝置、裝置到主機、裝置到裝置的記憶體拷貝過程。同時可以發現，利用該API手冊可以很方便地查詢我們需要用的這些API函式，所以以後編CUDA程式一定要把它開啟，隨時準備查詢，這樣可以大大提高程式設計效率。

好了，進入今天的主題：流並行。

前面已經介紹了執行緒並行和塊並行，知道了執行緒並行為細粒度的並行，而塊並行為粗粒度的並行，同時也知道了CUDA的執行緒組織情況，即Grid-Block-Thread結構。一組執行緒並行處理可以組織為一個block，而一組block並行處理可以組織為一個Grid，很自然地想到，Grid只是一個網格，我們是否可以利用多個網格來完成並行處理呢？答案就是利用流。

流可以實現在一個裝置上執行多個核函式。前面的塊並行也好，執行緒並行也好，執行的核函式都是相同的（程式碼一樣，傳遞引數也一樣）。而流並行，可以執行不同的核函式，也可以實現對同一個核函式傳遞不同的引數，實現任務級別的並行。

CUDA中的流用cudaStream_t型別實現，用到的API有以下幾個：cudaStreamCreate(cudaStream_t * s)用於建立流，cudaStreamDestroy(cudaStream_t s)用於銷燬流，cudaStreamSynchronize()用於單個流同步，cudaDeviceSynchronize()用於整個裝置上的所有流同步，cudaStreamQuery()用於查詢一個流的任務是否已經完成。具體的含義可以查詢API手冊。

下面我們將前面的兩個例子中的任務改用流實現，仍然是{1,2,3,4,5}+{10,20,30,40,50} = {11,22,33,44,55}這個例子。程式碼如下：

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = blockIdx.x;
c[i] = a[i] + b[i];
}
int main()
{
const int arraySize = 5;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
const int b[arraySize] = { 10, 20, 30, 40, 50 };
int c[arraySize] = { 0 };
// Add vectors in parallel.
cudaError_t cudaStatus;
int num = 0;
cudaDeviceProp prop;
cudaStatus = cudaGetDeviceCount(&num);
for(int i = 0;i<num;i++)
{
cudaGetDeviceProperties(&prop,i);
}
cudaStatus = addWithCuda(c, a, b, arraySize);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "addWithCuda failed!");
return 1;
}
printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",c[0],c[1],c[2],c[3],c[4]);
// cudaThreadExit must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaThreadExit();
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaThreadExit failed!");
return 1;
}
return 0;
}
// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)
{
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
goto Error;
}
// Allocate GPU buffers for three vectors (two input, one output) .
cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
 cudaStream_t stream[5];
for(int i = 0;i<5;i++)
{
cudaStreamCreate(&stream[i]); //建立流
}
 // Launch a kernel on the GPU with one thread for each element.
 for(int i = 0;i<5;i++)
{
addKernel<<<1,1,0,stream[i]>>>(dev_c+i, dev_a+i, dev_b+i); //執行流
}
cudaDeviceSynchronize();
 // cudaThreadSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaThreadSynchronize();
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
goto Error;
}
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
Error:
 for(int i = 0;i<5;i++)
{
cudaStreamDestroy(stream[i]); //銷燬流
}
 cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
return cudaStatus;
}

注意到，我們的核函式程式碼仍然和塊並行的版本一樣，只是在呼叫時做了改變，<<<>>>中的引數多了兩個，其中前兩個和塊並行、執行緒並行中的意義相同，仍然是執行緒塊數（這裡為1）、每個執行緒塊中執行緒數（這裡也是1）。第三個為0表示每個block用到的共享記憶體大小，這個我們後面再講；第四個為流物件，表示當前核函式在哪個流上執行。我們建立了5個流，每個流上都裝載了一個核函式，同時傳遞引數有些不同，也就是每個核函式作用的物件也不同。這樣就實現了任務級別的並行，當我們有幾個互不相關的任務時，可以寫多個核函式，資源允許的情況下，我們將這些核函式裝載到不同流上，然後執行，這樣可以實現更粗粒度的並行。

好了，流並行就這麼簡單，我們處理任務時，可以根據需要，選擇最適合的並行方式。

我們前面幾節主要介紹了三種利用GPU實現並行處理的方式：執行緒並行，塊並行和流並行。在這些方法中，我們一再強調，各個執行緒所進行的處理是互不相關的，即兩個執行緒不回產生交集，每個執行緒都只關注自己的一畝三分地，對其他執行緒毫無興趣，就當不存在。。。。

當然，實際應用中，這樣的例子太少了，也就是遇到向量相加、向量對應點乘這類才會有如此高的並行度，而其他一些應用，如一組數求和，求最大（小）值，各個執行緒不再是相互獨立的，而是產生一定關聯，執行緒2可能會用到執行緒1的結果，這時就需要利用本節的執行緒通訊技術了。

執行緒通訊在CUDA中有三種實現方式：

1. 共享儲存器；

2. 執行緒同步；

3. 原子操作；

最常用的是前兩種方式，共享儲存器，術語Shared Memory，是位於SM中的特殊儲存器。還記得SM嗎，就是流多處理器，大核是也。一個SM中不僅包含若干個SP（流處理器，小核），還包括一部分高速Cache，暫存器組，共享記憶體等，結構如圖所示：

從圖中可看出，一個SM內有M個SP，Shared

CUDA從入門到精通（大神之作）膜拜

CUDA從入門到精通（大神之作）膜拜

ZCU106開發詳解之PS側SD卡FAT32檔案系統（高階外設，大神之路）

ZCU106開發詳解之PS側開源TCP/IP協議棧UDP回顯程式（高階外設，大神之路）

Excel的下載和讀取，部分代碼（大神請路過）

CUDA從入門到精通（七）：流並行

CUDA從入門到精通（一）：環境搭建

CUDA從入門到精通（五）：執行緒並行

Python從入門到精通（加項目實戰）學習視頻

邁向大神之路 day8 函式（一）……

《Python程式設計從入門到實踐》記錄之將Python函式儲存在模組中（import、import*）

《Python程式設計從入門到實踐》記錄之json模組（資料儲存）

《Python程式設計從入門到實踐》記錄之檔案異常（try-except-else程式碼塊）

《Python程式設計從入門到實踐》記錄之測試程式碼（unitttest模組）

CUDA從入門到精通

海思Hi3519A MPP從入門到精通（五 YUV影象格式）

CUDA從入門到精通到精通_筆記5:GPU入門綜述博文

學習Qt之基礎篇——從入門開始（1）

【成為Java開發大神之路】入職公司Java開發實習（1）

微信小程式開發之大神之路最全微信小程式開發教程（視訊+精品文章）

CUDA從入門到精通到精通_筆記1:CUDA程式設計開發環境的配置

CUDA從入門到精通（大神之作）膜拜

相關推薦