轉載：簡明實用的CUDA入門教程

阿新 • • 發佈：2019-02-03

Sample code in adding 2 numbers with a GPU

Terminology: Host (a CPU and host memory), device (a GPU and device memory).

This sample code adds 2 numbers together with a GPU:

Define a kernel (a function to run on a GPU).
Allocate & initialize the host data.
Allocate & initialize the device data.
Invoke a kernel in the GPU.

Copy kernel output to the host.
Cleanup.

Define a kernel

Use the keyword __global__ to define a kernel. A kernel is a function to be run on a GPU instead of a CPU. This kernel adds 2 numbers a & b and store the result in c.

// Kernel definition 
// Run on GPU
// Adding 2 numbers and store the result in c
__global__ void add(int *a, int *b, int *c) 
{
    *c = *a + *b;
}

Allocate & initialize host data

In the host, allocate the input and output parameters for the kernel call, and initiate all input parameters.

int main(void) {
    // Allocate & initialize host data - run on the host
    int a, b, c;         // host copies of a, b, c
    a = 2;
    b = 7;
    ...
}

Allocate and copy host data to the device

A CUDA application manages the device space memory through calls to the CUDA runtime. This includes device memory allocation and deallocation as well as data transfer between the host and device memory.

We allocate space in the device so we can copy the input of the kernel (a & b) from the host to the device. We also allocate space to copy result from the device to the host later.

int main(void) {
    ...
	
    int *d_a, *d_b, *d_c; // device copies of a, b, c
	
    // Allocate space for device copies of a, b, c
    cudaMalloc((void **)&d_a, size);
    cudaMalloc((void **)&d_b, size);
    cudaMalloc((void **)&d_c, size);
	
    // Copy a & b from the host to the device
    cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);	
    ...
}

Invoke the kernel

Invoke the kernel add with parameters for a,b,c.

int main(void) {
    ...
    // Launch add() kernel on GPU with parameters (d_a, d_b, d_c) 
    add<<<1,1>>>(d_a, d_b, d_c);
    ...
}

To provide data parallelism, a multithreaded CUDA application is partitioned into blocks of threads that execute independently (and often concurrently) from each other. Each parallel invocation of add is referred to as a block. Each block have multiple threads. These block of threads can be scheduled on any of the available streaming multiprocessors (SM) within a GPU. In our simple example, since we just add one pair of numbers, we only need 1 block containing 1 thread (<<<1,1>>>).

In contrast to a regular C function call, a kernel can be executed N times in parallel by M CUDA threads (<<<N, M>>>). On current GPUs, a thread block may contain up to 1024 threads.

Copy kernel output to the host

Copy the addition result from the device to the host:

    // Copy result back to the host
    cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);

Clean up

Clean up memory:

    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

Putting together: Heterogeneous Computing

In CUDA, we define a single file to run both the host and the device code.

nvcc add.cu   # Compile the source code
a.out         # Run the code.

The following is the complete source code for our example.

// Kernel definition 
// Run on GPU
__global__ void add(int *a, int *b, int *c) {
    *c = *a + *b;
}

int main(void) {
	// Allocate & initialize host data - run on the host
    int a, b, c;         // host copies of a, b, c
    a = 2;
    b = 7;

    int *d_a, *d_b, *d_c; // device copies of a, b, c
    // Allocate space for device copies of a, b, c
    int size = sizeof(int);
    cudaMalloc((void **)&d_a, size);
    cudaMalloc((void **)&d_b, size);
    cudaMalloc((void **)&d_c, size);
	
    // Copy a & b from the host to the device
    cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
	
    // Launch add() kernel on GPU
    add<<<1,1>>>(d_a, d_b, d_c);
	
    // Copy result back to the host
    cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
	
    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    return 0;
}

CUDA logical model

    add<<<4,4>>>(d_a, d_b, d_c);

A CUDA applications composes of multiple blocks of threads (a grid) with each thread calls a kernel once.

In the second example, we have 6 blocks and 12 threads per block.

(source: Nvidia)

GPU physical model

A GPU composes of many Streaming Multiprocessors (SMs) with a global memory accessible by all SMs and a local memory.

Each SM contains multiple cores which share a shared memory as well as one local to itself.

Here is the architect for GeoForce 8800 with 16 SMs each with 8 cores (Streaming processing SP).

A SM in the Fermi architecture:

Execution model

Device level

When a CUDA application on the host invokes a kernel grid, the blocks of the grid are enumerated and a global work distribution engine assign them to SM with available execution capacity. Threads of the same block always run on the same SM. Multiple thread blocks and multiple threads in a thread block can execute concurrently on one SM. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

All threads in a grid execute the same kernel. GPU can handle multiple kernels from the same application simultaneously. Pascal GP100 can handle maximum of 32 thread blocks and 2048 threads per SM.

Here, we have a CUDA application composes of 8 blocks. It can be executed on a GPU with 2 SMs or 4SMs. With 4 SMs, block 0 & 4 is assigned to SM0, block 1, 5 to SM1, block 2, 6 to SM2 and block 3, 7 to SM3.

(source: Nvidia)

The entire device can only process one single application at a time and switch between applications is slow.

SM level

Once a block of threads is assigned to a SM, the threads are divided into units called warps. A block is partitioned into warps with each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0.The size of a warp is determined by the hardware implementation. A warp scheduler selects a warp that is ready to execute its next instruction. In Fremi architect, the warp scheduler schedule a warp of 32 threads. Each warp of threads runs the same instruction. In the diagram below, we have 2 dispatch unit. Each one runs a different warp. In each warp, it runs the same instruction. When the threads in a warp is wait for the previous instruction to complete, the warp scheduler will select another warp to execute. Two warps from different blocks or different kernels can be executed concurrently.

Branch divergence

A warp executes one common instruction at a time. Each core (SP) run the same instruction for each threads in a warp. To execute a branch like:

if (a[index]==0)
    a[index]++;
else	
    a[index]--;

SM skips execution of a core subjected to the branch conditions:

c0 (a=3)	c1(a=3)	c2 (a=-3)	c3(a=7)	c4(a=2)	c5(a=6)	c6 (a=-2)	c7 (a=-1)
if a[index]==0	↓	↓	↓	↓	↓	↓	↓	↓
a[index]++	↓	↓	↓	↓	↓
a[index]–	↓	↓	↓

So full efficiency is realized when all 32 threads of a warp branch to the same execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path.

To maximize throughput, all threads in a warp should follow the same control-flow. Program can be rewritten such that threads within a warp branch to the same code:

if (a[index]<range) 
   ...    // More likely, threads with a warp will branch the same way.
else
   ...

is preferred over

if (a[index]%2==0) 
   ...
else
   ...

For loop unrolling is another technique to avoid branching:

for (int i=0; i<4; i++)
  c[i] += a[i];

c[0] = a[0] + a[1] + a[2] + a[3];

Memory model

Every SM has a shared memory accessible by all threads in the same block. Each thread has its own set of registers and local memory. All blocks can access a global memory, a constant memory(read only) and a texture memory. (read only memory for spatial data.)

Local, Global, Constant, and Texture memory all reside off chip. Local, Constant, and Texture are all cached. Each SM has a L1 cache for global memory references. All SMs share a second L2 cache. Access to the shared memory is in the TB/s. Global memory is an order of magnitude slower. Each GPS has a constant memory for read only with shorter latency and higher throughput. Texture memory is read only.

Type	Read/write	Speed
Global memory	read and write	slow, but cached
Texture memory	read only	cache optimized for 2D/3D access pattern
Constant memory	read only	where constants and kernel arguments are stored
Shared memory	read/write	fast
Local memory	read/write	used when it does not fit in to registers part of global memory slow but cached
Registers	read/write	fast

Local memory is just thread local global memory. It is much slower than either registers or shared memory.

Speed (Fast to slow):

Register file
Shared Memory
Constant Memory
Texture Memory
(Tie) Local Memory and Global Memory

Declaration	Memory	Scope	Lifetime
int v	register	thread	thread
int vArray[10]	local	thread	thread
__shared__ int sharedV	shared	block	block
__device__ int globalV	global	grid	application
__constant__ int constantV	constant	grid	application

When threads in a warp load data from global memory, the system detects whether they are consecutive. It combines consecutive accesses into one single access to DRAM.

Shared memory

Shared memory is on-chip and is much faster than local and global memory. Shared memory latency is roughly 100x lower than uncached global memory latency. Threads can access data in shared memory loaded from global memory by other threads within the same thread block. Memory access can be controlled by thread synchronization to avoid race condition (__syncthreads). Shared memory can be uses as user-managed data caches and high parallel data reductions.

Static shared memory:

#include 

__global__ void staticReverse(int *d, int n)
{
  __shared__ int s[64];
  int t = threadIdx.x;
  int tr = n-t-1;
  s[t] = d[t];
  // Will not conttinue until all threads completed.
  __syncthreads();
  d[t] = s[tr];
}

int main(void)
{
  const int n = 64;
  int a[n], r[n], d[n];

  for (int i = 0; i < n; i++) {
    a[i] = i;
    r[i] = n-i-1;
    d[i] = 0;
  }

  int *d_d;
  cudaMalloc(&d_d, n * sizeof(int)); 

  // run version with static shared memory
  cudaMemcpy(d_d, a, n*sizeof(int), cudaMemcpyHostToDevice);
  staticReverse<<<1,n>>>(d_d, n);
  cudaMemcpy(d, d_d, n*sizeof(int), cudaMemcpyDeviceToHost);
  for (int i = 0; i < n; i++) 
    if (d[i] != r[i]) printf("Error: d[%d]!=r[%d] (%d, %d)n", i, i, d[i], r[i]);
}

The __syncthreads() is light weighted and a block level synchronization barrier. __syncthreads() ensures all threads have completed before continue.

Dynamic Shared Memory:

#include 

__global__ void dynamicReverse(int *d, int n)
{
  // Dynamic shared memory	
  extern __shared__ int s[];
  int t = threadIdx.x;
  int tr = n-t-1;
  s[t] = d[t];
  __syncthreads();
  d[t] = s[tr];
}

int main(void)
{
  const int n = 64;
  int a[n], r[n], d[n];

  for (int i = 0; i < n; i++) {
    a[i] = i;
    r[i] = n-i-1;
    d[i] = 0;
  }

  int *d_d;
  cudaMalloc(&d_d, n * sizeof(int)); 

  // run dynamic shared memory version
  cudaMemcpy(d_d, a, n*sizeof(int), cudaMemcpyHostToDevice);
  dynamicReverse<<<1,n,n*sizeof(int)>>>(d_d, n);
  cudaMemcpy(d, d_d, n * sizeof(int), cudaMemcpyDeviceToHost);
  for (int i = 0; i < n; i++) 
    if (d[i] != r[i]) printf("Error: d[%d]!=r[%d] (%d, %d)n", i, i, d[i], r[i]);
}

Shared memory are accessable by multiple threads. To reduce potential bottleneck, shared memory is divided into logical banks. Successive sections of memory are assigned to successive banks. Each bank services only one thread request at a time, multiple simultaneous accesses from different threads to the same bank result in a bank conflict (the accesses are serialized).

Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and the bandwidth is 32 bits per bank per clock cycle. For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16. There will be 2 reque

轉載：簡明實用的CUDA入門教程

Sample code in adding 2 numbers with a GPU

Define a kernel

Allocate & initialize host data

Allocate and copy host data to the device

Invoke the kernel

Copy kernel output to the host

Clean up

Putting together: Heterogeneous Computing

CUDA logical model

GPU physical model

Execution model

Device level

SM level

Branch divergence

Memory model

Shared memory

轉載：簡明實用的CUDA入門教程

轉載：Spark中文指南(入門篇)-Spark程式設計模型(一)

【轉載】LoadRunner效能測試入門教程詳解

技術轉載：Android高手進階教程(二十二)之---Android中幾種影象特效處理的集錦!! .

OpenCV+CUDA入門教程之五---GpuMat詳解

學習maven的使用，看到一篇很實用的入門教程 .

Nginx實用教程（二）：配置文件入門

轉載：Torch7在Ubuntu下的安裝與配置教程詳解(torch入門使用)

CodeIgniter 入門教程第一篇：信息發布

轉：【實用教程】阿裏雲服務器的配置和使用

Spring Cloud 入門教程(三)：配置自動刷新

Spring Cloud 入門教程(四)：分布式環境下自動發現配置服務

百度雲盤分享：MySQL零基礎入門視頻教程！

幹貨分享：MySQL零基礎入門視頻教程！

CoolEdit入門教程【轉載】

Systemd入門教程：命令篇（轉）

Systemd入門教程：實戰篇（轉）

一看就懂的ReactJs入門教程-精華版-轉載

Docker(一)：Docker入門教程

Node入門教程(4)第三章：第一個 Nodejs 程序

轉載：簡明實用的CUDA入門教程

Sample code in adding 2 numbers with a GPU

Define a kernel

Allocate & initialize host data

Allocate and copy host data to the device

Invoke the kernel

Copy kernel output to the host

Clean up

Putting together: Heterogeneous Computing

CUDA logical model

GPU physical model

Execution model

Device level

SM level

Branch divergence

Memory model

Shared memory

相關推薦