1. 程式人生 > >cuda練習(一):使用cuda將rbg影象轉為灰度影象

cuda練習(一):使用cuda將rbg影象轉為灰度影象

建立工程

使用cmake建立工程,CMakeLists.txt如下:

cmake_minimum_required(VERSION 2.8)
project(image_process)
find_package(OpenCV REQUIRED)   #會去找FindXXX.cmake或XXXConfig.cmake,從而返回一些變數
find_package(CUDA REQUIRED)     #REQUIRED代表如果找不到就會報錯
cuda_add_executable(image_process main.cu)
target_link_libraries(image_process ${OpenCV_LIBS})

疑點尚未解決:cuda_add_executable是如何指定呼叫NVCC進行編譯的,如何用其他方式制定nvcc編譯

編寫程式碼

程式碼思路很簡單,就是用cuda、cpu、cv::cvtColor都執行一遍彩色圖轉灰度圖的演算法,對比一下執行時間

cuda 程式

每一個thread處理一個畫素,執行緒網格與執行緒塊設定如下:

dim3 threadsPerBlock(32, 32);
dim3 blocksPerGrid((imgwidth + threadsPerBlock.x - 1) / threadsPerBlock.x,
        (imgheight + threadsPerBlock.y - 1) / threadsPerBlock.y);

kernel函式編寫如下:

__global__ void rgb2grayincuda(uchar3 * const d_in, unsigned char * const d_out, 
                                uint imgheight, uint imgwidth)
{
    const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
    const unsigned int idy = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (idx < imgwidth && idy < imgheight)
    {
        uchar3 rgb = d_in[idy * imgwidth + idx];
        d_out[idy * imgwidth + idx] = 0.299f * rgb.x + 0.587f * rgb.y + 0.114f * rgb.z;
    }
}

kernel函式比較tricky的一點是,對於不能被執行緒塊整除的情況,有一些執行緒是全程不工作的

測速時注意,要使用cudaDeviceSynchronize()函式來同步cpu和gpu,否則測出來的速度是cpu啟動核心函式的速度

cpu 遍歷函式

函式介面同kernel函式,使用指標遍歷元素:

void rgb2grayincpu(unsigned char * const d_in, unsigned char * const d_out,
                                uint imgheight, uint imgwidth)
{
    for(int i = 0; i < imgheight; i++)
    {
        for(int j = 0; j < imgwidth; j++)
        {
            d_out[i * imgwidth + j] = 0.299f * d_in[(i * imgwidth + j)*3]
                                     + 0.587f * d_in[(i * imgwidth + j)*3 + 1]
                                     + 0.114f * d_in[(i * imgwidth + j)*3 + 2];
        }
    }
}

*3那裡坑了我不少時間,果然還是太年輕

測速結果

cuda cpu cv::cvtColor
0.00077100 0.00244700 0.09298100

發現cuda為cpu的1/3,並沒有想象中提速快,有可能是因為執行緒塊設定的不合理導致的; 反倒是opencv的cvtColor函式,比cuda和cpu慢了一個數量級。

猜想:如果一個執行緒處理多個畫素,興許會快 疑點:為什麼opencv的cvtColor會這麼慢

原始碼

#include <iostream>
#include <time.h>
#include "opencv2/highgui.hpp"  //實際上在/usr/include下
#include "opencv2/opencv.hpp"
using namespace cv;
using namespace std;

#define PAUSE printf("Press Enter key to continue..."); fgetc(stdin);

__global__ void rgb2grayincuda(uchar3 * const d_in, unsigned char * const d_out, 
                                uint imgheight, uint imgwidth)
{
    const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
    const unsigned int idy = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (idx < imgwidth && idy < imgheight)
    {
        uchar3 rgb = d_in[idy * imgwidth + idx];
        d_out[idy * imgwidth + idx] = 0.299f * rgb.x + 0.587f * rgb.y + 0.114f * rgb.z;
    }
}

void rgb2grayincpu(unsigned char * const d_in, unsigned char * const d_out,
                                uint imgheight, uint imgwidth)
{
    for(int i = 0; i < imgheight; i++)
    {
        for(int j = 0; j < imgwidth; j++)
        {
            d_out[i * imgwidth + j] = 0.299f * d_in[(i * imgwidth + j)*3]
                                     + 0.587f * d_in[(i * imgwidth + j)*3 + 1]
                                     + 0.114f * d_in[(i * imgwidth + j)*3 + 2];
        }
    }
}

int main(void)
{
    Mat srcImage = imread("./test.jpg");
    imshow("srcImage", srcImage);
    waitKey(0);

    const uint imgheight = srcImage.rows;
    const uint imgwidth = srcImage.cols;

    Mat grayImage(imgheight, imgwidth, CV_8UC1, Scalar(0));

    uchar3 *d_in;
    unsigned char *d_out;

    cudaMalloc((void**)&d_in, imgheight*imgwidth*sizeof(uchar3));
    cudaMalloc((void**)&d_out, imgheight*imgwidth*sizeof(unsigned char));

    cudaMemcpy(d_in, srcImage.data, imgheight*imgwidth*sizeof(uchar3), cudaMemcpyHostToDevice);
    
    dim3 threadsPerBlock(32, 32);
    dim3 blocksPerGrid((imgwidth + threadsPerBlock.x - 1) / threadsPerBlock.x,
        (imgheight + threadsPerBlock.y - 1) / threadsPerBlock.y);

    clock_t start, end;
    start = clock();

    rgb2grayincuda<< <blocksPerGrid, threadsPerBlock>> >(d_in, d_out, imgheight, imgwidth);

    cudaDeviceSynchronize();
    end = clock();

    printf("cuda exec time is %.8f\n", (double)(end-start)/CLOCKS_PER_SEC);

    cudaMemcpy(grayImage.data, d_out, imgheight*imgwidth*sizeof(unsigned char), cudaMemcpyDeviceToHost);

    cudaFree(d_in);
    cudaFree(d_out);

    start = clock();

    rgb2grayincpu(srcImage.data, grayImage.data, imgheight, imgwidth);

    end = clock();

    printf("cpu exec time is %.8f\n", (double)(end-start)/CLOCKS_PER_SEC);

    start = clock();
    cvtColor(srcImage, grayImage, CV_BGR2GRAY);

    end = clock();

    printf("opencv-cpu exec time is %.8f\n", (double)(end-start)/CLOCKS_PER_SEC);

    imshow("grayImage", grayImage);
    waitKey(0);

    return 0;

}