1. 程式人生 > >C++實戰之OpenCL矩陣相乘優化(二)

C++實戰之OpenCL矩陣相乘優化(二)

前言

上一篇文章,分析了簡單的矩陣相乘在opencl裡面的優化kernel程式碼,每個work-item只負責計算結果矩陣的一個元素。下一步準備每次計算出結果矩陣的塊元素,看看計算時間是如何。

具體分析

這裡引入opencl記憶體的概念:
比較常見的有:
全域性記憶體 __global 修飾符,通常修飾指向一個數據型別的地址,
本地記憶體 __local 修飾符。local 定義的變數在一個work-group中是共享的,也就是說一個work-group中的所有work-item都可以通過本地記憶體來進行通訊,

私有記憶體,private 每個work-item裡的內部變數
常量記憶體, constant

下面是opencl的記憶體模型:
這裡寫圖片描述

我們分析一下之前的矩陣相乘的一些效能:

__kernel void hello_kernel(__global const int *a,
                           __global const int *b,
                           __global int *result_matrix,int  result_matrix_row,
                           int  result_matrix_col,int  compute_size)
{
    int
row = get_global_id(0); int col = get_global_id(1); int sum = 0; for(int i=0;i<compute_size;i++) { sum += a[row*compute_size+i] * b[i*result_matrix_col+col]; } result_matrix[row*result_matrix_col+col] = sum; }

首先在執行時總共有M*N個work-item同時執行,每個work-item中執行一個size為k(computesize)的for迴圈,迴圈裡面每次分別load 陣列a和b中的一個元素,所以綜合起來一個kernel會有 M*N*K*2 個載入global記憶體的操作,乘以2是因為a,b兩個陣列。

其次每個work-item計算出結果矩陣的一個元素並儲存,所以有M*N個對global記憶體的 store 的操作。

由上圖的記憶體模型可知這種訪問並不是最優的,再同一個work-group中我們可以定義local記憶體,來減少這種操作。

下面摘自國外部落格的配圖說明一下這次優化的原理:
這裡寫圖片描述

其實就是把之前row*col的方式變成了 多個row和col相乘,究其本質還是對應元素相乘再相加。

這邊的中心思想是引入work-group分塊計算再相加,work-item的大小還是沒變為M*N,不同的是在同一個work-group中把global陣列A和B的對應值儲存在local記憶體中,之後每個work-item在這個group中訪問這個local變數速度會相對訪問global較快,後面的大小為k的迴圈訪問的也是local記憶體,所以在這個點上是被優化了。先看一下程式碼實現:


__kernel void hello_kernel(const __global int* A,
                     const __global int* B,
                     __global int* C, int M,  int N,  int K) {

    // Thread identifiers
    const int row = get_local_id(0); // Local row ID (max: TS)
    const int col = get_local_id(1); // Local col ID (max: TS)
    const int globalRow = TS*get_group_id(0) + row; // Row ID of C (0..M)
    const int globalCol = TS*get_group_id(1) + col; // Col ID of C (0..N)



    // Local memory to fit a tile of TS*TS elements of A and B
    __local int Asub[TS][TS];
    __local int Bsub[TS][TS];

    // Initialise the accumulation register
    int acc = 0;

    // Loop over all tiles
    const int numTiles = K/TS;
    for (int t=0; t<numTiles; t++) {

        // Load one tile of A and B into local memory
        const int tiledRow = TS*t + row;
        const int tiledCol = TS*t + col;
        Asub[col][row] = A[tiledCol*M + globalRow];
        Bsub[col][row] = B[globalCol*K + tiledRow];
        printf("Asub[%d][%d]=A[%d]=%d\t",col,row,tiledCol*M + globalRow,A[tiledCol*M + globalRow]);
        // Synchronise to make sure the tile is loaded
        barrier(CLK_LOCAL_MEM_FENCE);

        // Perform the computation for a single tile
        for (int k=0; k<TS; k++) {
            acc += Asub[k][row] * Bsub[col][k];
            //printf("acc[%d][%d]=%d\n",k,row,Asub[k][row]);
        }
        printf("acc = %d\n",acc);
        // Synchronise before loading the next tile
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // Store the final result in C
    C[globalCol*M + globalRow] = acc;
}

下面具體分析一下這個kernel在執行時的的執行情況:
線看一下cpu端的配置:

#define TS 16
size_t globalWorkSize[2];
    globalWorkSize[0]= heightA;
    globalWorkSize[1]=widthB;
    size_t localWorkSize[2] ;
    localWorkSize[0]= TS;
    localWorkSize[1]= TS;
        errNum = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL,
                                        globalWorkSize, localWorkSize,
                                        0, NULL, NULL);

這邊新加了localworksize的引數,並且設定大小為16,這裡設定大小是有講究的:
首先TS 必須為2的冪次方
也就是
localWorkSize[0]*localWorkSize[1] <= CL_DEVICE_MAX_WORK_GROUP_SIZE 要怎麼知道自己機器的這個size呢?可以通過

size_t      maxWorkItemPerGroup;
    clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_GROUP_SIZE,sizeof(maxWorkItemPerGroup), &maxWorkItemPerGroup, NULL);
    printf("maxWorkItemPerGroup: %zd\n", maxWorkItemPerGroup);

我這邊列印的結果是256,也就是說我這邊group的size最大隻能設定到16.(16*16=256)

接下來看kernel的實現細節:

const int row = get_local_id(0); // Local row ID (max: TS)
const int col = get_local_id(1); // Local col ID (max: TS)

get_local_id 這一組操作主要是獲取work-group中當前work-item所在的2d索引。

const int globalRow = TS*get_group_id(0) + row; // Row ID of C (0..M)
const int globalCol = TS*get_group_id(1) + col; // Col ID of C (0..N)

這個是通過當前work-item所在的group-id和自己在此group中的索引計算出,當前work-item在全域性的索引。get_group_id是獲取當前work-item所在work-group的id。

    __local int Asub[TS][TS];
    __local int Bsub[TS][TS];

定義local 記憶體,在同一個work-group對所有work-item可見。

const int numTiles = K/TS;

這個是一個work-item 需要迴圈計算的group的數量,這邊可以知道,K也要為TS的倍數才行。

for (int t=0; t<numTiles; t++) {

        // Load one tile of A and B into local memory
        const int tiledRow = TS*t + row;
        const int tiledCol = TS*t + col;
        Asub[col][row] = A[tiledCol*M + globalRow];
        Bsub[col][row] = B[globalCol*K + tiledRow];
        printf("Asub[%d][%d]=A[%d]=%d\t",col,row,tiledCol*M + globalRow,A[tiledCol*M + globalRow]);
        // Synchronise to make sure the tile is loaded
        barrier(CLK_LOCAL_MEM_FENCE);

        // Perform the computation for a single tile
        for (int k=0; k<TS; k++) {
            acc += Asub[k][row] * Bsub[col][k];
            //printf("acc[%d][%d]=%d\n",k,row,Asub[k][row]);
        }
        printf("acc = %d\n",acc);
        // Synchronise before loading the next tile
        barrier(CLK_LOCAL_MEM_FENCE);
    }

主要核心就是這個for迴圈,迴圈一進來首先計算此時work-item在當前塊的索引位置

然後開始從global記憶體中把陣列A和B 中每塊大小為16*16的值儲存到本地記憶體上。用序列的思想去看這段程式碼,會比較困難。這邊有個barrier(CLK_LOCAL_MEM_FENCE); 關鍵語句,作用就是用來再work-group中同步所有work-item。也就是說只有當前work-group中所有的work-item到達這個點,換個意思就是要保證Asub和Bsub兩個大小為16*16大小本地記憶體被賦值完畢,16*16個work-item必須全部達到這個點,才會繼續下去執行。

接下去是一個k迴圈,前面已經得到了A和B的兩個子矩陣並被儲存在本地記憶體中,通過行列相乘相加得到一個子矩陣上的結果,一個work-item一樣也只計算出一個元素,一個work-group計算出結果矩陣對應的子矩陣全部元素。

接下去又是一個同步:這個同步是保證這一個分塊或者說group全部計算完畢,再去load下一個分塊。一個大迴圈結束後,就計算出結果矩陣對應的一個元素了,把它儲存在global記憶體中:

    // Store the final result in C
    C[globalCol*M + globalRow] = acc;

下面是主程式碼:

//
//  main.cpp
//  OpenCL
//
//  Created by wmy on 2017/9/19.
//  Copyright © 2017年 wmy. All rights reserved.
//


#include <OpenCL/OpenCL.h>
#include <iostream>
#include <fstream>
#include <sstream>
#include <unistd.h>
#include <sys/time.h>
#include<time.h>
#include<stdio.h>
#include<stdlib.h>
#include <mach/mach_time.h>
#include <boost/algorithm/string.hpp>


using namespace std;

//const int ARRAY_SIZE = 100000;


//4*3---3*5

const int midle = 32;
const int heightA = 32;

const int widthB = 32;
//const int heightB = 3;

//一、 選擇OpenCL平臺並建立一個上下文
cl_context CreateContext()
{
    cl_int errNum;
    cl_uint numPlatforms;
    cl_platform_id firstPlatformId;
    cl_context context = NULL;

    //選擇可用的平臺中的第一個
    errNum = clGetPlatformIDs(1, &firstPlatformId, &numPlatforms);
    if (errNum != CL_SUCCESS || numPlatforms <= 0)
    {
        std::cerr << "Failed to find any OpenCL platforms." << std::endl;
        return NULL;
    }

    //建立一個OpenCL上下文環境
    cl_context_properties contextProperties[] =
    {
        CL_CONTEXT_PLATFORM,
        (cl_context_properties)firstPlatformId,
        0
    };
    context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_GPU,
                                      NULL, NULL, &errNum);

    return context;
}


//二、 建立裝置並建立命令佇列
cl_command_queue CreateCommandQueue(cl_context context, cl_device_id *device)
{
    cl_int errNum;
    cl_device_id *devices;
    cl_command_queue commandQueue = NULL;
    size_t deviceBufferSize = -1;

    // 獲取裝置緩衝區大小
    errNum = clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &deviceBufferSize);

    if (deviceBufferSize <= 0)
    {
        std::cerr << "No devices available.";
        return NULL;
    }

    // 為裝置分配快取空間
    devices = new cl_device_id[deviceBufferSize / sizeof(cl_device_id)];
    printf("deviceBufferSize / sizeof(cl_device_id)=%ld\n",deviceBufferSize / sizeof(cl_device_id));
    errNum = clGetContextInfo(context, CL_CONTEXT_DEVICES, deviceBufferSize, devices, NULL);




//    size_t      valueSize;
//    clGetDeviceInfo(devices[0], CL_DEVICE_NAME, 0, NULL, &valueSize);
//    char* value = (char*) malloc(valueSize);
//    clGetDeviceInfo(devices[0], CL_DEVICE_NAME, valueSize, value, NULL);
//    printf("Device1 Name: %s\n", value);
//    free(value);
//    
//    clGetDeviceInfo(devices[1], CL_DEVICE_NAME, 0, NULL, &valueSize);
//    value = (char*) malloc(valueSize);
//    clGetDeviceInfo(devices[1], CL_DEVICE_NAME, valueSize, value, NULL);
//    printf("Device2 Name: %s\n", value);
//    free(value);


    //選取可用裝置中的第一個
    commandQueue = clCreateCommandQueue(context, devices[1], 0, NULL);

    *device = devices[0];
    delete[] devices;
    return commandQueue;
}


// 三、建立和構建程式物件
cl_program CreateProgram(cl_context context, cl_device_id device, const char* fileName)
{
    cl_int errNum;
    cl_program program;

    std::ifstream kernelFile(fileName, std::ios::in);
    if (!kernelFile.is_open())
    {
        std::cerr << "Failed to open file for reading: " << fileName << std::endl;
        return NULL;
    }

    std::ostringstream oss;
    oss << kernelFile.rdbuf();

    std::string srcStdStr = oss.str();
    const char *srcStr = srcStdStr.c_str();
    program = clCreateProgramWithSource(context, 1,
                                        (const char**)&srcStr,
                                        NULL, NULL);

    errNum = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

    return program;
}

//建立和構建程式物件
bool CreateMemObjects(cl_context context, cl_mem memObjects[3],
                      int *a, int *b)
{
    memObjects[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                                   sizeof(int) * midle*heightA, a, NULL);
    memObjects[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                                   sizeof(int) * widthB*midle, b, NULL);
    memObjects[2] = clCreateBuffer(context, CL_MEM_READ_WRITE,
                                   sizeof(int) * widthB*heightA, NULL, NULL);
    return true;
}


// 釋放OpenCL資源
void Cleanup(cl_context context, cl_command_queue commandQueue,
             cl_program program, cl_kernel kernel, cl_mem memObjects[3])
{
    for (int i = 0; i < 3; i++)
    {
        if (memObjects[i] != 0)
            clReleaseMemObject(memObjects[i]);
    }
    if (commandQueue != 0)
        clReleaseCommandQueue(commandQueue);

    if (kernel != 0)
        clReleaseKernel(kernel);

    if (program != 0)
        clReleaseProgram(program);

    if (context != 0)
        clReleaseContext(context);
}


void checkError(cl_int error, int line) {
    if (error != CL_SUCCESS) {
        switch (error) {
            case CL_DEVICE_NOT_FOUND:                 printf("-- Error at %d:  Device not found.\n", line); break;
            case CL_DEVICE_NOT_AVAILABLE:             printf("-- Error at %d:  Device not available\n", line); break;
            case CL_COMPILER_NOT_AVAILABLE:           printf("-- Error at %d:  Compiler not available\n", line); break;
            case CL_MEM_OBJECT_ALLOCATION_FAILURE:    printf("-- Error at %d:  Memory object allocation failure\n", line); break;
            case CL_OUT_OF_RESOURCES:                 printf("-- Error at %d:  Out of resources\n", line); break;
            case CL_OUT_OF_HOST_MEMORY:               printf("-- Error at %d:  Out of host memory\n", line); break;
            case CL_PROFILING_INFO_NOT_AVAILABLE:     printf("-- Error at %d:  Profiling information not available\n", line); break;
            case CL_MEM_COPY_OVERLAP:                 printf("-- Error at %d:  Memory copy overlap\n", line); break;
            case CL_IMAGE_FORMAT_MISMATCH:            printf("-- Error at %d:  Image format mismatch\n", line); break;
            case CL_IMAGE_FORMAT_NOT_SUPPORTED:       printf("-- Error at %d:  Image format not supported\n", line); break;
            case CL_BUILD_PROGRAM_FAILURE:            printf("-- Error at %d:  Program build failure\n", line); break;
            case CL_MAP_FAILURE:                      printf("-- Error at %d:  Map failure\n", line); break;
            case CL_INVALID_VALUE:                    printf("-- Error at %d:  Invalid value\n", line); break;
            case CL_INVALID_DEVICE_TYPE:              printf("-- Error at %d:  Invalid device type\n", line); break;
            case CL_INVALID_PLATFORM:                 printf("-- Error at %d:  Invalid platform\n", line); break;
            case CL_INVALID_DEVICE:                   printf("-- Error at %d:  Invalid device\n", line); break;
            case CL_INVALID_CONTEXT:                  printf("-- Error at %d:  Invalid context\n", line); break;
            case CL_INVALID_QUEUE_PROPERTIES:         printf("-- Error at %d:  Invalid queue properties\n", line); break;
            case CL_INVALID_COMMAND_QUEUE:            printf("-- Error at %d:  Invalid command queue\n", line); break;
            case CL_INVALID_HOST_PTR:                 printf("-- Error at %d:  Invalid host pointer\n", line); break;
            case CL_INVALID_MEM_OBJECT:               printf("-- Error at %d:  Invalid memory object\n", line); break;
            case CL_INVALID_IMAGE_FORMAT_DESCRIPTOR:  printf("-- Error at %d:  Invalid image format descriptor\n", line); break;
            case CL_INVALID_IMAGE_SIZE:               printf("-- Error at %d:  Invalid image size\n", line); break;
            case CL_INVALID_SAMPLER:                  printf("-- Error at %d:  Invalid sampler\n", line); break;
            case CL_INVALID_BINARY:                   printf("-- Error at %d:  Invalid binary\n", line); break;
            case CL_INVALID_BUILD_OPTIONS:            printf("-- Error at %d:  Invalid build options\n", line); break;
            case CL_INVALID_PROGRAM:                  printf("-- Error at %d:  Invalid program\n", line); break;
            case CL_INVALID_PROGRAM_EXECUTABLE:       printf("-- Error at %d:  Invalid program executable\n", line); break;
            case CL_INVALID_KERNEL_NAME:              printf("-- Error at %d:  Invalid kernel name\n", line); break;
            case CL_INVALID_KERNEL_DEFINITION:        printf("-- Error at %d:  Invalid kernel definition\n", line); break;
            case CL_INVALID_KERNEL:                   printf("-- Error at %d:  Invalid kernel\n", line); break;
            case CL_INVALID_ARG_INDEX:                printf("-- Error at %d:  Invalid argument index\n", line); break;
            case CL_INVALID_ARG_VALUE:                printf("-- Error at %d:  Invalid argument value\n", line); break;
            case CL_INVALID_ARG_SIZE:                 printf("-- Error at %d:  Invalid argument size\n", line); break;
            case CL_INVALID_KERNEL_ARGS:              printf("-- Error at %d:  Invalid kernel arguments\n", line); break;
            case CL_INVALID_WORK_DIMENSION:           printf("-- Error at %d:  Invalid work dimensionsension\n", line); break;
            case CL_INVALID_WORK_GROUP_SIZE:          printf("-- Error at %d:  Invalid work group size\n", line); break;
            case CL_INVALID_WORK_ITEM_SIZE:           printf("-- Error at %d:  Invalid work item size\n", line); break;
            case CL_INVALID_GLOBAL_OFFSET:            printf("-- Error at %d:  Invalid global offset\n", line); break;
            case CL_INVALID_EVENT_WAIT_LIST:          printf("-- Error at %d:  Invalid event wait list\n", line); break;
            case CL_INVALID_EVENT:                    printf("-- Error at %d:  Invalid event\n", line); break;
            case CL_INVALID_OPERATION:                printf("-- Error at %d:  Invalid operation\n", line); break;
            case CL_INVALID_GL_OBJECT:                printf("-- Error at %d:  Invalid OpenGL object\n", line); break;
            case CL_INVALID_BUFFER_SIZE:              printf("-- Error at %d:  Invalid buffer size\n", line); break;
            case CL_INVALID_MIP_LEVEL:                printf("-- Error at %d:  Invalid mip-map level\n", line); break;
            case -1024:                               printf("-- Error at %d:  *clBLAS* Functionality is not implemented\n", line); break;
            case -1023:                               printf("-- Error at %d:  *clBLAS* Library is not initialized yet\n", line); break;
            case -1022:                               printf("-- Error at %d:  *clBLAS* Matrix A is not a valid memory object\n", line); break;
            case -1021:                               printf("-- Error at %d:  *clBLAS* Matrix B is not a valid memory object\n", line); break;
            case -1020:                               printf("-- Error at %d:  *clBLAS* Matrix C is not a valid memory object\n", line); break;
            case -1019:                               printf("-- Error at %d:  *clBLAS* Vector X is not a valid memory object\n", line); break;
            case -1018:                               printf("-- Error at %d:  *clBLAS* Vector Y is not a valid memory object\n", line); break;
            case -1017:                               printf("-- Error at %d:  *clBLAS* An input dimension (M,N,K) is invalid\n", line); break;
            case -1016:                               printf("-- Error at %d:  *clBLAS* Leading dimension A must not be less than the size of the first dimension\n", line); break;
            case -1015:                               printf("-- Error at %d:  *clBLAS* Leading dimension B must not be less than the size of the second dimension\n", line); break;
            case -1014:                               printf("-- Error at %d:  *clBLAS* Leading dimension C must not be less than the size of the third dimension\n", line); break;
            case -1013:                               printf("-- Error at %d:  *clBLAS* The increment for a vector X must not be 0\n", line); break;
            case -1012:                               printf("-- Error at %d:  *clBLAS* The increment for a vector Y must not be 0\n", line); break;
            case -1011:                               printf("-- Error at %d:  *clBLAS* The memory object for Matrix A is too small\n", line); break;
            case -1010:                               printf("-- Error at %d:  *clBLAS* The memory object for Matrix B is too small\n", line); break;
            case -1009:                               printf("-- Error at %d:  *clBLAS* The memory object for Matrix C is too small\n", line); break;
            case -1008:                               printf("-- Error at %d:  *clBLAS* The memory object for Vector X is too small\n", line); break;
            case -1007:                               printf("-- Error at %d:  *clBLAS* The memory object for Vector Y is too small\n", line); break;
            case -1001:                               printf("-- Error at %d:  Code -1001: no GPU available?\n", line); break;
            default:                                  printf("-- Error at %d:  Unknown with code %d\n", line, error);
        }
        exit(1);
    }
}
#define TIMES 10
#define TS 16

int main(int argc, char** argv)
{



    cl_context context = 0;
    cl_command_queue commandQueue = 0;
    cl_program program = 0;
    cl_device_id device = 0;
    cl_kernel kernel = 0;
    cl_mem memObjects[3] = { 0, 0, 0 };
    cl_int errNum;
   // uint64_t t1,t2,t3;
    clock_t t1,t2,t3,t4;


    const char* filename = "/Users/wangmingyong/Projects/OpenCL/OpenCL/HelloWorld.cl";
    // 一、選擇OpenCL平臺並建立一個上下文
    context = CreateContext();

    // 二、 建立裝置並建立命令佇列
    commandQueue = CreateCommandQueue(context, &device);


    size_t      maxWorkItemPerGroup;
    clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_GROUP_SIZE,sizeof(maxWorkItemPerGroup), &maxWorkItemPerGroup, NULL);
    printf("maxWorkItemPerGroup: %zd\n", maxWorkItemPerGroup);

    size_t      valueSize;
    clGetDeviceInfo(device, CL_DEVICE_NAME, 0, NULL, &valueSize);
    char* value = (char*) malloc(valueSize);
    clGetDeviceInfo(device, CL_DEVICE_NAME, valueSize, value, NULL);
    printf("Device Name: %s\n", value);
    free(value);

    //建立和構建程式物件
    program = CreateProgram(context, device, filename);//"HelloWorld.cl");

    // 四、 建立OpenCL核心並分配記憶體空間
    kernel = clCreateKernel(program, "hello_kernel", NULL);

    //建立要處理的資料
    int result[widthB*heightA]{0};
    int a[midle*heightA];
    int b[widthB*midle];
    for (int i = 0; i < heightA; i++)
    {
        for(int j = 0;j < midle;j++)
        {
            a[i*midle+j]=2;//10.0f * ((int) rand() / (int) RAND_MAX);
        }

    }


    for (int k = 0; k < midle; k++)
    {
        for(int m = 0;m < widthB;m++)
        {
            b[k*widthB+m]=3;//10.0f * ((int) rand() / (int) RAND_MAX);
        }

    }

    t1 = clock();  //mach_absolute_time();
    //printf("t1 = %.8f\n",(double)t1);

    for(int tt=0;tt<TIMES;tt++){
        for(int l=0;l<heightA;l++){
            for(int n = 0;n<widthB;n++){
                for(int q=0;q<midle;q++){
                    result[l*widthB+n] +=a [l*midle+q]*b[q*widthB+n];

                }
                //std::cout<<"r = "<<result[l*widthB+n]<<std::endl;
            }
        }
    }
    t2 = clock(); //mach_absolute_time();
    //printf("t2 = %.8f\n",(double)t2);

    //建立記憶體物件
    if (!CreateMemObjects(context, memObjects, a, b))
    {
        Cleanup(context, commandQueue, program, kernel, memObjects);
        return 1;
    }

    // 五、 設定核心資料並執行核心
    errNum = clSetKernelArg(kernel, 0, sizeof(cl_mem), &memObjects[0]);
    errNum |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &memObjects[1]);
    errNum |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &memObjects[2]);
    errNum |= clSetKernelArg(kernel, 3, sizeof(int), &heightA);
    errNum |= clSetKernelArg(kernel, 4, sizeof(int), &widthB);
    errNum |= clSetKernelArg(kernel, 5, sizeof(int), &midle);

    size_t globalWorkSize[2];
    globalWorkSize[0]= heightA;
    globalWorkSize[1]=widthB;
    size_t localWorkSize[2] ;
    localWorkSize[0]= TS;
    localWorkSize[1]= TS;
    t3 = clock();
    for(int run=0;run<TIMES;run++){
        errNum = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL,
                                        globalWorkSize, localWorkSize,
                                        0, NULL, NULL);
        checkError(errNum,__LINE__);
        if(errNum == CL_SUCCESS)
            cout<<"enqueue success!"<<endl;
        else
            printf("errNum= %d\n",errNum);

        //mach_absolute_time();
        // 六、 讀取執行結果並釋放OpenCL資源
        errNum = clEnqueueReadBuffer(commandQueue, memObjects[2], CL_TRUE,
                                     0, widthB*heightA * sizeof(int), result,
                                     0, NULL, NULL);
        //    for(int p=0;p<20;p++){
        //        cout<<"new ="<<result[p];
        //    }

    }


        t4 = clock();



    printf("cpu t = %.8f\n",(float)(t2-t1)/CLOCKS_PER_SEC/TIMES);
    printf("gpu t = %.8f \n",(double)(t4-t3)/CLOCKS_PER_SEC/TIMES);

    std::cout << std::endl;
    std::cout << "Executed program succesfully." << std::endl;
    getchar();
    Cleanup(context, commandQueue, program, kernel, memObjects);

    return 0;
}

下面是時間效能分析:

維度 cpu gpu
32*32 0.00010410 0.00040870
128*128 0.00676160 0.00040980
512*512 0.52244419 0.00058840

這是跟之前的kernel效能相比:

維度 gpu1 gpu2
32*32 0.00029130 0.00040870
128*128 0.00036250 0.00040980
512*512 0.00056370 0.00058840

貌似時間都差不多,我這邊把readbuffer的操作去掉髮現時間少了很多,但是跟前一個kernel的都在同一個數量級差不多的時間,我這邊維數改到1024程式就會報錯,所以驗證不了高維度的效能。後續跟蹤下程式為什麼限制到了1024,不知是否是機器的原因。

kernel1
512 gpu t = 0.00000460
128 gpu t = 0.00000400
32 0.00000360

kernel2
512 0.00000320
128 gpu t = 0.00000310
32 0.00000370