1. 程式人生 > >OpenCL中的half與float的轉換

OpenCL中的half與float的轉換

在kernel中使用half型別可以在犧牲一定精度的代價下來提升運算速度. 在kernel中, 可以比較方便的對half資料進行計算, 但在host上的, 對half的使用就沒那麼方便了. 檢視cl_float的定義:typedef uint16_t cl_half __attribute__((aligned(2)));可知其本質是一個uint16_t. 所以, 如果直接拿cl_float的記憶體的值來使用的話, 系統會把它當做一個uint16_t來解析. 一般來說, 我們遇到最多的情況可能是在kernel中保為half, 然後把該記憶體資料複製到host, 然後在host中使用. 關於half和float的轉換, 主要有如下幾個方面值得注意.

使用vstore_half和vload_half

OpenCL 1.1文件中是這麼說的:

Loads from a pointer to a half and stores to a pointer to a half can be performed using the vload_half, vload_halfn, vloada_halfn and vstore_half, vstore_halfn, vstorea_halfn functions
respectively as described in section 6.11.7. The load functions read scalar or vector half values
from memory and convert them to a scalar or vector float value. The store functions take a
scalar or vector float value as input, convert it to a half scalar or vector value (with appropriate
rounding mode) and write the half scalar or vector value to memory

函式申明如下: vector型別類似

float vload_half (size_t offset, const __global half *p);
void vstore_half (float data, size_t offset, __global half *p);

可以知道, load時接受的half的記憶體資料, 然後vload_half會自動把他們變成float. store時接手的float資料, 然後vstore_half會自動把float資料變成half型別資料寫入記憶體.

使用read_imageh和write_imageh

來看定義:

half4 read_imageh (image2d_t image, sampler_t sampler, int2 coord);
half4 read_imageh (image2d_t image, sampler_t sampler, float2 coord);

void write_imageh (image2d_t image, int2 coord, half4 color);

其中, 對於read_iamgeh, 其返回值與image2d_t的image_channel_data_type型別有關:

image_channel_data_type 返回值
CL_UNORM_INT8,or CL_UNORM_INT16 [0.0 , 1.0]
CL_SNORM_INT8, or CL_SNORM_INT16 [-1.0, 1.0]
CL_HALF_FLOAT half精度的值

如果image2d_t的型別定義不是表格中所表示的型別, 那麼read的返回值將是undefined. 同理, write寫入的iamge物件也只能是定義為表格中型別.

在host中進行float和half的轉換

我們前面說到在host中, half實際是按照一個unit16_t來儲存, 所以我們肯定需要一個演算法或者規則來解析其記憶體資料, 得到我們想要的half-float值. 幸好, 我在高通的opencl sdk中找到了轉換方法, 大家可去下載, 貼出程式碼如下:

//--------------------------------------------------------------------------------------
// File: half_float.cpp
// Desc:
//
// Author:      QUALCOMM
//
//               Copyright (c) 2018 QUALCOMM Technologies, Inc.
//                         All Rights Reserved.
//                      QUALCOMM Proprietary/GTDR
//--------------------------------------------------------------------------------------

#include "half_float.h"
#include <cmath>
#include <limits>

cl_half to_half(float f)
{
    static const struct
    {
        unsigned int bit_size       = 16;                                                 // total number of bits in the representation
        unsigned int num_frac_bits  = 10;                                                 // number of fractional (mantissa) bits
        unsigned int num_exp_bits   = 5;                                                  // number of (biased) exponent bits
        unsigned int sign_bit       = 15;                                                 // position of the sign bit
        unsigned int sign_mask      = 1 << 15;                                            // mask to extract sign bit
        unsigned int frac_mask      = (1 << 10) - 1;                                      // mask to extract the fractional (mantissa) bits
        unsigned int exp_mask       = ((1 << 5) - 1) << 10;                               // mask to extract the exponent bits
        unsigned int e_max          = (1 << (5 - 1)) - 1;                                 // max value for the exponent
        int          e_min          = -((1 << (5 - 1)) - 1) + 1;                          // min value for the exponent
        unsigned int max_normal     = ((((1 << (5 - 1)) - 1) + 127) << 23) | 0x7FE000;    // max value that can be represented by the 16 bit float
        unsigned int min_normal     = ((-((1 << (5 - 1)) - 1) + 1) + 127) << 23;          // min value that can be represented by the 16 bit float
        unsigned int bias_diff      = ((unsigned int)(((1 << (5 - 1)) - 1) - 127) << 23); // difference in bias between the float16 and float32 exponent
        unsigned int frac_bits_diff = 23 - 10;                                            // difference in number of fractional bits between float16/float32
    } float16_params;

    static const struct
    {
        unsigned int abs_value_mask    = 0x7FFFFFFF; // ANDing with this value gives the abs value
        unsigned int sign_bit_mask     = 0x80000000; // ANDing with this value gives the sign
        unsigned int e_max             = 127;        // max value for the exponent
        unsigned int num_mantissa_bits = 23;         // 23 bit mantissa on single precision floats
        unsigned int mantissa_mask     = 0x007FFFFF; // 23 bit mantissa on single precision floats
    } float32_params;

    const union
    {
        float f;
        unsigned int bits;
    } value = {f};

    const unsigned int f_abs_bits = value.bits & float32_params.abs_value_mask;
    const bool         is_neg     = value.bits & float32_params.sign_bit_mask;
    const unsigned int sign       = (value.bits & float32_params.sign_bit_mask) >> (float16_params.num_frac_bits + float16_params.num_exp_bits + 1);
    cl_half            half       = 0;

    if (std::isnan(value.f))
    {
        half = float16_params.exp_mask | float16_params.frac_mask;
    }
    else if (std::isinf(value.f))
    {
        half = is_neg ? float16_params.sign_mask | float16_params.exp_mask : float16_params.exp_mask;
    }
    else if (f_abs_bits > float16_params.max_normal)
    {
        // Clamp to max float 16 value
        half = sign | (((1 << float16_params.num_exp_bits) - 1) << float16_params.num_frac_bits) | float16_params.frac_mask;
    }
    else if (f_abs_bits < float16_params.min_normal)
    {
        const unsigned int frac_bits    = (f_abs_bits & float32_params.mantissa_mask) | (1 << float32_params.num_mantissa_bits);
        const int          nshift       = float16_params.e_min + float32_params.e_max - (f_abs_bits >> float32_params.num_mantissa_bits);
        const unsigned int shifted_bits = nshift < 24 ? frac_bits >> nshift : 0;
        half                            = sign | (shifted_bits >> float16_params.frac_bits_diff);
    }
    else
    {
        half = sign | ((f_abs_bits + float16_params.bias_diff) >> float16_params.frac_bits_diff);
    }
    return half;
}

cl_float to_float(cl_half f)
{
    static const struct {
        uint16_t sign_mask                   = 0x8000;
        uint16_t exp_mask                    = 0x7C00;
        int      exp_bias                    = 15;
        int      exp_offset                  = 10;
        uint16_t biased_exp_max              = (1 << 5) - 1;
        uint16_t frac_mask                   = 0x03FF;
        float    smallest_subnormal_as_float = 5.96046448e-8f;
    } float16_params;

    static const struct {
        int sign_offset = 31;
        int exp_bias    = 127;
        int exp_offset  = 23;
    } float32_params;

    const bool     is_pos          = (f & float16_params.sign_mask) == 0;
    const uint32_t biased_exponent = (f & float16_params.exp_mask) >> float16_params.exp_offset;
    const uint32_t frac            = (f & float16_params.frac_mask);
    const bool     is_inf          = biased_exponent == float16_params.biased_exp_max
                                     && (frac == 0);

    if (is_inf)
    {
        return is_pos ? std::numeric_limits<float>::infinity() : -std::numeric_limits<float>::infinity();
    }

    const bool is_nan = biased_exponent == float16_params.biased_exp_max
                        && (frac != 0);
    if (is_nan)
    {
        return std::numeric_limits<float>::quiet_NaN();
    }

    const bool is_subnormal = biased_exponent == 0;
    if (is_subnormal)
    {
        return static_cast<float>(frac) * float16_params.smallest_subnormal_as_float * (is_pos ? 1.f : -1.f);
    }

    const int      unbiased_exp        = static_cast<int>(biased_exponent) - float16_params.exp_bias;
    const uint32_t biased_f32_exponent = static_cast<uint32_t>(unbiased_exp + float32_params.exp_bias);

    union
    {
        cl_float f;
        uint32_t ui;
    } res = {0};

    res.ui = (is_pos ? 0 : 1 << float32_params.sign_offset)
             | (biased_f32_exponent << float32_params.exp_offset)
             | (frac << (float32_params.exp_offset - float16_params.exp_offset));

    return res.f;
}

關於轉換精度

貼出 一組資料給大家感受下:

//原始float資料
    11.15780,   -128.9570,  6.154780,   0.9487320,  -1327.1247,     256.0,      0.0,        -127.597,       917.0,      -1.0047
//to_half然後在to_float的資料
11.156250   -128.875000     6.152344    0.948730    -1327.000000    256.000000  0.000000    -127.562500     917.000000  -1.003906

根據文件, 在0~2048範圍內的整數是可準確表示的. 然後對於浮點數, 精度大概可以形容為百分比的形式, 即如果數本身絕對值大, 那麼相差的絕對值也大, 如果本身小, 相差的絕對值也小. 對於常用的影象處理來說, 精度一般是夠了.