GPU加速計算-工具介紹

阿新 • • 發佈：2019-01-27

　　主要在R下和Python下使用GPU加速計算，使用成熟的工具來提高自己的工作效率，現在對其中一些方法做一下簡單總結。

R的GPU加速計算包gputools

　　1）gputools，R下的GPU加速計算的函式包，包含常見的運算操作。
　　https://cran.r-project.org/web/packages/gputools/
　　2）iFes，Incremental Feature Selection Algorithm accelerated by GPU。

Python的GPU加速計算包cudamat及gnumpy

Python在GPU下實現的演算法

　　在pypi下，搜尋gpu，肯定有。已知的有cudatree 0.6（隨機森林），dpmix 0.3（高斯混合）。
　　百度/Google下，搜尋Python+GPU+algorithm。

補充程式碼

1. R的gputools測試程式碼

## Objection: Test GPU Compution Function in R ##
## time: 2015.08.27
## author: yjm
## [email protected]~# nvidia-smi  ## check the gpu information
library(gputools)
help(package = 'gputools')
## we can find the Computing Function using GPU ##
## chooseGpu; ## default=0 

chooseGpu(deviceId=0)
############################################
## cpuMatMult (matrix multiplication); 
matA <- matrix(runif(2000*3000), 2000, 3000)
matB <- matrix(runif(3000*4000), 3000, 4000)
t1 = Sys.time()
y1 = cpuMatMult(matA, matB)
t2 = Sys.time()
t2-t1
#### without GPU also
y2 = matA %*% matB
t3 = Sys.time()
t3-t2
y3 = crossprod(t(matA), matB)
t4 = Sys.time()
t4-t3
############################################# 

## getGpuId; 
?getGpuId
getGpuId()
## default is device 0
## gpuCor (compute correlation coefficient--"pearson & kendall"); Matrix coef
numAvars <- 5
numBvars <- 10
numSamples <- 30
A <- matrix(runif(numAvars*numSamples), numSamples, numAvars)
B <- matrix(runif(numBvars*numSamples), numSamples, numBvars)
gpuCor(A, B, method="pearson")
gpuCor(A, B, method="kendall")
A[3,2] <- NA
gpuCor(A, B, use="pairwise.complete.obs", method="pearson")
#### without GPU
cor.test(A[,1], B[,1], method='pearson')
#########################################################
## gpuCrossprod (cross-product)
matA <- matrix(runif(3000*2000), 3000, 2000)
matB <- matrix(runif(3000*4000), 3000, 4000)
t1 = Sys.time()
y1 = gpuCrossprod(matA, matB)
t2 = Sys.time()
t2-t1
#### without GPU
y2 = t(matA) %*% matB
t3 = Sys.time()
t3-t2
y3 = crossprod(matA, matB)
t4 = Sys.time()
t4-t3
#############################################################
## gpuDist(matrix, method) (comput distance between vec) each row is a vector
numVectors <- 500
dimension <- 1000
Vectors <- matrix(runif(numVectors*dimension), numVectors, dimension)
t1 = Sys.time()
y1 = gpuDist(Vectors, "euclidean")
t2 = Sys.time()
t2-t1
#### without GPU
y2 = dist(Vectors, "euclidean")
t3 = Sys.time()
t3-t2
#gpuDist(Vectors, "maximum")
#gpuDist(Vectors, "manhattan")
#gpuDist(Vectors, "minkowski", 4)
###################################################
## gpuDistClust
## gpuGlm (glm with gpu)
## gpuGranger (granger causality tests)
## gpuHclust (Hierarchical Clustering)
## gpuLm (lm with gpu)
## gpuLm.defaultTol
## gpuLm.fit 
## gpuLsfit (least sqares fit)
## gpuMatMult (matrix multiplication with GPU)
matA <- matrix(runif(2000*3000), 2000, 3000)
matB <- matrix(runif(3000*4000), 3000, 4000)
t1 = Sys.time()
y1 = gpuMatMult(matA, matB)
t2 = Sys.time()
t2-t1
#### without GPU
y2 = matA %*% matB
t3 = Sys.time()
t3-t2
y3 = crossprod(t(matA), matB)
t4 = Sys.time()
t4-t3
###############################################
## gpuMi () # mutual information computing # each col represent a random variable
x <- matrix(runif(60), 20, 3)
y <- matrix(runif(60), 20, 3)
# do something interesting
y[,2] <- 3.0 * (x[,1] + x[,3])
z <- gpuMi(x, y, bins = 10, splineOrder = 3)
print(z)
## gpuQr (QR decomposition)
## gpuSolve 
## gpuTcrossprod (matrix Transposed Cross-product with GPU)
matA <- matrix(runif(2000*3000), 2000, 3000)
matB <- matrix(runif(4000*3000), 4000, 3000)
t1 = Sys.time()
y1 = gpuTcrossprod(matA, matB)
t2 = Sys.time()
t2 - t1
#### without GPU
y2 = matA %*% t(matB)
t3 = Sys.time()
t3 - t2
y3 = tcrossprod(matA, matB)
t4 = Sys.time()
t4 - t3
## gpuTtest (T-test with gpu)
quit()

## conclusion: 
## when nrow and ncol > 1000, GPU compution is larger quick than cpu.(more 100 times)

2. Python的cudamat包測試程式碼


# coding: utf-8

# In[179]:
## here want to test cudmat package
## time: 2015.09.07
## author: yjm
## cudamat.py檔案裡面有作者的註釋，可以檢視 

#import pycuda as cuda
#import pycuda.autoinit
#from pycuda.compiler import SourceModule
import cudamat as cm
import nose
import numpy as np

# 1)
cm.cublas_init()
#print(cm.CUDAMatrix.ones.shape)
# 2)
#cm.cublas_shutdown() ## 這裡面對cm.CUDAMatrix.ones 置零

# In[180]:

## 重置形狀 ##
m = 256
n = 128
cm1 = np.array(np.random.rand(n, m)*10, dtype = np.float32, order = 'F')
cm2 = np.array(np.random.rand(m, 1)*10, dtype = np.float32, order = 'F')
gm1 = cm.CUDAMatrix(cm1)
gm2 = cm.CUDAMatrix(cm2)
# maybe a error here in CUDAMatrix 
# soloution ：Try to fix the value "gpu_memory" of your .pbtxt file to "2G" or "2.5G"
print('----顯示CPU下各個變數的大小----')
print(cm1.shape)
print(cm2.shape)
print('----顯示GPU下各個變數的大小----')
print(gm1.shape)
print(gm2.shape)
print('test reshape in gpu')
gm1.reshape((m, n))
print(gm1.shape)
print('test assign')
#gm2.assign(gm1)
gm1.reshape((n,m))
print(gm1.shape)

# In[81]:

## GPU點乘 ## 以及GPU變數的轉置 ##
m = 256
n = 128
cm1 = np.array(np.random.rand(n, m)*10, dtype = np.float32, order = 'F')
cm2 = np.array(np.random.rand(m, n)*10, dtype = np.float32, order = 'F')
gm1 = cm.CUDAMatrix(cm1)
gm2 = cm.CUDAMatrix(cm2)
print(gm1.shape)
print(gm2.shape)
gm = cm.dot(gm1, gm2) ## here is dot on GPU ##
print(gm.shape)
gm = cm.dot(gm2.T, gm1.T) ## here is transpose on GPU ##
print(gm.shape)

# In[63]:

## assign ## ??? what is assign ??? 設定值
cm2 = np.array(np.random.rand(n, m)*10, dtype=np.float32, order='F')
gm2 = cm.CUDAMatrix(cm2)
print('----cm2-----')
print(cm2[1:5, 1:5])
print('----cm1-----')
print(cm1[1:5, 1:5])
gm1.assign(gm2)
gm1.copy_to_host()
print('----after gm1.assign(gm2)----')
gm1.copy_to_host()
print(gm1.numpy_array[1:5, 1:5])

# In[64]:

## assign ##
a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order = 'F')
m1 = cm.CUDAMatrix(a)
m1.assign(np.pi)
m1.copy_to_host()
print(m1.numpy_array[1:5, 1:5])

# In[74]:

## 獲得行的切片 ## 直接在GPU上進行的切片操作 ##
m = 256
n = 128
start = 11
end = 15

a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
b = np.array(np.random.rand(end-start, n)*10, dtype=np.float32, order='F')
c = np.array(a[start:end,:], order='F')

m1 = cm.CUDAMatrix(a)
m2 = cm.CUDAMatrix(b)
print(m1.shape)
print(m2.shape)

m1.get_row_slice(start, end, target = m2)
m3 = m1.get_row_slice(start, end)
m1.copy_to_host()
m2.copy_to_host()
m3.copy_to_host()
print('--after m1.get_row_slice(start, end, target = m2)---')
print(m1.shape)
print(m2.shape)
print(m3.shape)
print(m1.numpy_array[start:end, 1:5])
print(m2.numpy_array[:, 1:5])
print(m3.numpy_array[:, 1:5])

# In[95]:

## 將列向量 對應位置值 加到矩陣的每一列 ## add_col_vec ## 
## sample as following:
## a=[1, 2, 3]; b =[1,1,1; 2,2,2;3,3,3] ## then b.add_col_vec(a) = [2,3,4; 3,4,5; 4,5,6]
## 將行向量 對應位置值 加到矩陣的每一行 ## add_row_vec ##
m = 256
n = 128
a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
b = np.array(np.random.rand(m, 1)*10, dtype=np.float32, order='F')
c = a + b
m1 = cm.CUDAMatrix(a)
m2 = cm.CUDAMatrix(b)
print('---a-m1[1:5, 1:5]---')
print(a[1:5, 1:5])
print('---b-m2[1:5, :]---')
print(b[1:5, :])
print('---c-m1+m2---')
print(c[1:5, 1:5])

print('----after m1.add_col_ve(m2)----')
#m1.add_col_vec(m2, target = m3) 
## 搞不懂這地方為什麼要加target ？？？## 這一句也沒有什麼作用啊！！！
## soga ！ target 是為了儲存結果 ##
m1.add_col_vec(m2)  ## 將m2列向量直接疊加到矩陣m1的每一列 ##
m1.copy_to_host()
print(m1.numpy_array[1:5, 1:5])

# In[103]:

## GPU矩陣所有列資料加上 向量*一個數值 ## 看不懂這個函式有神馬意思 ？？？##
m = 256
n = 128
a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
b = np.array(np.random.rand(m, 1)*10, dtype=np.float32, order='F')
m1 = cm.CUDAMatrix(a)
m2 = cm.CUDAMatrix(b)
print('---a--m1[1:5, 1:5]----')
print(a[1:5, 1:5])
print('---b--m2[1:5, :]----')
print(b[1:5, :])
m1.add_col_mult(m2, np.pi)
m1.copy_to_host()
print(m1.numpy_array[1:5, 1:5])

# In[104]:

## mult_by_row; mult_by_row; div_by_col; div_by_row; ## who knows what's this! ##

# In[175]:

# 求和 ## sum
m = 256
n = 128
a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
rowSumRes = np.array(np.random.rand(m, 1)*10, dtype=np.float32, order='F') ## 儲存行sum結果##
colSumRes = np.array(np.random.rand(1, n)*10, dtype=np.float32, order='F') ##
m1 = cm.CUDAMatrix(a)
growSumRes = cm.CUDAMatrix(rowSumRes)
gcolSumRes = cm.CUDAMatrix(colSumRes)
print('---a--m1[1:5, 1:5]----')
print(a[1:5, 1:5])
mult = 1 ## 倍數控制 ##
m1.sum(axis = 1, target = growSumRes, mult = mult)
m1.sum(axis = 0, target = gcolSumRes, mult = mult)
growSumRes.copy_to_host()
print(growSumRes.numpy_array[1:5,:])
print(gcolSumRes.numpy_array[:,1:5])

# In[177]:

## 求均值 ## mean
m = 256
n = 128
a = np.array(np.random.rand(m, n)*10, dtype=np.float32, order='F')
rowMeaRes = np.array(np.random.rand(m, 1)*10, dtype=np.float32, order='F') ## 儲存行sum結果##
colMeaRes = np.array(np.random.rand(1, n)*10, dtype=np.float32, order='F') ##
m1 = cm.CUDAMatrix(a)
growMeaRes = cm.CUDAMatrix(rowSumRes)
gcolMeaRes = cm.CUDAMatrix(colSumRes)
print('---a--m1[1:5, 1:5]----')
print(a[1:5, 1:5])
mult = 1 ## 倍數控制 ##
m1.mean(axis = 1, target = growMeaRes)
m1.mean(axis = 0, target = gcolMeaRes)
growMeaRes.copy_to_host()
print(growMeaRes.numpy_array[1:5, :])
print(gcolMeaRes.numpy_array[:, 1:5])

# In[181]:

## 最大值最小值 ## max/min
## max(axis, target = None)
## min(axis, target = None)
## 跟求和sum 和 求均值mean 一樣用法。
## sign 函式是個什麼東東 ？？？##

# In[185]:

## sigmoid 函式 ## apply_sigmoid(target=None)
m = 256
n = 128
a = np.array(np.random.randn(m, n)*10, dtype=np.float32, order='F')
b = np.array(np.random.randn(m, n)*10, dtype=np.float32, order='F')

c = 1. / (1. + np.exp(-a))

m1 = cm.CUDAMatrix(a)
m2 = cm.CUDAMatrix(b)
m1.apply_sigmoid(target = m2)
m1.apply_sigmoid()

m1.copy_to_host()
m2.copy_to_host()
print(m1.numpy_array[1:5, 1:5])
print(m2.numpy_array[1:5, 1:5])
print(c[1:5, 1:5])

## 雙曲正切函式 ## tanh ##
## gm.apply_tanh(target = gm2)
## gm.apply_tanh()
## 軟閾值函式 ## soft_threshold ##

## log 函式 ## exp 函式 ## sqrt函式 ## 指數函式 ## where 函式 ##
## log(mat, target = None)
## exp(mat, target = None)
## pow(mat, p, target = None)
## where(condition_mat, if_mat, else_mat, target=None)

GPU加速計算-工具介紹

　　主要在R下和Python下使用GPU加速計算，使用成熟的工具來提高自己的工作效率，現在對其中一些方法做一下簡單總結。 R的GPU加速計算包gputools 　　1）gputools，R下的GPU加速計算的函式包，包含常見的運算操作。　　https:/

MATLAB上的GPU加速計算——學習筆記 (2014-12-22 04:44:05)

轉自：http://blog.sina.com.cn/s/blog_6f062c360102v9ic.html MATLAB可謂工程計算中的神器，一方面它自帶豐富的函式庫，另一方面它所有的資料都是內建的矩陣型別，最後畫圖也方便，因此解決一些小規模的計算問題如果對效能要求不高的話

MATLAB上的GPU加速計算

【時間】2018.10.12 【題目】MATLAB上的GPU加速計算概述怎樣在MATLAB上做GPU計算呢?主要分為三個步驟：資料的初始化、對GPU資料進行操作、把GPU上的資料回傳給CPU 一、資料的初始化首先要進行資料的初始化。有兩種

【Python-GPU加速】基於Numba的GPU計算加速（一）基本

Numba是一個可以利用GPU/CPU和CUDA 對python函式進行動態編譯，大幅提高執行速度的加速工具包。利用修飾器@jit,@cuda.jit,@vectorize等對函式進行編譯 JIT：即時編譯，提高執行速度基於特定資料型別

GPU】基於Python的GPU加速平行計算 -- pyCUDA

Python實現的CUDA – pyCUDA Nvidia的CUDA 架構為我們提供了一種便捷的方式來直接操縱GPU 並進行程式設計，但是基於 C語言的CUDA實現較為複雜，開發週期較長。而pyth

深度學習GPU計算工具CUDA安裝

當看到"Using gpu device 0: GeForce GTX 750 Ti"這樣的字眼之後，恭喜你，你搭建的CUDA平臺安裝配置工作已經完成，接下來就可以進行Deep Learning的學習了。 (adsbygoogle = window.adsbygoogle || [

利用GPU平行計算來加速簡單積分過程的實驗

由於CPU的摩爾定律已經不再適用，目前加速程式的最佳選擇就是通過GPU並行。經過幾天的摸索後，完成了這個利用GPU加速積分演算法的小實驗。數值積分中最常用的方法之一就是辛普森積分法，首先我們寫出一段三階辛普森積分的小程式： double Simpson_integ (i

OpenCV 光流演算法加速---使用GPU來計算光流

一、依賴項 OpenCV 2.4.13.x + CUDA 8.0 OpenCV 3.2.0及以上 + CUDA 8.0 OpenCV 3.4.x + CUDA 9.1 OpenCV編譯時，需要新增CUDA 支援。安裝CUDA 以及OpenCV，可參考我的另外兩篇部落格

轉://Oracle補丁及opatch工具介紹

獨立管理所無效對象 rim 計劃目錄 conn 技術有時一． CPU（Critical Patch Update）一個CPU內包含了對多個安全漏洞的修復，並且也包括相應必需的非安全漏洞的補丁。CPU是累積型的，只要安裝最新發布的CPU即可，其中包括之前

MQTT 測試工具介紹

tput repo osi rep tor 訪問 posit script win eclipse paho 下載地址為: https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/pah

Linux 查看磁盤分區、文件系統、磁盤的使用情況相關的命令和工具介紹

rfs partition pan 包含 logical cor name blocks 為什麽 Linux 磁盤分區表、文件系統的查看、統計的工具很多，有些工具是多功能的，不僅僅是查看磁盤的分區表，而且也能進行磁盤分區的操作；但在本文，我們只講磁盤分區的查看，以及分區的

Linux常用工具介紹——free

linux常用工具在Linux系統中，我們查看、監控系統內存使用情況，一般最常用的命令就是free，關於free的實現，其實是調用linux下的/proc/meminfo文件。[[email protected]/* */ /]# free -Vfree from procps-ng 3.3.9[

A002-開發工具介紹

split google mac 謝大 adb 詳細 sqlite 搭建 ogl 關於Android的開發工具有非常多，基本上都能夠在SDK中找到。下面我們逐個來看一下：首先我們使用的是Java語言進行Android應用的開發，那麽Java的執行環境

自動化運維工具介紹

運維自動化工具運維目標有三個階段，第一是追求穩定性，第二是追求標準化，第三是追求自動化。對於第三階段來說，什麽是運維自動化呢？簡單地講，運維自動化就是將日常重復性工作按照事先設定好的規則，在一定時間範圍內自動化運行，而不需要人工參與。接下來簡單介紹運維自動化工具，要了解運維平時用到的自動化工

前端相關開發工具介紹

語法 emacs 換行提升 cli 相互 dcloud ips java、 1、常用前端開發工具-編輯器 sublimetext：Sublime Text 是一款流行的代碼編輯器軟件，也是HTML和散文先進的文本編輯器，可運行在Linux，Windows和Mac OS

生菜自動化測試工具介紹

aid net 使用方法麻煩後來 mar 主體幫助穩定性測試工具簡介：這是一款實現操作系統性能、功能、穩定性測試自動化的工具。從測試工具的下載、安裝、配置、測試、發送測試結果到郵箱完全實現自動化。設計初衷：開始做性能測試時，需要自己手動安裝、配置、執行測

JVM自帶性能分析工具介紹——jmap和jhat

inf info weibo 介紹 oci mar style lan user 0L蛻投M口l形繃9http://huiyi.docin.com/sina_5847440681 RH姥胤1操士剮訊39鎂http://www.docin.com/sina_62699771

JVM自帶性能分析工具介紹——jstat

mfp ldd 性能分析工具 blank pxn qtp targe 自帶 mcs 炭撐pcdw1律v煌映40嘔http://tushu.docin.com/sina_6345212704 倩角72g傲28蛻iy墾84http://tushu.docin.com/sina_

10.28 rsync工具介紹 - 10.29/10.30 rsync常用選項 - 10.31 rsync通過ssh同步

10.28 rsync工具介紹 - 10.29/10.30 rsync常用選項 - 10.31 rsync通過ssh同步- 10.28 rsync工具介紹 - 10.29/10.30 rsync常用選項 - 10.31 rsync通過ssh同步 # 10.28 rsync工具介紹 -/A目錄 --&

pt(Percona Toolkit)工具介紹

工具介紹 pt pt-toolkit pt(Percona Toolkit)工具介紹一.介紹1.找出重復的索引和外鍵 pt-duplicate-key-checker例子：[[email protected] ~]# pt-duplicate-key-checker

GPU加速計算-工具介紹

R的GPU加速計算包gputools

Python的GPU加速計算包cudamat及gnumpy

Python在GPU下實現的演算法

補充程式碼

1. R的gputools測試程式碼

2. Python的cudamat包測試程式碼

相關推薦