1. 程式人生 > >python3表格資料處理

python3表格資料處理

# 技術背景 資料處理是一個當下非常熱門的研究方向,通過對於大型實際場景中的資料進行建模,可以用於預測下一階段可能出現的情況。比如我們有過去的2002年-2018年的黃金價格的資料: ![](https://img2020.cnblogs.com/blog/2277440/202103/2277440-20210327212959935-1428574262.png) 該資料來源於Gitee上的一個[開源專案](https://gitee.com/lbliang/gold_price_predict/blob/master/data.xls)。其中包含有:時間、開盤價、收盤價、最高價、最低價、交易數以及成交額這麼幾個引數。假如我們使用一個機器學習的模型去分析這個資料,也許我們可以預測在這個資料中並不存在的金價資料。如果預測的契合度較好,那麼對於一些人的投資策略來說有重大意義。但是這種實際場景下的資料,往往資料量是非常大的。雖然這裡我們使用到的資料只有300多KB,但是我們更多的時候不得不考慮10個GB甚至是1個TB以上的資料的處理。如果處理都無法處理,那我們如何對這些資料進行建模呢? # python對Excel表格的處理 首先我們看一個最簡單的情況,我們先不考慮效能的問題,那麼我們可以使用`xlrd`這個工具來在python中開啟和載入一個Excel表格: ```python # table.py def read_table_by_xlrd(): import xlrd workbook = xlrd.open_workbook(r'data.xls') sheet_name = workbook.sheet_names() print ('All sheets in the file data.xls are: {}'.format(sheet_name)) sheet = workbook.sheet_by_index(0) print ('The cell value of row index 0 and col index 1 is: {}'.format(sheet.cell_value(0, 1))) print ('The elements of row index 0 are: {}'.format(sheet.row_values(0))) print ('The length of col index 1 are: {}'.format(len(sheet.col_values(1)))) if __name__ == '__main__': read_table_by_xlrd() ``` 上述程式碼的輸出如下: ```bash [dechin@dechin-manjaro gold]$ python3 table.py All sheets in the file data.xls are: ['Sheet1', 'Sheet2', 'Sheet3'] The cell value of row index 0 and col index 1 is: 開 The elements of row index 0 are: ['時間', '開', '高', '低', '收', '量', '額'] The length of col index 1 are: 3923 ``` 我們這裡成功的將一個xls格式的表格載入到了python的記憶體中,我們可以對這些資料進行分析。如果需要對這些資料修改,可以使用`openpyxl`這個倉庫,但是這裡我們不做過多的贅述。 在python中還有另外一個非常常用且非常強大的庫可以用來處理表格資料,那就是`pandas`,這裡我們利用ipython這個工具簡單展示一下使用pandas處理表格資料的方法: ```python [dechin@dechin-manjaro gold]$ ipython Python 3.8.5 (default, Sep 4 2020, 07:30:14) Type 'copyright', 'credits' or 'license' for more information IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd In [2]: !ls -l 總用量 368 -rw-r--r-- 1 dechin dechin 372736 3月 27 21:31 data.xls -rw-r--r-- 1 dechin dechin 563 3月 27 21:42 table.py In [3]: data = pd.read_excel('data.xls', 'Sheet1') # 讀取excel格式的檔案 In [4]: data.to_csv('data.csv', encoding='utf-8') # 轉成csv格式的檔案 In [7]: !ls -l 總用量 588 -rw-r--r-- 1 dechin dechin 221872 3月 27 21:52 data.csv -rw-r--r-- 1 dechin dechin 372736 3月 27 21:31 data.xls -rw-r--r-- 1 dechin dechin 563 3月 27 21:42 table.py In [8]: !head -n 10 data.csv # 讀取csv檔案的頭10行 ,時間,開,高,低,收,量,額 0,2002-10-30,83.98,92.38,82.0,83.52,352,29373370 1,2002-10-31,83.9,83.92,83.9,83.91,66,5537480 2,2002-11-01,84.5,84.65,84.0,84.51,77,6502510 3,2002-11-04,84.9,85.06,84.9,84.99,95,8076330 4,2002-11-05,85.1,85.2,85.1,85.13,61,5193650 5,2002-11-06,84.9,84.9,84.9,84.9,1,84900 6,2002-11-07,85.0,85.15,85.0,85.14,26,2212310 7,2002-11-08,85.25,85.28,85.1,85.16,35,2981780 8,2002-11-11,85.18,85.19,85.18,85.19,65,5537050 ``` 在ipython中我們不僅可以執行python指令,還可以在前面加一個`!`就能夠執行一些系統命令,非常的方便。csv格式的檔案,其實就是用逗號跟換行符來替代常用的`\t`字串進行資料的分隔。 但是,不論是使用xlrd還是pandas,我們都會面臨一個同樣的問題:需要把所有的資料載入到記憶體中進行處理。我們一般的個人電腦只有8GB-16GB的記憶體,就算是比較大的64GB的記憶體,我們也只能夠在記憶體中對64GB以下記憶體大小的檔案進行處理,這對於大資料場景來說遠遠不夠。所以,下一章節中介紹的`vaex`就是一個很好的解決方案。另外,關於Linux下檢視本地記憶體以及使用情況的方法如下: ```bash [dechin@dechin-manjaro gold]$ vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b 交換 空閒 緩衝 快取 si so bi bo in cs us sy id wa st 0 0 0 35812168 328340 2904872 0 0 20 27 362 365 8 4 88 0 0 [dechin@dechin-manjaro gold]$ vmstat 2 3 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b 交換 空閒 緩衝 快取 si so bi bo in cs us sy id wa st 1 0 0 35810916 328356 2905844 0 0 20 27 362 365 8 4 88 0 0 0 0 0 35811916 328364 2904952 0 0 0 6 613 688 1 1 99 0 0 0 0 0 35812168 328364 2904856 0 0 0 0 672 642 0 1 99 0 0 ``` 我們可以看到空閒記憶體大約有36GB的記憶體,這裡我們本機一共有40GB的記憶體,算是比較大的了。 # vaex的安裝與使用 `vaex`提供了一種記憶體對映的資料處理方案,我們不需要將整個的資料檔案載入到記憶體中進行處理,我們可以直接對硬碟儲存進行操作。換句話說,我們所能夠處理的檔案大小不再受到記憶體大小的限制,只要在磁碟儲存空間允許的範圍內,我們都可以對這麼大小的檔案進行處理。 一般現在個人PC的磁碟最小也有128GB,遠遠大於記憶體可以承受的範圍。當然,由於分割槽的不同,不一定能夠保障所有的記憶體資源都能夠被使用到,這裡附上檢視當前目錄分割槽的可用磁碟空間大小查詢的方法: ```bash [dechin@dechin-manjaro gold]$ df -hl . 檔案系統 容量 已用 可用 已用% 掛載點 /dev/nvme0n1p9 144G 57G 80G 42% / ``` 這裡可以看到我們還有80GB的可用磁碟空間,也就是說,如果我們在當前目錄放一個80GB大小的表格檔案,那麼用pandas和xlrd都是沒辦法處理的,因為這已經遠遠超出了記憶體可支援的空間。但是用vaex,我們依然可以對這個檔案進行處理。 在vaex的[官方文件連結](https://vaex.io/docs/index.html)中也介紹有vaex的原理和優勢: ![](https://img2020.cnblogs.com/blog/2277440/202103/2277440-20210327232341427-1032605792.png) ## vaex的安裝 與大多數的python第三方包類似的,我們可以使用`pip`來進行下載和管理。當然由於下載的檔案會比較多,中間的過程也會較為緩慢,我們只需安靜等待即可: ```bash [dechin@dechin-manjaro gold]$ python3 -m pip install vaex Collecting vaex Downloading vaex-4.1.0-py3-none-any.whl (4.5 kB) Collecting vaex-ml<0.12,>=0.11.0 Downloading vaex_ml-0.11.1-py3-none-any.whl (95 kB) |████████████████████████████████| 95 kB 81 kB/s Collecting vaex-core<5,>=4.1.0 Downloading vaex_core-4.1.0-cp38-cp38-manylinux2010_x86_64.whl (2.5 MB) |████████████████████████████████| 2.5 MB 61 kB/s Collecting vaex-viz<0.6,>=0.5.0 Downloading vaex_viz-0.5.0-py3-none-any.whl (19 kB) Collecting vaex-astro<0.9,>=0.8.0 Downloading vaex_astro-0.8.0-py3-none-any.whl (20 kB) Collecting vaex-hdf5<0.8,>=0.7.0 Downloading vaex_hdf5-0.7.0-py3-none-any.whl (15 kB) Collecting vaex-server<0.5,>=0.4.0 Downloading vaex_server-0.4.0-py3-none-any.whl (13 kB) Collecting vaex-jupyter<0.7,>=0.6.0 Downloading vaex_jupyter-0.6.0-py3-none-any.whl (42 kB) |████████████████████████████████| 42 kB 82 kB/s Requirement already satisfied: traitlets in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-ml<0.12,>=0.11.0->vaex) (5.0.5) Requirement already satisfied: numba in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-ml<0.12,>=0.11.0->vaex) (0.51.2) Requirement already satisfied: jinja2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-ml<0.12,>=0.11.0->vaex) (2.11.2) Requirement already satisfied: psutil>=1.2.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (5.7.2) Requirement already satisfied: six in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.15.0) Requirement already satisfied: cloudpickle in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.6.0) Requirement already satisfied: numpy>=1.16 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.20.1) Requirement already satisfied: dask[array] in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (2.30.0) Collecting pyarrow>=3.0 Downloading pyarrow-3.0.0-cp38-cp38-manylinux2014_x86_64.whl (20.7 MB) |████████████████████████████████| 20.7 MB 86 kB/s Requirement already satisfied: pandas in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.1.3) WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)")': /simple/tabulate/ Collecting tabulate>=0.8.3 Downloading tabulate-0.8.9-py3-none-any.whl (25 kB) Requirement already satisfied: pyyaml in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (5.3.1) Collecting frozendict Downloading frozendict-1.2.tar.gz (2.6 kB) Collecting aplus Downloading aplus-0.11.0.tar.gz (3.7 kB) Requirement already satisfied: requests in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (2.24.0) Requirement already satisfied: nest-asyncio>=1.3.3 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (1.4.2) Collecting progressbar2 Downloading progressbar2-3.53.1-py2.py3-none-any.whl (25 kB) Requirement already satisfied: future>=0.15.2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-core<5,>=4.1.0->vaex) (0.18.2) Requirement already satisfied: matplotlib>=1.3.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-viz<0.6,>=0.5.0->vaex) (3.3.4) Requirement already satisfied: pillow in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-viz<0.6,>=0.5.0->vaex) (8.0.1) Requirement already satisfied: astropy in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-astro<0.9,>=0.8.0->vaex) (4.0.2) Requirement already satisfied: h5py>=2.9 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-hdf5<0.8,>=0.7.0->vaex) (2.10.0) Collecting cachetools Downloading cachetools-4.2.1-py3-none-any.whl (12 kB) Requirement already satisfied: tornado>4.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from vaex-server<0.5,>=0.4.0->vaex) (6.0.4) Collecting xarray Downloading xarray-0.17.0-py3-none-any.whl (759 kB) |████████████████████████████████| 759 kB 28 kB/s Collecting ipympl Downloading ipympl-0.7.0-py2.py3-none-any.whl (106 kB) |████████████████████████████████| 106 kB 39 kB/s Collecting ipyleaflet Downloading ipyleaflet-0.13.6-py2.py3-none-any.whl (3.3 MB) |████████████████████████████████| 3.3 MB 75 kB/s Collecting ipyvuetify<2,>=1.2.2 Downloading ipyvuetify-1.6.2-py2.py3-none-any.whl (11.7 MB) |████████████████████████████████| 11.7 MB 173 kB/s Collecting ipyvolume>=0.4 Downloading ipyvolume-0.5.2-py2.py3-none-any.whl (2.9 MB) |████████████████████████████████| 2.9 MB 66 kB/s Collecting bqplot>=0.10.1 Downloading bqplot-0.12.23-py2.py3-none-any.whl (1.2 MB) |████████████████████████████████| 1.2 MB 175 kB/s Requirement already satisfied: ipython-genutils in /home/dechin/anaconda3/lib/python3.8/site-packages (from traitlets->vaex-ml<0.12,>=0.11.0->vaex) (0.2.0) Requirement already satisfied: setuptools in /home/dechin/anaconda3/lib/python3.8/site-packages (from numba->vaex-ml<0.12,>=0.11.0->vaex) (50.3.1.post20201107) Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from numba->vaex-ml<0.12,>=0.11.0->vaex) (0.34.0) Requirement already satisfied: MarkupSafe>=0.23 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jinja2->vaex-ml<0.12,>=0.11.0->vaex) (1.1.1) Requirement already satisfied: toolz>=0.8.2; extra == "array" in /home/dechin/anaconda3/lib/python3.8/site-packages (from dask[array]->vaex-core<5,>=4.1.0->vaex) (0.11.1) Requirement already satisfied: pytz>=2017.2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from pandas->vaex-core<5,>=4.1.0->vaex) (2020.1) Requirement already satisfied: python-dateutil>=2.7.3 in /home/dechin/anaconda3/lib/python3.8/site-packages (from pandas->vaex-core<5,>=4.1.0->vaex) (2.8.1) Requirement already satisfied: certifi>=2017.4.17 in /home/dechin/anaconda3/lib/python3.8/site-packages (from requests->vaex-core<5,>=4.1.0->vaex) (2020.6.20) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from requests->vaex-core<5,>=4.1.0->vaex) (1.25.11) Requirement already satisfied: idna<3,>=2.5 in /home/dechin/anaconda3/lib/python3.8/site-packages (from requests->vaex-core<5,>=4.1.0->vaex) (2.10) Requirement already satisfied: chardet<4,>=3.0.2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from requests->vaex-core<5,>=4.1.0->vaex) (3.0.4) Collecting python-utils>=2.3.0 Downloading python_utils-2.5.6-py2.py3-none-any.whl (12 kB) Requirement already satisfied: cycler>=0.10 in /home/dechin/anaconda3/lib/python3.8/site-packages (from matplotlib>=1.3.1->vaex-viz<0.6,>=0.5.0->vaex) (0.10.0) Requirement already satisfied: kiwisolver>=1.0.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from matplotlib>=1.3.1->vaex-viz<0.6,>=0.5.0->vaex) (1.3.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/dechin/anaconda3/lib/python3.8/site-packages (from matplotlib>=1.3.1->vaex-viz<0.6,>=0.5.0->vaex) (2.4.7) Collecting ipywidgets>=7.6.0 Downloading ipywidgets-7.6.3-py2.py3-none-any.whl (121 kB) |████████████████████████████████| 121 kB 175 kB/s Requirement already satisfied: ipykernel>=4.7 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (5.3.4) Collecting branca<0.5,>=0.3.1 Downloading branca-0.4.2-py3-none-any.whl (24 kB) Collecting shapely Downloading Shapely-1.7.1-cp38-cp38-manylinux1_x86_64.whl (1.0 MB) |████████████████████████████████| 1.0 MB 98 kB/s Collecting traittypes<3,>=0.2.1 Downloading traittypes-0.2.1-py2.py3-none-any.whl (8.6 kB) Collecting ipyvue<2,>=1.5 Downloading ipyvue-1.5.0-py2.py3-none-any.whl (2.7 MB) |████████████████████████████████| 2.7 MB 80 kB/s Collecting ipywebrtc Downloading ipywebrtc-0.5.0-py2.py3-none-any.whl (1.1 MB) |████████████████████████████████| 1.1 MB 99 kB/s Collecting pythreejs>=1.0.0 Downloading pythreejs-2.3.0-py2.py3-none-any.whl (3.4 MB) |████████████████████████████████| 3.4 MB 30 kB/s Requirement already satisfied: widgetsnbextension~=3.5.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (3.5.1) Requirement already satisfied: nbformat>=4.2.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (5.0.8) Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (7.19.0) Collecting jupyterlab-widgets>=1.0.0; python_version >= "3.6" Downloading jupyterlab_widgets-1.0.0-py3-none-any.whl (243 kB) |████████████████████████████████| 243 kB 115 kB/s Requirement already satisfied: jupyter-client in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipykernel>=4.7->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (6.1.7) Collecting ipydatawidgets>=1.1.1 Downloading ipydatawidgets-4.2.0-py2.py3-none-any.whl (275 kB) |████████████████████████████████| 275 kB 73 kB/s Requirement already satisfied: notebook>=4.4.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (6.1.4) Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbformat>=4.2.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (3.2.0) Requirement already satisfied: jupyter-core in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbformat>=4.2.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (4.6.3) Requirement already satisfied: backcall in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.2.0) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (3.0.8) Requirement already satisfied: pickleshare in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.7.5) Requirement already satisfied: pexpect>4.3; sys_platform != "win32" in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (4.8.0) Requirement already satisfied: pygments in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (2.7.2) Requirement already satisfied: jedi>=0.10 in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.17.1) Requirement already satisfied: decorator in /home/dechin/anaconda3/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (4.4.2) Requirement already satisfied: pyzmq>=13 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jupyter-client->ipykernel>=4.7->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (19.0.2) Requirement already satisfied: terminado>=0.8.3 in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.9.1) Requirement already satisfied: argon2-cffi in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (20.1.0) Requirement already satisfied: Send2Trash in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (1.5.0) Requirement already satisfied: nbconvert in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (6.0.7) Requirement already satisfied: prometheus-client in /home/dechin/anaconda3/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.8.0) Requirement already satisfied: pyrsistent>=0.14.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.17.3) Requirement already satisfied: attrs>=17.4.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (20.3.0) Requirement already satisfied: wcwidth in /home/dechin/anaconda3/lib/python3.8/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.2.5) Requirement already satisfied: ptyprocess>=0.5 in /home/dechin/anaconda3/lib/python3.8/site-packages (from pexpect>4.3; sys_platform != "win32"->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.6.0) Requirement already satisfied: parso<0.8.0,>=0.7.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from jedi>=0.10->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.7.0) Requirement already satisfied: cffi>=1.0.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (1.14.3) Requirement already satisfied: mistune<2,>=0.8.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.8.4) Requirement already satisfied: testpath in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.4.4) Requirement already satisfied: pandocfilters>=1.4.1 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (1.4.3) Requirement already satisfied: jupyterlab-pygments in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.1.2) Requirement already satisfied: bleach in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (3.2.1) Requirement already satisfied: entrypoints>=0.2.2 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.3) Requirement already satisfied: defusedxml in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.6.0) Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.5.1) Requirement already satisfied: pycparser in /home/dechin/anaconda3/lib/python3.8/site-packages (from cffi>=1.0.0->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (2.20) Requirement already satisfied: webencodings in /home/dechin/anaconda3/lib/python3.8/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (0.5.1) Requirement already satisfied: packaging in /home/dechin/anaconda3/lib/python3.8/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (20.4) Requirement already satisfied: async-generator in /home/dechin/anaconda3/lib/python3.8/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.6.0->ipympl->vaex-jupyter<0.7,>=0.6.0->vaex) (1.10) Building wheels for collected packages: frozendict, aplus Building wheel for frozendict (setup.py) ... done Created wheel for frozendict: filename=frozendict-1.2-py3-none-any.whl size=3148 sha256=1ae5d8fe0d670f73bf3ee88453978246919197a616f0e08e601c84cc244cb238 Stored in directory: /home/dechin/.cache/pip/wheels/9b/9b/56/5713233cf7226423ab6c58c08081551a301b5863e343ba053c Building wheel for aplus (setup.py) ... done Created wheel for aplus: filename=aplus-0.11.0-py3-none-any.whl size=4412 sha256=9762d51c5ece813b0c5a27ff6ebc1a86e709d55edb7003dcc11272c954dd39c7 Stored in directory: /home/dechin/.cache/pip/wheels/de/93/23/3db69e1003030a764c9827dc02137119ec5e6e439afd64eebb Successfully built frozendict aplus Installing collected packages: pyarrow, tabulate, frozendict, aplus, python-utils, progressbar2, vaex-core, vaex-ml, vaex-viz, vaex-astro, vaex-hdf5, cachetools, vaex-server, xarray, jupyterlab-widgets, ipywidgets, ipympl, branca, shapely, traittypes, ipyleaflet, ipyvue, ipyvuetify, ipywebrtc, ipydatawidgets, pythreejs, ipyvolume, bqplot, vaex-jupyter, vaex Attempting uninstall: ipywidgets Found existing installation: ipywidgets 7.5.1 Uninstalling ipywidgets-7.5.1: Successfully uninstalled ipywidgets-7.5.1 Successfully installed aplus-0.11.0 bqplot-0.12.23 branca-0.4.2 cachetools-4.2.1 frozendict-1.2 ipydatawidgets-4.2.0 ipyleaflet-0.13.6 ipympl-0.7.0 ipyvolume-0.5.2 ipyvue-1.5.0 ipyvuetify-1.6.2 ipywebrtc-0.5.0 ipywidgets-7.6.3 jupyterlab-widgets-1.0.0 progressbar2-3.53.1 pyarrow-3.0.0 python-utils-2.5.6 pythreejs-2.3.0 shapely-1.7.1 tabulate-0.8.9 traittypes-0.2.1 vaex-4.1.0 vaex-astro-0.8.0 vaex-core-4.1.0 vaex-hdf5-0.7.0 vaex-jupyter-0.6.0 vaex-ml-0.11.1 vaex-server-0.4.0 vaex-viz-0.5.0 xarray-0.17.0 ``` 在出現`Successfully installed`的字樣之後,就代表我們已經安裝成功,可以開始使用了。 ## 效能對比 由於使用其他的工具我們也可以正常的開啟和讀取表格檔案,為了體現出使用vaex的優勢,這裡我們直接用ipython來對比一下兩者的開啟時間: ```python [dechin@dechin-manjaro gold]$ ipython Python 3.8.5 (default, Sep 4 2020, 07:30:14) Type 'copyright', 'credits' or 'license' for more information IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import vaex In [2]: import xlrd In [3]: %timeit xlrd.open_workbook(r'data.xls') 46.4 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [4]: %timeit vaex.open('data.csv') 4.95 ms ± 48.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [7]: %timeit vaex.open('data.hdf5') 1.34 ms ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` 我們從結果中發現,開啟同樣的一份檔案,使用xlrd需要將近`50ms`的時間,而vaex最低只需要`1ms`的時間,如此巨大的效能優勢使得我們不得不對vaex給予更多的關注。關於跟其他庫的對比,在這個[連結](https://zhuanlan.zhihu.com/p/240797772)中已經有人做過了,即使是對比pandas,vaex在讀取速度上也有1000多倍的加速,而計算速度的加速效果在數倍,總體來說表現非常的優秀。 ## 資料格式轉換 在上一章節的測試中,我們用到了1個沒有提到過的檔案:`data.hdf5`,這個檔案其實是從`data.csv`轉換而來的。這一章節我們主要就介紹如何將資料格式進行轉換,以適配vaex可以開啟和識別的格式。第一個方案是使用pandas將`csv`格式的檔案直接轉換為`hdf5`格式,操作類似於在python對錶格資料處理的章節中將`xls`格式的檔案轉換成`csv`格式: ```python [dechin@dechin-manjaro gold]$ ipython Python 3.8.5 (default, Sep 4 2020, 07:30:14) Type 'copyright', 'credits' or 'license' for more information IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd In [4]: data = pd.read_csv('data.csv') In [10]: data.to_hdf('data.hdf5','data',mode='w',format='table') In [11]: !ls -l 總用量 932 -rw-r--r-- 1 dechin dechin 221872 3月 27 21:52 data.csv -rw-r--r-- 1 dechin dechin 348524 3月 27 22:17 data.hdf5 -rw-r--r-- 1 dechin dechin 372736 3月 27 21:31 data.xls -rw-r--r-- 1 dechin dechin 563 3月 27 21:42 table.py ``` 操作完成之後在當前目錄下生成了一個hdf5檔案。但是這種操作方式有個弊端,就是生成的hdf5檔案跟vaex不是直接適配的關係,如果直接用`df = vaex.open('data.hdf5')`的方法進行讀取的話,輸出內容如下所示: ```python In [3]: df Out[3]: # table 0 '(0, [83.98, 92.38, 82. , 83.52], [ 0, ... 1 '(1, [83.9 , 83.92, 83.9 , 83.91], [ 1, ... 2 '(2, [84.5 , 84.65, 84. , 84.51], [ 2, ... 3 '(3, [84.9 , 85.06, 84.9 , 84.99], [ 3, ... 4 '(4, [85.1 , 85.2 , 85.1 , 85.13], [ 4, ... ... ... 3,917 '(3917, [274.65, 275.35, 274.6 , 274.61], [ ... 3,918 '(3918, [274.4, 275.2, 274.1, 275. ], [ 391... 3,919 '(3919, [275. , 275.01, 274. , 274.19], [ ... 3,920 '(3920, [275.2, 275.2, 272.6, 272.9], [ 392... 3,921 '(3921, [272.96, 273.73, 272.5 , 272.93], [ ... ``` 在這個資料中,丟失了最關鍵的索引資訊,雖然資料都被正確的保留了下來,但是在讀取上有非常大的不便。因此我們更加推薦第二種資料轉換的方法,直接用vaex進行資料格式的轉換: ```python [dechin@dechin-manjaro gold]$ ipython Python 3.8.5 (default, Sep 4 2020, 07:30:14) Type 'copyright', 'credits' or 'license' for more information IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import vaex In [2]: df = vaex.from_csv('data.csv') In [3]: df.export_hdf5('vaex_data.hdf5') In [4]: !ls -l 總用量 1220 -rw-r--r-- 1 dechin dechin 221856 3月 27 22:34 data.csv -rw-r--r-- 1 dechin dechin 348436 3月 27 22:34 data.hdf5 -rw-r--r-- 1 dechin dechin 372736 3月 27 21:31 data.xls -rw-r--r-- 1 dechin dechin 563 3月 27 21:42 table.py -rw-r--r-- 1 dechin dechin 293512 3月 27 22:52 vaex_data.hdf5 ``` 執行完畢後在當前目錄下生成了一個`vaex_data.hdf5`檔案,讓我們再試試讀取這個新的hdf5檔案: ```python [dechin@dechin-manjaro gold]$ ipython Python 3.8.5 (default, Sep 4 2020, 07:30:14) Type 'copyright', 'credits' or 'license' for more information IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import vaex In [2]: df = vaex.open('vaex_data.hdf5') In [3]: df Out[3]: # i t s h l e n a 0 0 '2002-10-30' 83.98 92.38 82.0 83.52 352 29373370 1 1 '2002-10-31' 83.9 83.92 83.9 83.91 66 5537480 2 2 '2002-11-01' 84.5 84.65 84.0 84.51 77 6502510 3 3 '2002-11-04' 84.9 85.06 84.9 84.99 95 8076330 4 4 '2002-11-05' 85.1 85.2 85.1 85.13 61 5193650 ... ... ... ... ... ... ... ... ... 3,917 3917 '2018-11-23' 274.65 275.35 274.6 274.61 13478 3708580608 3,918 3918 '2018-11-26' 274.4 275.2 274.1 275.0 13738 3773763584 3,919 3919 '2018-11-27' 275.0 275.01 274.0 274.19 13984 3836845568 3,920 3920 '2018-11-28' 275.2 275.2 272.6 272.9 15592 4258130688 3,921 3921 '2018-11-28' 272.96 273.73 272.5 272.93 592 161576336 In [4]: df.s Out[4]: Expression = s Length: 3,922 dtype: float64 (column) ------------------------------------- 0 83.98 1 83.9 2 84.5 3 84.9 4 85.1 ... 3917 274.65 3918 274.4 3919 275 3920 275.2 3921 272.96 In [11]: df.plot(df.i, df.s, show=True) # 作圖 /home/dechin/anaconda3/lib/python3.8/site-packages/vaex/viz/mpl.py:311: UserWarning: `plot` is deprecated and it will be removed in version 5.x. Please `df.viz.heatmap` instead. warnings.warn('`plot` is deprecated and it will be removed in version 5.x. Please `df.viz.heatmap` instead.') ``` 這裡我們也需要提一下,在新的hdf5檔案中,索引從高、低等中文變成了h、l等英文,這是為了方便資料的操作,我們在csv檔案中將索引手動的修改成了英文,再轉換成hdf5的格式。最後我們使用vaex自帶的畫圖功能,繪製了這十幾年期間黃金的價格變動: ![](https://img2020.cnblogs.com/blog/2277440/202103/2277440-20210327225938814-81696372.png) 由於vaex自帶的繪圖方法比較少,總結如下: ![](https://img2020.cnblogs.com/blog/2277440/202103/2277440-20210327230030164-1291770376.png) 最常用的還是熱度圖,因此這裡繪製出來的黃金價格圖的效果也是熱度圖的效果,但是基本上功能是比較完備的,而且效能異常的強大。 # 總結概要 在這篇文章中我們介紹了三種不同的python庫對錶格資料進行處理,分別是xlrd、pandas和vaex,其中特別著重的強調了一下vaex的優越效能以及在大資料中的應用價值。配合一些簡單的示例,我們可以初步的瞭解到這些庫各自的特點,在實際場景中可以斟酌使用。 # 版權宣告 本文首發連結為:https://www.cnblogs.com/dechinphy/p/vaex.html 作者ID:DechinPhy 更多原著文章請參考:https://www.cnblogs.com/dechinphy/ # 參考連結 1. https://zhuanlan.zhihu.com/p/2