neo4j 學習記錄（二）

阿新 • • 發佈：2018-12-16

Neo4j 匯入資料的幾種方式對比

1. 遇到的問題

在使用Neo4j時遇到一個問題，需要匯入上百億資料(同時匯入節點和關係)，想找一個最合適的方案來匯入資料。於是就想測測各種匯入方式的效率以及成本。

2. 常見資料匯入方式概覽

(1) Cypher create 語句，為每一條資料寫一個create

(2) Cypher load csv 語句，將資料轉成CSV格式，通過LOAD CSV讀取資料。

(3) 官方提供的neo4j-import工具，未來將被neo4j-adminimport代替

(4) 官方提供的Java API BatchInserter

(5) 大牛編寫的 batch-import 工具

(6) neo4j-apocload.csv +apoc.load.relationship

(7) 針對實際業務場景，定製化開發

這些工具有什麼不同呢？速度如何？適用的場景分別是什麼？

create語句	load csv語句	neo4j-import	BatchInster	batch-import	apoc
適用場景	初始化匯入增量更新	初始化匯入	初始化匯入	初始化匯入增量更新(有限制)	增量更新
匯入速度	很慢1000/s	數k /s	數w/s	數w/s	數w/s	數k/s
實際測試	無	9.5k/s (節點+關係)	12w/s (節點+關係)	1w/s (節點+關係)	1w/s (節點+關係)	4k/s(1億資料上增量更新) 1w/s(百萬資料上增量更新)
優點	1.使用方便 2.可實時插入	1.官方ETL工具 2.可以載入本地/遠端CSV 3.可實時插入	1.官方工具 2.佔用資源少	1.官方API	1.可以增量更新 2.基於BatchInserter	1.官方ETL工具 2.可以增量更新 3.支援線上匯入 4.支援動態傳Label RelationShip
缺點	1.速度慢 2.處理資料，拼CQL複雜	1.匯入速度較慢 2.不能動態傳Label RelationShip	1.需要離線匯入停止Neo4j資料庫 2.只能用於初始化匯入	1.需要離線匯入停止Neo4j資料庫 2.需要在JAVA環境中使用	1.需要離線匯入停止Neo4j資料庫	1.速度一般

註釋：本次測試使用的id(不是Neo4j id)型別是String，為uuid，如果同一Label下資料不超過2^32，可以用int型別

3. 各種匯入方法實際測試

3.1 create

未測試。

批量插入語句

./bin/neo4j-shell-c < /data/stale/data01/neo4j/neo4j_script.cypher

./bin/neo4j-shell -path ./data/dbms/ -conf ./conf/neoo4j.conf -file create_index.cypther

3.2 load csv

load csv 不能動態傳Label、RelationShip，所以測試時Label、RelationShip是寫死的

資料格式

uuid,name,Label

b6b0ea842890425588d4d3cfb38139a9,"文爍",Label1

5099c4f943d94fa1873165e3f6f3c2fb,"齊賀喜",Label3

c83ed0ae9fb34baa956a42ecf99c8f6e,"李雄",Label2

e62d1142937f4de994854fa1b3f0670a,"房玄齡",Label

3.2.1 匯入10w節點

neo4j-sh (?)$ using periodic commit 10000 load csv with headers from "file:/data/stale/data01/neo4j/node_uuid_10w.csv" as line with line create (:Test {uuid:line.uuid, name:line.name});
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 100000
Properties set: 200000
Labels added: 100000
3412 ms

10萬資料(只有節點，沒有關係)匯入用時3.412s

3.2.2 匯入1kw節點

neo4j-sh (?)$ load csv from "file:/data/stale/data01/neo4j/node_uuid_1kw.csv" as line return count(*);
+----------+
| count(*) |
+----------+
| 10000010 |
+----------+
1 row
7434 ms
neo4j-sh (?)$ using periodic commit 10000 load csv with headers from "file:/data/stale/data01/neo4j/node_uuid_1kw.csv" as line with line create (:Test {uuid:line.uuid, name:line.name});
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 10000009
Properties set: 20000018
Labels added: 10000009
151498 ms

1千萬資料(只有節點，沒有關係)匯入用時151.498s

3.2.3 匯入100w關係

neo4j-sh (?)$ using periodic commit 100000 load csv with headers from "file:/data/stale/data01/neo4j/relathionship_uuid_100w.csv" as line with line merge (n1:Test {uuid:line.uuid1}) merge (n2:Test {uuid:line.uuid2}) with * create (n1)-[r:Relationship]->(n2);
+-------------------+
| No data returned. |
+-------------------+
Relationships created: 1000000
75737 ms

建立100w關係用時75.737s

因為我節點已經提前匯入了，所以merge的時候節點全部存在，根據結果可以看到，只建立了100w關係，沒有建立節點

但是這種方式有一個弊端，關係要寫死，在只有一種關係時試用，在有多種關係時，不適用。還有一個不好的地方就是用的merge，uuid是String型別，會隨著資料的正常速度變慢。

load csv 的速度我用的是匯入節點時間+匯入關係時間

匯入100w 節點+資料 (Label寫死，RelationShip寫死，也就是隻有一種Label和一種RelationShip) 共花費15.149 + 75.737 = 90.886 。load csv的速度大概在1.1w/s，但這種情況一般很少使用，僅供參考。

3.3 neo4j-import

3.3.1 資料格式

node.csv

uuid:ID(users),name:String,:Label
c63bc1e7dc594fd49fbe36dd664ff0a6,"維特",Label1
b52fb5f2266b4edbadc82b5ec4c430b8,"廖二鬆",Label2
d95d430cfeee47dd95f9bf5e0ec1ae93,"徐青偏",Label3
b2d1fffc8173461fa603d4fbb601b3ee,"楊礎維",Label2

relationship.csv

uuid:START_ID(users),uuid:END_ID(users),:TYPE
c63bc1e7dc594fd49fbe36dd664ff0a6,b2d1fffc8173461fa603d4fbb601b3ee,RelationShip1
d95d430cfeee47dd95f9bf5e0ec1ae93,c63bc1e7dc594fd49fbe36dd664ff0a6,RelationShip2
b2d1fffc8173461fa603d4fbb601b3ee,b52fb5f2266b4edbadc82b5ec4c430b8,RelationShip3
b52fb5f2266b4edbadc82b5ec4c430b8,d95d430cfeee47dd95f9bf5e0ec1ae93,RelationShip1

3.3.2 匯入1000w資料

IMPORT DONE in 27s 932ms. 
Imported:
  10000000 nodes
  10000000 relationships
  20000000 properties
Peak memory usage: 209.81 MB

空庫初始化匯入1千萬資料(1kw節點 1kw關係 2kw屬性，id使用integer，屬性中只有數字和英文)花費27s 932ms

3.3.3 匯入1000w資料

IMPORT DONE in 1m 50s 9ms. 
Imported:
  10000000 nodes
  10000000 relationships
  20000000 properties
Peak memory usage: 209.81 MB

空庫初始化匯入1千萬資料(1kw節點 1kw關係 2kw屬性，包含中文屬性)花費1min 50s 9ms

3.3.4 匯入1.1y資料

IMPORT DONE in 15m 9s 37ms. 
Imported:
  110000010 nodes
  110000000 relationships
  220000020 properties
Peak memory usage: 2.27 GB
There were bad entries which were skipped and logged into /data/stale/data01/neo4j/neo4j-community-3.1.0/data/databases/test_uuid_1y_graph.db/bad.log

空庫初始化匯入1.1億資料(1.1億節點 1.1關係 2.2億屬性)花費15min 9s 37ms

3.4 BatchInster

batch-import調的BatchInserter的程式碼，所以BatchInserter沒測，可以認為BatchInster和batch-import速度一樣

3.5 batch-import

資料格式

node.csv

uuid:string:users,name:String,:label
c63bc1e7dc594fd49fbe36dd664ff0a6,"維特",Label1
b52fb5f2266b4edbadc82b5ec4c430b8,"廖二鬆",Label2
d95d430cfeee47dd95f9bf5e0ec1ae93,"徐青偏",Label3
b2d1fffc8173461fa603d4fbb601b3ee,"楊礎維",Label2

relationship.csv

uuid:string:users,uuid:string:users,type
c63bc1e7dc594fd49fbe36dd664ff0a6,b2d1fffc8173461fa603d4fbb601b3ee,RelationShip1
d95d430cfeee47dd95f9bf5e0ec1ae93,c63bc1e7dc594fd49fbe36dd664ff0a6,RelationShip2
b2d1fffc8173461fa603d4fbb601b3ee,b52fb5f2266b4edbadc82b5ec4c430b8,RelationShip3
b52fb5f2266b4edbadc82b5ec4c430b8,d95d430cfeee47dd95f9bf5e0ec1ae93,RelationShip1

Using: Importer batch.properties /data/stale/data01/neo4j/neo4j-community-3.1.0/data/databases/test_uuid_1000w_graph.db /data/stale/data01/neo4j/node_uuid_100w.csv /data/stale/data01/neo4j/relationship_uuid_100w.csv
  
Using Existing Configuration File
..........
Importing 1000000 Nodes took 15 seconds 
..........
Importing 1000000 Relationships took 16 seconds 
  
Total import time: 92 seconds

在資料庫中已有1kw資料的情況下，匯入100w資料(100w節點 100w關係 200w屬性，包含中文屬性)花費92s

3.6 apoc

load csv + merge + apoc.create.relationship

neo4j-sh (?)$ using periodic commit 1000000
> load csv from 'file:/data/stale/data01/neo4j/relathionship_uuid_1kw.csv' as line fieldterminator ','
> merge (n1:Test {uuid: line[0]})
> merge (n2:Test {uuid: line[1]})
> with n1, n2, line
> CALL apoc.create.relationship(n1, line[2], {}, n2) YIELD rel
> return count(rel) ;
+------------+
| count(rel) |
+------------+
| 10000010   |
+------------+
1 row
Nodes created: 8645143
Properties set: 8645143
Labels added: 8645143
2395852 ms

在1.1億資料上增量更新1kw資料花費2395.852s

VIRT Memory 90G RES Memory 78G

因為這部分也用到merge了，資料量越大，速度越慢

4. 結論

根據實際情況選用最好的方式

(1) neo4j-import匯入速度快，但是要求是空庫，匯入時要停止neo4j，也就是離線匯入，而且你要提前處理好資料，資料最好不要有重複，如果有重複，可以匯入時跳過，然後根據bad.log來檢視或者修正這部分資料

(2) batch-import可以增量匯入，但是要求匯入時停止neo4j資料庫(離線匯入)，而且增量更新的資料不會和庫裡存在的資料對比，所以要求資料全是新的，否則會出現重複資料

(3) load csv比較通用，而且可以在neo4j資料庫執行時匯入，但是匯入速度相對較慢，要提前整理好資料，而且不能動態建立 Label RelationShip

(4) apoc挺好用的，可以動態建立Label、RelationShip，但是速度一般

neo4j 學習記錄（二）

Neo4j 匯入資料的幾種方式對比

neo4j 學習記錄（二）

FCC學習記錄（二）—— Responsive Design with Bootsstrap

Linux命令學習記錄（二）

Ansible 學習記錄（二）基礎介紹

Centos6.10下Open-falcon學習記錄（二）——Mysql監控

redis入門學習記錄（二） redis入門學習記錄（一）

圖解HTTP學習記錄（二）

不平等博弈問題學習記錄（二）

Spark學習記錄（二）Spark叢集搭建

python學習記錄（二）

Zedboard 學習記錄（二）:移植不帶桌面的linux系統

OpenCV學習記錄（二）：自己訓練haar特徵的adaboost分類器進行人臉識別

MongoDB 學習記錄（二）yum安裝

neo4j 學習記錄（三）-資料匯入

SpringBoot學習記錄（二）——整合JSP

swift 學習記錄（二）

【webpack】學習記錄（二）

webpack學習記錄（二）-管理資源

javascript學習記錄（二）-function函式的應用之sort()函式詳解

Android Camera 流程學習記錄（二）—— Camera Open 呼叫流程

neo4j 學習記錄（二）

Neo4j 匯入資料的幾種方式對比

相關推薦