1. 程式人生 > >protobuf序列化/反序列化效能及問題

protobuf序列化/反序列化效能及問題

為了tensorflow專案要求測試protobuf序列化/反序列化的效能,測試過程及測試結果如下:

一. 測試環境

python 2.7 + proto3

二. 測試方法

1. 自定義一個proto訊息(使用protobuf example裡的例子,進行修改

message Person {
  string name = 1;
  int32 id = 2;  // Unique ID number for this person.
  string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    string number = 1;
    PhoneType type = 2;
  }

  repeated PhoneNumber phones = 4;
}

// Our address book file is just one of these.
message AddressBook {
  repeated Person people = 1;
}
2. 編譯proto檔案
protoc --python_out=. address.proto

得到 addressbook_pb2.py


3. 在測試檔案中,通過修改迴圈的大小,修改序列化內容的大小。並

for i in range(1024 * 1024):
  PromptForAddress(address_book.people.add())

4. 序列化

  begin = datetime.datetime.now()
  serialized = address_book.SerializeToString()
  end = datetime.datetime.now()
  print end-begin
  print len(serialized)
  f.write(serialized)

5. 反序列化
    book = f.read()
    parsebegin = datetime.datetime.now()
    address_book.ParseFromString(book)
    parseend = datetime.datetime.now()
    print parseend-parsebegin
    print len(book)

完整的py檔案如下:
#! /usr/bin/env python

# See README.txt for information and build instructions.

import addressbook_pb2
import sys
import datetime

# This function fills in a Person message based on user input.
def PromptForAddress(person):
  person.id = 160824
  person.name = "xxxxx xxxxx"
  person.email = "[email protected]"
  phone_number = person.phones.add()
  phone_number.number = "12345678"
  phone_number.type = addressbook_pb2.Person.MOBILE

  phone_number = person.phones.add()
  phone_number.number = "23456789"
  phone_number.type = addressbook_pb2.Person.HOME

  phone_number = person.phones.add()
  phone_number.number = "34567890"
  phone_number.type = addressbook_pb2.Person.WORK

# Main procedure:  Reads the entire address book from a file,
#   adds one person based on user input, then writes it back out to the same
#   file.
if len(sys.argv) != 2:
  print "Usage:", sys.argv[0], "ADDRESS_BOOK_FILE"
  sys.exit(-1)
address_book = addressbook_pb2.AddressBook()

# Read the existing address book.
try:
  with open(sys.argv[1], "rb") as f:
    book = f.read()
    parsebegin = datetime.datetime.now()
    address_book.ParseFromString(book)
    parseend = datetime.datetime.now()
    print parseend-parsebegin
    print len(book)



#    address_book.ParseFromString(f.read())
except IOError:
  print sys.argv[1] + ": File not found.  Creating a new file."

# Add an address.
for i in range(1024 * 1024):
  PromptForAddress(address_book.people.add())

# Write the new address book back to disk.
with open(sys.argv[1], "wb") as f:
  begin = datetime.datetime.now()
  serialized = address_book.SerializeToString()
  end = datetime.datetime.now()
  print end-begin
  print len(serialized)
  
'''
address_book = addressbook_pb2.AddressBook()

# Read the existing address book.
try:
  with open(sys.argv[1], "rb") as f:
    book = f.read()
    parsebegin = datetime.datetime.now()
    address_book.ParseFromString(book)
    parseend = datetime.datetime.now()
    print parseend-parsebegin
    print len(book)
'''

6. 修改迴圈次數,記錄不同大小的protobuf序列化反序列的效能

三. 測試結果

位元組(MB)

序列化(s)

反序列化(s)

1.03

0.799453

0.950107

53.00

36.759911

43.303041

61.64

41.674104

52.206466

81.00

63.077295

79.234909

106.00

72.048027

88.280556

102.83

81.08806

102.28786

162.00

128.883403

164.042591

205.66

163.994605

199.729636

243.00

197.582673

246.699898


注:表中位元組大小為序列化後得到的字串大小,即程式中的 
len(serialized) 

四. 測試分析及問題

根據測試的結果看是基本成線性增長,位元組數越大,所用時間越多。當位元組數為243MB時,序列化耗時3s左右,反序列化耗時4s左右。在測試結果上有幾個問題如下:

1. 測試方法是否正確,我感覺應該是可行的,但是結果比我預期的要大。

2. 本次測試是用Python測試的,我在c++下進行測試,得到的結果比python好很多(C++部分參考FlatBuffers與protobuf效能比較

我只對比測試了小資料量(1KB)的,序列化及反序列化均迴圈100次,結果如下:(兩次測試的proto檔案為同一個,在C++中用的序列化/反序列化函式為ParseFromArray/SerializeToArray,python中用的序列化/反序列化函式是ParseFromString/SerializeToString)

序列化(毫秒)

反序列化(毫秒)

Python

63.879

82.89

C++

1.336

1.352

測試結果顯示c++的效能比Python快60-80倍,c++是否能比Python快這麼多?

3. 經查閱相關資料,序列化反序列化跟proto的結構也是有關係的(比如多層巢狀),所以建議在學習tensorflow之後結合tensorflow再進行一次測試,在訓練某一個模型時,將其中序列化反序列化的過程單獨計時。


以上兩個問題還需討論,也歡迎各位批評指正。

五. 參考及學習文章