[原始碼解析] PyTorch 流水線並行實現 (2)--如何劃分模型

0x00 摘要

上一篇文章我們介紹了 PyTorch 流水線並行的基本知識,本文我們介紹其自動平衡機制和模型分割。

流水線並行其他文章連結如下:

[原始碼解析] 深度學習流水線並行Gpipe(1)---流水線基本實現

[原始碼解析] 深度學習流水線並行GPipe (2) ----- 梯度累積

[原始碼解析] 深度學習流水線並行 GPipe(3) ----重計算

[原始碼解析] 深度學習流水線並行之PipeDream(1)--- Profile階段

[原始碼解析] 深度學習流水線並行 PipeDream(2)--- 計算分割槽

[原始碼解析] 深度學習流水線並行 PipeDream(3)--- 轉換模型

[原始碼解析] 深度學習流水線並行 PipeDream(4)--- 執行時引擎

[原始碼解析] 深度學習流水線並行 PipeDream(5)--- 通訊模組

[原始碼解析] 深度學習流水線並行 PipeDream(6)--- 1F1B策略

[原始碼解析] PyTorch 流水線並行實現 (1)--基礎知識

本文圖來自論文和github原始碼。

0x01 問題

流水線並行首先面對的問題就是:

  • 如何把一個大模型切分成若干小模型?切分的演算法是什麼?
  • 如何把這些小模型分配到多個裝置之上?分配的演算法是什麼?
  • 如何做到整體效能最優或者近似最優?衡量標準是什麼?

比如一個擁有 6 個層的大模型,如何切分成三個小模型?

+-----------------------------------------------------------------------------------------+
| |
| Layer 1 +---> Layer 2 +-----> Layer 3 +-----> Layer 4 +-----> Layer 5 +---> Layer 6 |
| |
+------------------------------------------+----------------------------------------------+
|
|
| ? ? ? ? ?
|
|
v +--------------------+ +---------------------+ +--------------------+
|Device 1 | |Device 2 | |Device 3 |
| | | | | |
| Layer 1 | +---------> Layer 4 | | |
| + | | | + | +-------> Layer 6 |
| | | | | | | | | |
| v | | | | | | | |
| Layer 2 | | | | | | | |
| + | | | v | | | |
| | | | | Layer 5 +---------+ | |
| v | | | | | |
| Layer 3 +---------+ | | | |
| | | | | |
+--------------------+ +---------------------+ +--------------------+

接下來,我們就看看 torchgpipe 是如何解決這些問題的。

0x01 自動平衡

torchgpipe提供了子模組 torchgpipe.balance 來計算得到分割槽,目的是讓兩兩分割槽(pairwise)之間的資源差別儘量小。資源佔用情況是通過分析(profile)來計算。

1.1 Automatic Balancing

切分模型會影響GPU的利用率,比如其中計算量較大的層會減慢下游的速度,所以需要找到一個模型的最佳平衡點。但是,確定模型的最佳平衡點是很難的,特別是,如果使用者仍在設計模型階段,則模型體系結構可能會隨著時間的推移而改變。在這種情況下,TorchPipe 強烈建議使用 torchgpipe.balance來自動平衡。這不會給使用者提供最佳的平衡,但這是一個足夠好的平衡。

請注意,這個功能是由torchgpipe提供的,而不是來自Huang等人的GPipe 原始論文。

torchgpipe提供了兩個平衡工具,兩者都基於每層的profile結果來使用,使用者可以根據需要選擇平衡工具。

  • ~torchgpipe.balance.balance by_time:跟蹤每層的執行時間。
  • ~torchgpipe.balance.balance by_size:檢測每層的CUDA記憶體使用情況。

具體使用方式如下,使用者需要向模型中輸入一個樣本輸入。

   from torchgpipe import GPipe
from torchgpipe.balance import balance_by_time partitions = torch.cuda.device_count()
sample = torch.rand(128, 3, 224, 224) # 使用者需要向模型中輸入一個樣本輸入
balance = balance_by_time(partitions, model, sample) model = GPipe(model, balance, chunks=8)

1.2 基礎函式/函式

1.2.1 Batch

Batch 是一個基礎類,位於 torchgpipe/microbatch.py,其作用是把 tensor 或者 tensors 封裝起來做統一處理。Batch 把張量儲存在自己的 value 成員變數之中。在呼叫 call 方法時候,就把傳入的方法應用到 value 張量之上

比如後面我們會講到的 Pipeline.compute 方法之中會有如下,就是把 partition 應用到 batch 內的張量之上:

def compute(batch: Batch = batch,
partition: nn.Sequential = partition,
skip_tracker: SkipTrackerThroughPotals = skip_trackers[i],
) -> Batch:
with use_skip_tracker(skip_tracker):
return batch.call(partition) task = Task(streams[j], compute=compute, finalize=None)

具體 Batch 定義如下:

Tensors = Tuple[Tensor, ...]
TensorOrTensors = Union[Tensor, Tensors]
Function = Callable[[TensorOrTensors], TensorOrTensors] class Batch:
"""An abstraction of an atomic tensor or a tuple of tensors. This
eliminates every boilerplate code to classify an atomic tensor or a tuple
of tensors.
:: x = generate_tensor_or_tensors()
x = Batch(x) # in-place update
x[0] = F.apply(x[0])
x[:] = F.apply(*x) # f(x) if x is a tensor.
# f(*x) if x is a tuple of tensors.
# y is also a batch.
y = x.call(f) """ def __init__(self, value: TensorOrTensors) -> None:
self.value = value
self.atomic = torch.is_tensor(value) @property
def tensor(self) -> Tensor:
"""Retrieves the underlying tensor."""
if not self.atomic:
raise AttributeError('not atomic batch')
return cast(Tensor, self.value) @property
def tensors(self) -> Tensors:
"""Retrieves the underlying tensors."""
if self.atomic:
raise AttributeError('batch is atomic')
return cast(Tensors, self.value) @property
def tensor_or_tensors(self) -> TensorOrTensors:
"""Retrieves the underlying tensor or tensors regardless of type."""
return self.value def call(self, function: Function) -> 'Batch': # 這裡是關鍵方法
"""Calls a function by the underlying tensor or tensors. It also wraps
the output with :class:`Batch`.
"""
return Batch(function(self.value)) # 呼叫模型的forward

1.2.2 layerwise_sandbox

layerwise_sandbox 方法的作用是在不影響原有模型的基礎上,拷貝模型的層,這樣更容易profile。

def layerwise_sandbox(module: nn.Sequential,
device: torch.device,
) -> Generator[nn.Module, None, None]:
"""Copies layers for ease to profile. It doesn't modify the given
module.
"""
for layer in module:
layer_copy = copy.deepcopy(layer)
layer_copy.to(device)
layer_copy.train()
yield layer_copy

1.2.3 detach

detach 方法的作用是從autograd圖中detach一些張量,得到一組新的張量。這些張量從當前計算圖中被分離下來。但是仍指向原變數的存放位置。detach 可以切斷一些分支的反向傳播.。

def detach(batch: Batch) -> None:
"""Detaches from autograd graph."""
for i, x in enumerate(batch):
batch[i] = x.detach().requires_grad_(x.requires_grad)

torchgpipe程式碼中,經常可以見到 detach 的使用,這個從註釋可以看出來,是因為 PyTorch 的一個bug 而採取的workround。

    # A Python autograd function might fail with this error:
#
# RuntimeError: Returning Variables sharing storage with other Variables
# that require grad is not supported in Python functions. Please submit a
# feature request if you hit this error.
#
# It doesn't look like an essential restriction. But it happens on the
# current PyTorch version. To avoid it, we should detach the tensor before
# returning by identity autograd functions, such as Wait, Fork, and Join.
#

1.3 據計算時間來平衡

balance_by_time 方法的作用就是依據執行時間來平衡,其中引數如下:

  • partitions :分割槽數目

  • module : 需要分割槽的順序模型

  • sample :給定 batch size 的樣本

其實就是呼叫 profile_times 依據sample來得到執行時間,然後進行分割槽。

def balance_by_time(partitions: int,
module: nn.Sequential,
sample: TensorOrTensors,
*,
timeout: float = 1.0,
device: Device = torch.device('cuda'),
) -> List[int]:
"""Naive automatic balancing by elapsed time per layer.
:: sample = torch.empty(128, 3, 224, 224)
balance = balance_by_time(torch.cuda.device_count(), model, sample)
gpipe = GPipe(model, balance, chunks=8) Args:
partitions (int):
intended number of partitions
module (torch.nn.Sequential):
sequential module to be partitioned
sample (torch.Tensor):
example input with arbitrary batch size Keyword Args:
timeout (float):
profiling iterates again if the timeout (in second) is not exceeded
(default: ``1.0``)
device ('cpu' or 'cuda' device):
CPU or CUDA device where each layer is profiled (default: the
current CUDA device) Returns:
A list of number of layers in each partition. Use it for the `balance`
parameter of :class:`~torchgpipe.GPipe`. .. note::
`module` and `sample` must be placed on the same device. """
times = profile_times(module, sample, timeout, torch.device(device))
return balance_cost(times, partitions)

這裡的 Batch 類就是對張量或者張量陣列進行封裝,可以統一使用其方法。

profile_times 依據sample來得到執行時間,具體邏輯是:

  • 遍歷模型中的層,針對每個層:

    • 等待當前裝置上所有流中的所有kernel完成
    • 記錄起始執行時間
    • 對某層進行前向計算
    • 得到需要梯度的張量,如果存在,則進行後向計算
    • 等待當前裝置上所有流中的所有kernel完成
    • 記錄終止時間
  • 最後返回一個每層執行時間列表。
def profile_times(module: nn.Sequential,
sample: TensorOrTensors,
timeout: float,
device: torch.device,
) -> List[int]:
"""Profiles elapsed times per layer."""
if any(p.grad is not None for p in module.parameters()):
raise ValueError('some parameter already has gradient') _batch = Batch(sample)
for i, x in enumerate(_batch):
_batch[i] = x.detach().to(device).requires_grad_(x.requires_grad) time_bufs: List[List[float]] = [[] for _ in module]
begun_at = time.time() while time.time() - begun_at < timeout:
batch = _batch # 遍歷模型中的層
for i, layer in enumerate(layerwise_sandbox(module, device)):
detach(batch) if device.type == 'cuda':
torch.cuda.synchronize(device) # 等待當前裝置上所有流中的所有kernel完成
tick = time.time()# 起始執行時間 # Forward
batch = batch.call(layer) # 對某層進行前向計算 # Backward
# 得到需要梯度的張量
backward_tensors = tuple(y for y in batch if y.requires_grad)
# 進行後向計算
if backward_tensors:
torch.autograd.backward(backward_tensors, backward_tensors) if device.type == 'cuda':
torch.cuda.synchronize(device) # 等待當前裝置上所有流中的所有kernel完成
tock = time.time() # 終止時間 time_bufs[i].append(tock - tick) us = 1_000_000
return [sum(int(t*us) for t in buf) for buf in time_bufs]

1.4 據記憶體大小來平衡

balance_by_size 方法的作用就是依據執行時記憶體大小來平衡,其中引數如下:

  • partitions :分割槽數目,從示例看,可以認為是裝置數。

  • module : 需要分割槽的順序模型

  • sample :給定 batch size 的樣本

其實就是呼叫 profile_sizes 依據sample來得到執行時記憶體大小,然後進行分割槽。

在訓練期間,引數所需的記憶體取決於使用哪個優化器。優化器可以為每個引數使用緩衝區來在其內部跟蹤優化統計資訊,例如SGD中的動量緩衝區。

為了獲得更可靠的基於大小的平衡,使用者應該為優化器指定相應的“param_scale”。預設的“param_scale”是2,而不是1,這是因為梯度累積(gradient accumulation)是每個優化器所必需的。下面註釋之中也給出了一些參考取值。

def balance_by_size(partitions: int,
module: nn.Sequential,
input: TensorOrTensors,
*,
chunks: int = 1,
param_scale: float = 2.0,
device: Device = torch.device('cuda'),
) -> List[int]:
"""Naive automatic balancing by CUDA memory usage per layer. During training, required memory for parameters depends on which optimizer
is used. Optimizers may use buffers for each parameter to track
optimization statistics internally, such as momentum buffer in SGD. To get more reliable size based balance, you should specify `param_scale`
with regard to your optimizer. The default `param_scale` is 2 instead of 1
due to gradient accumulation which is necessary for every optimizer. Follow this guide to choose correct `param_scale` for typical optimizers: ========= ============= =========================================
Optimizer `param_scale` Internal State
========= ============= =========================================
SGD 2--3 (momentum_buffer)
Adam 4--5 exp_avg, exp_avg_sq, (max_exp_avg_sq)
Adadelta 4 square_avg, acc_delta
Adagrad 3 sum
RMSprop 3--5 square_avg, (momentum_buffer), (grad_avg)
========= ============= ========================================= Here's a simple example with the Adam optimizer:: balance = balance_by_size(
torch.cuda.device_count(),
model, # Same size with mini-batch to train
torch.empty(1024, 3, 224, 224), # Number of micro-batches to train with GPipe
chunks=8, # 4 for Adam
param_scale=4.0,
) gpipe = GPipe(model, balance, chunks=8)
adam = Adam(gpipe.parameters()) Args:
partitions (int):
intended number of partitions
module (torch.nn.Sequential):
sequential module to be partitioned
input (torch.Tensor):
example mini-batch with the same size to train Keyword Args:
chunks (int):
number of micro-batches will be used to train (default: ``1``)
param_scale (float):
how many copies of parameters would be allocated for training. It
depends on optimizer. See the above guide. (default: ``2.0``)
device ('cuda' device):
CUDA device where each layer is profiled (default: the current CUDA
device) Returns:
A list of number of layers in each partition. Use it for the `balance`
parameter of :class:`~torchgpipe.GPipe`. .. note::
`module` and `input` must be placed on the same CUDA device. """
sizes = profile_sizes(module, input, chunks, param_scale, torch.device(device))
return balance_cost(sizes, partitions)

profile_sizes 邏輯如下:

  • 遍歷模型中的層,針對每個層:

    • 使用 torch.cuda.memory_allocated 計算前向傳播用到的視訊記憶體,就是啟用值。torch.cuda.memory_allocated(device=None) 返回給定裝置device的張量所佔用的當前GPU記憶體。
    • 使用 p.storage().size() * p.storage().element_size() 計算引數尺寸。
      • pytorch中的storage指的是連續的記憶體塊,而tensor可以認為是對映到storage的檢視。
      • element_size() 返回單個元素的位元組。
    • 把啟用值和引數加在一起,插入列表。
  • 返回記憶體大小列表。
def profile_sizes(module: nn.Sequential,
input: TensorOrTensors,
chunks: int,
param_scale: float,
device: torch.device,
) -> List[int]:
"""Profiles CUDA memory usage per layer."""
if device.type != 'cuda':
raise ValueError('size profiler supports only CUDA device') batch = Batch(input)
sizes: List[int] = [] latent_scale = batch[0].size(0) / chunks
for i, x in enumerate(batch):
batch[i] = x[:1].detach().to(device).requires_grad_(x.requires_grad) for layer in layerwise_sandbox(module, device):
detach(batch) # Detect memory usage at forward.
# 計算前向傳播用到的視訊記憶體,就是啟用值
memory_before = torch.cuda.memory_allocated(device)
batch = batch.call(layer) # 對某層進行前向傳播
memory_after = torch.cuda.memory_allocated(device)
latent_size = memory_after - memory_before # Analyze size of parameters.
# 計算引數尺寸
param_size = sum(p.storage().size() * p.storage().element_size()
for p in layer.parameters()) # 把啟用值和引數加在一起,插入列表
# Combine size of parameters and activations with normalize scales.
size = latent_size*latent_scale + param_size*param_scale
sizes.append(int(size)) return sizes # 返回記憶體大小列表

1.5 分割演算法

得到每層的計算時間或者記憶體大小之後,會通過如下程式碼來進行具體分割。

times = profile_times(module, sample, timeout, torch.device(device))
return balance_cost(times, partitions)

具體 balance_cost 只是一個封裝而已,演算法還是 blockpartition.solve。

def balance_cost(cost: List[int], partitions: int) -> List[int]:
partitioned = blockpartition.solve(cost, partitions)
return [len(p) for p in partitioned]

從其註釋可知,blockpartition.solve 實現了這篇論文的演算法。

Implements "Block Partitions of Sequences" by Imre Bárány et al.Paper: https://arxiv.org/pdf/1308.2452.pdf

這是一篇數學論文,其演算法虛擬碼如下(與後續實現中註釋基本一一對應)。

該論文是純粹的數學論證,我們不去研究其內部機制,只是看看其執行結果。

我們回憶一下,這裡支援的模型是順序模型,所以無論時間還是記憶體大小,都是一個list。solve的作用就是把這個list儘量平均分配成若干組

假設模型有6層,需要分配到兩個device之上,那麼應該如何分割呢?

blockpartition.solve([1, 2, 3, 4, 5, 6], partitions=2) 

結果是 [[1, 2, 3, 4], [5, 6]]

如果分成三個device,則:

solve([1, 2, 3, 4, 5, 6], partitions=3)

結果是 [[1, 2, 3], [4, 5], [6]]

然後 balance_cost 會獲得每一個 partition 的具體層數,得到balance的最終是:

[3,2,1]

分割槽演算法具體程式碼如下,有興趣的朋友可以結合論文仔細研究。

def solve(sequence: List[int], partitions: int = 1) -> List[List[int]]:
"""Splits a sequence into several partitions to minimize variance for each
partition. The result might not be optimal. However, it can be done only in O(kn³),
where k is the number of partitions and n is the length of the sequence. """
if partitions < 1:
raise ValueError(f'partitions must be a positive integer ({partitions} < 1)') n = len(sequence)
if n < partitions:
raise ValueError(f'sequence is shorter than intended partitions ({n} < {partitions})') # Normalize the sequence in [0, 1].
minimum = min(sequence)
maximum = max(sequence) - minimum normal_sequence: List[float]
if maximum == 0:
normal_sequence = [0 for _ in sequence]
else:
normal_sequence = [(x-minimum)/maximum for x in sequence] splits = [n//partitions * (x+1) for x in range(partitions-1)] + [n] def block_size(i: int) -> float:
start = splits[i-1] if i > 0 else 0
stop = splits[i]
return sum(normal_sequence[start:stop]) def leaderboard() -> Iterator[Tuple[float, int]]:
return ((block_size(i), i) for i in range(partitions)) while True:
"""
(1) Fix p ∈ [k] with M(P) = bp. So Bp is a maximal block of P.
"""
# max_size: M(P)
max_size, p = max(leaderboard()) while True:
"""
(2) If M(P) ≤ m(P) + 1, then stop.
"""
# min_size: m(P)
min_size, q = min(leaderboard()) if max_size <= min_size + 1:
return [sequence[i:j] for i, j in zip([0]+splits[:-1], splits)] """
(3) If M(P) > m(P) + 1, then let m(P) = bq for the q ∈ [k] which is
closest to p (ties broken arbitrarily). Thus Bq is a minimal block
of P. Let Bh be the block next to Bq between Bp and Bq. (Note that
Bh is a non-empty block: if it were, then m(P) = 0 and we should
have chosen Bh instead of Bq.)
"""
if p < q:
"""
So either p < q and then h = q−1 and we define P ∗ by moving
the last element from Bh = Bq−1 to Bq,
"""
h = q - 1
splits[h] -= 1
else:
"""
or q < p, and then h = q + 1 and P ∗ is obtained by moving the
first element of Bh = Bq+1 to Bq.
"""
h = q + 1
splits[q] += 1 """
Set P = P ∗ . If p = h, then go to (1), else go to (2).
"""
if p == h:
break

0x02 模型劃分

2.1 呼叫

既然得到了 profile 的結果,下面就是對模型的各個層進行分割。如何分割可以參見下面註釋中的使用示例,把balance 作為引數傳遞給 GPipe建構函式。

'''
If your model is still under development, its optimal balance would change
frequently. In this case, we highly recommend 'torchgpipe.balance' for naive
automatic balancing: from torchgpipe import GPipe
from torchgpipe.balance import balance_by_time partitions = torch.cuda.device_count()
sample = torch.empty(...)
balance = balance_by_time(partitions, model, sample) model = GPipe(model, balance, ...)
'''

2.2 GPipe構建

Gpipe 的 __init__中可以看到,使用了 split_module 函式進行分割:

    def __init__(self,
module: nn.Sequential,
balance: Optional[Iterable[int]] = None,
*,
devices: Optional[Devices] = None,
chunks: int = chunks,
checkpoint: str = checkpoint,
spawn_workersdeferred_batch_norm: bool = False,
) -> None:
super().__init__() chunks = int(chunks)
checkpoint = str(checkpoint) verify_module(module)
# Verify if the underlying skippable modules satisfy integrity. The
# integrity can be verified before forward() because it is static.
verify_skippables(module) self.chunks = chunks
self.checkpoint = checkpoint if deferred_batch_norm:
module = DeferredBatchNorm.convert_deferred_batch_norm(module, chunks) if devices is None:
devices = range(torch.cuda.device_count())
devices = [torch.device(d) for d in devices]
devices = cast(List[torch.device], devices) try:
# 對模型進行切分
self.partitions, self.balance, self.devices = split_module(module, balance, devices)
except BalanceError as exc:
raise ValueError(recommend_auto_balance(str(exc))) self._copy_streams: List[List[AbstractStream]] = []
self._skip_layout = inspect_skip_layout(self.partitions)

所以我們看看 split_module 函式,其主要邏輯如下:

  • 遍歷模型包含的層

    • 把新的層加入到陣列layers中
    • 如果陣列大小等於balance[j],就是達到了device j應該包含的層數,則:
      • 把分割槽陣列構建成一個sequential module,得到變數 partition。
      • 利用 partition.to(device) 把partition放置到相關裝置之上,這就是前文提到的,~torchgpipe.GPipe使用CUDA進行訓練。使用者不需要自己將模組移動到GPU,因為~torchgpipe.GPipe自動把每個分割槽移動到不同的裝置上。
      • 把這個partition加入到分割槽陣列中
      • 然後去下一個device看看
  • 最後返回 partitions, balance, devices。
def split_module(module: nn.Sequential,
balance: Iterable[int],
devices: List[torch.device],
) -> Tuple[List[nn.Sequential], List[int], List[torch.device]]:
"""Splits a module into multiple partitions. Returns:
A tuple of (partitions, balance, devices). Partitions are represented as a :class:`~torch.nn.ModuleList` whose
item is a partition. All layers in a partition are placed in the
same device. Raises:
BalanceError:
wrong balance
IndexError:
the number of devices is fewer than the number of partitions. """
balance = list(balance) j = 0
partitions = []
layers: NamedModules = OrderedDict() for name, layer in module.named_children(): # 遍歷模型包含的層
layers[name] = layer # 把新的層加入到陣列中 if len(layers) == balance[j]: # 如果陣列大小等於balance[j],就是達到了device j應該包含的層數
# Group buffered layers as a partition.
partition = nn.Sequential(layers) # 把層陣列組合成一個sequential module device = devices[j]
partition.to(device) # 把層放置到相關裝置之上 partitions.append(partition) # 這個新module加入到分割槽陣列中 # Prepare for the next partition.
layers.clear()
j += 1 # 去下一個device看看 partitions = cast(List[nn.Sequential], nn.ModuleList(partitions))
del devices[j:] return partitions, balance, devices

結合上面例子,balance 如下:

[3,2,1]

所以前三個層 [1, 2, 3] 組合成一個module,中間兩個層 [4, 5] 組合成一個 module,最後層 [6] 是一個module。

最後分割槽陣列為:

[ module([1, 2, 3]),  module([4, 5]),  module([6])]

2.3 示例

我們再具體列印輸出看看,模型包含了6個層,分為 3 個partitions,分割槽內的層數分別是:3個,2個,1個。

a = nn.Linear(1, 1)
b = nn.Linear(1, 1)
c = nn.Linear(1, 1)
d = nn.Linear(1, 1)
e = nn.Linear(1, 1)
f = nn.Linear(1, 1) balance = [3,2,1]
model = nn.Sequential(a, b, c, d, e, f)
print(model)
model = GPipe(model, balance, devices=['gpu', 'gpu','gpu'])
print(model)

結果如下,可以看到原模型被分成3個partition,每個 partition 都是一個Sequential。

Sequential(
(0): Linear(in_features=1, out_features=1, bias=True)
(1): Linear(in_features=1, out_features=1, bias=True)
(2): Linear(in_features=1, out_features=1, bias=True)
(3): Linear(in_features=1, out_features=1, bias=True)
(4): Linear(in_features=1, out_features=1, bias=True)
(5): Linear(in_features=1, out_features=1, bias=True)
) GPipe(
(partitions): ModuleList(
(0): Sequential(
(0): Linear(in_features=1, out_features=1, bias=True)
(1): Linear(in_features=1, out_features=1, bias=True)
(2): Linear(in_features=1, out_features=1, bias=True)
)
(1): Sequential(
(3): Linear(in_features=1, out_features=1, bias=True)
(4): Linear(in_features=1, out_features=1, bias=True)
)
(2): Sequential(
(5): Linear(in_features=1, out_features=1, bias=True)
)
)
)

執行時變數如下:

model = {GPipe: 6}
balance = {list: 3} [3, 2, 1]
checkpoint = {str} 'except_last'
chunks = {int} 1
devices = {list: 3}
0 = {device} gpu
1 = {device} gpu
2 = {device} gpu
partitions = {ModuleList: 3}
_modules =
'0' = {Sequential: 3}
Sequential(
(0): Linear(in_features=1, out_features=1, bias=True)
(1): Linear(in_features=1, out_features=1, bias=True)
(2): Linear(in_features=1, out_features=1, bias=True))
'1' = {Sequential: 2}
Sequential(
(3): Linear(in_features=1, out_features=1, bias=True)
(4): Linear(in_features=1, out_features=1, bias=True))
'2' = {Sequential: 1}
Sequential(
(5): Linear(in_features=1, out_features=1, bias=True))

需要注意一點:GPipe 的 partitions 成員變數是 nn.ModuleList 型別。nn.ModuleList是一個容器,其儲存不同 module,並自動將每個 module 的 parameters 新增到網路中。但是nn.ModuleList 並沒有定義一個網路,而只是將不同的模組儲存在一起,這些模組之間並沒有什麼先後順序,網路的執行順序是根據 forward 函式來決定的。

隨之而來問題就是:partition內部可以用Sequential來進行一系列的前向操作,但是如何配置partitions 之間的執行順序?這個我們會在後續文章中分析。

2.4 總結

最後總結一下,流程是從上至下。

  1. 使用 balance_by_size 或者 balance_by_time 來先執行系統,得到 profile 結果。
  2. 然後使用 split_module 來對模型進行分割。
  3. 最後就得到了一個相對平衡的分割槽結果。
  4. 把這些分割槽分配到不同的裝置之上。

具體如下圖:

+-----------------------------------------------------------------------------------------+
| |
| Layer 1 +---> Layer 2 +-----> Layer 3 +-----> Layer 4 +-----> Layer 5 +---> Layer 6 |
| |
+--------------------------+---------------------------+----------------------------------+
| |
balance_by_size | 1 1 | balance_by_time
| |
v v
[[1, 2, 3], [4, 5], [6]] [[1, 2, 3, 4], [5, 6]]
+ +
| |
+-----------+ +--------+
| |
v v
2 split_module
+
|
|
3 v
+------------------------------------------------------------------------------------+
| +--------------------+ +---------------------+ +--------------------+ |
| |Partition 1 | |Partition 2 | |Partition 3 | |
| | | | | | | |
| | Layer 1 | +---------> Layer 4 | | | |
| | + | | | + | +-------> Layer 6 | |
| | | | | | | | | | | |
| | v | | | | | | | | |
| | Layer 2 | | | | | | | | |
| | + | | | v | | | | |
| | | | | | Layer 5 +---------+ | | |
| | v | | | | | | |
| | Layer 3 +---------+ | | | | |
| | | | | | | |
| +---------+----------+ +---------+-----------+ +-----------+--------+ |
| | | | |
+------------------------------------------------------------------------------------+
| | |
4 | 4 | 4 |
v v v
+---------+----------+ +---------+-----------+ +----------+---------+
| | | | | |
| Device 1 | | Device 2 | | Device 3 |
| | | | | |
+--------------------+ +---------------------+ +--------------------+

至此,我們分析了自動平衡機制,下一篇我們看看如何切分資料和一些執行時機制。

0xFF 參考

https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html#stream-sync-behavior

CUDA學習:基礎知識小結

CUDA隨筆之Stream的使用

NVIDIA解決方案架構師深度解析大規模引數語言模型Megatron-BERT

Accelerating Wide & Deep Recommender Inference on GPUs

HugeCTR: High-Performance Click-Through Rate Estimation Training

https://discuss.pytorch.org/t/how-to-prefetch-data-when-processing-with-gpu/548

https://github.com/NVIDIA/apex/

https://github.com/justheuristic/prefetch_generator

https://pytorch.org/tutorials/intermediate/model_parallel_turotial.html

https://pytorch.org/docs/stable/autograd.html

https://pytorch.org/docs/notes/cuda.html

https://zhuanlan.zhihu.com/p/61765561

https://pytorch.apachen.org/docs/1.7/64.html

https://zhidx.com/p/217999.html