MPI 常用函式

阿新 • • 發佈：2018-12-14

MPI 的 manpages 需要線上檢視，或者在 Linux 系統中用 man 檢視，不方便。這裡我做了一些對常用函式的分類總結。

文章目錄

基本結構：啟動和終止

MPI_Init
MPI_Finalize
MPI_Comm_size
MPI_Comm_rank
MPI_Get_processor_name

點對點的通訊：傳送和接收

MPI_Send
MPI_Recv
MPI_Get_count
MPI_Probe

笛卡爾拓撲

MPI_Cart_create
MPI_Cart_coords
MPI_Cart_shift

集體通訊：廣播和規約

MPI_Barrier （同步點）
MPI_Bcast （廣播）
MPI_Scatter
MPI_Gather
MPI_Gatherv
MPI_Allgather （多對多）
MPI_Reduce
MPI_Allreduce

Groups 和 Communicators

MPI_Comm_split
MPI_Comm_create
MPI_Comm_group
MPI_Group_union
MPI_Group_intersection
MPI_Comm_create_group
MPI_Group_incl

版權宣告
附錄 A 延伸閱讀
附錄 B OpenMPI 的配置

基本結構：啟動和終止

參考：http://mpitutorial.com/tutorials/mpi-hello-world/zh_cn/

#include <mpi.h> // ******************1
#include <stdio.h>

int main(int argc, char** argv) {
    MPI_Init(NULL, NULL); // ******************2

    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    printf("Hello world from processor %s, rank %d out of %d processors\n",
           processor_name, world_rank, world_size);

    MPI_Finalize(); // ******************3
}

MPI_Init

MPI_Init(
    int* argc,
    char*** argv)

所有 MPI 的全域性變數或者內部變數都會被建立。舉例來說，一個通訊器 communicator 會根據所有可用的程序被創建出來（程序是我們通過 mpi 執行時的引數指定的），然後每個程序會被分配獨一無二的秩 rank

MPI_Finalize

MPI_Finalize()

用來清理 MPI 環境的。這個呼叫之後就沒有 MPI 函式可以被呼叫了。

MPI_Comm_size

MPI_Comm_size(
    MPI_Comm communicator,
    int* size)

返回 communicator 的大小，也就是 communicator 中可用的程序數量。

MPI_Comm_rank

MPI_Comm_rank(
    MPI_Comm communicator,
    int* rank)

返回 communicator 中當前程序的 rank。 communicator 中每個程序會以此得到一個從 0 開始遞增的數字作為 rank 值。rank 值主要是用來指定傳送或者接受資訊時對應的程序。

MPI_Get_processor_name

MPI_Get_processor_name(
    char* name,
    int* name_length)

得到當前程序實際跑的時候所在的處理器名字。

點對點的通訊：傳送和接收

MPI_Send

參考資料：http://mpitutorial.com/tutorials/mpi-send-and-receive/

MPI_Send(
    void* data,
    int count,
    MPI_Datatype datatype,
    int destination,
    int tag,
    MPI_Comm communicator)

本端點發送包含 count 個 datatype 型別的資料 *data 給 rank 為 destination 的目標端點，資料標籤為 tag，通訊器為 communicator（通常為 MPI_COMM_WORLD）。

該方法會阻塞直到傳送快取可以被回收。這意味著當網路可以緩衝訊息時，該方法就可以返回；如果網路不可以快取訊息，就會一直阻塞至遇到匹配的接受方法。

datatype 取值有：

MPI datatype	C equivalent
MPI_SHORT	short int
MPI_INT	int
MPI_LONG	long int
MPI_LONG_LONG	long long int
MPI_UNSIGNED_CHAR	unsigned char
MPI_UNSIGNED_SHORT	unsigned short int
MPI_UNSIGNED	unsigned int
MPI_UNSIGNED_LONG	unsigned long int
MPI_UNSIGNED_LONG_LONG	unsigned long long int
MPI_FLOAT	float
MPI_DOUBLE	double
MPI_LONG_DOUBLE	long double
MPI_BYTE	char

MPI_Recv

MPI_Recv(
    void* data,
    int count,
    MPI_Datatype datatype,
    int source,
    int tag,
    MPI_Comm communicator,
    MPI_Status* status)

本端點接受 rank 為 source （不限制時用 MPI_ANY_SOURCE ）的源端點傳來的，標籤為 tag （不限制時用 MPI_ANY_TAG），型別為 datatype 的資料，資料儲存在 *data 中，最大長度為 count，實際接受的資料長度和 tag 儲存在 status 中，status.MPI_SOURCE 為實際接受的源 rank，status.MPI_TAG 為實際接受的 tag，通訊器為 communicator。

該方法會阻塞來接受匹配 source 和 tag 的資料。

MPI_Get_count

參考資料：http://mpitutorial.com/tutorials/dynamic-receiving-with-mpi-probe-and-mpi-status/

MPI_Get_count(
    MPI_Status* status,
    MPI_Datatype datatype,
    int* count)

根據 status 和 datatype，查詢實際接受到了資料個數儲存在 *count 中。

MPI_Probe

MPI_Probe(
    int source,
    int tag,
    MPI_Comm comm,
    MPI_Status* status)

可以作為 MPI_Recv 的預熱，通過 status 確定收到的資料大小之後，再分配準確的記憶體來用 MPI_Recv 接受資料。

示例：

    MPI_Status status;
    // Probe for an incoming message from process zero
    MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
    // When probe returns, the status object has the size and other
    // attributes of the incoming message. Get the size of the message.
    MPI_Get_count(&status, MPI_INT, &number_amount);
    // Allocate a buffer just big enough to hold the incoming numbers
    int* number_buf = (int*)malloc(sizeof(int) * number_amount);
    // Now receive the message with the allocated buffer
    MPI_Recv(number_buf, number_amount, MPI_INT, 0, 0, MPI_COMM_WORLD,
             MPI_STATUS_IGNORE);

笛卡爾拓撲

MPI_Cart_create

int MPI_Cart_create(MPI_Comm comm_old, int ndims, const int dims[],
    const int periods[], int reorder, MPI_Comm *comm_cart)

ndims：指定拓撲結構的維度
dims[]陣列：指定每個維度的大小（[3,2] 表示維度 0 的座標為 0-2，維度 1 的座標為 0-1）
periods[]陣列：指定拓撲結構中是否有環繞連線，非0表示有，0表示無
reorder：確定新通訊器中的程序是否需要重新排序

獲取屬於通訊器 comm_old 的一組程序，建立一個虛擬程序結構。指定的程序數不能大於通訊器 comm_old 中的程序總數。不是笛卡爾結構的組成部分的程序獲得的 comm_cart 值為 MPI_COMM_NULL。

MPI_Cart_coords

int MPI_Cart_coords(MPI_Comm comm, int rank, int maxdims,
    int coords[])

通常先用 MPI_Comm_rank 獲得當前程序在笛卡爾通訊器中的等級，再用 MPI_Cart_coords 獲得程序的笛卡爾座標。

MPI_Cart_shift

int MPI_Cart_shift(MPI_Comm comm, int direction, int disp,
    int *rank_source, int *rank_dest)

direction：指定維度
disp：指定通訊的方向和距離，負數表示負方向
rank_source：通訊的源程序的等級
rank_dest：通訊的目的程序的等級

計算在資料交換操作中源程序和目標程序的等級。

集體通訊：廣播和規約

MPI_Barrier （同步點）

MPI_Barrier(MPI_Comm communicator)

（Barrier，屏障）- 這個方法會構建一個屏障，任何程序都沒法跨越屏障，直到所有的程序都到達屏障。

MPI_Bcast （廣播）

MPI_Bcast(
    void* data,
    int count,
    MPI_Datatype datatype,
    int root,
    MPI_Comm communicator)

一個廣播發生的時候，一個程序會把同樣一份資料傳遞給一個 communicator 裡的所有其他程序。根節點呼叫 MPI_Bcast 函式的時候，data 變數裡的值會被髮送到其他的節點上。當其他的節點呼叫 MPI_Bcast 的時候，data 變數會被賦值成從根節點接受到的資料。

實現使用了一個樹形廣播演算法來獲得比較好的網路利用率。

MPI_Scatter

MPI_Scatter(
    void* send_data,
    int send_count,
    MPI_Datatype send_datatype,
    void* recv_data,
    int recv_count,
    MPI_Datatype recv_datatype,
    int root,
    MPI_Comm communicator)

root 程序執行該函式時，接收一個數組 send_data，並把元素按程序的秩分發出去，給每個程序傳送 send_count 個元素。其他程序包括（root）執行該函式時，收到 recv_count 個 revc_datatype 型別的資料，存放在陣列 recv_data 中。

在這裡插入圖片描述

MPI_Gather

MPI_Gather(
    void* send_data,
    int send_count,
    MPI_Datatype send_datatype,
    void* recv_data,
    int recv_count,
    MPI_Datatype recv_datatype,
    int root,
    MPI_Comm communicator)

所有程序執行該函式時，從 send_datatype 型別的陣列 send_data 中取出前 send_count 個元素，傳送給 root 程序。root 程序同時還會將從每個程序中收集到的 recv_count 個數據，存放在 recv_data 陣列中。
在這裡插入圖片描述

MPI_Gatherv

int MPI_Gatherv(
  const void* sendbuf, int sendcount, MPI_Datatype sendtype,
  void* recvbuf, const int recvcounts[], const int displs[],
  MPI_Datatype recvtype, int root, MPI_Comm comm)

當每個節點傳遞的資料長度不一時，採用這個函式。

IN sendbuf: starting address of send buffer (choice)
IN sendcount: number of elements in send buffer (non-negative integer)
IN sendtype: data type of send buffer elements (handle)
OUT recvbuf: address of receive buffer (choice, significant only at root)
IN recvcounts: non-negative integer array (of length group size) containing the number of elements that are received from each process (significant only at root)
IN displs: integer array (of length group size). Entry i specifies the displacement relative to recvbuf at which to place the incoming data from process i (significant only at root)

MPI_Allgather （多對多）

MPI_Allgather(
    void* send_data,
    int send_count,
    MPI_Datatype send_datatype,
    void* recv_data,
    int recv_count,
    MPI_Datatype recv_datatype,
    MPI_Comm communicator)

MPI_Allgather

MPI_Reduce

MPI_Reduce(
    void* send_data,
    void* recv_data,
    int count,
    MPI_Datatype datatype,
    MPI_Op op,
    int root,
    MPI_Comm communicator)

每個程序傳送容量為 count 的陣列 send_data，root 程序收到後進行 op 操作，存放在容量也為 count 的陣列 recv_data 中。

MPI_Op 操作型別有：

MPI_MAX - 最大
MPI_MIN - 最小
MPI_SUM - 求和
MPI_PROD - 乘積
MPI_LAND - 邏輯與
MPI_LOR - 邏輯或
MPI_BAND - 位運算的“與”
MPI_BOR - 位運算的“或”
MPI_MAXLOC - 最大值和擁有該值的程序的 rank
MPI_MINLOC - 最小值和擁有該值的程序的 rank

MPI_Allreduce

MPI_Allreduce(
    void* send_data,
    void* recv_data,
    int count,
    MPI_Datatype datatype,
    MPI_Op op,
    MPI_Comm communicator)

Groups 和 Communicators

警告

MPI 一次可建立的物件是有數量限制的，如果用完了可分配的物件，而不釋放，可能導致執行時錯誤。
新建的 MPI_Comm 需要用 MPI_Comm_free(MPI_Comm *comm) 來釋放，該函式不能用 MPI_COMM_NULL 做引數
新建的 MPI_Group 需要用 MPI_Group_free(MPI_Group *group) 來釋放

MPI_Comm_split

MPI_Comm_split(
	MPI_Comm comm,
	int color,
	int key,
	MPI_Comm* newcomm)

將 comm 中的程序分到新的 newcomm 中，color 相同的程序被分到同一個 newcomm，且根據 key 的大小進行排序，最小的為 0。

在這裡插入圖片描述

MPI_Comm_create

MPI_Comm_create(
	MPI_Comm comm,
	MPI_Group group,
    MPI_Comm* newcomm)

group 是 comm 的組的子集，利用這個組建立一個新的通訊器 newcomm。非該組內的程序執行函式得到的 newcomm 為 MPI_COMM_NULL。釋放資源時要注意！！！

MPI_Comm_group

MPI_Comm_group(
	MPI_Comm comm,
	MPI_Group *group)

獲得通訊器 comm 對應的組 *group。

MPI_Group_union

MPI_Group_union(
	MPI_Group group1,
	MPI_Group group2,
	MPI_Group* newgroup)

MPI_Group_intersection

MPI_Group_intersection(
	MPI_Group group1,
	MPI_Group group2,
	MPI_Group* newgroup)

MPI_Comm_create_group

MPI_Comm_create_group(
	MPI_Comm comm,
	MPI_Group group,
	int tag,
	MPI_Comm* newcomm)
)

group 是通訊器 comm 對應的組的子組，利用這個組建立一個新的通訊器 newcomm。不在這個組內的程序，呼叫此方法時得到的 newcomm 為 MPI_COMM_NULL。釋放資源時要注意！！！

MPI_Group_incl

MPI_Group_incl(
	MPI_Group group,
	int n,
	const int ranks[],
	MPI_Group* newgroup)

ranks 陣列中有 n 個元素，代表了 group 中的部分程序，用這些程序來建立一個新的組 newgroup。

版權宣告

本文主要內容來自 A Comprehensive MPI Tutorial Resource，一個簡潔的 MPI 入門教程，部分有中文翻譯。

附錄 A 延伸閱讀

How to code parallel stuff in C/C++ using MPI with CLion on Windows，介紹瞭如何在 Windows 上配置 CLion 來編寫 MPI 程式，並可以執行！（膜拜大佬）
How to compile and run a simple MS-MPI program ，介紹瞭如何利用 Microsoft Visual Studio 來編譯執行 MPI 專案。（膜拜微軟）
勞倫斯利弗莫爾國家實驗室的 MPI 教程：https://computing.llnl.gov/tutorials/mpi/

附錄 B OpenMPI 的配置

bash 的環境變數在載入過程中，會依次執行 /etc/profile -> ~/bash_profile(-> ~/.bashrc)。所以我們可以把 OpenMPI 的變數寫在 ~/.bashrc 的末尾。

export PATH=/opt/openmpi/1.10.7/bin/:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/openmpi/1.10.7/lib

最後記得載入環境變數：

source ~/.bashrc