1. 程式人生 > >WebGPU學習(十一):學習兩個優化:“reuse render command buffer”和“dynamic uniform buffer offset”

WebGPU學習(十一):學習兩個優化:“reuse render command buffer”和“dynamic uniform buffer offset”

大家好,本文介紹了“reuse render command buffer”和“dynamic uniform buffer offset”這兩個優化,以及Chrome->webgpu-samplers->animometer示例對它們進行的benchmark效能測試。

上一篇博文:
WebGPU學習(十):介紹“GPU實現粒子效果”

學習優化:reuse render command buffer

提出問題

每一幀經過下面的步驟進行繪製:

  • 建立一個command buffer
  • 開始一個render pass
  • 設定多個render command到command buffer中
  • 結束該render pass

相關程式碼如下:

return function frame() {
    ...
    const commandEncoder = device.createCommandEncoder();
    ...
    const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
    
    passEncoder.setPipeline(pipeline);
    passEncoder.setVertexBuffer(0, verticesBuffer);
    passEncoder.setBindGroup(0, uniformBindGroup1);
    passEncoder.draw(36, 1, 0, 0);
    
    passEncoder.endPass();
    ...
}

我們可以發現,一般來說,每幀設定的render command不會變化,這造成了重複record的開銷。開銷具體包括兩個方面:

  • js binding的開銷,如轉換descriptor object(如轉換建立render pipeline時傳入的引數:GPURenderPipelineDescriptor)和字串、處理邊界、檢驗資料的合法性等開銷
  • 建立render command的開銷和設定(encode)render command到command buffer的開銷

優化方案

WebGPU提供了GPURenderBundle,只需record一次render command到render bundle,然後每幀執行該bundle,從而實現了command buffer的複用。

WebGPU還支援建立多個bundle,從而可以record不同的render command到對應的render bundle中

案例程式碼

對案例程式碼的說明:
1.發起兩個drawcall,對應兩個bind group。

這裡給出原始的案例程式碼和優化後的案例程式碼,供讀者參考:

  • 原始的案例程式碼:不使用bundle
    程式碼如下:
return function frame() {
    ...
    const commandEncoder = device.createCommandEncoder();
    ...
    const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
    passEncoder.setPipeline(pipeline);
    passEncoder.setVertexBuffer(0, verticesBuffer);

    passEncoder.setBindGroup(0, uniformBindGroup1);
    passEncoder.draw(36, 1, 0, 0);

    passEncoder.setBindGroup(0, uniformBindGroup2);
    passEncoder.draw(36, 1, 0, 0);

    passEncoder.endPass();
    ...
}
  • 優化後的案例程式碼:建立一個bundle
    程式碼如下:
function recordRenderPass(passEncoder) {
    passEncoder.setPipeline(pipeline);
    passEncoder.setVertexBuffer(0, verticesBuffer);

    passEncoder.setBindGroup(0, uniformBindGroup1);
    passEncoder.draw(36, 1, 0, 0);

    passEncoder.setBindGroup(0, uniformBindGroup2);
    passEncoder.draw(36, 1, 0, 0);
}

const renderBundleEncoder = device.createRenderBundleEncoder({
    colorFormats: [swapChainFormat],
});
recordRenderPass(renderBundleEncoder);
const renderBundle = renderBundleEncoder.finish();


return function frame(timestamp) {
    ...
    const commandEncoder = device.createCommandEncoder();
    ...
    const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);

    passEncoder.executeBundles([renderBundle]);

    passEncoder.endPass();
    ...
}
  • 優化後的案例程式碼:建立兩個bundle
    程式碼如下:
function recordRenderPass1(passEncoder) {
    passEncoder.setPipeline(pipeline);
    passEncoder.setVertexBuffer(0, verticesBuffer);

    passEncoder.setBindGroup(0, uniformBindGroup1);
    passEncoder.draw(36, 1, 0, 0);
}

function recordRenderPass2(passEncoder) {
    passEncoder.setPipeline(pipeline);
    passEncoder.setVertexBuffer(0, verticesBuffer);

    passEncoder.setBindGroup(0, uniformBindGroup2);
    passEncoder.draw(36, 1, 0, 0);
}

const renderBundleEncoder1 = device.createRenderBundleEncoder({
    colorFormats: [swapChainFormat],
});
recordRenderPass1(renderBundleEncoder1);
const renderBundle1 = renderBundleEncoder1.finish();



const renderBundleEncoder2 = device.createRenderBundleEncoder({
    colorFormats: [swapChainFormat],
});
recordRenderPass2(renderBundleEncoder2);
const renderBundle2 = renderBundleEncoder2.finish();


return function frame(timestamp) {
    ...
    const commandEncoder = device.createCommandEncoder();
    ...
    const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);

    passEncoder.executeBundles([renderBundle1, renderBundle2]);

    passEncoder.endPass();
    ...
}
}

進一步分析

我們再來看下bundle和render pass相關的定義:

interface GPUDevice : EventTarget {
   ...
   GPURenderBundleEncoder createRenderBundleEncoder(GPURenderBundleEncoderDescriptor descriptor);
   ...
}

dictionary GPURenderBundleEncoderDescriptor : GPUObjectDescriptorBase {
    required sequence<GPUTextureFormat> colorFormats;
    GPUTextureFormat depthStencilFormat;
    //與MSAA有關,這裡不分析
    unsigned long sampleCount = 1;
};

...

interface GPUCommandEncoder {
    ...
    GPURenderPassEncoder beginRenderPass(GPURenderPassDescriptor descriptor);
    ...
}

...

dictionary GPURenderPassDescriptor : GPUObjectDescriptorBase {
    required sequence<GPURenderPassColorAttachmentDescriptor> colorAttachments;
    GPURenderPassDepthStencilAttachmentDescriptor depthStencilAttachment;
};

建立bundle時,需要指定與所屬render pass相同的color attachments和depthAndStencil attachment的format。

參考資料

Encoder results reuse
Add GPURenderBundle
How do people reuse command buffers?(要FQ)

學習優化:dynamic uniform buffer offset

提出問題

在大多數應用中,每個drawcall需要不同的uniform變數,對應不同的uniform buffer。而uniform buffer被設定在bind group中,這意味著需要在每一幀中為每個drawcall建立並設定一個bind group(或者只建立一次所有的bind group作為cache,然後在每一幀中複用它)。

建立bind group比drawcall的開銷更大。通過在“Proposal: Dynamic uniform and storage buffer offsets”中進行的效能測試,我們知道現代圖形API建立bind group的個數是有限的(而WebGPU是基於現代圖形API而實現的,因此它在WebGPU中也是有限的):

This means, in a single frame, the Metal devices can create 285 bind groups, the D3D12 devices can create 7270 bind groups, and the Vulkan devices can create 18561 bind groups.

優化方案

WebGPU支援“dynamic uniform buffer offset”,也就是說:
可以只建立一個bind group,但是它只能設定一個或多個uniform buffer(不能設定storage buffer等資料);
每個drawcall用對應的offset來設定同一個bind group。

這樣就去掉了建立多個bind group的開銷。

根據Proposal: Dynamic uniform and storage buffer offsets:

I believe we said:
We need at least one of the two for the MVP
Having both causes more complication because they will fight for root table space so we might have to introduce a combined limit for pushConstantSize + N * DynamicBufferCount.

WebGPU的MVP版本應該不會支援dynamic storage buffer offset,也就是說bind group不能設定storage buffer。

案例程式碼

對案例程式碼的說明:
1.bind group只設置一個uniform buffer,它包含的uniform變數為:

float scale;
float offsetX;
float offsetY;
float scalar;
float scalarOffset;

2.一共有100個gameObject,對應100個draw call和uniform變數的100份資料(設定在uniformBufferData中)
3.不同drawcall對應的bind group->uniform buffer的offset需要為256的倍數

這裡給出原始的案例程式碼和優化後的案例程式碼,供讀者參考:

  • 原始的案例程式碼:建立bind group cache
    程式碼如下:
const bindGroupLayout = device.createBindGroupLayout({
    bindings: [
        { binding: 0, visibility: GPUShaderStage.VERTEX, type: "uniform-buffer" },
    ],
});


const pipelineLayout = device.createPipelineLayout({ bindGroupLayouts: [bindGroupLayout] });


const pipeline = device.createRenderPipeline({
    layout: pipelineLayout,
    ...
});



const gameObjects = 100;
const uniformBytes = 5 * Float32Array.BYTES_PER_ELEMENT;
const alignedUniformBytes = Math.ceil(uniformBytes / 256) * 256;
const alignedUniformFloats = alignedUniformBytes / Float32Array.BYTES_PER_ELEMENT;

const uniformBuffer = device.createBuffer({
    size: gameObjects * alignedUniformBytes + Float32Array.BYTES_PER_ELEMENT,
    usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.UNIFORM
});


const uniformBufferData = new Float32Array(gameObjects * alignedUniformFloats);

//bind group的cache陣列
const bindGroups = new Array(gameObjects);

function setUniformBufferData(i) {
    uniformBufferData[alignedUniformFloats * i + 0] = Math.random() * 0.2 + 0.2;        // scale
    uniformBufferData[alignedUniformFloats * i + 1] = 0.9 * 2 * (Math.random() - 0.5);  // offsetX
    uniformBufferData[alignedUniformFloats * i + 2] = 0.9 * 2 * (Math.random() - 0.5);  // offsetY
    uniformBufferData[alignedUniformFloats * i + 3] = Math.random() * 1.5 + 0.5;       // scalar
    uniformBufferData[alignedUniformFloats * i + 4] = Math.random() * 10;               // scalarOffset
}

for (let i = 0; i < gameObjects; ++i) {
    setUniformBufferData(i);

    bindGroups[i] = device.createBindGroup({
        layout: bindGroupLayout,
        bindings: [{
            binding: 0,
            resource: {
                buffer: uniformBuffer,
                offset: i * alignedUniformBytes,
                size: 5 * Float32Array.BYTES_PER_ELEMENT,
            }
        }]
    });
}

uniformBuffer.setSubData(0, uniformBufferData);


return function frame() {
    ...
    const commandEncoder = device.createCommandEncoder();
    ...
    const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
    passEncoder.setPipeline(pipeline);
    passEncoder.setVertexBuffer(0, verticesBuffer);

    for (let i = 0; i < gameObjects; ++i) {
        passEncoder.setBindGroup(0, bindGroups[i]);
        passEncoder.draw(3, 1, 0, 0);
    }

    passEncoder.endPass();
    ...
}
  • 優化後的案例程式碼:使用offset
    程式碼如下:
//設定hasDynamicOffset為true
const dynamicBindGroupLayout = device.createBindGroupLayout({
    bindings: [
        { binding: 0, visibility: GPUShaderStage.VERTEX, type: "uniform-buffer", hasDynamicOffset: true },
    ],
});

const dynamicBindGroup = device.createBindGroup({
    layout: dynamicBindGroupLayout,
    bindings: [{
        binding: 0,
        resource: {
            buffer: uniformBuffer,
            offset: 0,
            size: 5 * Float32Array.BYTES_PER_ELEMENT,
        },
    }],
});


const dynamicPipelineLayout = device.createPipelineLayout({ bindGroupLayouts: [dynamicBindGroupLayout] });

const dynamicPipeline = device.createRenderPipeline({
    layout: dynamicPipelineLayout,
    ...
});

//定義gameObjects等程式碼與原始的案例程式碼相同,故省略
...

for (let i = 0; i < gameObjects; ++i) {
    //setUniformBufferData函式與原始的案例程式碼相同
    setUniformBufferData(i);
}

const dynamicBindGroup = device.createBindGroup({
    layout: dynamicBindGroupLayout,
    bindings: [{
        binding: 0,
        resource: {
            buffer: uniformBuffer,
            offset: 0,
            size: 5 * Float32Array.BYTES_PER_ELEMENT,
        },
    }],
});

uniformBuffer.setSubData(0, uniformBufferData);

const dynamicOffsets = [0];

return function frame() {
    ...
    const commandEncoder = device.createCommandEncoder();
    ...
    const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
    passEncoder.setPipeline(pipeline);
    passEncoder.setVertexBuffer(0, verticesBuffer);


    for (let i = 0; i < gameObjects; ++i) {
        //之所以要預先建立dynamicOffsets陣列,然後在這裡設定它的元素,而不直接用“passEncoder.setBindGroup(0, dynamicBindGroup, [i * alignedUniformBytes]);”,是因為這樣會增加“建立陣列”的開銷
        dynamicOffsets[0] = i * alignedUniformBytes;
        passEncoder.setBindGroup(0, dynamicBindGroup, dynamicOffsets);
        passEncoder.draw(3, 1, 0, 0);
    }

    passEncoder.endPass();
    ...
}

參考資料

Proposal: Dynamic uniform and storage buffer offsets

效能測試

animometer示例對這兩個優化進行了benchmark測試。

(需要說明的是,該示例的“size: 6 * Float32Array.BYTES_PER_ELEMENT”應該被改為“size: 5 * Float32Array.BYTES_PER_ELEMENT”)

該示例的執行截圖如下所示:

在右側的紅圈內選中按鈕可啟用對應的優化;
右上角的紫圈可設定繪製的三角形個數;
在左上角的藍圈內,第一行顯示每一幀在CPU端所用時間,主要包括render pass的js binding所用的時間;第二行顯示每一幀總時間,它等於CPU端+GPU端的所用時間。

測試資料

在我的電腦(Mac Pro 2014,MacOS Catalina10.15.1,Chrome Canary 80.0.3977.4)上繪製4萬個三角形的測試結果:

  • 只使用bundle與沒用任何優化相比

大幅降低了js binding所用時間,由14ms變為0.2ms;
每一幀總時間只降低了20%。

  • 同時使用bundle與offset與只使用bundle相比

js binding所用時間和每一幀總時間幾乎沒有變化

  • 只使用offset與沒用任何優化相比

js binding所用時間大幅增加了60%;
每一幀總時間只稍微增加了10%。

結論

使用offset優化,雖然增加了CPU端開銷,但也降低了GPU端開銷,從而使每一幀總時間增加得很少。而且它使程式碼更為簡潔(只建立一個bind group),可能也減少了記憶體佔用(我沒有進行測試,僅為推測),所以推薦使用。

使用bundle優化,雖然大幅降低了CPU端開銷,但也增加了GPU端開銷。不過考慮到每一幀總時間還是降低了20%,而且有被瀏覽器進一步優化的空間(參考Encoder results reuse),所以推薦使用。

參考資料

animometer示