Shader 優化相關資料整理

阿新 • • 發佈：2018-11-08

優化必須先搞清楚什麼是渲染管線

注：

應用程式階段：主要是CPU與記憶體打交道，例如碰撞檢測，計算好的資料（頂點座標、法向量、紋理座標、紋理）就會通過資料匯流排傳給圖形硬體。

幾何階段：其實上圖有個問題（網上不少部落格也沒寫清楚這個問題），根據 OpenGL 藍寶書（Super Bible）上的講解，“幾何圖元裝配”應該位於“細分著色器”階段之後（細分著色器處理的還是一個個 Patch），然後再進入“幾何著色器”（因為幾何著色器處理的基礎就是一整個圖元）生成新的圖元，在該階段的最後，就是我們熟悉的MVP變換視錐體裁剪操作了。

光柵化階段：進入

光柵化器進行光柵化（還包括剪刀測試、深度測試、模板測試），最後輸出到螢幕的 framebuffer 中。

編寫shader時的一些建議

轉自：http://www.cnblogs.com/sifenkesi/p/4716791.html

1、只計算需要計算的東西

儘量減少無用的頂點資料, 比如貼圖座標, 如果有Object使用2組有的使用1組, 那麼不要將他們放在一個vertex buffer中, 這樣可以減少傳輸的資料量;
避免過多的頂點計算,比如過多的光源, 過於複雜的光照計算(複雜的光照模型);
避免 VS 指令數量太多或者分支過多, 儘量減少 VS 的長度和複雜程度;

2、儘量在 VS 總計算

通常，需要渲染的畫素比頂點數多，而頂點數又比物體數多很多。所以如果可以，儘量將運算從 FS 移到 VS，或直接通過 script 來設定某些固定值；

3、指令優化【Unity】

在使用Surface Shader時，可以通過一些指令讓shader優化很多。

　　通常情況下，Surface shader的很多預設選項都是開啟的，以適應大多數情況，但是很多時候，你可以關閉其中的一些選項，從而讓你的shader執行的更快：
　　(1) approxview 對於使用了 view direction 的shader，該選項會讓 view dir 的 normalize 操作 per-vertex 進行，而不是 per-pixel。這個優化通常效果明顯。
　　(2) halfasview 可以讓Specular shader變得快一些，使用一個介於光照方向和觀察方向之間的 half vector 來代替真正的觀察方向 viewDir 來計算光照函式。
　　(3) noforwardadd Forward Rende r時，完全只支援一盞方向光的 per-pixel 渲染，其餘的光照全部按照 per-vertex 或 SH 渲染。這樣可以確保shader在一個pass裡渲染完成。
　　(4) noambient 禁掉 ambient lighting 和 SH lighting，可以讓 shader 快一點兒。

4、浮點數精度相關：

　　float:最高精度，通常32位
　　half:中等精度，通常16位，-60000到60000，
　　fixed:最低精度，通常11位，-2.0到2.0，1/256的精度。
　　儘量使用低精度。對於 color 和 unit length vectors，使用fixed，其他情況，根據取值範圍儘量使用 half，實在不夠則使用 float 。

　　在移動平臺，關鍵是在 FS 中儘可能多的使用低精度資料。另外，對於多數移動GPU，在低精度和高精度之間轉換是非常耗的，在fixed上做swizzle操作也是很費事的。

5、Alpha Test

　　Alpha test 和 clip() 函式，在不同平臺有不同的效能開銷。
　　通常使用它來剔除那些完全透明的畫素。
　　但是，在 iOS 和一些 Android 上使用的 PowerVR GPUs上面，alpha test非常的昂貴。

6、Color Mask

　　在移動裝置上，Color Mask 也是非常昂貴的，所以儘量別使用它，除非真的是需要。

shader中用for，if等條件語句為什麼會使得幀率降低很多？

作者：空明流轉
連結：https://www.zhihu.com/question/27084107/answer/39281771
來源：知乎
著作權歸作者所有，轉載請聯絡作者獲得授權。

1. For和If不一定意味著動態分支

在GPU上的分支語句（for，if-else，while），可以分為三類。
Branch的Condition僅依賴編譯期常數
此時編譯器可以直接攤平分支，或者展開（unloop）。對於For來說，會有個權衡，如果For的次數特別多，或者body內的程式碼特別長，可能就不展開了，因為會指令裝載也是有限或者有耗費的
額外成本可以忽略不計
Branch的Condition僅依賴編譯期常數和Uniform變數
一個執行期固定的跳轉語句，可預測
同一個Warp內所有micro thread均執行相同分支
額外成本很低
Branch 的 Condition 是動態的表示式
這才是真正的“動態分支”
會存在一個Warp的 Micro Thread 之間各自需要走不同分支的問題

2. 跳轉本身的成本非常低

隨著IP/EP（Instruction Pointer/Execution Pointer）的引入，現代GPU在執行指令上的行為，和CPU沒什麼兩樣。跳轉僅僅是重新設定一個暫存器。

3.Micro Thread 走不同分支時的處理

GPU本身的執行速度快，是因為它一條指令可以處理多個 Micro Thread 的資料（SIMD）。但是這需要多個 Micro Thread 同一時刻的指令是相同的。
如果不同，現代GPU通常的處理方法是，按照每個Micro Thread的不同需求多次執行分支。

x = tex.Load();
if(x == 5)
{
    // Thread 1 & 2 使用這個路徑
   out.Color = float4(1, 1, 1, 1); 
}
else
{
    // Thread 3 & 4 使用這個路徑
   out.Color = float4(0, 0, 0, 0);
}

比如在上例中，兩個分支的語句Shader Unit都會執行，只是不同的是如果在執行if分支，那麼計算結果將不會寫入到thread 3 和 4的儲存中（無副作用）。
這樣做就相當於運算量增加了不少，這是動態分支的主要成本。
但是如果所有的執行緒，都走的是同一分支，那麼另外一個分支就不用走了。這個時候Shader Unit也不會去傻逼一樣的執行另外一個根本不需要執行的分支。此時效能的損失也不多。並且，在實際的Shader中，除非特殊情況，大部分Warp內的執行緒，即便在動態分支的情況下，也多半走的是同一分支。

4. 動態分支和程式碼優化難度有相關性

這一點經常被忽視，就是有動態分支的程式碼，因為沒準你要讀寫點什麼，前後還可能有依賴，往往也難以被優化。比如說你非要鬧這樣的語句出來：

if(x == 1)
{
   color = tex1.Load(coord);
}
else if(x == 2)
{
   color = tex2.Load(coord);
}
...

你說編譯器怎麼給你優化。

說句題外話，為啥要有TextureArray呢？也是為了這個場合。TextureArray除了紋理不一樣，無論格式、大小、座標、LoD、偏移，都可以是相同的。這樣甚至可以預見不同Texture Surface上取資料的記憶體延遲也是非常接近的。這樣有很多的操作都可以合併成SIMD，就比多個Texture分別來取快得多了。這就是一個通過增加了約束（紋理格式、大小、定址座標）把SISD優化成SIMD的例子。

定位渲染通道瓶頸的方法

轉自：http://blog.csdn.net/rabbit729/article/details/6398343

一般來說，定位渲染通道瓶頸的方法就是改變渲染通道每個步驟的工作量, 如果吞吐量也改變了, 那個步驟就是瓶頸.。找到了瓶頸就要想辦法消除瓶頸, 可以減少該步驟的工作量, 增加其他步驟的工作量。
　　一般在光柵化之前的瓶頸稱作”transform bound”, 三角形設定處理後的瓶頸稱作”fill bound”

定位瓶頸的辦法:
1. 改變幀緩衝或者渲染目標(Render Target)的顏色深度(16 到 32 位), 如果幀速改變了, 那麼瓶頸應該在幀緩衝(RenderTarget)的填充率上。
2. 否則試試改變貼圖大小和貼圖過濾設定, 如果幀速變了,那麼瓶頸應該是在貼圖這裡。
3. 否則改變解析度.如果幀速改變了, 那麼改變一下pixel shader的指令數量, 如果幀速變了, 那麼瓶頸應該就是pixel shader. 否則瓶頸就在光柵化過程中。
4. 否則, 改變頂點格式的大小, 如果幀速改變了, 那麼瓶頸應該在顯示卡頻寬上。
5. 如果以上都不是, 那麼瓶頸就在CPU這一邊。

Best Practices for Shaders

轉自： OpenGL ES Programming Guide for iOS

Compile and Link Shaders During Initialization

Creating a shader program is an expensive operation compared to other OpenGL ES state changes. Compile, link, and validate your programs when your app is initialized. Once you’ve created all your shaders, the app can efficiently switch between them by calling glUseProgram.

Check for Shader Program Errors When Debugging

Reading diagnostic information after compiling or linking a shader program is not necessary in a Release build of your app and can reduce performance. Use OpenGL ES functions to read shader compile or link logs only in development builds of your app, as shown in Listing 10-1.

Listing 10-1 Read shader compile/link logs only in development builds

// After calling glCompileShader, glLinkProgram, or similar

#ifdef DEBUG
// Check the status of the compile/link
glGetProgramiv(prog, GL_INFO_LOG_LENGTH, &logLen);
if(logLen > 0) {
// Show any errors as appropriate
glGetProgramInfoLog(prog, logLen, &logLen, log);
fprintf(stderr, “Prog Info Log: %s\n”, log);
}
#endif

Similarly, you should call the glValidateProgram function only in development builds. You can use this function to find development errors such as failing to bind all texture units required by a shader program. But because validating a program checks it against the entire OpenGL ES context state, it is an expensive operation. Since the results of program validation are only meaningful during development, you should not call this function in Release builds of your app.

Use Separate Shader Objects to Speed Compilation and Linking

any OpenGL ES apps use several vertex and fragment shaders, and it is often useful to reuse the same fragment shader with different vertex shaders or vice versa. Because the core OpenGL ES specification requires a vertex and fragment shader to be linked together in a single shader program, mixing and matching shaders results in a large number of programs, increasing the total shader compile and link time when you initialize your app.
OpenGL ES 2.0 and 3.0 contexts on iOS support the EXT_separate_shader_objects extension. You can use the functions provided by this extension to compile vertex and fragment shaders separately, and to mix and match precompiled shader stages at render time using program pipeline objects. Additionally, this extension provides a simplified interface for compiling and using shaders, shown in Listing 10-2.

Listing 10-2 Compiling and using separate shader objects

- (void)loadShaders
{
const GLchar *vertexSourceText = " ... vertex shader GLSL source code ... ";
const GLchar *fragmentSourceText = " ... fragment shader GLSL source code ... ";

// Compile and link the separate vertex shader program, then read its uniform variable locations
_vertexProgram = glCreateShaderProgramvEXT(GL_VERTEX_SHADER, 1, &vertexSourceText);
_uniformModelViewProjectionMatrix = glGetUniformLocation(_vertexProgram, "modelViewProjectionMatrix");
_uniformNormalMatrix = glGetUniformLocation(_vertexProgram, "normalMatrix");

// Compile and link the separate fragment shader program (which uses no uniform variables)
_fragmentProgram = glCreateShaderProgramvEXT(GL_FRAGMENT_SHADER, 1, &fragmentSourceText);

// Construct a program pipeline object and configure it to use the shaders
glGenProgramPipelinesEXT(1, &_ppo);
glBindProgramPipelineEXT(_ppo);
glUseProgramStagesEXT(_ppo, GL_VERTEX_SHADER_BIT_EXT, _vertexProgram);
glUseProgramStagesEXT(_ppo, GL_FRAGMENT_SHADER_BIT_EXT, _fragmentProgram);
}

- (void)glkView:(GLKView *)view drawInRect:(CGRect)rect
{
// Clear the framebuffer
glClearColor(0.65f, 0.65f, 0.65f, 1.0f);
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);

// Use the previously constructed program pipeline and set uniform contents in shader programs
glBindProgramPipelineEXT(_ppo);
glProgramUniformMatrix4fvEXT(_vertexProgram, _uniformModelViewProjectionMatrix, 1, 0, _modelViewProjectionMatrix.m);
glProgramUniformMatrix3fvEXT(_vertexProgram, _uniformNormalMatrix, 1, 0, _normalMatrix.m);

// Bind a VAO and render its contents
glBindVertexArrayOES(_vertexArray);
glDrawElements(GL_TRIANGLE_STRIP, _indexCount, GL_UNSIGNED_SHORT, 0);
}

Respect the Hardware Limits on Shaders

OpenGL ES limits the number of each variable type you can use in a vertex or fragment shader. The OpenGL ES specification doesn’t require implementations to provide a software fallback when these limits are exceeded; instead, the shader simply fails to compile or link. When developing your app you must ensure that no errors occur during shader compilation, as shown in Listing 10-1.

Use Precision Hints

Precision hints were added to the GLSL ES language specification to address the need for compact shader variables that match the smaller hardware limits of embedded devices. Each shader must specify a default precision; individual shader variables may override this precision to provide hints to the compiler on how that variable is used in your app. An OpenGL ES implementation is not required to use the hint information, but may do so to generate more efficient shaders. The GLSL ES specification lists the range and precision for each hint.

Important: The range limits defined by the precision hints are not enforced. You cannot assume your data is clamped to this range.
Follow these guidelines:

When in doubt, default to high precision.
Colors in the 0.0 to 1.0 range can usually be represented using low precision variables.
Position data should usually be stored as high precision.
Normals and vectors used in lighting calculations can usually be stored as medium precision.
After reducing precision, retest your app to ensure that the results are what you expect.

For example:

precision highp float; // Defines precision for float and float-derived (vector/matrix) types.
uniform lowp sampler2D sampler; // Texture2D() result is lowp.
varying lowp vec4 color;
varying vec2 texCoord;   // Uses default highp precision.
 
void main()
{
    gl_FragColor = color * texture2D(sampler, texCoord);
}

Perform Vector Calculations Lazily

Not all graphics processors include vector processors; they may perform vector calculations on a scalar processor. When performing calculations in your shader, consider the order of operations to ensure that the calculations are performed efficiently even if they are performed on a scalar processor.

If the code in Listing 10-4 were executed on a vector processor, each multiplication would be executed in parallel across all four of the vector’s components. However, because of the location of the parenthesis, the same operation on a scalar processor would take eight multiplications, even though two of the three parameters are scalar values.

Listing 10-4 Poor use of vector operators

highp float f0, f1;
highp vec4 v0, v1;
v0 = (v1 * f0) * f1;

The same calculation can be performed more efficiently by shifting the parentheses as shown in Listing 10-5. In this example, the scalar values are multiplied together first, and the result multiplied against the vector parameter; the entire operation can be calculated with five multiplications.

Listing 10-5 Proper use of vector operations

highp float f0, f1;
highp vec4 v0, v1;
v0 = v1 * (f0 * f1);

Similarly, your app should always specify a write mask for a vector operation if it does not use all of the components of the result. On a scalar processor, calculations for components not specified in the mask can be skipped. Listing 10-6 runs twice as fast on a scalar processor because it specifies that only two components are needed.

Listing 10-6 Specifying a write mask

highp vec4 v0;
highp vec4 v1;
highp vec4 v2;
v2.xz = v0 * v1;

Use Uniforms or Constants Instead of Computing Values in a Shader

Whenever a value can be calculated outside the shader, pass it into the shader as a uniform or a constant. Recalculating dynamic values can potentially be very expensive in a shader.

Use Branching Instructions with Caution

Branches are discouraged in shaders, as they can reduce the ability to execute operations in parallel on 3D graphics processors (although this performance cost is reduced on OpenGL ES 3.0–capable devices).

Your app may perform best if you avoid branching entirely. For example, instead of creating a large shader with many conditional options, create smaller shaders specialized for specific rendering tasks. There is a tradeoff between reducing the number of branches in your shaders and increasing the number of shaders you create. Test different options and choose the fastest solution.

If your shaders must use branches, follow these recommendations:

Best performance: Branch on a constant known when the shader is compiled.
Acceptable: Branch on a uniform variable.
Potentially slow: Branch on a value computed inside the shader.

Eliminate Loops

You can eliminate many loops by either unrolling the loop or using vectors to perform operations. For example, this code is very inefficient:

int i;
float f;
vec4 v;

for(i = 0; i < 4; i++)
v[i] += f;

The same operation can be done directly using a component-wise add:

float f;
vec4 v;
v += f;

When you cannot eliminate a loop, it is preferred that the loop have a constant limit to avoid dynamic branches.

Avoid Computing Array Indices in Shaders

Using indices computed in the shader is more expensive than a constant or uniform array index. Accessing uniform arrays is usually cheaper than accessing temporary arrays.

Be Aware of Dynamic Texture Lookups

Dynamic texture lookups, also known as dependent texture reads, occur when a fragment shader computes texture coordinates rather than using the unmodified texture coordinates passed into the shader. Dependent texture reads are supported at no performance cost on OpenGL ES 3.0–capable hardware; on other devices, dependent texture reads can delay loading of texel data, reducing performance. When a shader has no dependent texture reads, the graphics hardware may prefetch texel data before the shader executes, hiding some of the latency of accessing memory.

Listing 10-7 shows a fragment shader that calculates new texture coordinates. The calculation in this example can easily be performed in the vertex shader, instead. By moving the calculation to the vertex shader and directly using the vertex shader’s computed texture coordinates, you avoid the dependent texture read.

Note: It may not seem obvious, but any calculation on the texture coordinates counts as a dependent texture read. For example, packing multiple sets of texture coordinates into a single varying parameter and using a swizzle command to extract the coordinates still causes a dependent texture read.

Listing 10-7 Dependent Texture Read

varying vec2 vTexCoord;
uniform sampler2D textureSampler;

void main()
{
vec2 modifiedTexCoord = vec2(1.0 - vTexCoord.x, 1.0 - vTexCoord.y);
gl_FragColor = texture2D(textureSampler, modifiedTexCoord);
}

Fetch Framebuffer Data for Programmable Blending

Traditional OpenGL and OpenGL ES implementations provide a fixed-function blending stage, illustrated in Figure 10-1. Before issuing a draw call, you specify a blending operation from a fixed set of possible parameters. After your fragment shader outputs color data for a pixel, the OpenGL ES blending stage reads color data for the corresponding pixel in the destination framebuffer, then combines the two according to the specified blending operation to produce an output color.

Figure 10-1 Traditional fixed-function blending

In iOS 6.0 and later, you can use the EXT_shader_framebuffer_fetch extension to implement programmable blending and other effects. Instead of supplying a source color to be blended by OpenGL ES, your fragment shader reads the contents of the destination framebuffer corresponding to the fragment being processed. Your fragment shader can then use whatever algorithm you choose to produce an output color, as shown in Figure 10-2.

Figure 10-2 Programmable blending with framebuffer fetch

This extension enables many advanced rendering techniques:

Additional blending modes. By defining your own GLSL ES functions for combining source and destination colors, you can implement blending modes not possible with the OpenGL ES fixed-function blending stage. For example, Listing 10-8 implements the Overlay and Difference blending modes found in popular graphics software.
Post-processing effects. After rendering a scene, you can draw a full-screen quad using a fragment shader that reads the current fragment color and transforms it to produce an output color. The shader in Listing 10-9 can be used with this technique to convert a scene to grayscale.
Non-color fragment operations. Framebuffers may contain non-color data. For example, deferred shading algorithms use multiple render targets to store depth and normal information. Your fragment shader can read such data from one (or more) render targets and use them to produce an output color in another render target.
These effects are possible without the framebuffer fetch extension—for example, grayscale conversion can be done by rendering a scene into a texture, then drawing a full-screen quad using that texture and a fragment shader that converts texel colors to grayscale. However, using this extension generally results in better performance.

To enable this feature, your fragment shader must declare that it requires the EXT_shader_framebuffer_fetch extension, as shown in Listing 10-8 and Listing 10-9. The shader code to implement this feature differs between versions of the OpenGL ES Shading Language (GLSL ES).

Using Framebuffer Fetch in GLSL ES 1.0

For OpenGL ES 2.0 contexts and OpenGL ES 3.0 contexts not using #version 300 es shaders, you use the gl_FragColor builtin variable for fragment shader output and the gl_LastFragData builtin variable to read framebuffer data, as illustrated in Listing 10-8.

Listing 10-8 Fragment shader for programmable blending in GLSL ES 1.0

#extension GL_EXT_shader_framebuffer_fetch : require
 
#define kBlendModeDifference 1
#define kBlendModeOverlay    2
#define BlendOverlay(a, b) ( (b<0.5) ? (2.0*b*a) : (1.0-2.0*(1.0-a)*(1.0-b)) )
 
uniform int blendMode;
varying lowp vec4 sourceColor;
 
void main()
{
    lowp vec4 destColor = gl_LastFragData[0];
    if (blendMode == kBlendModeDifference) {
        gl_FragColor = abs( destColor - sourceColor );
    } else if (blendMode == kBlendModeOverlay) {
        gl_FragColor.r = BlendOverlay(sourceColor.r, destColor.r);
        gl_FragColor.g = BlendOverlay(sourceColor.g, destColor.g);
        gl_FragColor.b = BlendOverlay(sourceColor.b, destColor.b);
        gl_FragColor.a = sourceColor.a;
    } else { // normal blending
        gl_FragColor = sourceColor;
    }
}

Using Framebuffer Fetch in GLSL ES 3.0

In GLSL ES 3.0, you use user-defined variables declared with the out qualifier for fragment shader outputs. If you declare a fragment shader output variable with the inout qualifier, it will contain framebuffer data when the fragment shader executes. Listing 10-9 illustrates a grayscale post-processing technique using an inout variable.

Listing 10-9 Fragment shader for color post-processing in GLSL ES 3.0

#version 300 es
#extension GL_EXT_shader_framebuffer_fetch : require
 
layout(location = 0) inout lowp vec4 destColor;
 
void main()
{
    lowp float luminance = dot(vec3(0.3, 0.59, 0.11), destColor.rgb);
    destColor.rgb = vec3(luminance);
}

Use Textures for Larger Memory Buffers in Vertex Shaders

In iOS 7.0 and later, vertex shaders can read from currently bound texture units. Using this technique you can access much larger memory buffers during vertex processing, enabling high performance for some advanced rendering techniques. For example:

① Displacement mapping. Draw a mesh with default vertex positions, then read from a texture in the vertex shader to alter the position of each vertex. Listing 10-10 demonstrates using this technique to generate three-dimensional geometry from a grayscale height map texture.

② Instanced drawing. As described in Use Instanced Drawing to Minimize Draw Calls, instanced drawing can dramatically reduce CPU overhead when rendering a scene that contains many similar objects. However, providing per-instance information to the vertex shader can be a challenge. A texture can store extensive information for many instances. For example, you could render a vast cityscape by drawing hundreds of instances from vertex data describing only a simple cube. For each instance, the vertex shader could use the gl_InstanceID variable to sample from a texture, obtaining a transformation matrix, color variation, texture coordinate offset, and height variation to apply to each building.

Listing 10-10 Vertex shader for rendering from a height map

attribute vec2 xzPos;

uniform mat4 modelViewProjectionMatrix;
uniform sampler2D heightMap;

void main()
{
// Use the vertex X and Z values to look up a Y value in the texture.
vec4 position = texture2D(heightMap, xzPos);
// Put the X and Z values into their places in the position vector.
position.xz = xzPos;

// Transform the position vector from model to clip space.
gl_Position = modelViewProjectionMatrix * position;
}

You can also use uniform arrays and uniform buffer objects (in OpenGL ES 3.0) to provide bulk data to a vertex shader, but vertex texture access offers several potential advantages. You can store much more data in a texture than in either a uniform array or uniform buffer object, and you can use texture wrapping and filtering options to interpolate the data stored in a texture. Additionally, you can render to a texture, taking advantage of the GPU to produce data for use in a later vertex processing stage.

To determine whether vertex texture sampling is available on a device (and the number of texture units available to vertex shaders), check the value of the MAX_VERTEX_TEXTURE_IMAGE_UNITS limit at run time. (See Verifying OpenGL ES Capabilities.)

其他優化資料：GPU 優化總結