Richard's blog: skia 性能優化

Skia是一個 Android 中的一個 open source 向量圖型庫,

在 Android 2.2 中 skia 幾乎無所不在, 提供了 draw text, draw rectangle, draw bitmap 等等的功能. 許多 apk 的UI, 其具體實現就是用skia, 例如, launcher. 而 skia 在 Android 的原生代碼中, 提供了 skia 以及 skiaGL (OpenGL加速的版本). 在 skia 的版本中, skia 的運算仰賴 CPU, 雖然有提供了 fixed point 以及 floating point 兩套版本, 但筆者實際上透過 #define 調整時, 編譯上會產生錯誤. 有部分的代碼需要修改調整.(是否有其他同好有調試過這部分的代碼?)

因為skia 的運算仰賴 CPU, 因此, 有了優化代碼的空間. 以 drawRect 為例, 我們選擇一個 blitter 來優化:

代碼位置 : \external\skia\src\core\SkBlitter_RGB16.cpp

我們選擇的 blitter 是 SRGB16_Blitter, 要優化的是 blitRect 函式, 先來看原生的 SkRGB16_Blitter::blitRect 函式:

Source3.cpp

01 void SkRGB16_Blitter::blitRect(int x, int y, int width, int height) {
02     SkASSERT(x + width <= fDevice.width() && y + height <= fDevice.height());
03     uint16_t* SK_RESTRICT device = fDevice.getAddr16(x, y);
04     unsigned    deviceRB = fDevice.rowBytes();
05     SkPMColor src32 = fSrcColor32;        
06     while (--height >= 0)
07         {
08         blend32_16_row(src32, device, width);
09         device = (uint16_t*)((char*)device + deviceRB);
10     }
11 }

第８行的Blend32_16_row 函式的代碼如下:

Source3.cpp

01 static inline void blend32_16_row(SkPMColor src, uint16_t dst[], int count) {
02     SkASSERT(count > 0);
03     uint32_t src_expand = pmcolor_to_expand16(src);
04     unsigned scale = SkAlpha255To256(0xFF - SkGetPackedA32(src)) >> 3;
05     do {
06         uint32_t dst_expand = SkExpand_rgb_16(*dst) * scale;
07         *dst = SkCompact_rgb_16((src_expand + dst_expand) >> 5);
08         dst += 1;
09     } while (--count != 0);
10 }

SkRGB16_Blitter::blitRect 函式對 rectangle 由上到下, 計算每一條 line 的 memory address. 然後調用blend32_16_row 函式處理line上的每個pixel.

我們要優化的是blend32_16_row 這個函式. 由於筆者使用的 CPU 是 ARM 9,

其 store, load 的資料寬度為 32 bits, 但是 blend32_16_row 存取 dst 的方式

是以 16 bits 為單位存取. 這裡優化的方式就是一次讀取 32bits, 也就是 2 個 element. 優化後得程式碼如下:

Source3.cpp

01 static inline void blend32_16_row_32(SkPMColor src,uint16_t dst[],int count) {
02     SkASSERT(count > 0);
03     uint32_t src_expand = pmcolor_to_expand16(src);
04     unsigned scale = SkAlpha255To256(0xFF - SkGetPackedA32(src)) >> 3;
05     uint32_t *dst32 = (uint32_t*)dst;
06     uint32_t  dstH, dstL;
07     uint32_t dst_expand, dst_expandH, dst_expandL;
08     do {       
09         dstH = *dst32;
10         dstL = dstH&0xffff; 
11         dstH >>= 16;
12         dst_expandH = SkExpand_rgb_16(dstH) * scale;
13         dst_expandL = SkExpand_rgb_16(dstL) * scale;
14         dstH = (uint32_t)SkCompact_rgb_16((src_expand + dst_expandH) >> 5);
15         dstL = (uint32_t)SkCompact_rgb_16((src_expand + dst_expandL) >> 5);
16         *dst32++ = (dstH << 16) | (dstL&0xffff);
17         count -= 2;
18     } while (count != 0);
19 }

從 dst讀取 32bits 後, unpacked 成兩個 element, 分別運算完後, 再進行 packed 成 32bits寫回 buffer.

另外 dst 的 memory address 不一定都是 32bits alignment 的因此針對這種狀況, 筆者簡單使用 loop unroll 的優化方式. 優化後代碼如下:

Source3.cpp

01 static inline void blend32_16_row_5(SkPMColor src,uint16_t dst[],int count){
02     SkASSERT(count > 0);
03     uint32_t src_expand = pmcolor_to_expand16(src);
04     unsigned scale = SkAlpha255To256(0xFF - SkGetPackedA32(src)) >> 3;
05     uint32_t *dst32 = (uint32_t*)dst;
06     uint16_t  *dstH = dst;
07     uint32_t dst_expand, dst_expandH, dst_expandL;
08     do {
09         dst_expandH = SkExpand_rgb_16(*dst++) * scale;
10         dst_expandL = SkExpand_rgb_16(*dst++) * scale;
11         *dstH++ = SkCompact_rgb_16((src_expand + dst_expandH)>>5);
12         *dstH++ = SkCompact_rgb_16((src_expand + dst_expandL)>>5);
13         count -= 2;
14     } while (count != 0);
15 }

條件是 rect 的 width 必須是 2 的倍數.

若不是 2 的倍數, 也有優化方式, 但這裡筆者就用原生的做法.

修改後的SkRGB16_Blitter::blitRect函式如下:

Source3.cpp

01 void SkRGB16_Blitter::blitRect(int x, int y, int width, int height) {
02     SkASSERT(x + width <= fDevice.width() && y + height <= fDevice.height());
03     uint16_t* SK_RESTRICT device = fDevice.getAddr16(x, y);
04     unsigned    deviceRB = fDevice.rowBytes();
05     SkPMColor src32 = fSrcColor32;
06     if( ((width&0x1)==0)&&(((unsigned int)device&0x3)==0) )
07     { // even width & 32bits alignment.
08       while (--height >= 0)
09       {
10         blend32_16_row_32(src32, device, width);
11         device = (uint16_t*)((char*)device + deviceRB);
12       }
13     }
14     else if((width&0x1)==0)
15     {// even width.
16       while (--height >= 0) 
17       {
18         blend32_16_row_5(src32, device, width);
19         device = (uint16_t*)((char*)device + deviceRB);
20       }
21     }
22     else  
23     { 
24       while (--height >= 0)
25       {
26         blend32_16_row(src32, device, width);
27         device = (uint16_t*)((char*)device + deviceRB);
28       }
29     }
30 }

用 0xBenchmark 的 Draw Rect 選項對優化前後的 image 分別 benchmark 的結果如下.

優化前 benchmark 結果

優化後 benchmark 結果

Richard's blog

2012年10月9日星期二

skia 性能優化

沒有留言:

張貼留言

文章分類

關於我自己

標籤

2012年10月9日 星期二

skia 性能優化

沒有留言:

張貼留言

標籤

2012年10月9日星期二