draw1: 6M for draw 0,0,100,100 no repl draw3: 4M for draw 0,0,100,100 no repl just read src, dst - 250k mask reading - 650k write dst - 100k alpha calculation - 3000k olddraw: 10M for draw 0, 0, 1000, 1000 no repl all ldepth 3 44M for draw 0, 0, 1000, 1000 src, mask ldepth 2 dst ldepth 3 draw4: 160M for draw 0, 0, 1000, 1000 no repl all r8g8b8 null loop: 10k src, dst reading: 13-15M each mask reading: 30M alpha calculation loop: 90M null alpha loop: 2M minimal loop control +20M alpha calculation with divides +190M alpha calculation wtih shifts +70M writeback: 11M