Krita/Benchmarking

From KDE Community Wiki

Benchmarking Krita performance

Tile engine

Data Manager

  • Image dimension for test 4096*4096, RGB
  • I executed every test few times and I selected the results that popped again more times
  • callgrind backend did not produced callgrind.* files so I used valgrind directly, but that does create benchmarking also for Qt test lib
  • http://lukast.mediablog.sk/callgrind/DatamanagerBenchmarks.tar.gz
benchmark name walltime tickcounter Mb/s
benchmarkWriteBytes 38.0 msec per iteration (total: 380, iterations: 10) 77,528,468.2 ticks per iteration (total: 775284683, iterations: 10) 1333.3 Mb/s
benchmarkReadBytes 39.3 msec per iteration (total: 394, iterations: 10) 77,311,910.2 ticks per iteration (total: 773119103, iterations: 10) 1628.4 Mb/s
benchmarkReadWriteBytes 46.2 msec per iteration (total: 462, iterations: 10) 91,198,881.7 ticks per iteration (total: 911988817, iterations: 10) 1391.3 Mb/s
benchmarkExtent 0.00020 msec per iteration (total: 34, iterations: 163840) 735.0 ticks per iteration (total: 7350, iterations: 10) N/A
benchmarkClear 1.3 msec per iteration (total: 26, iterations: 20) 2,542,070.2 ticks per iteration (total: 25420702, iterations: 10) N/A

Iterators

Horizontal Iterator

benchmark name walltime tickcounter Mb/s
benchmarkWriteBytes 1,383.4 msec per iteration (total: 13834, iterations: 10) 4,389,801,089.3 ticks per iteration (total: 43898010893, iterations: 10) 46.3 Mb/s
benchmarkReadBytes 1,443.2 msec per iteration (total: 14433, iterations: 10) 4,461,418,645.5 ticks per iteration (total: 44614186455, iterations: 10) 44.4 Mb/s
benchmarkConstReadBytes 1,380.7 msec per iteration (total: 13808, iterations: 10) 4,501,257,062.3 ticks per iteration (total: 45012570623, iterations: 10) 46.3 Mb/s
benchmarkReadWriteBytes 2,041.7 msec per iteration (total: 20418, iterations: 10) 5,736,531,494.3 ticks per iteration (total: 57365314943, iterations: 10) 31.3 Mb/s
benchmarkNoMemCpy 655.7 msec per iteration (total: 6557, iterations: 10) 3,025,535,970.6 ticks per iteration (total: 30255359707, iterations: 10) 97.7 Mb/s
benchmarkConstNoMemCpy 583.7 msec per iteration (total: 5837, iterations: 10) 2,889,942,765.8 ticks per iteration (total: 28899427658, iterations: 10) 109.6 Mb/s
benchmarkTwoIteratorsNoMemCpy 1,205.7 msec per iteration (total: 12057, iterations: 10) 3,952,530,421.5 ticks per iteration (total: 39525304215, iterations: 10) 53.1 Mb/s


Update state:trunk 17.feb 2010 15:38

benchmark name walltime Mb/s
benchmarkWriteBytes 1,548.0 msec per iteration (total: 15481, iterations: 10) 41.34 Mb/s
benchmarkReadBytes 3,087.8 msec per iteration (total: 30878, iterations: 10) 20.73 Mb/s
benchmarkConstReadBytes 3,062.0 msec per iteration (total: 30620, iterations: 10) 20.90 Mb/s
benchmarkReadWriteBytes 3,725.0 msec per iteration (total: 37251, iterations: 10) 17.18 Mb/s
benchmarkNoMemCpy 2,264.4 msec per iteration (total: 22644, iterations: 10) 28.26 Mb/s
benchmarkConstNoMemCpy 2,316.8 msec per iteration (total: 23168, iterations: 10) 27.62 Mb/s
benchmarkTwoIteratorsNoMemCpy 2,950.0 msec per iteration (total: 29501, iterations: 10) 21.69 Mb/s

state: caching patch applied to trunk

benchmark name walltime Mb/s
benchmarkWriteBytes 1,211.4 msec per iteration (total: 12114, iterations: 10) 52.83 Mb/s (speedup 1.28)
benchmarkReadBytes 1,196.2 msec per iteration (total: 11962, iterations: 10) 53.50 Mb/s (speedup 2.58)
benchmarkConstReadBytes 1,202.2 msec per iteration (total: 12022, iterations: 10) 53.24 Mb/s (speedup 1.28)
benchmarkReadWriteBytes 1,563.0 msec per iteration (total: 15631, iterations: 10) 40.95 Mb/s (speedup 2.38)
benchmarkNoMemCpy 389.1 msec per iteration (total: 3891, iterations: 10) 164.48 Mb/s (speedup 5.82)
benchmarkConstNoMemCpy 372.5 msec per iteration (total: 3725, iterations: 10) 171.81 Mb/s (speedup 6.21)
benchmarkTwoIteratorsNoMemCpy 670.3 msec per iteration (total: 6704, iterations: 10) 95.48 Mb/s (speedup 4.4)

Vertical Iterator

benchmark name walltime tickcounter Mb/s
benchmarkWriteBytes 1,541.9 msec per iteration (total: 15419, iterations: 10) Not measured 41.52 Mb/s
benchmarkReadBytes 1,534.4 msec per iteration (total: 15344, iterations: 10) Not measured 41.7 Mb/s
benchmarkConstReadBytes 1,460.5 msec per iteration (total: 14606, iterations: 10) Not measured 43.82 Mb/s
benchmarkReadWriteBytes 2,156.3 msec per iteration (total: 21563, iterations: 10) Not measured 29.7 Mb/s
benchmarkNoMemCpy 649.0 msec per iteration (total: 6490, iterations: 10) Not measured 98.6 Mb/s
benchmarkConstNoMemCpy 599.3 msec per iteration (total: 5994, iterations: 10) Not measured 106.7 Mb/s
benchmarkTwoIteratorsNoMemCpy 1,231.5 msec per iteration (total: 12316, iterations: 10) Not measured 52 Mb/s

Rectangular Iterator

benchmark name walltime Mb/s
benchmarkWriteBytes 118.2 msec per iteration (total: 1182, iterations: 10) 541.4 Mb/s
benchmarkReadBytes 121.7 msec per iteration (total: 1217, iterations: 10) 525.9 Mb/s
benchmarkConstReadBytes 120.5 msec per iteration (total: 1205, iterations: 10) 533.3 Mb/s
benchmarkReadWriteBytes 167.0 msec per iteration (total: 1670, iterations: 10) 383.2 Mb/s
benchmarkNoMemCpy 35.7 msec per iteration (total: 358, iterations: 10) 1792.7 Mb/s
benchmarkConstNoMemCpy 37.7 msec per iteration (total: 377, iterations: 10) 1697.6 Mb/s
benchmarkTwoIteratorsNoMemCpy 65.2 msec per iteration (total: 652, iterations: 10) 981.6 Mb/s

Random Iterator

benchmark name walltime Mb/s
benchmarkWriteBytes 1,641.5 msec per iteration (total: 16415, iterations: 10) 39.0 Mb/s
benchmarkReadBytes 1,598.5 msec per iteration (total: 15985, iterations: 10) 40.0 Mb/s
benchmarkConstReadBytes 1,654.5 msec per iteration (total: 16545, iterations: 10) 38.68 Mb/s
benchmarkReadWriteBytes 2,934.8 msec per iteration (total: 29348, iterations: 10) 21.8 Mb/s
benchmarkNoMemCpy 971.3 msec per iteration (total: 9714, iterations: 10) 65.9 Mb/s
benchmarkConstNoMemCpy 938.6 msec per iteration (total: 9386, iterations: 10) 68.2 Mb/s
benchmarkTwoIteratorsNoMemCpy 1,929.7 msec per iteration (total: 19298, iterations: 10) 33.2 Mb/s
benchmarkTileByTileWrite 1,310.0 msec per iteration (total: 13101, iterations: 10) 48.9 Mb/s
benchmarkTotalRandom 27,999 msec per iteration (total: 27999, iterations: 1) 2.2 Mb/s
benchmarkTotalRandomConst 29,124 msec per iteration (total: 29124, iterations: 1) 2.2 Mb/s

KisPainter

Composition (bitBlt)

benchmark name walltime Mb/s
benchmarkBitBlt 5,456.8 msec per iteration (total: 54569, iterations: 10) 234.6 Mb/s
benchmarkBitBltSelection 5,922.8 msec per iteration (total: 59228, iterations: 10) 216.1 Mb/s
benchmarkFixedBitBlt 3,635.5 msec per iteration (total: 36356, iterations: 10) 352.1 Mb/s
benchmarkFixedBitBltSelection 5,342.1 msec per iteration (total: 53421, iterations: 10) 239.6 Mb/s

Filters

Brightness/Contrast

benchmark name walltime Mb/s
benchmarkFilter 1,783.5 msec per iteration (total: 17835, iterations: 10) 14.47 Mb/s

Blur


benchmark name walltime Mb/s
benchmarkFilter 31,674 msec per iteration (total: 31674, iterations: 1) 0.81 Mb/s

Projection

Everything is benchmarked in one go.

benchmark name walltime Mb/s
benchmarkProjection 834.6 msec per iteration (total: 8346, iterations: 10) N/A

Painting strokes

  • we paint on empty 4096x4096 paint device
  • The brush used is 70px pixelbrush, autobrush (the default one)
  • the benchmark can run with any paintop, just need to change the preset
  • first test paints the stroke you can see in the preview box in different scale. On 4096x4096px image.
  • the second test paints 20 random lines (every test the same 20 lines) with varying pressure (from 0.0 to 1.0)
  • http://lukast.mediablog.sk/callgrind/strokeBenchmarks.tar.gz [TODO add bouds result]


benchmark name walltime Mb/s
benchmarkStroke 2,962 msec per iteration (total: 2962, iterations: 1) N/A
benchmarkRandomLines 18,576 msec per iteration (total: 18576, iterations: 1) N/A

First results

Computer specification

Compiler options

gcc -Wnon-virtual-dtor -Wno-long-long -ansi -Wundef -Wcast-align -Wchar-subscripts -Wall -W -Wpointer-arith -Wformat-security -fno-exceptions -DQT_NO_EXCEPTIONS -fno-check-new -fno-common -Woverloaded-virtual -fno-threadsafe-statics -fvisibility=hidden -fvisibility-inlines-hidden -O2 -g -fPIC -Wl,--enable-new-dtags

In CMake Configuration we have option called KritaDevs, that's what I used for the benchmarking. This output was found by make VERBOSE=1

First optimizations

With performance fix + FastMath::atan2

benchmark name walltime Mb/s
benchmarkStroke 650.2 msec per iteration (total: 6503, iterations: 10) N/A
benchmarkRandomLines 4,158.8 msec per iteration (total: 41589, iterations: 10) N/A

Cyrille's tuning commits around lunch

benchmark name walltime Mb/s
benchmarkStroke 533.3 msec per iteration (total: 5334, iterations: 10) N/A
benchmarkRandomLines 3,555.5 msec per iteration (total: 35556, iterations: 10) N/A


Just with performance fix

benchmark name walltime Mb/s
benchmarkStroke 683.7 msec per iteration (total: 6838, iterations: 10) N/A
benchmarkRandomLines 4,696.3 msec per iteration (total: 46964, iterations: 10) N/A

Compute 1/4 for the symmetrical brushes

benchmark name walltime Mb/s
benchmarkStroke 257.3 msec per iteration (total: 2574, iterations: 10) N/A
benchmarkRandomLines 1,449.2 msec per iteration (total: 14492, iterations: 10) N/A