The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Benchmarking Krita performance
Tile engine
Data Manager
- Image dimension for test 4096*4096, RGB
- I executed every test few times and I selected the results that popped again more times
- callgrind backend did not produced callgrind.* files so I used valgrind directly, but that does create benchmarking also for Qt test lib
- http://lukast.mediablog.sk/callgrind/DatamanagerBenchmarks.tar.gz
benchmark name |
walltime |
tickcounter |
Mb/s
|
benchmarkWriteBytes |
38.0 msec per iteration (total: 380, iterations: 10) |
77,528,468.2 ticks per iteration (total: 775284683, iterations: 10) |
1333.3 Mb/s
|
benchmarkReadBytes |
39.3 msec per iteration (total: 394, iterations: 10) |
77,311,910.2 ticks per iteration (total: 773119103, iterations: 10) |
1628.4 Mb/s
|
benchmarkReadWriteBytes |
46.2 msec per iteration (total: 462, iterations: 10) |
91,198,881.7 ticks per iteration (total: 911988817, iterations: 10) |
1391.3 Mb/s
|
benchmarkExtent |
0.00020 msec per iteration (total: 34, iterations: 163840) |
735.0 ticks per iteration (total: 7350, iterations: 10) |
N/A
|
benchmarkClear |
1.3 msec per iteration (total: 26, iterations: 20) |
2,542,070.2 ticks per iteration (total: 25420702, iterations: 10) |
N/A
|
Iterators
Horizontal Iterator
benchmark name |
walltime |
tickcounter |
Mb/s
|
benchmarkWriteBytes |
1,383.4 msec per iteration (total: 13834, iterations: 10) |
4,389,801,089.3 ticks per iteration (total: 43898010893, iterations: 10) |
46.3 Mb/s
|
benchmarkReadBytes |
1,443.2 msec per iteration (total: 14433, iterations: 10) |
4,461,418,645.5 ticks per iteration (total: 44614186455, iterations: 10) |
44.4 Mb/s
|
benchmarkConstReadBytes |
1,380.7 msec per iteration (total: 13808, iterations: 10) |
4,501,257,062.3 ticks per iteration (total: 45012570623, iterations: 10) |
46.3 Mb/s
|
benchmarkReadWriteBytes |
2,041.7 msec per iteration (total: 20418, iterations: 10) |
5,736,531,494.3 ticks per iteration (total: 57365314943, iterations: 10) |
31.3 Mb/s
|
benchmarkNoMemCpy |
655.7 msec per iteration (total: 6557, iterations: 10) |
3,025,535,970.6 ticks per iteration (total: 30255359707, iterations: 10) |
97.7 Mb/s
|
benchmarkConstNoMemCpy |
583.7 msec per iteration (total: 5837, iterations: 10) |
2,889,942,765.8 ticks per iteration (total: 28899427658, iterations: 10) |
109.6 Mb/s
|
benchmarkTwoIteratorsNoMemCpy |
1,205.7 msec per iteration (total: 12057, iterations: 10) |
3,952,530,421.5 ticks per iteration (total: 39525304215, iterations: 10) |
53.1 Mb/s
|
Update
state:trunk 17.feb 2010 15:38
benchmark name |
walltime |
Mb/s
|
benchmarkWriteBytes |
1,548.0 msec per iteration (total: 15481, iterations: 10) |
41.34 Mb/s
|
benchmarkReadBytes |
3,087.8 msec per iteration (total: 30878, iterations: 10) |
20.73 Mb/s
|
benchmarkConstReadBytes |
3,062.0 msec per iteration (total: 30620, iterations: 10) |
20.90 Mb/s
|
benchmarkReadWriteBytes |
3,725.0 msec per iteration (total: 37251, iterations: 10) |
17.18 Mb/s
|
benchmarkNoMemCpy |
2,264.4 msec per iteration (total: 22644, iterations: 10) |
28.26 Mb/s
|
benchmarkConstNoMemCpy |
2,316.8 msec per iteration (total: 23168, iterations: 10) |
27.62 Mb/s
|
benchmarkTwoIteratorsNoMemCpy |
2,950.0 msec per iteration (total: 29501, iterations: 10) |
21.69 Mb/s
|
state: caching patch applied to trunk
benchmark name |
walltime |
Mb/s
|
benchmarkWriteBytes |
1,211.4 msec per iteration (total: 12114, iterations: 10) |
52.83 Mb/s (speedup 1.28)
|
benchmarkReadBytes |
1,196.2 msec per iteration (total: 11962, iterations: 10) |
53.50 Mb/s (speedup 2.58)
|
benchmarkConstReadBytes |
1,202.2 msec per iteration (total: 12022, iterations: 10) |
53.24 Mb/s (speedup 1.28)
|
benchmarkReadWriteBytes |
1,563.0 msec per iteration (total: 15631, iterations: 10) |
40.95 Mb/s (speedup 2.38)
|
benchmarkNoMemCpy |
389.1 msec per iteration (total: 3891, iterations: 10) |
164.48 Mb/s (speedup 5.82)
|
benchmarkConstNoMemCpy |
372.5 msec per iteration (total: 3725, iterations: 10) |
171.81 Mb/s (speedup 6.21)
|
benchmarkTwoIteratorsNoMemCpy |
670.3 msec per iteration (total: 6704, iterations: 10) |
95.48 Mb/s (speedup 4.4)
|
Vertical Iterator
benchmark name |
walltime |
tickcounter |
Mb/s
|
benchmarkWriteBytes |
1,541.9 msec per iteration (total: 15419, iterations: 10) |
Not measured |
41.52 Mb/s
|
benchmarkReadBytes |
1,534.4 msec per iteration (total: 15344, iterations: 10) |
Not measured |
41.7 Mb/s
|
benchmarkConstReadBytes |
1,460.5 msec per iteration (total: 14606, iterations: 10) |
Not measured |
43.82 Mb/s
|
benchmarkReadWriteBytes |
2,156.3 msec per iteration (total: 21563, iterations: 10) |
Not measured |
29.7 Mb/s
|
benchmarkNoMemCpy |
649.0 msec per iteration (total: 6490, iterations: 10) |
Not measured |
98.6 Mb/s
|
benchmarkConstNoMemCpy |
599.3 msec per iteration (total: 5994, iterations: 10) |
Not measured |
106.7 Mb/s
|
benchmarkTwoIteratorsNoMemCpy |
1,231.5 msec per iteration (total: 12316, iterations: 10) |
Not measured |
52 Mb/s
|
Rectangular Iterator
benchmark name |
walltime |
Mb/s
|
benchmarkWriteBytes |
118.2 msec per iteration (total: 1182, iterations: 10) |
541.4 Mb/s
|
benchmarkReadBytes |
121.7 msec per iteration (total: 1217, iterations: 10) |
525.9 Mb/s
|
benchmarkConstReadBytes |
120.5 msec per iteration (total: 1205, iterations: 10) |
533.3 Mb/s
|
benchmarkReadWriteBytes |
167.0 msec per iteration (total: 1670, iterations: 10) |
383.2 Mb/s
|
benchmarkNoMemCpy |
35.7 msec per iteration (total: 358, iterations: 10) |
1792.7 Mb/s
|
benchmarkConstNoMemCpy |
37.7 msec per iteration (total: 377, iterations: 10) |
1697.6 Mb/s
|
benchmarkTwoIteratorsNoMemCpy |
65.2 msec per iteration (total: 652, iterations: 10) |
981.6 Mb/s
|
Random Iterator
benchmark name |
walltime |
Mb/s
|
benchmarkWriteBytes |
1,641.5 msec per iteration (total: 16415, iterations: 10) |
39.0 Mb/s
|
benchmarkReadBytes |
1,598.5 msec per iteration (total: 15985, iterations: 10) |
40.0 Mb/s
|
benchmarkConstReadBytes |
1,654.5 msec per iteration (total: 16545, iterations: 10) |
38.68 Mb/s
|
benchmarkReadWriteBytes |
2,934.8 msec per iteration (total: 29348, iterations: 10) |
21.8 Mb/s
|
benchmarkNoMemCpy |
971.3 msec per iteration (total: 9714, iterations: 10) |
65.9 Mb/s
|
benchmarkConstNoMemCpy |
938.6 msec per iteration (total: 9386, iterations: 10) |
68.2 Mb/s
|
benchmarkTwoIteratorsNoMemCpy |
1,929.7 msec per iteration (total: 19298, iterations: 10) |
33.2 Mb/s
|
benchmarkTileByTileWrite |
1,310.0 msec per iteration (total: 13101, iterations: 10) |
48.9 Mb/s
|
benchmarkTotalRandom |
27,999 msec per iteration (total: 27999, iterations: 1) |
2.2 Mb/s
|
benchmarkTotalRandomConst |
29,124 msec per iteration (total: 29124, iterations: 1) |
2.2 Mb/s
|
KisPainter
Composition (bitBlt)
benchmark name |
walltime |
Mb/s
|
benchmarkBitBlt |
5,456.8 msec per iteration (total: 54569, iterations: 10) |
234.6 Mb/s |
|
benchmarkBitBltSelection |
5,922.8 msec per iteration (total: 59228, iterations: 10) |
216.1 Mb/s |
|
benchmarkFixedBitBlt |
3,635.5 msec per iteration (total: 36356, iterations: 10) |
352.1 Mb/s |
|
benchmarkFixedBitBltSelection |
5,342.1 msec per iteration (total: 53421, iterations: 10) |
239.6 Mb/s |
|
Filters
Brightness/Contrast
benchmark name |
walltime |
Mb/s
|
benchmarkFilter |
1,783.5 msec per iteration (total: 17835, iterations: 10) |
14.47 Mb/s
|
Blur
benchmark name |
walltime |
Mb/s
|
benchmarkFilter |
31,674 msec per iteration (total: 31674, iterations: 1) |
0.81 Mb/s
|
Projection
Everything is benchmarked in one go.
benchmark name |
walltime |
Mb/s
|
benchmarkProjection |
834.6 msec per iteration (total: 8346, iterations: 10) |
N/A
|
Painting strokes
- we paint on empty 4096x4096 paint device
- The brush used is 70px pixelbrush, autobrush (the default one)
- the benchmark can run with any paintop, just need to change the preset
- first test paints the stroke you can see in the preview box in different scale. On 4096x4096px image.
- the second test paints 20 random lines (every test the same 20 lines) with varying pressure (from 0.0 to 1.0)
- http://lukast.mediablog.sk/callgrind/strokeBenchmarks.tar.gz [TODO add bouds result]
benchmark name |
walltime |
Mb/s
|
benchmarkStroke |
2,962 msec per iteration (total: 2962, iterations: 1) |
N/A
|
benchmarkRandomLines |
18,576 msec per iteration (total: 18576, iterations: 1) |
N/A
|
First results
Computer specification
Compiler options
gcc -Wnon-virtual-dtor -Wno-long-long -ansi -Wundef -Wcast-align -Wchar-subscripts -Wall -W -Wpointer-arith -Wformat-security -fno-exceptions -DQT_NO_EXCEPTIONS -fno-check-new -fno-common -Woverloaded-virtual -fno-threadsafe-statics -fvisibility=hidden -fvisibility-inlines-hidden -O2 -g -fPIC -Wl,--enable-new-dtags
In CMake Configuration we have option called KritaDevs, that's what I used for the benchmarking. This output was found by make VERBOSE=1
First optimizations
With performance fix + FastMath::atan2
benchmark name |
walltime |
Mb/s
|
benchmarkStroke |
650.2 msec per iteration (total: 6503, iterations: 10) |
N/A
|
benchmarkRandomLines |
4,158.8 msec per iteration (total: 41589, iterations: 10) |
N/A
|
Cyrille's tuning commits around lunch
benchmark name |
walltime |
Mb/s
|
benchmarkStroke |
533.3 msec per iteration (total: 5334, iterations: 10) |
N/A
|
benchmarkRandomLines |
3,555.5 msec per iteration (total: 35556, iterations: 10) |
N/A
|
Just with performance fix
benchmark name |
walltime |
Mb/s
|
benchmarkStroke |
683.7 msec per iteration (total: 6838, iterations: 10) |
N/A
|
benchmarkRandomLines |
4,696.3 msec per iteration (total: 46964, iterations: 10) |
N/A
|
Compute 1/4 for the symmetrical brushes
benchmark name |
walltime |
Mb/s
|
benchmarkStroke |
257.3 msec per iteration (total: 2574, iterations: 10) |
N/A
|
benchmarkRandomLines |
1,449.2 msec per iteration (total: 14492, iterations: 10) |
N/A
|