GSoC/2018/StatusReports/ivanyossiIván
Optimize Krita Soft, Gaussian and Stamp brushes mask generation to use AVX with Vc Library
Summary
- Project Name: Optimize Krita Soft, Gaussian and Stamp brushes mask generation to use AVX with Vc Library
- Proposal: View Proposal
- Abstract: Digital painting app relies on quick painting response to give a natural experience. A painted line is composed of thousands of images, called dabs, placed one after the other, each dab is masked to generate a different brush tip shape. As mask shapes are more complex and bigger, rendering them can be costly and painting becomes laggy. This project seeks to minimize the time spent generating the mask by implementing the generator using AVX instructions sets. Vc library is used to interface with the SIMD operations. Testing suggest the speed gains can be up to 10 times faster which improves the workflow using big brushes or complex multibrushes.
- Final merged Code differential
Project Goals
Implement Mask AVX optimization (Mask Type / Status, task) Tasks
- Circular Gauss implemented, merged T8734
- Circular Soft implemented, merged T8868
- Rectangular Gaussian implemented, merged T9010
- Rectangular Soft implemented, merged T9133
- Added' Default Rectangular implemented, merged T9344
- Stamp Mask progressively implementing T9346
Code summaries
- 54 Commits merged
- Work done Differential (code made until June 12, 2018)
- Final merged Code differential
Implementations Status
Status report on each goal implementation.
Unit Test: Similarity test
Goal: Test the current mask generators produce the same mask representation.
This unit test makes sure the masks generated are equal to the dab shape stated by the Krita community. The mask shape equality ensures consistency between versions and every implementation needs to adhere to the shape accepted (unless a new definition is decided upon).
Current Status Current test verifies the equality between the old engine and new AVX vectorized engine. The similarity is adjusted such as no pixel is allowed to be different by more than a brightness value of 2 (in RGB 8-but space).
TODO Simplify code. The test checks the mask generated from the scalar an vector method are equal, but it doesn't check the mask generated is consistent with the expected Mask (the one defined by Krita).
Challenges Mask shape has many variants that affect size, ratio, fade and antialias. Each of this operations work in tandem but in some situations input variants won't alter result, or need to be tested separately. The test needs to include as many variants in as few shapes as possible.
Related blog posts
Commits and Differentials
- R37:8fa826838aa9: Adjust similarity Tolerance
- R37:5f60267ccd80: Modify similiarity test to try more mask variation
- R37:08efed86d2bb: Added CircleGauss to SimilarityTes
- R37:9963768b392b: Correctly compare images alpha channel by setting fuzzy alpha and tolerance
- R37:d67ecdd905ad: Adhere code to coding style more strictly
- R37:db1ebe824c92: KisBrushMaskSimilarityTest:
Circular Gauss
Goal: Implement Circular Gauss vectorized Mask generator using Vc
Gaussian mask generator uses a Gauss function to control the fade of the mask shape. Because of that is the slowest of all mask generators, since it calls the math erf() function twice on each pixel. The erf number can be approximated in a number of ways, original implementation does this using standard double precision erf() on each pixel, making it very slow.
Current Status: Implemented and added merge to master. Released in Krita 4.1. Mask generation is 10 times more faster to render. All tests pass which proves both scalar and vectorized implementation are identical. Code profiled, no bottle necks or code issues found. Feature work 100%
Challenges Gaussian depends in the correct erf() values generation, but no such function existed for the vectorized data type of Vc. Implement a correct and quick vectorized erf() using single precision float was the biggest issue. The standard erf() not only works in double precision but it also makes different operations depending on the input value. The implemented vcErf() takes into account that any value it will receive is between zero and 255. Working with cases we replicated the precision needed to replicate the original Scalar implementation.
Related blog posts
Commits and Differentials
- 37:b55ed74ac98b: FIX: Gauss Circular Mask Antialiasing
- 37:45cf521214b5: FIX: Float precision bug masking issues for vectorized GaussMask generator
- 37:8dc950e705ee: FIX: Gauss Circular Mask Antialiasing
- 37:b395b05ef54d: FIX: Float precision bug masking issues for vectorized GaussMask generator
- R37:daac6985670c: FIX: Missing Antialias on Vectorized Circular Gauss
- R37:df4cb29add28: FIX: Missing Antialias on Vectorized Circular Gauss
- R37:884dcc104e3a: FIX: Missing Antialias on Vectorized Circular Gauss
- R37:08efed86d2bb: Added CircleGauss to SimilarityTest
- R37:37effe636a30: ADD: Vectorized CircularGaussMask, UnitTestPAssing
- R37:a9b6c3a4eb36: ref T8734
Differentials
Circular Soft
Goal: Implement Circular Soft vectorized Mask generator using Vc
Soft Generator creates a Mask based on curve values. The curve itself is generated elsewhere using the initial values on the mask generator. The curve is defined by a list of points in which 0 < x < 1 and 0 < y < 1. Fade generation uses the same object as the Circular Gauss
Current Status: Implemented and merged to master. Mask generation improved by 5 times, the change is not as drastic as the Gauss version, this is because the scalar implementation was not as heavy dependent on math operations but on memory read. All tests variants pass. Profiling code shows no time consumer. Feature set is implemented in full.
Challenges Soft Mask values are determined by a curve represented as a Vector of gray values. Each value index position corresponds to the distance to the center of the Mask. For a Scalar approach getting value one by one using an index is something trivial. On Vc however the values needs to be in an array next to the other to allow for the best optimization. Getting the space values from the vector into the Vc SIMD array was the main problem to solve. Luckily there was no need for in house implementation as Vc has a method to gather indexes from different regions of an array into the Vc Array. Using this method and passing the data pointer of the vector allowed to access the curve values efficiently.
Related blog posts
Commits and Differentials
- R37:ae2f0e5cdaa1: Adjust format and on CircSoft Mask FastRow
- R37:f6182887b9b5: Modify maksBenchmark to create identical Soft Masks
- R37:e8de81d0db26: - Soft Circular vectorized brush mask Add missing antialias modification for
- R37:dfae36961a09: NEW: Implement Vectorized Soft Brush Mask Generator.
Differentials
Rectangular Gauss
Goal: Implement Rectangular Gauss vectorized Mask generator using Vc
Current Status: Implemented and merged to master. Speed up of 10x, mask generator reuses vcErf() implemented for Circular Gauss. Vectorize and Scalar implementation give the same result and profiling show no bottleneck
Challenges Rectangular shapes use a 2D fader to determine the gray value a the given coordinate. This value is calculated in a relation between x and y coordinates. The 2D fader made different operations depending on the x and y values received and returned the results at different moments in the function. Adapting this to use masks on vectors was the main challenge. Also since any operation done would be done for all values, the new implementation was made such that the hard double vcErf() operation was done only once per iteration.
Related blog posts
- No related blogposts
Commits and Differentials
- R37:461af3f9a957: Minor code clean up Set a bigger size for generated mask rect
- R37:55067015ca49: Include Gaussian Rectangular in FreeStrokeBenchMark
- R37:e36538da1a45: Optimize Rectangular Gauss Mask
Differentials
Rectangular Soft
Goal: Implement Rectangular Soft vectorized Mask generator using Vc
Current Status: Implemented and merged to master. Speed gains were as good as 4x, since most time is spent getting values from memory. Current implementation generates the same output as the scalar version.
Challenges Biggest challenge was to get the best performance from Vc initializations to avoid adding time to the slow memory access time. Original code used casting to get the integer part of the float value, at first this casting made the Vc code slow, but switching to internal VcIndexes made the casting super fast.
Related blog posts
- No related blogposts
Commits and Differentials
- R37:728cec98fba8: Adjust Spacing of auto_soft_rect.kpp test preset to be 0.1
- R37:21afc59cea14: Use Vc Indexes instead of custom SimdArray for integer casting
- R37:b8f2080917a3: Reduce casting on Vector Indexes creation
- R37:2565a74bdc9c: Vectorize Soft Rect Mask Generator
Differentials
Rectangular Default
Added to project goals as it was very similar to the other work done and not very time consuming Goal: Implement Rectangular Default vectorized Mask generator using Vc
Current Status: Implemented and merged to master. Speed is almost as fast the fastest Vc implementation, the Default Circular. Mask generated is equal to the original scalar version and profiling shows no bottlenecks
Challenges As usual with math dependent implementations, the difference in precision made the final values vary by more than the threshold in similarity test. This difference also made a bug surface much more often with single precision, fixing the bug and reducing the epsilon used for integer conversion allowed the masks to be equal. Default Rect does not uses a 2D fader which made the implementation a bit tricky to mimic the scalar version logic.
Related blog posts
- No related blogposts
Commits and Differentials
- R37:0a7d9b113791: Add Default Rectangular to FreehandStrokeBenchmark
- R37:ff83ebb73821: Reduce difference gap of Default Rect Mask Vector impl
- R37:da249312e649: Implement Vc Optimization for Default Rect Mask
Differentials
Stamp Mask
Goal: Optimize Predefined Mask generator using vectorized operations with Vc
Current Status: Identified current bottleneck: qpainter.drawImage method is not very fast. Corrently code is reorganized to avoid repetitions on highly iterative sections. Making current QImage transform fron QImage >> QImage to a QImage >> KisFixedPaintDevice.
'TODO Finish implementation of Qimage transform to KisFixedPaintDevice and make the code use the new code. Port the new transform code to Vc operations for fast processing.
Challenges Base code is completely different than auto_brush code used in previous optimizations. First challenge would be identifying the slow parts of the code, then understanding how eachp part of the code works to generate the Mask. With the the help of my mentor we discuss a plan to implement the optimization progressively in steps. Biggest challenge will be to speed up the image transformation beyond what is currently doing Qt.
Related blog posts
- No related blogposts
Commits and Differentials
GSoC Work report chronicle
First week, during community bonding, I read the documentation and made a first proposal for the Unit test to be used in the implementation process. This Unit test has to compare the new mask shape and the legacy one and assert they are similar with a certain error. Unit test works ok, but it is not as isolated as needed and possibly other brush preparations used could interfere with the brush mask testing.
On the following weeks and previous to the coding phase I started to be more on IRC and the forums and help out the users I could. I began reading more about Vc and Intel AVX and started to make a small map of the code about brush masks to know exactly what was going on. A second version of the unit test was made, this time we went deeper into the code and managed the Masks directly from the pointer data of KisMaskGenerator.
Coding phase
I spend the first week of coding phase working on understanding how to implement a fully featured Circular Gaussian. I get into the problem of implementing an in house erf for vectorize operations. Once this implementation was passing the test I made a quick painting test and run the FreehandStrokeBenchmark to see if there was more speed gain than with the first dummy implementation. The new implementation was super fast.
Second week my mentor asked me to create a BenchMark specifically for the MaskGeneration, the idea behind this is to have even more evidence that we are getting much better performance from the new vectorized version. The benchmark did not take long to implement and testings confirm the speed gains seen on the other test. I sent the code for review and it was suggested I merged it.
Third and Fourth week SoftBrush implementation was born and during the tests and feature implementation I realized there was some features missing from the Gauss implementation. The feature in question was the antialiasing. I ported the antialias code from the Soft Mask to Gauss Mask (since both use the same logic in the scalar version), and while testing I discovered that with some softness and fading values Gauss Mask failed. The image confirmed the mask was not coming out properly. I spent the next two day finding the root cause and fixing the bug, caused by float imprecision and one bad guard condition. The fixes also applied to Soft Brush and we finished initial feature complete implementation. I did not sent for review yet as I wanted to do much more in deep testing and optimization first.
Also we used this time to help out a little with the new documentation platform. I specifically sent two proposals to help in the automatization of the LaTeX version of the manual D13205, D13204. this should make easier to deploy the PDF version of the documentation when its needed.
5ª and 6º week Circular Soft was sent for revision and work on the Rectangular Gauss began, this show the necessity of implementing a new static class with all Vc extra operation needed by both Gaussian implementations to avoid repeating code. My mentor pointed out that my current 1D Fader implementation was not very good as it made private members public (trough methods) with defeated the purpose of encapsulation. A refactored 1D Fader was made and with those lessons learned I proceeded to make the Vc implementation for the 2D Fader. The finished implementation had a missing file so review process was delayed. In the end the revision found a regression on smaller brushes, which took me a few days to figure out as in the beginning I thought it was due to my code, but turned out to be that I turned on supersampling, mainly used for Default brush.
7º and 8º week. Used this week to fix bugs and improve the tests, but since I was doing a good advance I started working on Rectangular Soft which was pushed to my branch one week later. The implementation took longer to include all tests cases. And also because i was testing other options to generate the square image. And another full week to finish completely all tests on how to make it faster (this included making a version that used stack instead of heap, however it didn't prove fruitful).
Coding for GSoC had to be put aside in this lapse since my university calendar is wrapping up the trimester around the end of July, and as part of my final project I had to finish a system and run a heavy simulation. Attempts to run it on ubuntu prove unsuccessful so I had to use the main computer to run the simulations which took more than the 3 days anticipated, up to a week and a half. Running two simulations at a time produced errors and made the main computer unavailable for any type of work. In the end it took almost two weeks because I reduced the simulation running to one at a time at day 5. This let me work again on GSoC, but limiting any recompile to one core.
In the end the slowdown because of simulation suffered a little more with the write up of the project report (a small thesis). however I made it work by coding for GSoC half a day and writing the other half, allowing me to finish the project in time. :].
Back to coding! But barely we have time left to code!
Final weeks With two weeks to go my mentor and I traced a plan, for the remaining mask optimization. however since Rectangular Default used the same auto_brush framework we decided I should optimize it as well and then work on the Stamp mask (or officialy: predefined mask). I thought estimated the optimization of Rectangular default would take me at most 3 days, but as with everything, the simplest looking problem turned out to have hidden difficulties. In the end it took me an entire week to fix all quirks in the code and make it render the same as the original version. Also I made an alternate render of this mask a little more softer in the corners, but I haven't send this to evaluation to inclusion. All in all that lost week left me with one week to understand a completely new mask generator code for the predefined masks. In that week I was able to understand the code and start planing an implementation route. At the end of that week my mentor helped me to trace a concrete route to start incorporating optimizations to the code to eventually make the Vc vectorization.
Things left to do
The project finished almost all objectives and those finished are already merged into the master repository an ready to land on Krita 4.2. The missing generator pal is described below:
- Predefined mask generator is under refactor to start using an in-house implementation of the drawImage method of QPainter to work directly with the Krita paint device and Vc to optimize it. Once the scalar implementation works the same as the QPainter one, the idea is to create a vectorized version to speed all math operations doing what is now a two step operation into one.