From KDE Community Wiki
< GSoC‎ | 2018‎ | StatusReports

Optimize Krita Soft, Gaussian and Stamp brushes mask generation to use AVX with Vc Library


  • Project Name: Optimize Krita Soft, Gaussian and Stamp brushes mask generation to use AVX with Vc Library
  • Proposal: View Proposal
  • Abstract: Digital painting app relies on quick painting response to give a natural experience. A painted line is composed of thousands of images, called dabs, placed one after the other, each dab is masked to generate a different brush tip shape. As mask shapes are more complex and bigger, rendering them can be costly and painting becomes laggy. This project seeks to minimize the time spent generating the mask by implementing the generator using AVX instructions sets. Vc library is used to interface with the SIMD operations. Testing suggest the speed gains can be up to 10 times faster which improves the workflow using big brushes or complex multibrushes.
  • Final merged Code differential

Project Goals

Implement Mask AVX optimization (Mask Type / Status, task) Tasks

  • Circular Gauss implemented, merged T8734
  • Circular Soft implemented, merged T8868
  • Rectangular Gaussian implemented, merged T9010
  • Rectangular Soft implemented, merged T9133
  • Default Rectangular implemented, merged T9344 (Added later to plan)
  • Stamp Mask progressively implementing T9346

Project related links

Code summaries

Implementations Status

Status report on each goal implementation.

Unit Test: Similarity test

Goal: Test the current mask generators produce the same mask representation.

This unit test makes sure the masks generated are equal to the dab shape stated by the Krita community. The mask shape equality ensures consistency between versions and every implementation needs to adhere to the shape accepted (unless a new definition is decided upon).

Current Status Current test verifies the equality between the old engine and new AVX vectorized engine. The similarity is adjusted such as no pixel is allowed to be different by more than a brightness value of 2 (in RGB 8-but space).

TODO Simplify code. The test checks the mask generated from the scalar an vector method are equal, but it doesn't check the mask generated is consistent with the expected Mask (the one defined by Krita).

Challenges Mask shape has many variants that affect size, ratio, fade and antialias. Each of this operations work in tandem but in some situations input variants won't alter result, or need to be tested separately. The test needs to include as many variants in as few shapes as possible.

Related blog posts

Commits and Differentials

Phabricator Task T8581

Circular Gauss

Goal: Implement Circular Gauss vectorized Mask generator using Vc

Gaussian mask generator uses a Gauss function to control the fade of the mask shape. Because of that is the slowest of all mask generators, since it calls the math erf() function twice on each pixel. The erf number can be approximated in a number of ways, original implementation does this using standard double precision erf() on each pixel, making it very slow.

Current Status: Implemented and added merge to master. Released in Krita 4.1. Mask generation is 10 times more faster to render. All tests pass which proves both scalar and vectorized implementation are identical. Code profiled, no bottle necks or code issues found. Feature work 100%

Challenges Gaussian depends in the correct erf() values generation, but no such function existed for the vectorized data type of Vc. Implement a correct and quick vectorized erf() using single precision float was the biggest issue. The standard erf() not only works in double precision but it also makes different operations depending on the input value. The implemented vcErf() takes into account that any value it will receive is between zero and 255. Working with cases we replicated the precision needed to replicate the original Scalar implementation.

Related blog posts

Commits and Differentials

Phabricator Task T8734


Circular Soft

Goal: Implement Circular Soft vectorized Mask generator using Vc

Soft Generator creates a Mask based on curve values. The curve itself is generated elsewhere using the initial values on the mask generator. The curve is defined by a list of points in which 0 < x < 1 and 0 < y < 1. Fade generation uses the same object as the Circular Gauss

Current Status: Implemented and merged to master. Mask generation improved by 5 times, the change is not as drastic as the Gauss version, this is because the scalar implementation was not as heavy dependent on math operations but on memory read. All tests variants pass. Profiling code shows no time consumer. Feature set is implemented in full.

Challenges Soft Mask values are determined by a curve represented as a Vector of gray values. Each value index position corresponds to the distance to the center of the Mask. For a Scalar approach getting value one by one using an index is something trivial. On Vc however the values needs to be in an array next to the other to allow for the best optimization. Getting the space values from the vector into the Vc SIMD array was the main problem to solve. Luckily there was no need for in house implementation as Vc has a method to gather indexes from different regions of an array into the Vc Array. Using this method and passing the data pointer of the vector allowed to access the curve values efficiently.

Related blog posts

Commits and Differentials

Phabricator Task T8868


Rectangular Gauss

Goal: Implement Rectangular Gauss vectorized Mask generator using Vc

Current Status: Implemented and merged to master. Speed up of 10x, mask generator reuses vcErf() implemented for Circular Gauss. Vectorize and Scalar implementation give the same result and profiling show no bottleneck

Challenges Rectangular shapes use a 2D fader to determine the gray value a the given coordinate. This value is calculated in a relation between x and y coordinates. The 2D fader made different operations depending on the x and y values received and returned the results at different moments in the function. Adapting this to use masks on vectors was the main challenge. Also since any operation done would be done for all values, the new implementation was made such that the hard double vcErf() operation was done only once per iteration.

Related blog posts

  • No related blogposts

Commits and Differentials

Phabricator Task T9010


Rectangular Soft

Goal: Implement Rectangular Soft vectorized Mask generator using Vc

Current Status: Implemented and merged to master. Speed gains were as good as 4x, since most time is spent getting values from memory. Current implementation generates the same output as the scalar version.

Challenges Biggest challenge was to get the best performance from Vc initializations to avoid adding time to the slow memory access time. Original code used casting to get the integer part of the float value, at first this casting made the Vc code slow, but switching to internal VcIndexes made the casting super fast.

Related blog posts

  • No related blogposts

Commits and Differentials

Phabricator Task T9133


Rectangular Default

Added to project goals as it was very similar to the other work done and not very time consuming Goal: Implement Rectangular Default vectorized Mask generator using Vc

Current Status: Implemented and merged to master. Speed is almost as fast the fastest Vc implementation, the Default Circular. Mask generated is equal to the original scalar version and profiling shows no bottlenecks

Challenges As usual with math dependent implementations, the difference in precision made the final values vary by more than the threshold in similarity test. This difference also made a bug surface much more often with single precision, fixing the bug and reducing the epsilon used for integer conversion allowed the masks to be equal. Default Rect does not uses a 2D fader which made the implementation a bit tricky to mimic the scalar version logic.

Related blog posts

  • No related blogposts

Commits and Differentials

Phabricator Task T9344


Stamp Mask

Goal: Optimize Predefined Mask generator using vectorized operations with Vc

Current Status: Identified current bottleneck: qpainter.drawImage method is not very fast. Corrently code is reorganized to avoid repetitions on highly iterative sections. Making current QImage transform fron QImage >> QImage to a QImage >> KisFixedPaintDevice.

'TODO Finish implementation of Qimage transform to KisFixedPaintDevice and make the code use the new code. Port the new transform code to Vc operations for fast processing.

Challenges Base code is completely different than auto_brush code used in previous optimizations. First challenge would be identifying the slow parts of the code, then understanding how eachp part of the code works to generate the Mask. With the the help of my mentor we discuss a plan to implement the optimization progressively in steps. Biggest challenge will be to speed up the image transformation beyond what is currently doing Qt.

Related blog posts

  • No related blogposts

Commits and Differentials

Phabricator Task T9346

GSoC Work report chronicle

First week, during community bonding, I read the documentation and made a first proposal for the Unit test to be used in the implementation process. This Unit test has to compare the new mask shape and the legacy one and assert they are similar with a certain error. Unit test works ok, but it is not as isolated as needed and possibly other brush preparations used could interfere with the brush mask testing.

On the following weeks and previous to the coding phase I started to be more on IRC and the forums and help out the users I could. I began reading more about Vc and Intel AVX and started to make a small map of the code about brush masks to know exactly what was going on. A second version of the unit test was made, this time we went deeper into the code and managed the Masks directly from the pointer data of KisMaskGenerator.

Coding phase

I spend the first week of coding phase working on understanding how to implement a fully featured Circular Gaussian. I get into the problem of implementing an in house erf for vectorize operations. Once this implementation was passing the test I made a quick painting test and run the FreehandStrokeBenchmark to see if there was more speed gain than with the first dummy implementation. The new implementation was super fast.

Second week my mentor asked me to create a BenchMark specifically for the MaskGeneration, the idea behind this is to have even more evidence that we are getting much better performance from the new vectorized version. The benchmark did not take long to implement and testings confirm the speed gains seen on the other test. I sent the code for review and it was suggested I merged it.

Third and Fourth week SoftBrush implementation was born and during the tests and feature implementation I realized there was some features missing from the Gauss implementation. The feature in question was the antialiasing. I ported the antialias code from the Soft Mask to Gauss Mask (since both use the same logic in the scalar version), and while testing I discovered that with some softness and fading values Gauss Mask failed. The image confirmed the mask was not coming out properly. I spent the next two day finding the root cause and fixing the bug, caused by float imprecision and one bad guard condition. The fixes also applied to Soft Brush and we finished initial feature complete implementation. I did not sent for review yet as I wanted to do much more in deep testing and optimization first.

Also we used this time to help out a little with the new documentation platform. I specifically sent two proposals to help in the automatization of the LaTeX version of the manual D13205, D13204. this should make easier to deploy the PDF version of the documentation when its needed.

5ª and 6º week Circular Soft was sent for revision and work on the Rectangular Gauss began, this show the necessity of implementing a new static class with all Vc extra operation needed by both Gaussian implementations to avoid repeating code. My mentor pointed out that my current 1D Fader implementation was not very good as it made private members public (trough methods) with defeated the purpose of encapsulation. A refactored 1D Fader was made and with those lessons learned I proceeded to make the Vc implementation for the 2D Fader. The finished implementation had a missing file so review process was delayed. In the end the revision found a regression on smaller brushes, which took me a few days to figure out as in the beginning I thought it was due to my code, but turned out to be that I turned on supersampling, mainly used for Default brush.

7º and 8º week. Used this week to fix bugs and improve the tests, but since I was doing a good advance I started working on Rectangular Soft which was pushed to my branch one week later. The implementation took longer to include all tests cases. And also because i was testing other options to generate the square image. And another full week to finish completely all tests on how to make it faster (this included making a version that used stack instead of heap, however it didn't prove fruitful).

Coding for GSoC had to be put aside in this lapse since my university calendar is wrapping up the trimester around the end of July, and as part of my final project I had to finish a system and run a heavy simulation. Attempts to run it on ubuntu prove unsuccessful so I had to use the main computer to run the simulations which took more than the 3 days anticipated, up to a week and a half. Running two simulations at a time produced errors and made the main computer unavailable for any type of work. In the end it took almost two weeks because I reduced the simulation running to one at a time at day 5. This let me work again on GSoC, but limiting any recompile to one core.

In the end the slowdown because of simulation suffered a little more with the write up of the project report (a small thesis). however I made it work by coding for GSoC half a day and writing the other half, allowing me to finish the project in time. :].

Back to coding! But barely we have time left to code!

Final weeks With two weeks to go my mentor and I traced a plan, for the remaining mask optimization. however since Rectangular Default used the same auto_brush framework we decided I should optimize it as well and then work on the Stamp mask (or officialy: predefined mask). I thought estimated the optimization of Rectangular default would take me at most 3 days, but as with everything, the simplest looking problem turned out to have hidden difficulties. In the end it took me an entire week to fix all quirks in the code and make it render the same as the original version. Also I made an alternate render of this mask a little more softer in the corners, but I haven't send this to evaluation to inclusion. All in all that lost week left me with one week to understand a completely new mask generator code for the predefined masks. In that week I was able to understand the code and start planing an implementation route. At the end of that week my mentor helped me to trace a concrete route to start incorporating optimizations to the code to eventually make the Vc vectorization.

Things left to do

The project finished almost all objectives and those finished are already merged into the master repository an ready to land on Krita 4.2. The missing generator mask implementation path is described below:

  • Predefined mask generator is under refactor to start using an in-house implementation of the drawImage method of QPainter to work directly with the Krita paint device and Vc to optimize it. Once the scalar implementation works the same as the QPainter one, the idea is to create a vectorized version to speed all math operations doing what is now a two step operation into one.