This article is about working on KDevelop-PG-Qt itself, for an introduction to KDevelop-PG-Qt see Development/KDevelop-PG-Qt Introduction, but it might also be useful for advanced KDevelop-PG-Qt users.
The idea behind the generated parser is call "Recursive Descent". There is exactly one function for every symbol. The parser will try to replace the start-symbol with other symbols, the process ends with a terminal used in a rule. It is some kind of depth-first-search, for every sub-symbol the parse will invoke the corresponding function. More precisely, KDevelop-PG-Qt generates a LL(1) parser, that means it looks at exactly 1 token to decide, which sub-symbols to choose.
A strength of KDevelop-PG-Qt is the possibility to "inject" hand-written code nearly everywhere. That is why developers of KDevelop-PG-Qt should take care about the following aspects:
There is also another aspect: Grammar-constructs are better than C++-constructs, feel free to replace common C++-code with KDevelop-PG-Qt-constructs!
KDevelop-PG-Qt's generated sources follow the visitor pattern. That is the focus of it: It should be possible to visit the parse-tree more than once and it should be simple to implement visitors.
Recursive Descent Parsers have a serious problem: Imagine operators in C++: They have as many as 18 different priorities (from "," to "::"). A RDP would theoretically (there are certainly some tricks to reduce it for C++) need 18 different symbols (from "CommaExpression" to "Name") and every expression would be parsed by invoking 18 functions down to the terminal, even if the expression would be a plain numeric-literal. But this is maybe the smaller problem: The simple number will also need 18 AST-Nodes. Each list node would need a type-identifier, the start-position, the end-position, a pointer to a list of the one and only child and the ListNode, on a 64-bit system you would need 48 bytes for each node. The PHP plugin also introduces a DUContext-pointer, this means 56 bytes per node, that is a lot ;). But there is an alternative: Walk through the tokens and construct the tree beginning on the bottom by adding each token to the tree in the right way. The tree will grow but some leaps will stay, that is why it is called "Bottom-Up Parsing". It works well, but only very few people have tested it. Months after having written it I noticed that the algorithm used is also called “Shunting-yard-algorithm“.
That does not mean that parsers should use signals and slots and templates are forbidden, but KDevelop-PG-Qt was designed to be easily integratable into Qt. So it should use Qt's classes like QString and QList.
Use create<Model::XYZItem>() to allocate an AST. The BNFVisitor uses some custom stuff for better performance. The same memory-pools like in the generated parsers are used. There is a special function-template nodeCast, it behaves like dynamic_cast and qobject_cast, but it should only be used for pointers to ASTs.
In kdevelop-pg-qt/tests you can find tests and benchmarks for the classes located in kdevelop-pg-qt/include, these are the onliest test-cases being properly integrated with CMake/CTest, you can simply use `make test`. In kdevelop-pg-qt/examples there are three basic examples for parsing (more or less) real programming languages. kdevelop-pg-qt/kdev-pg/tests contains test-cases for errors and warnings, when changing something being somehow related to the different kinds of problems you should use test.sh to ensure that everything is correct. Additionally there are two examples, Op and Op2, for parsing operator-expressions. In Op a parse-tree for more or less classical expressions will be built. Op2 tries to be very complex and cover a lot of features of KDevelop-PG-Qt, in the inputs for Op2 you can configure the syntax for operator-expressions immediately before testing an expression.
We plan to make a 1.0 release in the next few weeks, but we do not know more than that. We will also move to git within the next few weeks. Ask apaku about git- or release-stuff, ask The_User, if you want to contribute.
This would probably need macros mentioned below.
How to do it: For each conflict try to resolve it by increasing k and computing the new FIRST- and FOLLOW-sets until a limit is reached. Maybe there could be also a detection for non-LL(k)-grammars (a conflict can be resolved if there is a common possibly infinite beginning)
Maybe it should not be done with macros because a lot of #define, #undef etc. would look messy. Do not forget the macro for the current node! Maybe $-signs could be used, if replacement should be done at generation-time.
You like one of these ideas? Contact us at #email@example.com or firstname.lastname@example.org! You have another idea? Add it to the list! You think a feature is more important than "low"? Contact us!
It should make tokenizing easier, e.g. UTF-8 support with Unicode-character-classes etc. Could be a bit tricky to keep conditions small when constructing the DFA, e.g. expressions like "unicode-letter, without chinese, without b, x and z" would be needed as conditions in the transition-graph (we do not want to care about all chinese letters in the transition-graph!), but short stuff should of course be expanded, e.g. "unicode-letter without (unicode-letters without b, x and z)" to "b, x, z". The current prototype contains different implementations for handling sets of characters: traditional, table based, and range based.