Here’s a concise update on the latest about Abstract Syntax Trees (ASTs).
Key recent themes
- ASTs are increasingly central to code understanding and manipulation, with research comparing parsing frontends (JDT, Tree-sitter, ANTLR, srcML) in terms of tree size, depth, and abstraction level, and how these properties affect code-related tasks. This helps explain why some tools prefer smaller, higher-abstraction ASTs for efficiency, while others favor richer trees for legibility and expressiveness.[4]
- Studies and surveys continue to examine how AST representations affect downstream tasks like code summarization, code search, and program understanding, noting that high-level abstractions can improve some tasks but may also introduce learning burdens for models if there’s too much detail.[4]
- Practical use cases and tooling around ASTs persist, including language-aware parsing libraries and code analysis/patching ecosystems that leverage ASTs for transformations, linting, and automated fixes, with activity around related GitHub projects and discussions on adoption in tooling pipelines.[8][9]
- There is ongoing academic interest in how different AST representations influence learning outcomes for AI models, particularly in programming-language understanding, with some findings suggesting smaller, more abstract ASTs can be advantageous for certain tasks, while richer ASTs may help others but risk redundancy.[2][4]
Illustration: how AST choices influence tooling
- If you’re building a code formatter or linter, a deeper AST with more node types can enable finer-grained transformations but may slow down analysis, whereas a leaner AST can speed up large-scale scans but might require extra logic to cover edge cases.[4]
- For code understanding by language models, choosing an AST that balances size and abstraction can affect learning efficiency and performance across tasks like summarization, patching, and code search.[2][4]
If you’d like, I can pull the very latest papers or blog posts and summarize their findings, or tailor an AST-based approach for a particular language or tooling stack you’re using (e.g., JavaScript/TypeScript, Python, or C/C++). I can also generate a small example showing how different AST representations affect a simple code transformation.
Citations
- Abstract Syntax Tree comparisons across parsing methods and impacts on code tasks.[4]
- AST representations and their influence on code-related tasks and model learning.[2][4]
- Tools and discussions around AST-based code analysis, patching, and linting ecosystems.[9][8]
Sources
We apply the approach to gradually migrate the schemas of the AUTOBAYES program synthesis system to concrete syntax. Fit experiences show that this can result in a considerable reduction of the code size and an improved readability of the code. In particular, abstracting out fresh-variable generation and second-order term construction allows the formulation of larger continuous fragments and improves the locality in the schemas. … We used the recent grammar of the Arden Syntax v.2.10, and both...
www.science.govBased on the extensive experimental results, we conclude the following findings: • The ASTs generated by different AST parsing methods differ in size and abstraction level. The size (in terms of tree size and tree depth) and abstraction level (in terms of unique types and unique tokens) of the ASTs generated by JDT are the smallest and highest, respectively. On … pets require more high-level abstract summaries in code summarization, and code snippets semantically match but contain fewer query...
arxiv.orginterpreter, pyre-ast will be able to parse/reject it as well. Furthermore, abstract syntax trees obtained from pyre-ast is guaranteed to 100% match the results obtained by Python's own ast.parse API, down to every AST node and every line/column number.
alan.petitepomme.netievans on June 7, 2021 It supports many more languages (~17 at various stages of development) and being able to do AST patching as in the original is one of the capabilities we're experimenting with: https://semgrep.dev/docs/experiments/overview/#autofix Would love your feedback!
news.ycombinator.com• The ASTs generated by different AST parsing methods differ in size and abstraction level. The size (in terms of tree size and tree depth) and abstraction level (in terms of unique types and unique tokens) of the ASTs generated by JDT are the smallest and highest, respectively. On the contrary, ASTs generated by ANTLR exhibit the largest size and the lowest abstraction level. Tree-sitter and srcML are both intermediate in structure size and abstraction level between JDT and ANTLR. … • Among...
arxiv.org