From developer perspective, compilers are black boxes – source code goes in one end, magic happens in the middle, and object files or assemblies come out the other end. During their job, compilers build up a deep understanding of our code. For decades this valuable information was unreachable.
The Roslyn project aims to open compilers for developers and offer it as services. Through Roslyn, compilers become services with APIs that can be used for code related tasks in your tools and applications.
In order to understand the APIs that Roslyn offers, we need to understand the general compiler phases:
For each phase of these phases, Roslyn provides a set of APIs to access its services and an object model that allow access to the information at that phase.
The fundamental data structure exposed by the Compiler APIs is the syntax tree. It represents the lexical and syntactic structure of source code, No part of the source code is understood without it first being identified. These trees enable tools to manipulate source code without editing it as text.
Syntax trees have three key attributes:
Each syntax tree is made up of nodes, tokens, and trivia.
Syntax Nodes: the primary elements of syntax trees. It represent syntactic constructs such as declarations, statements, clauses, and expressions. They are non-terminal nodes in the syntax tree, which means they always have other nodes and tokens as children. Each category of syntax nodes is represented by a separate class derived from SyntaxNode.
Parent property returns the node parent.
ChildNodes() returns a list of child nodes in sequential order based on its position in the source text. This list does not contain tokens.
DescendantNodes(), DescendantTokens(), and DescendantTrivia() returns a list of child nodes in sequential order based on its position in the source text.
Each syntax node subclass exposes all the same children through strongly typed properties.
BinaryExpressionSyntax node class
Property Left/Right of type ExpressionSyntax
Property OperatorToken of type SyntaxToken
IfStatementSyntax node class
Optional property ElseClauseSyntax which returns null if the child is not present.
Syntax Tokens are the terminals of the language grammar, representing the smallest syntactic fragments of the code. They are never parents of other nodes or tokens. Syntax tokens consist of keywords, identifiers, literals, and punctuation.
Syntax Trivia represent the parts of the source text that are largely insignificant for normal understanding of the code, such as whitespace, comments, and preprocessor directives. Because trivia are not part of the normal language syntax and can appear anywhere between any two tokens, they are not included in the syntax tree as a child of a node. you can access them through token’s LeadingTrivia or TrailingTrivia collections.
Spans In order to make each node knows its position within the source text, each node has two properties of type TextSpan. A TextSpan object is the beginning position and a count of characters, both represented as integers. Each node has two TextSpan properties:
The Span property is the text span from the start of the first token in the node’s sub-tree to the end of the last token. This span does not include any leading or trailing trivia.
The FullSpan property is the text span that includes the node’s normal span, plus the span of any leading or trailing trivia.
Kinds Each node, token, or trivia has a Kind property, of type SyntaxKind, that identifies the exact syntax element represented. The Kind property allows for easy disambiguation of syntax node types that share the same node class.
Errors When the parser encounters code that does not conform to the defined syntax of the language, it uses one of two techniques to create a syntax tree.
If the parser expects a particular kind of token, but does not find it, it may insert a missing token into the syntax tree in the location that the token was expected. A missing token represents the actual token that was expected, but it has an empty span, and its IsMissing property returns true.
The parser may skip tokens until it finds one where it can continue parsing. In this case, the skipped tokens that were skipped are attached as a trivia node with the kind SkippedTokens.
In this post we explored Microsoft Roslyn, an interesting product that offer the compiler as a service and opens a lot of doors in front of developers to develop code-focused tools and applications. In later posts we will gradually show Roslyn in action.