Introduction
From developer perspective, compilers are black boxes – source code
goes in one end, magic happens in the middle, and object files or
assemblies come out the other end. During their job, compilers build up
a deep understanding of our code. For decades this valuable information
was unreachable.
The Roslyn
project
aims to open compilers for developers and offer it as services. Through
Roslyn, compilers become services with APIs that can be used for code
related tasks in your tools and applications.
In order to understand the APIs that Roslyn offers, we need to
understand the general compiler phases:
- The parse phase where source is tokenized and parsed into syntax
that follows the language grammar.
- The declaration phase where declarations from source and
imported metadata are analyzed to form named symbols.
- The binding phase where identifiers in the code are matched to
symbols.
- The emitting phase where all the information built up by the
compiler is emitted as an assembly.
For each phase of these phases, Roslyn provides a set of APIs to access
its services and an object model that allow access to the information at
that phase.
- The parsing phase is exposed as a syntax tree
- The declaration phase as a hierarchical symbol table
- The binding phase as a model that exposes the result of the
compiler’s semantic analysis
- The emit phase as an API that produces IL byte codes.
Rosyln API Layers
- Compiler APIs contains a representation of a single invocation
of a compiler (including assembly references, compiler options, and
source code files) which provides the object models that corresponds
to the compiler pahases.
- Scripting APIs represents a runtime execution context for
C# or VB piece of code. It contains a scripting engine that
allows evaluation of expressions and statements as top-level
constructs in programs.
- Services APIs represents the starting point for doing code
analysis and refactoring over entire solutions by organizing all the
information about the projects in a solution into single object
model, offering you direct access to the compiler layer object
models without needing to parse files, configure options or manage
project to project dependencies. In addition, the services layer
surfaces a set of commonly used APIs used when implementing code
analysis and refactoring tools that function within a host
environment like the Visual Studio IDE.
- Editor Services APIs allows a user to easily extend Visual
Studio IDE features (such as IntelliSense, refactorings, and code
formatting). This layer has a dependency on the Visual Studio text
editor and the Visual Studio software development kit (SDK).
Working with Syn tax
The fundamental data structure exposed by the Compiler APIs is the
syntax tree. It represents the lexical and syntactic structure of source
code, No part of the source code is understood without it first being
identified. These trees enable tools to manipulate source code without
editing it as text.
Syntax Trees Attributes
Syntax trees have three key attributes:
- Full fidelity which means that the syntax tree contains every
piece of information found in the source text, every grammatical
construct, every lexical token, and everything else in between
including whitespace, comments, and preprocessor directives.
- Round-trippable which means that it can be traced back to the
text it was parsed from and get the text representation of the
sub-tree rooted at that node. This means that syntax trees can be
used as a way to construct and edit source text. By creating a tree
you have by implication created the equivalent source code, and by
editing a syntax tree, making a new tree out of changes to an
existing tree, you have effectively edited the source code.
- Immutable and thread-safe which means that after a tree is
obtained, it is a snapshot of the current state of the code, and
never changes. This allows multiple users to interact with the same
syntax tree at the same time in different threads without locking or
duplication.
Syntax Tree Elements
Each syntax tree is made up of nodes, tokens, and trivia.
Syntax Nodes: the primary elements of syntax trees. It represent
syntactic constructs such as declarations, statements, clauses, and
expressions. They are non-terminal nodes in the syntax tree, which means
they always have other nodes and tokens as children. Each category of
syntax nodes is represented by a separate class derived from
SyntaxNode.
Parent property returns the node parent.
ChildNodes() returns a list of child nodes in sequential order based
on its position in the source text. This list does not contain tokens.
DescendantNodes(), DescendantTokens(), and DescendantTrivia() returns
a list of child nodes in sequential order based on its position in the
source text.
Each syntax node subclass exposes all the same children through strongly
typed properties.
BinaryExpressionSyntax node class
- Property Left/Right of type ExpressionSyntax
- Property OperatorToken of type SyntaxToken
IfStatementSyntax node class
- Optional property ElseClauseSyntax which returns null if the child
is not present.
Syntax Tokens are the terminals of the language grammar,
representing the smallest syntactic fragments of the code. They are
never parents of other nodes or tokens. Syntax tokens consist of
keywords, identifiers, literals, and punctuation.
Syntax Trivia represent the parts of the source text that are
largely insignificant for normal understanding of the code, such as
whitespace, comments, and preprocessor directives. Because trivia are
not part of the normal language syntax and can appear anywhere between
any two tokens, they are not included in the syntax tree as a child of a
node. you can access them through token’s LeadingTrivia or
TrailingTrivia collections.
Spans In order to make each node knows its position within the
source text, each node has two properties of type TextSpan. A TextSpan
object is the beginning position and a count of characters, both
represented as integers. Each node has two TextSpan properties:
- The Span property is the text span from the start of the first
token in the node’s sub-tree to the end of the last token. This span
does not include any leading or trailing trivia.
- The FullSpan property is the text span that includes the node’s
normal span, plus the span of any leading or trailing trivia.
Kinds Each node, token, or trivia has a Kind property, of type
SyntaxKind, that identifies the exact syntax element represented. The
Kind property allows for easy disambiguation of syntax node types that
share the same node class.
Errors When the parser encounters code that does not conform to the
defined syntax of the language, it uses one of two techniques to create
a syntax tree.
- If the parser expects a particular kind of token, but does not find
it, it may insert a missing token into the syntax tree in the
location that the token was expected. A missing token represents the
actual token that was expected, but it has an empty span, and its
IsMissing property returns true.
- The parser may skip tokens until it finds one where it can continue
parsing. In this case, the skipped tokens that were skipped are
attached as a trivia node with the kind SkippedTokens.
In this post we explored Microsoft Roslyn, an interesting product that
offer the compiler as a service and opens a lot of doors in front of
developers to develop code-focused tools and applications. In later
posts we will gradually show Roslyn in action.