從零寫一個編譯器(九):語義分析之構造抽象語法樹(AST)
專案的完整程式碼在 C2j-Compiler
前言
在上一篇完成了符號表的構建,下一步就是輸出抽象語法樹(Abstract Syntax Tree,AST)
抽象語法樹(abstract syntax tree 或者縮寫為 AST),是原始碼的抽象語法結構的樹狀表現形式,這裡特指程式語言的原始碼。樹上的每個節點都表示原始碼中的一種結構。
AST對於編譯器是至關重要的,現在的編譯型語言一般通過AST來生成IR,解釋型語言也可以不用虛擬機器而直接遍歷AST來解釋執行,之後要寫直譯器和編譯器都依賴這個AST
這一篇主要檔案有:
- AstBuilder.java
- AstNode.java
- AstNodeImpl.java
- NodeKey.java
- NodeFactory.java
主要資料結構
AST節點的表示
public interface AstNode { AstNode addChild(AstNode node); AstNode getParent(); ArrayList<AstNode> getChildren(); void setAttribute(NodeKey key, Object value); Object getAttribute(NodeKey key); boolean isChildrenReverse(); void reverseChildren(); AstNode copy(); }
這是對AstNode介面的實現,並且繼承HashMap,這裡的NodeKey是
TokenType, VALUE, SYMBOL, PRODUCTION, TEXT
對應的value,
- TokenType就是非終結符的型別
- Text用來儲存解析物件的文字資訊
- Symbol對應的就是變數的符號物件
- Value是對應物件解析的值,比如int a = 1,那麼value的值就為1
public class AstNodeImpl extends HashMap<NodeKey, Object> implements AstNode { private Token type; private AstNodeImpl parent; private ArrayList<AstNode> children; String name; private boolean isChildrenReverse = false; public AstNodeImpl(Token type) { this.type = type; this.parent = null; this.children = new ArrayList<>(); setAttribute(NodeKey.TokenType, type); } @Override public AstNode addChild(AstNode node) { if (node != null) { children.add(node); ((AstNodeImpl) node).parent = this; } return node; } @Override public AstNode getParent() { return parent; } @Override public void reverseChildren() { if (isChildrenReverse) { return; } Collections.reverse(children); isChildrenReverse = true; } @Override public boolean isChildrenReverse() { return isChildrenReverse; } @Override public ArrayList<AstNode> getChildren() { reverseChildren(); return children; } @Override public void setAttribute(NodeKey key, Object value) { if (key == NodeKey.TEXT) { name = (String) value; } put(key, value); } @Override public Object getAttribute(NodeKey key) { return get(key); } @Override public String toString() { String info = ""; if (get(NodeKey.VALUE) != null) { info += "Node Value is " + get(NodeKey.VALUE).toString(); } if (get(NodeKey.TEXT) != null) { info += "\nNode Text is " + get(NodeKey.TEXT).toString(); } if (get(NodeKey.SYMBOL) != null) { info += "\nNode Symbol is " + get(NodeKey.SYMBOL).toString(); } return info + "\n Node Type is " + type.toString(); } @Override public AstNode copy() { AstNodeImpl copy = (AstNodeImpl) NodeFactory.createICodeNode(type); Set<Entry<NodeKey, Object>> attributes = entrySet(); Iterator<Map.Entry<NodeKey, Object>> it = attributes.iterator(); while (it.hasNext()) { Map.Entry<NodeKey, Object> attribute = it.next(); copy.put(attribute.getKey(), attribute.getValue()); } return copy; } }
NodeFactory
NodeFactory就是簡單的返回一個節點的實現
public class NodeFactory {
public static AstNode createICodeNode(Token type) {
return new AstNodeImpl(type);
}
}
構造AST
AST的建立也是需要在語法分析過程中根據reduce操作進行操作的。也就是在takeActionForReduce方法中呼叫AstBuilder的buildSyntaxTree方法
在AstBuilder裡面還是需要兩個堆疊來輔助操作
private Stack<AstNode> nodeStack = new Stack<>();
private LRStateTableParser parser = null;
private TypeSystem typeSystem = null;
private Stack<Object> valueStack = null;
private String functionName;
private HashMap<String, AstNode> funcMap = new HashMap<>();
private static AstBuilder instance;
構造AST的主要邏輯在buildSyntaxTree方法裡,需要注意的是有一些節點在解釋執行和程式碼生成的時候是不一樣的,有時程式碼生成需要的節點解釋執行的話並不需要
在這裡提一下UNARY這個非終結符,這個非終結符和NAME很像,但是它一般是代表進行運算和一些操作的時候,比如陣列,++,--或者函式呼叫的時候
其實構建AST的過程和符號表的構建過程有點兒類似,都是根據reduce操作來建立資訊和組合資訊,符號表是組合修飾符說明符等,而AST則是組合節點間的關係變成一棵樹
我們只看幾個操作
Specifiers_DeclList_Semi_TO_Def
這個節點需要注意的是,從堆疊的什麼地方拿到Symbol,這個需要從reduce次數和推導式中得出
* DEF -> SPECIFIERS DECL_LIST SEMI
* DECL -> VAR_DECL
* VAR_DECL -> NEW_NAME
* | VAR_DECL LP RP
* | VAR_DECL LP VAR_LIST RP
* | LP VAR_DECL RP
* | START VAR_DECL
從推導式可以看出,DEF節點的符號應該在valueStack.size() - 3,但是DECL和VAR_DECL沒有做reduce操作,所以符號應該在valueStack.size() - 2。這其實和前面的符號表構建算出之前符號的位置是一樣的。
TO_UNARY
這裡則是變數、數字或者字串的節點,如果是個變數的號,這個節點就需要一個Symbol的value了
case SyntaxProductionInit.Number_TO_Unary:
case SyntaxProductionInit.Name_TO_Unary:
case SyntaxProductionInit.String_TO_Unary:
node = NodeFactory.createICodeNode(Token.UNARY);
if (production == SyntaxProductionInit.Name_TO_Unary) {
assignSymbolToNode(node, text);
}
node.setAttribute(NodeKey.TEXT, text);
break;
其餘的節點無非是把一些語句拆分它的邏輯然後組成節點,真正的求值部分像Name_TO_Unary比較少,更多是比如把一個if else塊分成if節點、判斷節點、else節點,之後再按照這棵樹進行解釋執行或者程式碼生成
public AstNode buildSyntaxTree(int production, String text) {
AstNode node = null;
Symbol symbol = null;
AstNode child = null;
if (Start.STARTTYPE == Start.INTERPRETER) {
int p1 = SyntaxProductionInit.Specifiers_DeclList_Semi_TO_Def;
int p2 = SyntaxProductionInit.Def_To_DefList;
int p3 = SyntaxProductionInit.DefList_Def_TO_DefList;
boolean isReturn = production == p1 || production == p2 || production == p3;
if (isReturn) {
return null;
}
}
switch (production) {
case SyntaxProductionInit.Specifiers_DeclList_Semi_TO_Def:
node = NodeFactory.createICodeNode(Token.DEF);
symbol = (Symbol) valueStack.get(valueStack.size() - 2);
node.setAttribute(NodeKey.SYMBOL, symbol);
break;
case SyntaxProductionInit.Def_To_DefList:
node = NodeFactory.createICodeNode(Token.DEF_LIST);
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.DefList_Def_TO_DefList:
node = NodeFactory.createICodeNode(Token.DEF_LIST);
node.addChild(nodeStack.pop());
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.Number_TO_Unary:
case SyntaxProductionInit.Name_TO_Unary:
case SyntaxProductionInit.String_TO_Unary:
node = NodeFactory.createICodeNode(Token.UNARY);
if (production == SyntaxProductionInit.Name_TO_Unary) {
assignSymbolToNode(node, text);
}
node.setAttribute(NodeKey.TEXT, text);
break;
case SyntaxProductionInit.Unary_LP_RP_TO_Unary:
node = NodeFactory.createICodeNode(Token.UNARY);
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.Unary_LP_ARGS_RP_TO_Unary:
node = NodeFactory.createICodeNode(Token.UNARY);
node.addChild(nodeStack.pop());
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.Unary_Incop_TO_Unary:
case SyntaxProductionInit.Unary_DecOp_TO_Unary:
case SyntaxProductionInit.LP_Expr_RP_TO_Unary:
case SyntaxProductionInit.Start_Unary_TO_Unary:
node = NodeFactory.createICodeNode(Token.UNARY);
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.Unary_LB_Expr_RB_TO_Unary:
node = NodeFactory.createICodeNode(Token.UNARY);
node.addChild(nodeStack.pop());
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.Uanry_TO_Binary:
node = NodeFactory.createICodeNode(Token.BINARY);
child = nodeStack.pop();
node.setAttribute(NodeKey.TEXT, child.getAttribute(NodeKey.TEXT));
node.addChild(child);
break;
case SyntaxProductionInit.Binary_TO_NoCommaExpr:
case SyntaxProductionInit.NoCommaExpr_Equal_NoCommaExpr_TO_NoCommaExpr:
node = NodeFactory.createICodeNode(Token.NO_COMMA_EXPR);
child = nodeStack.pop();
String t = (String) child.getAttribute(NodeKey.TEXT);
node.addChild(child);
if (production == SyntaxProductionInit.NoCommaExpr_Equal_NoCommaExpr_TO_NoCommaExpr) {
child = nodeStack.pop();
t = (String) child.getAttribute(NodeKey.TEXT);
node.addChild(child);
}
break;
case SyntaxProductionInit.Binary_Plus_Binary_TO_Binary:
case SyntaxProductionInit.Binary_DivOp_Binary_TO_Binary:
case SyntaxProductionInit.Binary_Minus_Binary_TO_Binary:
case SyntaxProductionInit.Binary_Start_Binary_TO_Binary:
node = NodeFactory.createICodeNode(Token.BINARY);
node.addChild(nodeStack.pop());
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.Binary_RelOP_Binary_TO_Binray:
node = NodeFactory.createICodeNode(Token.BINARY);
node.addChild(nodeStack.pop());
AstNode operator = NodeFactory.createICodeNode(Token.RELOP);
operator.setAttribute(NodeKey.TEXT, parser.getRelOperatorText());
node.addChild(operator);
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.NoCommaExpr_TO_Expr:
node = NodeFactory.createICodeNode(Token.EXPR);
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.Expr_Semi_TO_Statement:
case SyntaxProductionInit.CompountStmt_TO_Statement:
node = NodeFactory.createICodeNode(Token.STATEMENT);
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.LocalDefs_TO_Statement:
node = NodeFactory.createICodeNode(Token.STATEMENT);
if (Start.STARTTYPE == Start.CODEGEN) {
node.addChild(nodeStack.pop());
}
break;
case SyntaxProductionInit.Statement_TO_StmtList:
node = NodeFactory.createICodeNode(Token.STMT_LIST);
if (nodeStack.size() > 0) {
node.addChild(nodeStack.pop());
}
break;
case SyntaxProductionInit.FOR_OptExpr_Test_EndOptExpr_Statement_TO_Statement:
node = NodeFactory.createICodeNode(Token.STATEMENT);
node.addChild(nodeStack.pop());
node.addChild(nodeStack.pop());
node.addChild(nodeStack.pop());
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.StmtList_Statement_TO_StmtList:
node = NodeFactory.createICodeNode(Token.STMT_LIST);
node.addChild(nodeStack.pop());
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.Expr_TO_Test:
node = NodeFactory.createICodeNode(Token.TEST);
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.If_Test_Statement_TO_IFStatement:
node = NodeFactory.createICodeNode(Token.IF_STATEMENT);
node.addChild(nodeStack.pop());
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.IfElseStatemnt_Else_Statemenet_TO_IfElseStatement:
node = NodeFactory.createICodeNode(Token.IF_ELSE_STATEMENT);
node.addChild(nodeStack.pop());
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.While_LP_Test_Rp_TO_Statement:
case SyntaxProductionInit.Do_Statement_While_Test_To_Statement:
node = NodeFactory.createICodeNode(Token.STATEMENT);
node.addChild(nodeStack.pop());
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.Expr_Semi_TO_OptExpr:
case SyntaxProductionInit.Semi_TO_OptExpr:
node = NodeFactory.createICodeNode(Token.OPT_EXPR);
if (production == SyntaxProductionInit.Expr_Semi_TO_OptExpr) {
node.addChild(nodeStack.pop());
}
break;
case SyntaxProductionInit.Expr_TO_EndOpt:
node = NodeFactory.createICodeNode(Token.END_OPT_EXPR);
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.LocalDefs_StmtList_TO_CompoundStmt:
node = NodeFactory.createICodeNode(Token.COMPOUND_STMT);
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.NewName_LP_RP_TO_FunctDecl:
case SyntaxProductionInit.NewName_LP_VarList_RP_TO_FunctDecl:
node = NodeFactory.createICodeNode(Token.FUNCT_DECL);
node.addChild(nodeStack.pop());
child = node.getChildren().get(0);
functionName = (String) child.getAttribute(NodeKey.TEXT);
symbol = assignSymbolToNode(node, functionName);
break;
case SyntaxProductionInit.NewName_TO_VarDecl:
nodeStack.pop();
break;
case SyntaxProductionInit.NAME_TO_NewName:
node = NodeFactory.createICodeNode(Token.NEW_NAME);
node.setAttribute(NodeKey.TEXT, text);
break;
case SyntaxProductionInit.OptSpecifiers_FunctDecl_CompoundStmt_TO_ExtDef:
node = NodeFactory.createICodeNode(Token.EXT_DEF);
node.addChild(nodeStack.pop());
node.addChild(nodeStack.pop());
funcMap.put(functionName, node);
break;
case SyntaxProductionInit.NoCommaExpr_TO_Args:
node = NodeFactory.createICodeNode(Token.ARGS);
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.NoCommaExpr_Comma_Args_TO_Args:
node = NodeFactory.createICodeNode(Token.ARGS);
node.addChild(nodeStack.pop());
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.Return_Semi_TO_Statement:
node = NodeFactory.createICodeNode(Token.STATEMENT);
break;
case SyntaxProductionInit.Return_Expr_Semi_TO_Statement:
node = NodeFactory.createICodeNode(Token.STATEMENT);
node.addChild(nodeStack.pop());
break;
case SyntaxProductionInit.Unary_StructOP_Name_TO_Unary:
node = NodeFactory.createICodeNode(Token.UNARY);
node.addChild(nodeStack.pop());
node.setAttribute(NodeKey.TEXT, text);
break;
default:
break;
}
if (node != null) {
node.setAttribute(NodeKey.PRODUCTION, production);
nodeStack.push(node);
}
return node;
}
小結
其實構造AST和建立符號表上非常相似,都是依據reduce操作的資訊來完成。在AST的構建中的主要任務就是對原始碼語句裡的邏輯進行分塊,比如對於一個ifelse語句:
上面的圖是我依據這個意思話的,和上面構造出來的AST不完全一致
另外我的github部落格:https://dejavudwh.