Building Your Own Programming Language From Scratch Part IV: Implementing Functi...

In this part of creating your own programming language, we will introduce new function constructions. Please checkout the previous parts:

The full source code is available over on GitHub.

1. Function Design

We will begin embedding function constructions with the lexical analyzer. As you may remember from the previous parts, we store language lexeme types in the TokenType enum with corresponding regular expressions.

With the new function statement, we’re missing the fun and return keywords. The first fun will stand for the beginning of the function declaration, the second return will be used to exit the function and pass the resulting value. Therefore, we add these missing keywords as “or” expressions:

package org.example.toylanguage.token;

...
public enum TokenType {
    ...
    Keyword("(if|then|end|print|input|struct|fun|return)(?=\\s|$)"),
    ...
}

Then we will map the fun and return keywords inside the syntax analyzer. The purpose of the syntax analyzer is to build statements from tokens generated by the lexical analyzer in the first step. But before diving into parsing function expressions, we will add a couple of auxiliary classes, which are required to store and manage function expressions.

1.1 Function Statement

First, we need a class for the function statement. It should extend the existing CompositeStatement, which will allow to store inner statements as the early created ConditionStatement (if/then):

package org.example.toylanguage.statement;

public class FunctionStatement extends CompositeStatement {
}

1.2 Function Definition

In order to store function declarations with all metadata information, we need a function definition class. We already have StructureDefinition for structures.

Let’s initialize a new one for the functions. It will contain the following fields: function name, function arguments and the FunctionStatement itself, which will be used later when we meet the actual function invocation to access inner function statements:

package org.example.toylanguage.definition;

@RequiredArgsConstructor
@Getter
@EqualsAndHashCode(onlyExplicitlyIncluded = true)
public class FunctionDefinition implements Definition {
	@EqualsAndHashCode.Include
	private final String name;
	private final List<String> arguments;
	private final FunctionStatement statement;
}

1.3 Function Header

Now we have classes to read a function definition and function inner statements, we can start reading the function header. We will extend the existent StatementParser#parseExpression() method to read the new fun Keyword:

package org.example.toylanguage;

public class StatementParser {
    ...
	private Statement parseExpression() {
		Token token = next(TokenType.Keyword, TokenType.Variable, TokenType.Operator);
		switch (token.getType()) {
			...
			case Keyword:
				switch (token.getValue()) {
						...
					case "fun":
						...

				}
			default:
				...
		}
	}
}

After we successfully caught fun Keyword, we will read the function name:

...
case "fun":
    Token type = tokens.next(TokenType.Variable);
...

After the function name, we expect to receive the function arguments surrounded by square brackets and divided by comma:

...
case "fun":
    Token type = tokens.next(TokenType.Variable);

    List<String> args = new ArrayList<>();

    if (tokens.peek(TokenType.GroupDivider, "[")) {

		tokens.next(TokenType.GroupDivider, "["); //skip open square bracket

		while (!tokens.peek(TokenType.GroupDivider, "]")) {
			Token arg = tokens.next(TokenType.Variable);
			args.add(arg.getValue());

			if (tokens.peek(TokenType.GroupDivider, ","))
				tokens.next();
		}

		tokens.next(TokenType.GroupDivider, "]"); //skip close square bracket
	}

...

In the next step, we initialize the FunctionDefinition instance and add it to the functions HashMap, which will be used later by inner ExpressionReader to construct function invocations:

public class StatementParser {
    ...
    private final Map<String, FunctionDefinition> functions;
    ...

    public StatementParser(List<Token> tokens) {
        ...
        this.functions = new HashMap<>();
    }

    ...
    case "fun":
        ...
        FunctionStatement functionStatement = new FunctionStatement();
        FunctionDefinition functionDefinition = new FunctionDefinition(type.getValue(), args, functionStatement);
        functions.put(type.getValue(), functionDefinition);
    ...

}

1.4 Function Body

After completing the statement header, we will read the inner statements until we reach the finalizing end Keyword the same way we did to read the ConditionStatement (if/then).

To parse an inner statement, we invoke the StatementParser#parseExpression(), which will read nested functions as well.

 ...
    case "fun":
        ...
        while (!tokens.peek(TokenType.Keyword, "end")) {
            Statement statement = parseExpression();
            functionStatement.addStatement(statement);
        }
        tokens.next(TokenType.Keyword, "end");
	  
	    return null;
...

In the end, we return null as we don’t want this function statement to be executed in the main program without a direct invocation call. This code definitely needs to be refactored, but for now it’s the easiest way to manage inner parseExpression() invocations.

2. Return Statement

In the previous section, we added the return as the TokenType.Keyword to be captured by the lexical analyzer. In this step, we will handle this keyword using the syntax analyzer.

First, we start with a statement implementation, which should contain only an expression to execute and pass as the result of the function invocation:

package org.example.toylanguage.statement;

@RequiredArgsConstructor
@Getter
public class ReturnStatement implements Statement {
    private final Expression expression;

    @Override
    public void execute() {
        Value<?> result = expression.evaluate();
    }
}

To pass the resulting expression of the invoked function, we introduce an auxiliary ReturnScope class with the invoke() method, indicating that we have triggered the end of the function:

package org.example.toylanguage.context;

@Getter
public class ReturnScope {
	private boolean invoked;
	private Value<?> result;

	public void invoke(Value<?> result) {
		setInvoked(true);
		setResult(result);
	}

	private void setInvoked(boolean invoked) {
		this.invoked = invoked;
	}

	private void setResult(Value<?> result) {
		this.result = result;
	}
}

There is one important note: we can have several nested function invocations with multiple function exits, but one ReturnScope cannot manage all function calls. Therefore, we create an extra class to manage the return context for each of the function invocation:

package org.example.toylanguage.context;

public class ReturnContext {
	private static ReturnScope scope = new ReturnScope();

	public static ReturnScope getScope() {
		return scope;
	}

	public static void reset() {
		ReturnContext.scope = new ReturnScope();
	}
}

Now we can go back to the ReturnStatement declaration and call ReturnScope#invoke() method using ReturnContext:

package org.example.toylanguage.statement;

...
public class ReturnStatement implements Statement {
    ...

    @Override
    public void execute() {
        Value<?> result = expression.evaluate();
        ReturnContext.getScope().invoke(result);
    }
}

The ReturnStatement is ready and will notify the current ReturnScope about function exit, therefore we can implement logic for the syntax analyzer to map return lexeme into ReturnStatement:

package org.example.toylanguage;

public class StatementParser {
	private Statement parseExpression() {
		Token token = next(TokenType.Keyword, TokenType.Variable, TokenType.Operator);
		switch (token.getType()) {
			...
			case Keyword:
				switch (token.getValue()) {
						...
					case "return":
						Expression expression = new ExpressionReader().readExpression();
                        return new ReturnStatement(expression);

				}
			default:
				...
		}
	}
}

The next step would be to subscribe function execution to the ReturnContext. The function needs to know at which moment we performed the return statement to stop the subsequent execution. We will extend ConditionStatement#execute() implementation, which will work both for the FunctionStatement and for the ConditionStatement (if/then), by stopping the execution at the moment we catch the return invocation:

package org.example.toylanguage.statement;

...
public class CompositeStatement implements Statement {
    ...

    @Override
    public void execute() {
        for (Statement statement : statements2Execute) {
            statement.execute();

            //stop the execution in case ReturnStatement has been invoked
            if (ReturnContext.getScope().isInvoked())
                return;
        }
    }
}

3. Empty Return

Our return statement can contain no resulting value, indicating just the end of the function without the provided result. Let’s introduce the new Value implementation containing “null” value.

First, we add new Null lexeme to parse “null” as different lexeme instead of Variable:

...
public enum TokenType {
    ...
    Numeric("[+-]?((?=[.]?[0-9])[0-9]*[.]?[0-9]*)"),
    Null("(null)(?=,|\\s|$)"),
    Text("\"([^\"]*)\""),
    ...
}

In the next step, we create the Value implementation for this Null lexeme. We can initialize a static instance as we will use the same object to refer:

package org.example.toylanguage.expression.value;

public class NullValue<T extends Comparable<T>> extends Value<T> {

	public static final NullValue<?> NULL_INSTANCE = new NullValue<>(null);

	private NullValue(T value) {
		super(value);
	}

	@Override
	public T getValue() {
		//noinspection unchecked
		return (T) this;
	}

	@Override
	public String toString() {
		return "null";
	}
}

The last thing would be to map Null lexeme as NullValue by the ExpressionReader. If our return statement will contain only return statement without “null” value, we will not have any operands for the returned expression.

Therefore, we should add extra validation at the end of the readExpression() method by returning NULL_INSTANCE in the case operands stack is empty:

private class ExpressionReader {
    ...

    private Expression readExpression() {
        while (tokens.peekSameLine(TokenType.Operator, TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Null, TokenType.Text)) {
            Token token = tokens.next();
            switch (token.getType()) {
                case Operator:
                    ...
                default:
                    String value = token.getValue();
                    Expression operand;
                    switch (token.getType()) {
                        case Numeric:
                            ...
                        case Logical:
                            ...
                        case Text:
                            ...
                        case Null:
                            operand = NULL_INSTANCE;
                            break;
                        case Variable:
                        default:
                            ...
                    }
                    ...
            }
        }

        ...

        if (operands.isEmpty()) {
            return NULL_INSTANCE;
        } else {
            return operands.pop();
        }
    }
    ...
}

4. Memory Scope

Before diving into function expressions, first we will need to get familiar with the function scope. Basically, scope is the current context in the code.

Scope can be defined locally or globally. Currently, if we define any variable, it will be stored as a global variable in the StatementParser#variables Map and will be accessible in the global context.

Inside a function, we can also declare variables, and these variables should be accessible only in the scope of this declared function and should be isolated from each function call. To isolate the memory, we will introduce a new class MemoryScope, which will hold the Map of variables:

package org.example.toylanguage.context;

public class MemoryScope {
	private final Map<String, Value<?>> variables;

	public MemoryScope(MemoryScope parent) {
		this.variables = new HashMap<>();
	}

	public Value<?> get(String name) {
		return variables.get(name);
	}

	public void set(String name, Value<?> value) {
		variables.put(name, value);
	}
}

Unlike ReturnScope, we shouldn’t reset the memory (variables Map) after we enter a function or any other inner block of code. We still need a way to access global variables. Therefore, we can add a parent instance of the MemoryScope, which will be used to read the global variables declared outside of the function:

public class MemoryScope {
	private final Map<String, Value<?>> variables;
	private final MemoryScope parent;

	public MemoryScope(MemoryScope parent) {
		this.variables = new HashMap<>();
		this.parent = parent;
	}

	public Value<?> get(String name) {
		Value<?> value = variables.get(name);
		if (value == null && parent != null) {
			return parent.get(name);
		}
		return value;
	}

    ...
}

To switch between memory contexts easily, we will introduce the MemoryContext class as we did for the ReturnScope. The newScope() will be called when we enter an inner block of code and the endScope() will be triggered at the end of this block:

package org.example.toylanguage.context;

public class MemoryContext {
	private static MemoryScope memory = new MemoryScope(null);

	public static MemoryScope getMemory() {
		return memory;
	}

	public static void newScope() {
		MemoryContext.memory = new MemoryScope(memory);
	}

	public static void endScope() {
		MemoryContext.memory = memory.getParent();
	}
}

Now we‌ can refactor variables access for the AssignStatement, InputStatement and VariableExpression with the new MemoryContext interface:

package org.example.toylanguage.statement;

...
public class AssignStatement implements Statement {
    private final String name;
    private final Expression expression;

    @Override
    public void execute() {
        MemoryContext.getMemory().set(name, expression.evaluate());
    }
}

package org.example.toylanguage.statement;

...
public class InputStatement implements Statement {
    private final String name;
    private final Supplier<String> consoleSupplier;

    @Override
    public void execute() {
        ...

        MemoryContext.getMemory().set(name, value);
    }
}

package org.example.toylanguage.expression;

...
public class VariableExpression implements Expression {
    private final String name;

    @Override
    public Value<?> evaluate() {
        Value<?> value = MemoryContext.getMemory().get(name);
        ...
    }

    public void setValue(Value<?> value) {
        MemoryContext.getMemory().set(name, value);
    }
}

5. Function Invocation

In this section, we will make our syntax analyzer able to read and parse function invocations. First, we will start with the FunctionExpression declaration.

This class will contain the FunctionDefinition field to access the function details and the Map of Expression values, passed as arguments during function invocation:

package org.example.toylanguage.expression;

@RequiredArgsConstructor
@Getter
public class FunctionExpression implements Expression {
    private final FunctionDefinition definition;
    private final List<Expression> values;

    @Override
    public Value<?> evaluate() {
        return null;
    }
}

To evaluate() the function expression, we need to execute function inner statements. To retrieve them, we use the FunctionStatement instance from the declared FunctionDefinition field:

...
@Override
public Value<?> evaluate() {
	FunctionStatement statement = definition.getStatement();
    statement.execute();
	return null;
}
...

But before executing function statements, it’s important to initialize the new MemoryScope. It will allow inner statements to use its own memory. The inner memory shouldn’t be accessible outside this function call, therefore at the end of the execution, we set the memory scope back:

...
@Override
public Value<?> evaluate() {
	FunctionStatement statement = definition.getStatement();
    MemoryContext.newScope(); //set new memory scope
      
	//execute function body
	try {
		statement.execute();
	} finally {
		MemoryContext.endScope(); //end function memory scope
	}

	return null;
}
...

We should also initialize function arguments, which can be passed as value lexemes and not be available in the global memory scope. We can save them in the new function’s memory scope by creating and executing the AssignStatement for each of the function values:

...
@Override
public Value<?> evaluate() {
	FunctionStatement statement = definition.getStatement();
	MemoryContext.newScope(); //set new memory scope

	//initialize function arguments
	IntStream.range(0, values.size())
			.boxed()
			.map(i -> new AssignStatement(definition.getArguments().get(i), values.get(i)))
			.forEach(Statement::execute);

	//execute function body
	try {
		statement.execute();
	} finally {
		MemoryContext.endScope(); //end function memory scope
	}
      
      return null;
}
...

The last thing left to do is to return the result. We can grab it from the ReturnContext. The result will be captured in the ReturnScope when we invoke the ReturnStatement.

Don’t forget to reset the ReturnScope at the end of the function block to make the next function invocation use its own ReturnScope instance:

...	
@Override
public Value<?> evaluate() {
	FunctionStatement statement = definition.getStatement();
	MemoryContext.newScope(); //set new memory scope

	//initialize function arguments
	IntStream.range(0, values.size())
			.boxed()
			.map(i -> new AssignStatement(definition.getArguments().get(i), values.get(i)))
			.forEach(Statement::execute);

	//execute function body
	try {
		statement.execute();
		return ReturnContext.getScope().getResult();
	} finally {
		MemoryContext.endScope(); //end function memory scope
		ReturnContext.reset(); //reset return context
	}
}
...

Now we can switch to the ExpressionReader class and read an actual function invocation. The pattern of function call looks almost the same as the structure instantiation, except for the fact, we don’t have the New keyword in the front of the expression.

Before creating the VariableExpression, we check that the next lexeme is the opening square bracket, which will indicate the function call:

private class ExpressionReader {
    ...

    private Expression readExpression() {
        while (tokens.peekSameLine(TokenType.Operator, TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Null, TokenType.Text)) {
            Token token = tokens.next();
            switch (token.getType()) {
                case Operator:
                   ...
                default:
                    String value = token.getValue();
                    Expression operand;
                    switch (token.getType()) {
                        case Numeric:
                            ...
                        case Logical:
                            ...
                        case Text:
                            ...
                        case Null:
                            ...
                        case Variable:
                        default:
                            if (!operators.isEmpty() && operators.peek() == Operator.StructureInstance) {
                                operand = readStructureInstance(token);
                            } else if (tokens.peekSameLine(TokenType.GroupDivider, "[")) {
                                operand = readFunctionInvocation(token);
                            } else {
                                operand = new VariableExpression(value);
                            }
                    }
                    ...
            }
        }

        ...
    }

}

Reading the function invocation will look almost the same as structure instantiation. First, we obtain the function definition by function name. Then, we read function arguments. In the end, we return an instance of FunctionExpression:

private FunctionExpression readFunctionInvocation(Token token) {
    FunctionDefinition definition = functions.get(token.getValue());

    List<Expression> arguments = new ArrayList<>();
    if (tokens.peekSameLine(TokenType.GroupDivider, "[")) {

        tokens.next(TokenType.GroupDivider, "["); //skip open square bracket

        while (!tokens.peekSameLine(TokenType.GroupDivider, "]")) {
            Expression value = new ExpressionReader().readExpression();
            arguments.add(value);

            if (tokens.peekSameLine(TokenType.GroupDivider, ","))
                tokens.next();
        }

        tokens.next(TokenType.GroupDivider, "]"); //skip close square bracket
    }

    return new FunctionExpression(definition, arguments);
}

6. Testing Time

We have finished embedding functions in both lexical and syntax analyzers. Now we should be able to parse function definitions and to read function invocations. Note that function invocation is just like any other expression, including structure instantiation.

Therefore, you can build any logical complex nested expressions. Let’s create a more difficult task for the syntax analysis to determine if two binary trees are the same:

#
# Determine if two binary trees are identical following these rules:
# - root nodes have the same value
# - left sub trees are identical
# - right sub trees are identical
#

struct TreeNode [ value, left, right ]

fun is_same_tree [ node1, node2 ]
    if node1 == null and node2 == null then
        return true
    end
    if node1 != null and node2 != null then
        return node1 :: value == node2 :: value and is_same_tree [ node1 :: left, node2 :: left ] and is_same_tree [ node1 :: right, node2 :: right ]
    end
    return false
end

#                               tree-s example
#
#          tree1            tree2            tree3            tree4
#            3                3                3                3
#          /   \            /   \            /   \            /   \
#         4     5          4     5          4     5          4     5
#       /  \   /  \      /  \   /  \      /  \   / \        /  \     \
#      6    7 8    9    6    7 8    9    6    7 9   8      6    7     9

# tree1 nodes
tree1_node9 = new TreeNode [ 9, null, null ]
tree1_node8 = new TreeNode [ 8, null ]
tree1_node7 = new TreeNode [ 7 ]
tree1_node6 = new TreeNode [ 6 ]
tree1_node5 = new TreeNode [ 5, tree1_node8, tree1_node9 ]
tree1_node4 = new TreeNode [ 4, tree1_node6, tree1_node7 ]
tree1 = new TreeNode [ 3, tree1_node4, tree1_node5 ]

tree2 = new TreeNode [ 3, new TreeNode [ 4, new TreeNode [ 6 ], new TreeNode [ 7 ] ], new TreeNode [ 5, new TreeNode [ 8 ], new TreeNode [ 9 ] ] ]
tree3 = new TreeNode [ 3, new TreeNode [ 4, new TreeNode [ 6 ], new TreeNode [ 7 ] ], new TreeNode [ 5, new TreeNode [ 9 ], new TreeNode [ 8 ] ] ]
tree4 = new TreeNode [ 3, new TreeNode [ 4, new TreeNode [ 6 ], new TreeNode [ 7 ] ], new TreeNode [ 5, null, new TreeNode [ 9 ] ] ]

print is_same_tree [ tree1, tree2 ]
print is_same_tree [ tree1, tree3 ]
print is_same_tree [ tree2, tree3 ]
print is_same_tree [ tree1, tree4 ]
print is_same_tree [ tree2, tree4 ]
print is_same_tree [ tree3, tree4 ]

In this part of creating your own programming language, we will introduce new function constructions. Please checkout the previous parts:

1. Function Design

1.1 Function Statement

1.2 Function Definition

1.3 Function Header

1.4 Function Body

2. Return Statement

3. Empty Return

4. Memory Scope

5. Function Invocation

6. Testing Time

Recommend

GitHub - codic12/worm: A dynamic, tag-based window manager written in Nim

为什么到了冬天，我们的尿意就变得频繁

字节排“毒” 得物见“顶”

Web3.0：中国如何引领互联网技术变革？

《战神: 诸神黄昏》或11月发售，目前仍有不少猜测

Three coins looking to beat the crypto blues: Gnox (GNOX), Apecoin (APE) and Aav...

请暂时抛弃使用 eBPF 取代服务网格和 sidecar 模式的幻想

Gender Discrimination Exists: Google Agreed To Pay $118 Million

9 Digital Wellness Tasks to Do After a Breakup

Apple Makes Further Adjustments to Dating App Rules to Satisfy Dutch Regulatory...

About Joyk