17

Crafting Interpreters: Classes and Instances

 4 years ago
source link: http://craftinginterpreters.com/classes-and-instances.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Caring too much for objects can destroy you. Only if you care for a thing enough, it takes on a life of its own, doesn’t it? And isn’t the whole point of things beautiful things that they connect you to some larger beauty?

Donna Tartt, The Goldfinch

The last area left to implement in clox is object-oriented programming. OOP is a bundle of intertwined features: classes, instances, fields, methods, initializers, and inheritance. Using relatively high-level Java, we packed all that into two chapters. Now that we’re coding in C, which feels like building a model of the Eiffel tower out of toothpicks, we’ll devote three chapters to covering the same territory. This makes for a leisurely stroll through the implementation. After strenuous chapters likeclosures and thegarbage collector, you have earned a rest. In fact, the book should be easy from here on out.

In this chapter, we cover the first three features: classes, instances, and fields. This is the stateful side of object orientation. Then in the next two chapters, we will hang behavior and code reuse off of those objects.

In a class-based object-oriented language, everything begins with classes. They define what sorts of objects exist in the program and are the factories used to produce new instances. Going bottom-up, we’ll start with their runtime representation and then hook that into the language.

By this point, we’re well-acquainted with the process of adding a new object type to the VM. We start with a struct:

} ObjClosure;

object.h

add after struct ObjClosure

typedef struct sObjClass {                    
  Obj obj;                                    
  ObjString* name;                            
} ObjClass;
ObjClosure* newClosure(ObjFunction* function);

object.h , add after struct ObjClosure

After the Obj header, we store the class’s name. This isn’t strictly needed for the user’s program, but it lets us show the name at runtime for things like stack traces.

The new type needs a corresponding case in the ObjType enum:

typedef enum {

object.h

in enum ObjType

  OBJ_CLASS,
  OBJ_CLOSURE,

object.h , in enum ObjType

And that type gets a corresponding pair of macros. First, for testing an object’s type:

#define OBJ_TYPE(value)         (AS_OBJ(value)->type)        

object.h

#define IS_CLASS(value)         isObjType(value, OBJ_CLASS)  
#define IS_CLOSURE(value)       isObjType(value, OBJ_CLOSURE)

object.h

And then for casting a Value to an ObjClass pointer:

#define IS_STRING(value)        isObjType(value, OBJ_STRING)

object.h

#define AS_CLASS(value)         ((ObjClass*)AS_OBJ(value))  
#define AS_CLOSURE(value)       ((ObjClosure*)AS_OBJ(value))

object.h

The VM creates new class objects using this function:

} ObjClass;                                   

object.h

add after struct ObjClass

ObjClass* newClass(ObjString* name);
ObjClosure* newClosure(ObjFunction* function);

object.h , add after struct ObjClass

The implementation lives over here:

object.c

add after allocateObject ()

ObjClass* newClass(ObjString* name) {                 
  ObjClass* klass = ALLOCATE_OBJ(ObjClass, OBJ_CLASS);
  klass->name = name; ​
  return klass;                                       
}

object.c , add after allocateObject ()

Pretty much all boilerplate. It takes in the class’s name as a string and stores it. Every time the user declares a new class, the VM will create a new one of these ObjClass structs to represent it.

When the VM no longer needs a class, it frees it like so:

  switch (object->type) {

memory.c

in freeObject ()

    case OBJ_CLASS: {        
      FREE(ObjClass, object);
      break;                 
    } ​

    case OBJ_CLOSURE: {

memory.c , in freeObject ()

We have a memory manager now, so we also need to support tracing through class objects:

  switch (object->type) {

memory.c

in blackenObject ()

    case OBJ_CLASS: {                     
      ObjClass* klass = (ObjClass*)object;
      markObject((Obj*)klass->name);      
      break;                              
    }                                     

    case OBJ_CLOSURE: {

memory.c , in blackenObject ()

When the GC reaches a class object, we mark its name to keep that string alive too.

The last operation the VM can perform on a class is printing it:

  switch (OBJ_TYPE(value)) {

object.c

in printObject ()

    case OBJ_CLASS:                              
      printf("%s", AS_CLASS(value)->name->chars);
      break;
    case OBJ_CLOSURE:

object.c , in printObject ()

A class simply prints its own name.

27 . 2 Class Declarations

Runtime representation in hand, we are ready to add support for classes to the language. As usual, we start in the parser:

static void declaration() {

compiler.c

in declaration ()

replace 1 line

  if (match(TOKEN_CLASS)) {     
    classDeclaration();         
  } else if (match(TOKEN_FUN)) {
    funDeclaration();

compiler.c , in declaration (), replace 1 line

Class declarations are statements and the parser recognizes one by the leading class keyword. The rest of the compilation happens over here:

compiler.c

add after function ()

static void classDeclaration() {                              
  consume(TOKEN_IDENTIFIER, "Expect class name.");            
  uint8_t nameConstant = identifierConstant(&parser.previous);
  declareVariable();

  emitBytes(OP_CLASS, nameConstant);                          
  defineVariable(nameConstant);                               

  consume(TOKEN_LEFT_BRACE, "Expect '{' before class body."); 
  consume(TOKEN_RIGHT_BRACE, "Expect '}' after class body."); 
}

compiler.c , add after function ()

Immediately after the class keyword is the class’s name. We take that identifier and add it to the surrounding function’s constant table as a string. As you just saw, printing a class shows its name, so the compiler needs to stuff the name string somewhere that the runtime can find. The constant table is the way to do that.

The class’s name is also used to bind the class object to a variable of the same name. So we declare a variable with that identifier right after consuming its token.

Next, we emit a new instruction to actually create the class object at runtime. That instruction takes the constant table index of the class’s name as an operand.

After that, but before compiling the body of the class, we define the variable for the class’s name. Declaring the variable adds it to the scope but recall froma previous chapter that we can’t use the variable until it’s defined . For classes, we define the variable before the body. That way, users can refer to the containing class inside the bodies of methods. That’s useful for things like factory methods.

Finally, we compile the body. We don’t have methods yet, so right now it’s simply an empty pair of braces. Lox doesn’t require fields to be declared in the class, so we’re done with the body and the parser for now.

The compiler is emitting a new instruction, so let’s define that:

  OP_RETURN,

chunk.h

in enum OpCode

  OP_CLASS,
} OpCode;

chunk.h , in enum OpCode

And add it to the disassembler:

    case OP_RETURN:                                         
      return simpleInstruction("OP_RETURN", offset);

debug.c

in disassembleInstruction ()

    case OP_CLASS:                                          
      return constantInstruction("OP_CLASS", chunk, offset);
    default:

debug.c , in disassembleInstruction ()

For such a large-seeming feature, the interpreter support is easy:

        frame = &vm.frames[vm.frameCount - 1]; 
        break;                                 
      }

vm.c

in run ()

      case OP_CLASS:                           
        push(OBJ_VAL(newClass(READ_STRING())));
        break;
    }

vm.c , in run ()

We load the string for the class’s name from the constant table and pass that to newClass() . That creates a new class object with the given name. We push that onto the stack and we’re good. If the class is bound to a global variable, then the compiler’s call to defineVariable() will emit code to store that object from the stack into the global variable table. Otherwise, it’s right where it needs to be on the stack for a new local variable.

There you have it, our VM supports classes now. You can run this:

class Brioche {}
print Brioche;

Unfortunately, printing is about all you can do with classes, so next is making them more useful.

27 . 3 Instances of Classes

Classes serve two main purposes in a language:

  1. They are how you create new instances.Sometimes this involves a new keyword, other times it’s a method call on the class object, but you usually mention the class by name somehow to get a new instance.

  2. They contain methods.These define how all instances of the class behave.

We won’t get to methods until the next chapter, so for now we will only worry about the first part. Before classes can create instances, we need a representation for them:

} ObjClass;

object.h

add after struct ObjClass

typedef struct {                    
  Obj obj;                          
  ObjClass* klass;                  
  Table fields; ​
} ObjInstance;
ObjClass* newClass(ObjString* name);

object.h , add after struct ObjClass

Each instance has a pointer to the class that it is an instance of. Instances know their class. We won’t use this much in this chapter, but it will become critical when we add methods.

More important to this chapter is how instances store their state. Lox lets users freely add fields to an instance at runtime. This means we need a storage mechanism that can grow. We could use a dynamic array, but we also want to look up fields by name as quickly as possible. There’s a data structure that’s just perfect for quickly accessing a set of values by name and even more convenient we’ve already implemented it. Each instance stores its fields using a hash table.

We only need to add an include and we’ve got it:

#include "chunk.h"

object.h

#include "table.h"
#include "value.h"

object.h

This new struct gets a new object type:

  OBJ_FUNCTION,

object.h

in enum ObjType

  OBJ_INSTANCE,
  OBJ_NATIVE,

object.h , in enum ObjType

I want to slow down a bit here because Lox’s notion of “type” and the VM’s implementation notion of type brush against each other in ways that can be confusing. Inside the C code that makes clox, there are a number of different types of Obj ObjString, ObjClosure, etc. Each has its own internal representation and semantics.

In the Lox language , users can define their own classes say Cake and Pie and then create instances of those classes. From the user’s perspective, an instance of Cake is a different “type” of object than an instance of Pie. But, from the VM’s perspective every class the user defines is simply another value of type ObjClass. Likewise, each instance in the user’s program, no matter what class it is an instance of, is an ObjInstance. That one VM object type covers instances of all classes. The two worlds map to each other something like this:

n2qIZvy.png!web

Got it? OK, back to the implementation. We also get our usual macros:

#define IS_FUNCTION(value)      isObjType(value, OBJ_FUNCTION)

object.h

#define IS_INSTANCE(value)      isObjType(value, OBJ_INSTANCE)
#define IS_NATIVE(value)        isObjType(value, OBJ_NATIVE)  

object.h

And:

#define AS_FUNCTION(value)      ((ObjFunction*)AS_OBJ(value))          

object.h

#define AS_INSTANCE(value)      ((ObjInstance*)AS_OBJ(value))          
#define AS_NATIVE(value)        (((ObjNative*)AS_OBJ(value))->function)

object.h

Since fields are added after the instance is created, the “constructor” function only needs to know the class:

ObjFunction* newFunction();

object.h

add after newFunction ()

ObjInstance* newInstance(ObjClass* klass);
ObjNative* newNative(NativeFn function);

object.h , add after newFunction ()

We implement that function here:

object.c

add after newFunction ()

ObjInstance* newInstance(ObjClass* klass) {                       
  ObjInstance* instance = ALLOCATE_OBJ(ObjInstance, OBJ_INSTANCE);
  instance->klass = klass;                                        
  initTable(&instance->fields);                                   
  return instance;                                                
}

object.c , add after newFunction ()

We store a reference to the instance’s class. Then we initialize the field table to an empty hash table. A new baby object is born!

At the sadder end of the instance’s lifespan, it gets freed:

      FREE(ObjFunction, object);                   
      break;                                       
    }                                              

memory.c

in freeObject ()

    case OBJ_INSTANCE: {                           
      ObjInstance* instance = (ObjInstance*)object;
      freeTable(&instance->fields);                
      FREE(ObjInstance, object);                   
      break;                                       
    }                                              

    case OBJ_NATIVE:

memory.c , in freeObject ()

The instance owns its field table so when freeing the instance, we also free the table.

Speaking of memory management, we also need to support instances in the garbage collector:

      markArray(&function->chunk.constants);       
      break;                                       
    }                                              

memory.c

in blackenObject ()

    case OBJ_INSTANCE: {                           
      ObjInstance* instance = (ObjInstance*)object;
      markObject((Obj*)instance->klass);           
      markTable(&instance->fields);                
      break;                                       
    }                                              

    case OBJ_UPVALUE:

memory.c , in blackenObject ()

If the instance is alive, we need to keep its class around. Also, we need to keep every object referenced by the instance’s fields. Most live objects that are not roots are reachable because some instance references the object in a field. Fortunately, we already have a nice markTable() function to make tracing them easy.

Less critical but still important is printing:

    case OBJ_FUNCTION:                                              
      printFunction(AS_FUNCTION(value));                            
      break;

object.c

in printObject ()

    case OBJ_INSTANCE:                                              
      printf("%s instance", AS_INSTANCE(value)->klass->name->chars);
      break;
    case OBJ_NATIVE:

object.c , in printObject ()

An instance prints its name followed by “instance”. (The “instance” part is mainly so that classes and instances don’t print the same.)

The real fun happens over in the interpreter. Lox has no special new keyword. The way to create an instance of a class is to invoke the class itself as if it were a function. The runtime already supports function calls and it checks the type of object being called to make sure the user doesn’t try to invoke a number or other invalid type.

We extend that with a new valid case:

    switch (OBJ_TYPE(callee)) {

vm.c

in callValue ()

      case OBJ_CLASS: {                                          
        ObjClass* klass = AS_CLASS(callee);                      
        vm.stackTop[-argCount - 1] = OBJ_VAL(newInstance(klass));
        return true;                                             
      }
      case OBJ_CLOSURE:

vm.c , in callValue ()

If the value being called the object that results when evaluating the expression to the left of the opening parenthesis is a class, then we treat it as a constructor call. We create a new instance of the called class and store the result in the stack.

We’re one step farther. Now we can define classes and create instances of them:

class Brioche {}
print Brioche();

Note the parentheses after Brioche on the second line now. This prints “Brioche instance”.

27 . 4 Get and Set Expressions

Our object representation for instances can already store state, so all that remains is exposing that functionality to the user. Fields are accessed and modified using get and set expressions. Not one to break with tradition, Lox uses the classic “dot” syntax:

eclair.filling = "pastry creme";
print eclair.filling;

The period full stop for my English friends works sort of like an infix operator. There is an expression to the left that is evaluated first and produces an instance. After that is the . followed by a field name. Since there is a preceding operand, we hook this into the parse table as an infix expression:

  { NULL,     NULL,    PREC_NONE },       // TOKEN_COMMA

compiler.c

replace 1 line

  { NULL,     dot,     PREC_CALL },       // TOKEN_DOT  
  { unary,    binary,  PREC_TERM },       // TOKEN_MINUS

compiler.c , replace 1 line

As in other languages, the . operator binds tightly, with precedence as high as the parentheses in a function call. After the parser consumes the dot token, it dispatches to:

compiler.c

add after call ()

static void dot(bool canAssign) {                              
  consume(TOKEN_IDENTIFIER, "Expect property name after '.'.");
  uint8_t name = identifierConstant(&parser.previous);

  if (canAssign && match(TOKEN_EQUAL)) {                       
    expression();                                              
    emitBytes(OP_SET_PROPERTY, name);                          
  } else {                                                     
    emitBytes(OP_GET_PROPERTY, name);                          
  }                                                            
}

compiler.c , add after call ()

The parser expects to find a property name immediately after the dot. We load that token’s lexeme into the constant table as a string so that the name is available at runtime.

We have two new expression forms getters and setters that this one function handles. If we see an equals sign after the field name, it must be a set expression that is assigning to a field. But we don’t always allow an equals sign after the field to be compiled. Consider:

a + b.c = 3

This is syntactically invalid according to Lox’s grammar, which means our Lox implementation is obligated to detect and report the error. If dot() silently parsed the = 3 part, we would incorrectly interpret the code as if the user had written:

a + (b.c = 3)

The problem is that the = side of a set expression has much lower precedence than the . part. The parser may call dot() in a context that is too high precedence to permit a setter to appear. To avoid incorrectly allowing that, we only parse and compile the equals part when canAssign is true. If an equals token appears when canAssign is false, dot() leaves it alone and returns. In that case, the compiler will eventually unwind up to parsePrecedence() which stops at the unexpected = still sitting as the next token and reports an error.

If we found an = in a context where it is allowed, then we compile the right-hand expression being stored in the field. After that, we emit a new OP_SET_PROPERTY instruction. That takes a single operand for the index of the property name in the constant table. If we didn’t compile a set expression, we assume it’s a getter and emit an OP_GET_PROPERTY instruction, which also takes an operand for the property name.

Now is a good time to define these two new instructions:

  OP_SET_UPVALUE,

chunk.h

in enum OpCode

  OP_GET_PROPERTY,
  OP_SET_PROPERTY,
  OP_EQUAL,

chunk.h , in enum OpCode

And add support for disassembling them:

      return byteInstruction("OP_SET_UPVALUE", chunk, offset);

debug.c

in disassembleInstruction ()

    case OP_GET_PROPERTY:                                          
      return constantInstruction("OP_GET_PROPERTY", chunk, offset);
    case OP_SET_PROPERTY:                                          
      return constantInstruction("OP_SET_PROPERTY", chunk, offset);
    case OP_EQUAL:

debug.c , in disassembleInstruction ()

27 . 4 . 1 Interpreting getter and setter expressions

Sliding over to the runtime, we’ll start with get expressions since those are a little simpler:

        *frame->closure->upvalues[slot]->location = peek(0);
        break;                                              
      }

vm.c

in run ()

      case OP_GET_PROPERTY: {                               
        ObjInstance* instance = AS_INSTANCE(peek(0));       
        ObjString* name = READ_STRING();

        Value value;                                        
        if (tableGet(&instance->fields, name, &value)) {    
          pop(); // Instance.                               
          push(value);                                      
          break;                                            
        }                                                   
      }
      case OP_EQUAL: {

vm.c , in run ()

When the interpreter reaches this instruction, the expression to the left of the . has already been executed and the resulting instance is on top of the stack. We read the field name from the constant pool and look it up in the instance’s field table. If the hash table contains an entry with that name, we pop the instance and push the entry’s value as the result.

Of course, the field might not exist. In Lox, we’ve defined that to be a runtime error. So we add a check for that and abort if it happens:

          push(value);                                        
          break;                                              
        }

vm.c

in run ()

        runtimeError("Undefined property '%s'.", name->chars);
        return INTERPRET_RUNTIME_ERROR;
      }

      case OP_EQUAL: {

vm.c , in run ()

There is another failure mode to handle which you’ve probably noticed. The above code assumes the expression to the left of the dot did evaluate to an ObjInstance. But there’s nothing preventing a user from writing:

var obj = "not an instance";
print obj.field;

The user’s program is wrong, but the VM still has to handle it with some grace. Right now, it will misinterpret the bits of the ObjString as an ObjInstance and, I don’t know, catch on fire or something definitely not graceful.

In Lox, only instances are allowed to have fields. You can’t stuff a field onto a string or number. So we need to check that the value is an instance before accessing any fields on it:

      case OP_GET_PROPERTY: {

vm.c

in run ()

        if (!IS_INSTANCE(peek(0))) {                      
          runtimeError("Only instances have properties.");
          return INTERPRET_RUNTIME_ERROR;                 
        }                                                 

        ObjInstance* instance = AS_INSTANCE(peek(0));

vm.c , in run ()

If the value on the stack isn’t an instance, we report a runtime error and safely exit.

Of course, get expressions are not very useful when no instances have any fields. For that we need setters:

        runtimeError("Undefined property '%s'.", name->chars);
        return INTERPRET_RUNTIME_ERROR;                       
      }

vm.c

in run ()

      case OP_SET_PROPERTY: {                                 
        ObjInstance* instance = AS_INSTANCE(peek(1));         
        tableSet(&instance->fields, READ_STRING(), peek(0));

        Value value = pop();                                  
        pop();                                                
        push(value);                                          
        break;                                                
      }
      case OP_EQUAL: {

vm.c , in run ()

This is a little more complex than OP_GET_PROPERTY . On top of the stack, we first have the instance whose field is being set and then above that is the value to be stored in the instance’s field. Like before, we read the instruction’s operand and find the field name string. Using that, we store the value on top of the stack into the instance’s field table.

After that is a little stack juggling. We pop the stored value off, then pop the instance, and finally push the value back on. In other words, we remove the second element from the stack while leaving the top alone. A setter is itself an expression whose result is the assigned value, so we need to leave that value on the stack. Here’s what I mean:

class Toast {}
var toast = Toast();
print toast.jam = "grape"; // Prints "grape".

Unlike when reading a field, we don’t need to worry about the hash table not containing the field. A setter implicitly creates the field if needed. We do need to handle the user incorrectly trying to store a field on a value that isn’t an instance:

      case OP_SET_PROPERTY: {

vm.c

in run ()

        if (!IS_INSTANCE(peek(1))) {                  
          runtimeError("Only instances have fields.");
          return INTERPRET_RUNTIME_ERROR;             
        }                                             

        ObjInstance* instance = AS_INSTANCE(peek(1));

vm.c , in run ()

Exactly like with get expressions, we check the value’s type and report a runtime error if it’s invalid. And, with that, the stateful side of Lox’s support for object-oriented programming is in place. Give it a try:

class Pair {}

var pair = Pair();
pair.first = 1;
pair.second = 2;
print pair.first + pair.second; // 3.

This doesn’t really feel very object -oriented. It’s more like a strange dynamically-typed variant of C where objects are loose struct-like bags of data. Sort of a dynamic procedural language. But this is a big step in expressiveness. Our Lox implementation now lets users freely aggregate data into bigger units. In the next chapter, we will breathe life into those inert blobs.

  1. Trying to access a non-existent field on an object immediately aborts the entire VM. The user has no way to recover from this runtime error, nor is there any way to see if a field exists before trying to access it. It’s up to the user to ensure on their own that only valid fields are read.

    How do other dynamically-typed languages handle missing fields? What do you think Lox should do? Implement your solution.

  2. Fields are accessed at runtime by their string name. But that name must always appear directly in the source code as an identifier token . A user program cannot imperatively build a string value and then use that as the name of a field. Do you think they should be able to? Devise a language feature that enables that and implement it.

  3. Conversely, Lox offers no way to remove a field from an instance. You can set a field’s value to nil , but the entry in the hash table is still there. How do other languages handle this? Choose and implement a strategy for Lox.

  4. Because fields are accessed by name at runtime, working with instance state is slow. It’s technically a constant-time operation thanks, hash tables but the constant factors are relatively large. This is a major component of why dynamic languages are slower than statically-typed ones.

    How do sophisticated implementations of dynamically-typed languages cope with and optimize this?


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK