LinuxQuestions.org - [SOLVED] Objects and assignment in interpreter

Page 1 of 9

Show 50 post(s) from this thread on one page

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Objects and assignment in interpreter (https://www.linuxquestions.org/questions/programming-9/objects-and-assignment-in-interpreter-861721/)

MTK358

02-09-2011 07:37 PM

Objects and assignment in interpreter

I have it so that an Env object contains all the variables in the current scope, and the parent scope. You can also save them to variables like objects, and my idea is to use the "." operator to get values from them.

I have 2 issues about this:

First, how do I assign to them? I currently have it that only a variable name can be assigned to, but you should be able to assign to an "obj.value" expression, too. How do I keep track of what variable to set and still be able to say it in an expression and have it evaluate to the variable?

Second, these object don't really have a "type", they're just containers that contain any values you want under any name you want. How can I, for example, define a primitive "boolean" object and have things like if statements recognize it?

Nominal Animal

02-09-2011 09:52 PM

Quote:

Originally Posted by MTK358 (Post 4253614)

First, how do I assign to them? I currently have it that only a variable name can be assigned to, but you should be able to assign to an "obj.value" expression, too. How do I keep track of what variable to set and still be able to say it in an expression and have it evaluate to the variable?

The compilers and interpreters I know call assignable items as lvalues. Basically, you need to have a way to convert the expression, be it a variable or an object field, to an assignment target -- either the storage object, or to the setter method.

In practice, your parser needs to proceed one part a time: first "obj", then "value". When it first sees "obj", it much search the local scope to see if there is an object of that name, and if not, if the global scope has one. If not, then it is a new object, and you create it in the local scope.

When the parser finds the object the string refers to, it checks if the expression refers to the object, or some subobject. In your example, at "obj", the parser has already found the "obj" object, but there is an additional token "value", after the subobject operator. So, the parser must check all the subobjects of the current object to see if any of them are named "value".

Quote:

Originally Posted by MTK358 (Post 4253614)

Second, these object don't really have a "type", they're just containers that contain any values you want under any name you want. How can I, for example, define a primitive "boolean" object and have things like if statements recognize it?

For each primitive type you need a function that can take any container, and decide the value represented by the container. You also need an evaluator function for each statement type (if clause, trigraph, loop test). When your parser parses a statement, it of course recognizes the statement type, and gleans the references to the objects used in the statement. The parser will then call the evaluator function for that statement. The evaluator function will use the primitive functions to decide what the values in the containers mean.

Have you taken a look at the Python source code? It is very similar in its approach to variable and object storage and reference methods. If you are conversant in C, I'd recommend looking at the Extending and Embedding section, to see how Python developers solved these problems -- even if you yourself decide to do it another way.

I'm not sure I'm making any sense, though.

Nominal Animal

MTK358

02-10-2011 04:59 AM

One thing I'm confused about is that you're talking about the parser as if it actually interprets the code. I thought it should just build the AST and then the interpretation will be done in a completely different function.

Quote:

Originally Posted by Nominal Animal (Post 4253670)

I kind of get it now, even though I'm still a bit confused about the actual details.

Quote:

Originally Posted by Nominal Animal (Post 4253670)

For each primitive type you need a function that can take any container, and decide the value represented by the container.

So I should have a function that takes an object as a parameter and returns the type it represents?

Again, objects in my interpreter are actually scopes, they're the same thing. And they have no type, they're basically nesting associative arrays. If you think that's a bad idea and there's a better solution, you can tell me.

Nominal Animal

02-10-2011 04:22 PM

Quote:

Originally Posted by MTK358 (Post 4253932)

Sure, that is the standard way. I'm sorry, I meant "parser" in the general sense (as in lexer + parser + abstract tree + evaluator). Let's restart.

Quote:

Originally Posted by MTK358 (Post 4253932)

I kind of get it now, even though I'm still a bit confused about the actual details.

I'm quite confused, too. Why don't you tell us which bits you already have working, or already thought out at least? What language are you programming your interpreter in? I may not have any intelligent input for you, but I am very interested!

Quote:

Originally Posted by MTK358 (Post 4253932)

So I should have a function that takes an object as a parameter and returns the type it represents?

No, I meant that for every basic native type -- boolean, integer, float/real, string -- you should have a separate conversion function. It must be able to take any kind of an object, and return the machine representation of the value (or an error). For example, the boolean conversion function would take an object, and return a machine boolean.

Consider an if clause. In your AST, you have an if-statement node, with two or three child AST nodes: expression, true-statement, and optionally false-statement. Correct?

Your if statement evaluator function will call the boolean conversion function on the expression AST node, to determine which statement to execute. If the expression node is a leaf, the function will just peek inside the referred machine object, and then return true or false, depending on the contents.

If the expression is more complex, your conversion function will notice that the AST node is not a leaf. It must be a logical operator, with one or more child AST nodes. So, it calls the logical operator (say AND) evaluator function, giving it the child AST nodes as parameters. (I assume your parser has already taken care of the operator precedence, so the expression is a correct AST tree, and not just a chain of operations.)

The logical operator (AND) evaluator function will call the boolean conversion function on the first parameter. If it evaluates to FALSE, it will return machine FALSE. (These are the standard short circuit rules.) Otherwise it will call the conversion function on the second parameter, and return whatever it returns.

In this fashion, the evaluation goes on recursively, and the conversion functions peek at the machine objects. Program control statements are evaluated using machine types.

Okay, now back on track. If I've understood you correctly, you are having a problem of associating a small AST tree for "obj.value" ("obj" AST node which has a "." child AST node and a "value" grandchild AST node) to the machine object it represents.

In this case your assignment target evaluator function must first resolve the "obj" AST node to the proper object -- I think you already have this for standard variables. Then it must notice that it is not a leaf node, but has a "." child; therefore it must resolve the child node as an object, from within the "obj" object, and use that as the target. Note that this assumes your lexer and parser recognize and understand the "." as an operator.

In practice, each of your variable and object properties or fields or methods, must have a corresponding machine objects, with a recognizable label the evaluator functions can find. Python uses a single mechanism for all variables, fields and methods. It has two main objects, globals, and local. A new search will start at the local object. If not found, it'll check the global object. All further searches ("." operators) are limited to the current object. In C, your object reference resolver could look something like this:

Code:

typedef struct object object_t;

struct object {

    label_t name;

    size_t children;

    struct object *child;

    /* ... */

};





int label_match(const label_t *const name1, const label_t *const name2)

{

    /* Return nonzero if name1 == name2

    */

}





/* Internal function: resolves name in base_object to an object, or returns NULL.

*/

object_t *do_resolve(const label_t *const name,

                    object_t *base_object)

{

    size_t i = base_object->children;



    /* Search for the label in the given scope.

    */

    while (i--)

        if (label_match(name, base_object->child[i]->label))

            return base_object->child[i];



    return NULL;

}





/* Name-to-object resolver. Types are irrelevant.

 * Will return NULL if no match found.

*/

object_t *resolve(const label_t *const name,

                  object_t *current,

                  object_t *local,

                  object_t *global)

{

    /* Search for the label in the current scope.

    */

    if (current)

        return do_resolve(name, current);



    /* Search for the label in the local scope.

    */

    if (local) {

        current = do_resolve(name, local);

        if (current)

            return current;

    }



    /* The global scope is the last resort.

    */

    return do_resolve(name, global);

}

Note that hash tables are much more efficient. Also, you can use the above resolver for all object references, not just assignments. The current is initially NULL for each expression and statement. This will work correctly for nested references too (object.child.grandchild), and you will find that arrays and indexing is very easy to implement on top of this. Well, actually you could just use object.1 and so on, if you want!

Quote:

Originally Posted by MTK358 (Post 4253932)

Again, objects in my interpreter are actually scopes, they're the same thing. And they have no type, they're basically nesting associative arrays. If you think that's a bad idea and there's a better solution, you can tell me.

No, that sounds exactly like it's done in Python, and I'd do it that way myself. Lacking a type is no problem, as long as you can derive a machine value out of any object somehow.

One thing I'd do, for embedding reasons, is to make sure it is possible to instantiate multiple interpreters in the same process and thread. In other words, that the entire interpreter state is encapsulated in a single structure supplied to the main interpreter function. This is extremely useful when an application fires up scripts to handle certain events: the event handler scripts can then each be a separate instance. It would be especially if used as an Apache module: for each request, the worker process could clone the interpreter instance from a preconfigured, prepared state (perhaps even having run an initialization script for that subtree at configuration time). After processing the request, the instance is just thrown away. That would make for an extremely robust and effective scripting language for web pages..

Waiting to hear more,

Nominal Animal

MTK358

02-10-2011 04:33 PM

Quote:

Originally Posted by Nominal Animal (Post 4254503)

The problem is that I can't really choose a language. Python is easy but too slow (who would want an interpreted interpreted language?), I don't like C++ because of its' many confusing quirks, and C (which I would prefer) has no OOP functionality that would make things much easier.

Quote:

Originally Posted by Nominal Animal (Post 4254503)

OK, that sounds good in theory, but I'm not sure how that would be done.

Quote:

The logical operator (AND) evaluator function will call the boolean conversion function on the first parameter. If it evaluates to FALSE, it will return machine FALSE. (These are the standard short circuit rules.) Otherwise it will call the conversion function on the second parameter, and return whatever it returns.

OK. Just note that there aren't currently really any "operators", just functions stored as values in the object.

Operators, when implemented, would probably just call predefined functions on the object behind the scenes. That way, you can even use the operators for your own objects, not just the built-in ones.

Quote:

("obj" AST node which has a "." child AST node and a "value" grandchild AST node) to the machine object it represents.

I thought that there can only be a "." AST node with object and value leaves. I don't see how it can be done otherwise.

Nominal Animal

02-10-2011 09:03 PM

Quote:

Originally Posted by MTK358 (Post 4254514)

I've found C to be quite object-friendly when using C99. For example,

Code:

typedef enum objtype  objtype_t;

enum objtype {

    OBJECT_UNKNOWN = 0,

    OBJECT_GENERIC = 1,

    OBJECT_STRING,

    OBJECT_FLOAT,

    OBJECT_INTEGER,

    OBJECT_BOOLEAN,

};



typedef struct string  string_t;

struct string {

    size_t            size;

    size_t            used;

    unsigned char      data[];

};



typedef struct hash    hash_t;

struct hash {

    size_t            size; /* Of hashed object */

    uint64_t          hash;

};



typedef struct object  object_t;

struct object {

    object_t          *parent;

    string_t          *name;

    size_t            max_children;

    size_t            children;

    object_t          *child;

    hash_t            *child_name;

    objtype_t          type;

    /* Payload. Customize? */

    size_t            size;

    unsigned char      data[];

};

When you create an object, you first determine how much private data it needs, then allocate a large enough object_t, and put the private data in the data field. The easiest way is to alias it with a structure tailored to the object subtype.

For example, a generic object matches exactly your scope concept. It doesn't need to have any data, just child objects, and they can have any format. Since each has their own name, you already have a basic associative list construct. If you keep the child array as a binary tree, you can locate any child in logarithmic time. If you also use the child_hash array with e.g. crc32's of the name to be looked up, you can do extremely large lists/scopes efficiently. A hash table is constant-time, and thus probably even faster, but I find rehashing tedious. A tight binary tree as an array is very, very fast.

There are other, more interesting features you can add. For example, instead of embedding the data like above, you can use pointers and reference counting to save memory. Nested object structure is already very useful for garbage collection; whenever you exit the current scope, just free the current scope object and all objects it contains.

As to the conversion functions, consider this as an example:

Code:

/* 1: True, 0: False.

*/

int object_to_boolean(const object_t *const object)

{

    if (!object)

        return 0;



    switch (object->type) {



    case OBJECT_GENERIC:

        /* Generic objects are TRUE if they contain children,

        * false otherwise.

        */

        return (object->children > 0) ? 1 : 0;



    case OBJECT_STRING:

        /* String is TRUE if it is non-empty.

        * Assume string contents are in the payload as-is.

        */

        return (object->size > 0) ? 1 : 0;



    case OBJECT_FLOAT:

        /* Float has a single double value as a payload.

        */

        return (*(double *)(&object->data) == 0.0) ? 1 : 0;



    case OBJECT_INTEGER:

        /* Integer has a 64-bit integer value as a payload.

        */

        return (*(int64_t *)(&object->data) == (int64_t)0) ? 1 : 0;



    default:

        /* All other objects are considered FALSE for now.

        */

        return 0;

    }

}

I suspect that your functions and methods should be written so that they always return a pointer to a newly allocated object_t, and have a special built-in object type, say exception, to signal and propagate errors. An assignment operator then either copies, or replaces the object contents with the resolved target.

Quote:

Originally Posted by MTK358 (Post 4254514)

I thought that there can only be a "." AST node with object and value leaves. I don't see how it can be done otherwise.

Define it as a suffix operator. The parent "object" is then the AST node, "." operator is either the child or the right sibling, and "value" is a child to the operator node.

So, exactly what use case you have in mind for your interpreter?

Nominal Animal

MTK358

02-11-2011 08:47 AM

Quote:

Originally Posted by Nominal Animal (Post 4254649)

I've found C to be quite object-friendly when using C99. For example,

Code:

typedef enum objtype  objtype_t;

enum objtype {

    OBJECT_UNKNOWN = 0,

    OBJECT_GENERIC = 1,

    OBJECT_STRING,

    OBJECT_FLOAT,

    OBJECT_INTEGER,

    OBJECT_BOOLEAN,

};



typedef struct string  string_t;

struct string {

    size_t            size;

    size_t            used;

    unsigned char      data[];

};



typedef struct hash    hash_t;

struct hash {

    size_t            size; /* Of hashed object */

    uint64_t          hash;

};



typedef struct object  object_t;

struct object {

    object_t          *parent;

    string_t          *name;

    size_t            max_children;

    size_t            children;

    object_t          *child;

    hash_t            *child_name;

    objtype_t          type;

    /* Payload. Customize? */

    size_t            size;

    unsigned char      data[];

};

I don't understand this. I do get the objtype part, which seems like a good idea, and the string makes sense (although I don't get what it's for). But what's the hash struct, what does it do?

And I don't understand some of object's members:

What's the idea of the object storing its name? It could have many different names at once since it can be stored in variables and copied.

Why is there a limit on the number of children?

Why is the child member an object_t*, not and object_t**? How can you ensure that the children will be created one after another in an array, and not allocated in random parts of the computer's memory? Seems like you need an array of pointers to the children.

And finally, what's the hash for?

Quote:

whenever you exit the current scope, just free the current scope object and all objects it contains.

But the current scope could have been saved in a variable by the time it exits, so just destroying its contents would also ruin the purpose of the variable.

I'm considering using this (link) to make things easier (just write GC_malloc instead of malloc, GC_realloc instead of realloc, and forget about freeing things). Or will it cause serious performance issues?

Quote:

OK, but how will the type of the exception be tracked, since user defined exceptions will be just generic objects? Maybe there should be one exception type, that will have a name and message, and the user can add more data to it just by creating an instance of the built-in exception type and adding user-defined data usign the "." operator.

And since my idea is that all functions are anonymous, how would you implement it in C, where functions can't be nested and only have predefined global names? I guess you can define a struct that has the parameter names and a pointer to an AST node, but what about functions predefined in C?

Quote:

Define it as a suffix operator. The parent "object" is then the AST node, "." operator is either the child or the right sibling, and "value" is a child to the operator node.

I still don't get it. How can an operator that goes between two things be a suffix operator? This is how I see it:

Code:

expr: lvalue



lvalue: ID | expr DOT ID

Nominal Animal

02-11-2011 12:32 PM

Quote:

Originally Posted by MTK358 (Post 4255125)

But what's the hash struct, what does it do?

It is there only if you want to speed up child object lookups. It's not used by the example code at all. If you don't use a hash, you need to compare the string names, which is much slower if you have many child objects.

Quote:

Originally Posted by MTK358 (Post 4255125)

What's the idea of the object storing its name? It could have many different names at once since it can be stored in variables and copied.

In that case, you need a different object model. Consider the previous object type as a variable. The payload of the object should then point to the actual reference-counted data storage.

Quote:

Originally Posted by MTK358 (Post 4255125)

Why is there a limit on the number of children?

It's just the current maximum number of child object pointers the child array can take. If it gets full, you can always reallocate and increase the maximum.

Quote:

Originally Posted by MTK358 (Post 4255125)

Why is the child member an object_t*, not and object_t**?

Good catch, it should be object_t **child; or object_t *child[]; (no real difference, I prefer the latter for clarity).

Quote:

Originally Posted by MTK358 (Post 4255125)

And finally, what's the hash for?

Only to allow you to speed up child object lookups. It's optional.

Quote:

Originally Posted by MTK358 (Post 4255125)

But the current scope could have been saved in a variable by the time it exits, so just destroying its contents would also ruin the purpose of the variable.

If you use reference counting, that won't be a problem.

Here's some example code for reference counting:

Code:

typedef struct storage storage_t;

struct storage {

    struct storage *chain;

    int32_t        references;

    enum datatype  type;

    size_t          size;

    unsigned char  data[];

};



pthread_rwlock_t storage_access;

storage_t *garbage = NULL;



static inline void start_using_storage(void)

{

    if (pthread_rwlock_rdlock(&storage_access)) {

        /* Serious error condition! Stop execution! */

    }

}



static inline void finish_using_storage(void)

{

    if (pthread_rwlock_unlock(&storage_access)) {

        /* Serious error condition! Stop execution! */

    }

}



void collect_garbage(void)

{

    storage_t *next;



    if (thread_rwlock_wrlock(&storage_access)) {

        /* Serious error condition! Stop execution! */

    }



    next = __sync_fetch_and_and(&garbage, NULL);



    if (thread_rwlock_unlock(&storage_access)) {

        /* Serious error condition! Stop execution! */

    }



    while (next) {

        storage_t *const curr = next;

        next = next->next;



        if (__sync_fetch_and_or((int32_t *)&curr->references, (int32_t)0) > 0) {

            /* Serious error, a BUG. The reference count should not be able to grow. */

        }

        free(curr);

    }

}



static inline storage_t *new_storage(const enum datatype type, const size_t size)

{

    storage_t *storage;



    storage = malloc(sizeof(storage_t) + size);

    if (!storage)

        return NULL;



    storage->references = 1;

    storage->type = type;

    storage->size = size;



    return storage;

}



static inline storage_t *storage_attach(storage_t *const storage)

{

    int32_t references;



    if (!storage)

        return NULL;



    references = __sync_add_and_fetch((int32_t *)&storage->references, (int32_t)1);

    if (references < 2) {

        __sync_sub_and_fetch((int32_t *)&storage->references, (int32_t)1);

        return NULL;

    }



    return storage;

}



static inline void storage_detach(storage_t *const storage)

{

    int32_t references;



    if (!storage)

        return;



    references = __sync_sub_and_fetch((int32_t *)&storage->references, (int32_t)1);

    if (references < 1) {

        storage_t *temp;



        storage->size = 0;

        storage->type = CONTENT_DELETED;



        /* Add storage to the garbage chain */

        do {

            temp = __sync_fetch_and_or(&garbage, NULL);

            storage->chain = temp;

        } while (!__sync_bool_compare_and_swap(&garbage, temp, storage));

    }



    return;

}

The above is thread-safe, because it uses the atomic built-ins (available in most C compilers) to increase and decrease the reference count and when moving the object to the garbage chain. Storage in the garbage chain can only be released when none of the threads have any temporary storage objects, so I added the storage_access rwlock (which needs to be initialized first, not shown above). Basically, whenever an interpreter thread starts handling storage objects, it must call start_using_storage(), and after it's done, call finish_using_storage(). Many threads can handle storage simultaneously, it just blocks the garbage collection. At regular intervals, a thread must call collect_garbage() to release the memory used by the garbage objects.

In this manner you destroy the objects whenever you like, as long as you use storage_new for allocating new storage, storage_attach when adding a reference to the storage, and content_detach when removing a reference to the storage (including when destroying an object instance). The data is only freed when there are no more references to it.

The garbage chain is needed to avoid referencing already free()d data; the actual freeing is only done when it is surely safe.

Quote:

Originally Posted by MTK358 (Post 4255125)

I'm considering using this (link) to make things easier (just write GC_malloc instead of malloc, GC_realloc instead of realloc, and forget about freeing things). Or will it cause serious performance issues?

I have heard of it, but haven't used it myself, so I don't really know.

Quote:

Originally Posted by MTK358 (Post 4255125)

Quite right. You could put the exception "type" (string) into the object storage, and any used-defined data as subobjects. That way the user can have an unlimited number of exception types.

Quote:

Originally Posted by MTK358 (Post 4255125)

And since my idea is that all functions are anonymous, how would you implement it in C, where functions can't be nested and only have predefined global names? I guess you can define a struct that has the parameter names and a pointer to an AST node, but what about functions predefined in C?

Umm, I'm not sure what you mean. In C, functions can be nested, and you can use a function pointer, so while they do have a name defined at compilation time, you can add many other names using a number of methods. Native C functions take the parameter object list, and return an object, so I don't see the problem. Your structure could have one pointer for AST node, and another for a native C function; one would always be NULL.

Quote:

Originally Posted by MTK358 (Post 4255125)

I still don't get it. How can an operator that goes between two things be a suffix operator?

I thought the AST would be easier if the operator was attached to the object, and the value would be a sibling to the object -- so the operator would not actually go between the two, it'd be part of the left side. But I think I was wrong, and a normal operator works fine. Lets see.

If "." is a normal operator, the AST for a.b is (.(a,b)), right?
And for a.b.c it is (.(a,(.(b,c))), right?
Okay, then this should work:

Code:

struct object {

    struct object *sibling;

    struct object *child;

    enum objtype  type;

    size_t        name_length;

    unsigned char  name[];

    /* Storage omitted */

};



struct node {

    struct node  *sibling;

    struct node  *child;

    enum nodetype type;

    size_t        name_length;

    unsigned char name[];

};



struct object *subobject(const unsigned char name[], const size_t name_length,

                        struct object *root)

{

    if (!root)

        return NULL;



    if (root->name_length == name_length && !strcmp(name, root->name))

        return root;



    if (!root->child)

        return NULL;



    root = root->child;

    while (root) {

        if (root->name_length == name_length && !strcmp(name, root->name))

            return root;

        root = root->sibling;

    }



    return NULL;

}



struct object *resolve_object(const struct node *const ast,

                              struct object *local_scope,

                              struct object *global_scope)

{

    struct object *obj, *parent;



    switch (ast->type) {

    case AST_GENERIC:

        /* This node just names a local or global object. */

        obj = subobject(ast->name, ast->name_length, local_scope);

        if (obj)

            return obj;

        return subobject(ast->name, ast->name_lengthm global_scope);



    case AST_DOT_OPERATOR:

        /* This node is a dot operator. */

        ast = ast->child;

        parent = subobject(ast->name, ast->name_length, local_scope);

        if (!parent)

            parent = subobject(ast->name, ast->name_length, global_scope);

        if (!parent)

            return NULL;

        return resolve_object(ast->sibling, parent, NULL);



    /* others not implemented */

    }



    return NULL;

}

Quote:

Originally Posted by MTK358 (Post 4255125)

This is how I see it:

Code:

expr: lvalue

lvalue: ID | expr DOT ID

Yeah. Except if we're pedantic, the left side of the dot operator, expr, does not need to be an lvalue, so maybe

Code:

reference: ID | reference DOT ID

lvalue: reference

would be a better way of putting it.

Nominal Animal

MTK358

02-11-2011 12:51 PM

Quote:

Originally Posted by Nominal Animal (Post 4255384)

It's just that the idea of having an array with huge gaps in it for every object would be incredibly wasteful of memory.

Quote:

In that case, you need a different object model. Consider the previous object type as a variable. The payload of the object should then point to the actual reference-counted data storage.

???

Quote:

I still don't understand how that would work. Do you mean that there would be a node that stored an object and the name of the member to get?

That wouldn't work because there is no object during parsing, just an expression node that will evaluate to the object diring evaluation, and then you might as well just have a '.' node with the object node and member name node.

Quote:

If "." is a normal operator, the AST for a.b is (.(a,b)), right?
And for a.b.c it is (.(a,(.(b,c))), right?
Okay, then this should work:

Exactly.

Quote:

Code:

struct object {

    struct object *sibling;

    struct object *child;

    enum objtype  type;

    size_t        name_length;

    unsigned char  name[];

    /* Storage omitted */

};

Again, why is the name stored here? What's the purpose of the sibling pointer?

Quote:

Code:

struct node {

    struct node  *sibling;

    struct node  *child;

    enum nodetype  type;

    size_t        name_length;

    unsigned char  name[];

};

What is the sibling pointer for? There is often more than one sibling. And what's the name for? Nodes don't have a name, the nodetype enum is enough.

Quote:

Yeah. Except if we're pedantic, the left side of the dot operator, expr, does not need to be an lvalue, so maybe

Code:

reference: ID | reference DOT ID

lvalue: reference

would be a better way of putting it.

I don't understand any of that.

In case that's what confused you, when I wrote "expr: lvalue" I didn't mean that an expr can only be an lvalue, I assumed that it would be clear that expr should also match operators, function calls, etc.

Nominal Animal

02-11-2011 02:14 PM

Quote:

Originally Posted by MTK358 (Post 4255396)

???

The AST nodes must contain the names of referred to things, or evaluation is impossible.

The object_t type matches a named thing; each named thing corresponds to exactly one object_t. The scope the name is valid in determined by the tree (global, local) and the position in the tree. The actual data stored in the object_t is in a separate structure. This separate structure counts any references to it by the variable instances, and is only destroyed when the count reaches zero.

When you copy an object, say by evaluating a = b; you find the object_t named b in the current scope. Then you create a new object_t, set its name to a, and copy the storage reference from b, increasing the storage reference count. Then you check if there is an existing object_t named a, and if yes, you replace it (destroying it, decreasing the storage reference count to whatever storage it uses) with the new one.

Quote:

Originally Posted by MTK358 (Post 4255396)

That wouldn't work because there is no object during parsing, just an expression node that will evaluate to the object diring evaluation, and then you might as well just have a '.' node with the object node and member name node.

Like I said above, the AST nodes which refer to object names must contain the name string in them, right?
The example code resolve_object(AST node, local scope object, global scope object) uses the names stored in the AST nodes and the variable objects (names of variable objects) to resolve both simple object names (like a) and dot operator references (like a.b.c).

It should work if the AST for a.b.c is (.(a,.(b,c))).

Quote:

Originally Posted by MTK358 (Post 4255396)

Again, why is the name stored here?

It's the variable (or object or property) name. The data is stored elsewhere.

Quote:

Originally Posted by MTK358 (Post 4255396)

What's the purpose of the sibling pointer?

Since you disliked the array, this object type can be chained (thorough the sibling pointer) to form lists.

Quote:

Originally Posted by MTK358 (Post 4255396)

What is the sibling pointer for?

For the AST node, it is the "very next" node. You cannot have more than one immediate sibling in the AST tree. How would you express such a tree? You'd need a third dimension to put parallel trees in. Perhaps we are using different names to these things.

Quote:

Originally Posted by MTK358 (Post 4255396)

And what's the name for? Nodes don't have a name, the nodetype enum is enough.

It's the AST node string payload. For operators it is always empty. For object/method/function/variable references, it contains the name string. For constants, it contains the constant value, e.g. the string for string constants. Right?

Quote:

Originally Posted by MTK358 (Post 4255396)

In case that's what confused you, when I wrote "expr: lvalue" I didn't mean that an expr can only be an lvalue, I assumed that it would be clear that expr should also match operators, function calls, etc.

Yeah, okay. I think we can get further if we start working with actual code, or at least specify which reference to base further theoretical stuff on. All my experience with compilers and interpreters is in the practical programming side; I'm weak on the theory.

Nominal Animal

MTK358

02-11-2011 02:55 PM

Quote:

Originally Posted by Nominal Animal (Post 4255458)

The AST nodes must contain the names of referred to things, or evaluation is impossible.

I thought you were talking about objects. Now I'm totally confused.

Anyway, I also wanted to mention that saved scopes aren't the only data type that will be stored in variables: functions will be, too.

I guess I need a data type struct, that would say whether it's an object or a function, and the object and function types will "inherit" the data type struct by placing the data type struct's members before the type-specific members. That way an object or function can be cast to a generic data type struct.

Quote:

Since you disliked the array, this object type can be chained (thorough the sibling pointer) to form lists.

My original idea was to use a pair of dynamic array objects, and to access a member you would get the index of the name in the keys array, and get the object from the values array at that index.

Quote:

What's the point of tree nodes to know their siblings? My idea was that each AST node has an array of pointers to its child nodes, and the type of the node would determine what they mean.

For example, if the node is an if/else node, the eval function would assume that it has 3 children: first the test, then the true-body, then the false-body.

Quote:

Oh, I didn't think of that. I was thinking to have completely different structs for nodes that had a payload and not just child nodes, but I like your idea better because it means less messing around with different data types.

Quote:

I'll try to write some code.

MTK358

02-11-2011 03:16 PM

I started writing code to evaluate the syntax tree:

Code:

typedef enum {

    NODETYPE_INTEGER,  // create Integer

    NODETYPE_FLOAT,    // create Float

    NODETYPE_STRING,    // create String

    NODETYPE_IF,        // evaluate only if condition true/false

    NODETYPE_WHILE,    // loop while condition true

    NODETYPE_DOWHILE,  // loop while condition true, eval body at least once

    NODETYPE_CALL,      // call function

    NODETYPE_GETMEMBER, // get member from env ('.' operator)

    NODETYPE_ASSIGN,    // assign value to lvalue ('=' operator)

    NODETYPE_FUNC,      // create anonymous function

} nodetype_t;



typedef struct node node_t;

struct node {

    nodetype_t type;

    int child_count;

    node_t *children[];

    char *payload;

};



object_t* eval(node_t *node, env_t *env)

{

    switch (node->type) {

    case NODETYPE_INTEGER:

        return // create integer object from node's payload

    case NODETYPE_IF:

        if object_to_boolean(eval(node->children[0], env))

            return eval(node->children[1], env);

        if (node->child_count == 3)

            return eval(node->children[2], env);

        return NULL;

    case NODETYPE_WHILE:

        object_t *value = NULL;

        while object_to_boolean(eval(node->children[0], env))

            value = eval(node->children[1], env);

        return value;

    case NODETYPE_DOWHILE:

        object_t *value;

        do {

            value = eval(node->children[1], env);

        } while object_to_boolean(eval(node->children[0], env));

        return value;

    }

}

But I'm stuck now because I still don't understand how to assign to lvalues, and I found another problem: how to store the parameter names of a defined function on the AST?

I started naming things that can be stored in variables "object"s, functions "function"s, and environments (that will serve as scopes and can be saved in variables) "env"s. If you have a better idea, tell me.

Nominal Animal

02-11-2011 05:33 PM

Okay. Let's assume you have this code:

Code:

print("Command: ", argv.0, "\n")

for (i = 1; i < argc; i++)

    print(i, ":", argv.i, "\n")

if (argc < 1)

    print("No parameters.\n")

You parse that into this AST, right?

Code:

  CALL:"print" ( STRING:"Command: ",

                GETMEMBER ( NAME:"argv",

                            NUMBER:"0" ),

                STRING:"\n" ),

  FORLOOP( ASSIGN( NAME:"i",

                  NUMBER:"1" ),

          LESSTHAN( NAME:"i",

                    NAME:"argc" ),

          POSTINCREMENT( NAME:"i" ),

  CALL:"print"( NAME:"i",

                STRING:":"

                GETMEMBER( NAME:"argv",

                          NAME:"i" ),

                STRING:"\n" ),

  IF( LESSTHAN( NAME:"argc",

                NUMBER:"1" ),

  CALL:"print"( STRING:"No parameters.\n" )

As an ASCII graphic using only a parent, next AKA sibling, and child pointers in each node, this is

Code:

CALL ─── FORLOOP ─── CALL ─── IF ─── CALL

 │        │          │        │      │

 │        │          │        │    STRING

 │        │          │        │

 │        │          │      LESSTHAN

 │        │          │        │

 │        │          │      NAME ─── NUMBER

 │        │          │

 │        │          NAME ─── STRING ─── GETMEMBER ─── STRING

 │        │                              │

 │        │                              NAME ─── NUMBER

 │        │

 │        ASSIGN ─── LESSTHAN ─── POSTINCREMENT

 │        │          │            │

 │        │          │          NAME

 │        │          │

 │        │          NAME ─── NUMBER

 │        │

 │        NAME ─── NUMBER

 │

STRING ─── GETMEMBER ─── STRING

            │

          NAME ─── NUMBER

I'd recommend expanding the node type names a bit, to help group them. Since a double can hold any integer value, I think it might be enough to have a single number type.

Code:

typedef enum {



    NODE_NAME, /* A string token */

    NODE_NUMBER, /* A numeric token */

    NODE_STRING, /* A quoted string constant */



    NODE_OP_ASSIGN, /* Assignment operator */

    NODE_OP_GETMEMBER, /* Member operator */

    NODE_OP_LESSTHAN, /* < operator */



    NODE_CALL, /* Function call */



    NODE_LOOP_FOR, /* For loop construct */



    NODE_COND_IF, /* If conditional */



} nodetype_t;

The AST node can be just

Code:

typedef struct node node_t;

struct node {

    struct node  *parent;  /* Parent node */

    struct node  *next;    /* Next (sibling) node */

    struct node  *child;    /* First child node */

    nodetype_t    type;

    size_t        length;  /* Token string length, usually 0 */

    unsigned char string[]; /* Token string, will have at least one zero (EOS) at end */

};

Usually string is empty and length zero, but for NODE_NAME, NODE_CALL, NODE_STRING and NODE_NUMBER, it must contain the parsed token as a string.
In case of NODE_STRING, it should not contain the double quotes.

Each of your objects, functions and envs require a name. You can treat them all as generic objects.
Assuming again a simple binary tree form, and not child arrays:

Code:

typedef struct object object_t;

struct object {

    struct object *parent;  /* Container object */

    struct object *next;    /* First sibling */

    struct object *child;  /* First child */

    struct data  *data;    /* Reference to the actual data */

    size_t        namelen; /* Name length */

    unsigned char  name[];  /* Name */

};

The actual object contents are referenced in a polymorphic data structure, pointed to by struct data *data above.
When used internally by your parser, you cast it to the proper subtype first.

Code:

/* This bit is common to all data, found in the 'common' member

*/

struct data_common {

    struct data *garbage;  /* Garbage collection chain, normally NULL */

    int32_t      references; /* Number of objects referring to this data */

    int32_t      type;      /* Actual object type */

};



/* Any data type can be cast to this base data type:

*/

struct data {

    struct data_common common;

    unsigned char      data[]; /* Some opaque data, size unknown */

};



/* This data object contains a single native number (a double):

*/

struct data_number {

    struct data_common common;

    double            value;

};



/* This data object is a fixed-size vector of numbers.

* You don't need nor want this, this is just an example.

*/

struct data_numbervector {

    struct data_common common;

    size_t            size;

    double            value[];

};



/* This contains one editable string (or binary data blob):

*/

struct data_string {

    struct data_common common;

    size_t            allocated;

    size_t            used;

    unsigned char    *length;

};



/* This describes a (possibly anonymous) function.

*/

struct data_function {

    struct data_common common;

    size_t          parameters;      /* Number of parameters */

    size_t        *parameter_length; /* Lengths of local variable names */

    unsigned char **parameter_name;  /* Names of local variables to map parameter objects to */

    struct object  *function_state;  /* Persistent local state? */

    struct node    *execute;          /* Node to start execution */

};



/* This describes a native C function.

 * The function expects the following parameters:

 * - interpreter state

 * - global scope object

 * - local scope object

 * - number of parameter objects, possibly zero

 * - array of parameter objects, NULL if no parameters

*/

struct data_builtin {

    struct data_common common;

    object_t (*function)(state_t *, object_t *, object_t *, size_t, object_t *);

};

With very small adaptations, my previous name resolver fits right in. Given the interpreter state (to get the garbage collection chain and rwlocks) and local and global state objects it should work for this just fine. For assignment operation, you only need to use the resolver to get the target object; it should work for both standard variables and object properties. An evaluator that can execute the tree above should not be difficult at all. Need example code?

Nominal Animal

MTK358

02-11-2011 07:18 PM

Quote:

Originally Posted by Nominal Animal (Post 4255596)

Okay. Let's assume you have this code:

Code:

  print("Command: ", argv.0, "\n")

  for (i = 1; i < argc; i++) print(i, ":", argv.i, "\n")

  if (argc < 1) print("No parameters.\n")

You parse that into this AST, right?

Code:

  CALL:"print" ( STRING:"Command: ",

                GETMEMBER ( NAME:"argv",

                            NUMBER:"0" ),

                STRING:"\n" ),

  FORLOOP( ASSIGN( NAME:"i",

                  NUMBER:"1" ),

          LESSTHAN( NAME:"i",

                    NAME:"argc" ),

          POSTINCREMENT( NAME:"i" ),

  CALL:"print"( NAME:"i",

                STRING:":"

                GETMEMBER( NAME:"argv",

                          NAME:"i" ),

                STRING:"\n" ),

  IF( LESSTHAN( NAME:"argc",

                NUMBER:"1" ),

  CALL:"print"( STRING:"No parameters.\n" )

Wrong. This language has completely first-calss functions, no different whatsoever from other data. They can be stored in varaibles, passed to functions, and expressions can even evaluate to them. You have to have an expression node to specify the function, not a name string.

There is no such thing here as functions with permanent, predefined names.

Quote:

I'd recommend expanding the node type names a bit, to help group them. Since a double can hold any integer value, I think it might be enough to have a single number type.

Code:

typedef enum {



    NODE_NAME,          /* A string token */

    NODE_NUMBER,        /* A numeric token */

    NODE_STRING,        /* A quoted string constant */



    NODE_OP_ASSIGN,    /* Assignment operator */

    NODE_OP_GETMEMBER,  /* Member operator */

    NODE_OP_LESSTHAN,  /* <  operator */



    NODE_CALL,          /* Function call */



    NODE_LOOP_FOR,      /* For loop construct */



    NODE_COND_IF,      /* If conditional */



} nodetype_t;

The problem with floats is they don't have perfect accuracy. If you add 5 and 3, the result might not be exactly 8.0, and an equality test might fail.

Quote:

Assuming again a simple binary tree form, and not child arrays:

Code:

typedef struct object  object_t;

struct object {

    struct object *parent;  /* Container object */

    struct object *next;    /* First sibling */

    struct object *child;    /* First child */

    struct data  *data;    /* Reference to the actual data */

    size_t        namelen;  /* Name length */

    unsigned char  name[];  /* Name */

};

Objects (envs) do not need to know about their children and do not have siblings (are you confusing them with nodes?). And I still have no clue what it is with you and these names. I just don't get it.

Also, what's so bad about using a separate list type?

Nominal Animal

02-11-2011 10:27 PM

Quote:

Originally Posted by MTK358 (Post 4255656)

Wrong. This language has completely first-calss functions, no different whatsoever from other data. They can be stored in varaibles, passed to functions, and expressions can even evaluate to them.

Wrong what? I am showing you first-class functions! The strings in the AST tree are the tokens the script programmer uses to refer to the objects!

If you want to allow an expression to evaluate to a function name, similar to PHP variable functions, the function name needs to be specified in the first child node instead of the AST node itself. For example, a print call AST would be (CALL("print", ...)). The difference is not significant, really.

Quote:

Originally Posted by MTK358 (Post 4255656)

You have to have an expression node to specify the function, not a name string.

If you mean that the function name must be allowed to be an expression, then the minor change to the AST node handles that. It's really not a problem.
If you mean that you don't want name strings in the AST nodes, then I'm stumped. I don't know of any method you can refer to the objects then. What do you do with the strings the script programmer uses to label the objects and functions and variables?

Quote:

Originally Posted by MTK358 (Post 4255656)

There is no such thing here as functions with permanent, predefined names.

Permanent? Predefined? Of course not. The name is just the name of the object that can be used to refer to the function now.

Quote:

Originally Posted by MTK358 (Post 4255656)

The problem with floats is they don't have perfect accuracy. If you add 5 and 3, the result might not be exactly 8.0, and an equality test might fail.

All 32-bit (actually, 52-bit) integers are exactly represented by doubles. For example GNU awk uses doubles for all numeric values. But I guess I'd probably also implement separate integer types. And vectors! I need vectors!

Quote:

Originally Posted by MTK358 (Post 4255656)

I think here is our major difference in approach.

In my opinion, the memory layout of the object hierarchy should follow the logical hierarchy: members of an object should be children to the object. The data or code that defines the object, is in a separate, reference-counted structure, which has no name. The distinction is similar to directory entries and inodes in POSIX file systems. The interpreter has only pointers to two objects: the global object, and the current local scope object. When exiting a local scope, the interpreter changes to the parent of the current local scope object, and the current local scope object is destroyed. Since all data is reference-counted, return values and so on are perfectly safe, and garbage collection is easily implemented. Only the objects have names. The strings the script programmer uses to refer to anything is resolved to objects, based on the object names.

I have zero idea how you intend to refer to anything without using names.

Quote:

Originally Posted by MTK358 (Post 4255656)

Also, what's so bad about using a separate list type?

Nothing.. but why use one if it's not necessary for functionality nor readability?

Nominal Animal

All times are GMT -5. The time now is 03:48 PM.

Page 1 of 9

Last »

Show 50 post(s) from this thread on one page