Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Type checker #96

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from
Draft

Conversation

YerinAlexey
Copy link
Collaborator

@YerinAlexey YerinAlexey commented Mar 22, 2024

Fixes #50

This took around two evenings of work, but there it is.

Currently not integrated in build system and generators because everything is going to break horribly in that case.

A quick design overview:

There are 3 main components to this:

  • Checked AST
  • Type table
  • Scope table

Checked AST

The checked AST is quite similar to the regular parsed AST with a few additions:

  • All expressions have a result type attached to them
  • Types in all structures are mandatory and inferred if needed. For functions that don't have a return type, void is used
  • Variables and types are referenced by IDs in their respective tables

Type table

The type table is actually a tree structure that stores and deduplicates types. Each type is given a unique Id value which can then be used to reference it anywhere in the type checker. The Type structure stores some metadata about the type like size and alignment (necessary for QBE/LLVM codegen) and a Repr (standing for representation) which is what "type" usually refers to.

Basic types are:

  • any (any type can be converted to it, currently has no defined size; TODO)
  • int
  • string
  • bool
  • void

The base types are inserted into the type table during its construction and their IDs are available as fields of the types::Table struct.

Those can be combined into composite types:

  • T[n] - fixed arrays
  • T[] - dynamic arrays
  • struct{...} - structures

There's also a special named type, which just relays information about its inner type but can be referred to by name. This is currently only used for structs, but can allow for more generic type aliases like C's typedef or Rust's type a = b;.

Though, I'm currently not very happy with the design of named types in particular right now

Scope table

Scope table is a tree-like structure, implemented similarly to type table, that tracks scopes and their relations, as well as variables in those scopes. A scope is established by either a function declaration, a loop or a block statement. Scopes also store a loop flag which makes tracking loops for break/continue much easier.

A scope stores mappings of variables to their type IDs, it could possibly also store additional information for codegen.

TODO

  • Function prototype syntax: fn declared_elsewhere(x: int): string;
  • Write tests
  • Check for duplicate variable and top-level declarations
  • Type hints for array types. Currently an empty array literal doesn't work because the type checker doesn't tell apart let x: int[] = [] and let x: string[] = [] when checking the [] expression
  • Add a system for signalling implicit conversions (such as passing a value to a parameter with any type) to code generators
  • Implement a better system for error reporting. Currently any stages after parsing have no access to position information, would be useful to add that to the syntax tree as well as returning something more meaningful than String for errors (hints should also fit here)
    • Fix up checker error messages and add more context (such as affected types)
  • Make naming consistent across parsed and checked AST
  • Refactor code generation to use the checked AST
  • Add documentation for type checker internals

@garritfra
Copy link
Collaborator

Oh wow, nice work!

This check phase is quite complex, but I assume there's no way around a second AST structure if we want proper typing.

(Please don't forget to add tests and fix up the dev docs afterwards, but I'm sure you're aware of that)

@YerinAlexey
Copy link
Collaborator Author

This check phase is quite complex, but I assume there's no way around a second AST structure if we want proper typing.

It's not actually that complex, there's just a lot of boilerplate code (especially function parameters that get split onto multiple lines). The logic is really simple if you get to it

(Please don't forget to add tests and fix up the dev docs afterwards, but I'm sure you're aware of that)

I'll probably add tests after refactoring error handling so failure tests can explicitly specify which error should be produced

The parser only checks the following token for a 'return;' construct but
doesn't actually consume the semicolon. When the parser starts processing the
next statement, it will fail because semicolon is not a valid token in that
context.
@YerinAlexey
Copy link
Collaborator Author

YerinAlexey commented Mar 23, 2024

The type checker finally supports all language features! I also went through a few rounds of refactoring and I think the new internal API turned out to be a bit better. There are some improvements for parser/AST here, which are independent of the type checker. Most notable change is disallowing uninitialized bindings without an explicit type (let x).

I also relaxed type restrictions for +, it can also concatenate strings now. I'm not sure what the exact semantics should be, but I ended up allowing string + (int|string|bool). The code also turned out to be a bit convoluted, I feel like it's better to move concatenation into a separate operator.

A few last things to be solved:

  • There needs to be a way of declaring a prototype for a function so that the type checker knows types of its arguments and return type
  • Handle duplicate objects in a better way. Currently I expect a lot of things to break if something is declared twice
  • Figure out the semantics of any for non-JS backends

My next goal is going to be adding tests for all this, refactoring error handling (it's inconsistent and lacks useful context), and trying to integrate it with the build system and generators

Short assignment forms (+=, -=, *=, /=) are now combined with the Assign
statement instead of being expressions.

To implement that, convoluted suffix expressions (field/array access and calls)
handling in the parser had to be reworked. Right now all suffix expressions are
parsed inside parse_expression. The assignment logic is also deduplicated
between parse_statement and various expression parsers.
Those are already handled under Whitespace
- Introduce a FileTable for easily referencing files instead of cloning paths
- Attach a Position to errors returned from lexer and parser
- Add End token signifying the end of file to provide position when reaching
  an unexpected EOF
- Move util::string_util::highlight_position_in_file to a method of lexer::Error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proper check-phase
2 participants