Introduction
Welcome to the docs of Kalei. A toy language named after Kaleidoscope.
These docs are a way to publicly store language ideas, compiler implementation details or keep links to articles that concern both. Right now, you may find many pages empty or not clearly organized.
The most interesting pages are:
- Pipeline (Compiler) to see what does the current compiler pipeline looks like.
- Future (Language) to check out some syntax/language ideas I would like to try.
Syntax
Expressions
Design
Short-circuiting operators
and and or operators makes it clearer that they act on control flow as opposed to other character? based binary operators
if condition and other_condition { }
if condition or other_condition { }
Structs
Design
enum Option<T> {
Some(uint),
None,
}
enum Toto {
Foo,
Bar(uint),
Baz { tata: uint }
}
struct Baz {
content: uint,
}
// struct NamedTuple<T, U> {
// foo: T,
// bar: U,
// }
fn foo(option: Option) { }
/*
enum Option { Foo, Bar }
fn main() {
let maybe = .Some(10);
match maybe {
.Some()
.None => {}
else => {}
}
let baz = Baz { content: 10 };
let inner = baz.content;
let _ { content } = &baz;
foo(.Foo);
}
Traits
Rust-like traits.
Design
Definition
trait Foo {
fn bar() uint;
fn baz() {
// snip
}
}
Implementation
struct MyType;
impl Foo for MyType {
// notice how there is no type in impl block
//
// concern: against code locality
fn bar() {
// chosen by a fair roll of dice
4
}
}
Match
Design
Invocation
(find name)
match variable { <arms> }
match foo.method() { <arms> }
Postfix
let variable = Some(42);
variable.match { <arms> }
foo.method().match { <arms> }
foo.method().match { <arms> }.continue()
Arms
Small arrow syntax (languages)
<match> {
123 -> {}
else -> {}
}
Rust large arrow syntax
<match> {
123 => {}
_ => {}
}
no arrows
<match> {
123 { 456 }
_ {
// -snip-
}
}
Attributes
Design
-
Which character to introduce attribute?
#,@@clashes with bindings inmatcharms.
#attr
#attr("foo")
#attr{name = "c"}
fn item() {
#rustfmt::skip
let matrix = [
1, 2, 3,
4, 5, 6,
];
#allow(clippy::placeholder_name)
let foo = ();
}
-
Concern: Bare attributes (i.e. no
[]to delimit) on expression can conflict#foo (1) #foo() 1 #foo [1, 2, 3] #foo[] 1Currently, attributes have precedence.
- Do we not have bare attributes?
#foois#foo() - Do we delimit the attribute?
- Do we not have bare attributes?
Future
All the syntax here are thoughts on language design that may or may not be interesting to implement in a language. Most of the syntax here came from a quick though, is bad and needs to be bikeshed.
Modern programming languages should optimize for readability.
Reading
-
Left to Right Programming (Programs Should Be Valid as They Are Typed)
When writing a fragment of code, the sooner you give context information, the sooner the language server is able to help you back. That is by giving type information, methods and functions discovery, etc.
Syntax Samples
Many of the tests shown here are usual constructs in programming but tried in a left-to-right syntax. I would like to explore the LTR design space.
is operator
-
Motivation: LTR programming
Would work especially well with path inference.
Equiv. (Rust) if let <pattern> = <expr>
let option = Some(42);
if option is Some(num) { }
if option.is Some(num) { }
not postfix operator
- Motivation: LTR programming
Would replace the current not prefix operator.
if not vec.is_empty() { }
// becomes
if vec.is_empty().not { }
if vec.is_empty().not() { } // keep as a boolean method?
postfix deref operator
- Motivation: LTR programming
let num = 10;
let myref = #
let mynum = myref.*;
bar.mac_call().*;
bar.field.*;
LTR Loop statements
let stmts = [];
let stmts_iter = stmts.iter();
while stmts_iter.next() is Some(stmt) { }
in stmts for stmt { }
in stmts the stmt { }
for stmts each stmt { }
for stmts be stmt { }
all stmts be stmt { }
loop stmts for stmt { }
stmts.for |stmt| { }
stmts.iter().enumerate().for |i, stmt| { }
LTR Trait implementation
for Struct impl Trait { }
Struct::impl Trait { }
Struct.impl Trait { }
Struct.impl { }
Postfix try control-flow operator
let res = Ok(98);
let num = res.?;
Type ascription
- Motivation: better type inference ergonomics
// annotate functions as such
let myexpr = f() : uint;
let myvec = (
(1..=100)
.into_iter()
.collect() : Vec<_>
)
.into_iter()
.map((|x| x+1):fn(uint) -> uint);
Generics
Which syntax to adopt?
< >collide with comparison operators and need special handling (e.g. Rust’s turbofish)
fn forget[T](v: T);
fn substring['a](s: &'a str) &'a str;
fn substring(s: &str) &str;
fn foo<T>(arg: T) T {
arg
}
Path
Which syntax to adopt?
use std.arith.Plus; // can be confused with field access
use std/arith/Plus; // no
use std::arith::Plus;
Alternative keywords
My rationale for using verbs instead of nouns for keywords is that they are less likely to conflict with user variable names. On the other hand, it’s difficult to have a wide range of verbs with similar semantic or just a verb for too many things. Using one or the other would be more coherent.
- Verb keywords:
deffor values (functions, constants)decl/dclfor type-level constructs (structs, enums, etc.)letfor local values (variables)
- Noun keywords:
fn/functype/struct/enumvar/valcst/const
def func() { }
def constant = 42;
trait Add {
// dcl → declare instead of `fn`
dcl Output;
// def → define instead of `type/struct/enum`
def add(self, other: Self) Self::Output;
}
Callsite code
- Motivation: Avoid excessive monomorphization
fn as_ref_str(s: impl AsRef<str>) {
@callsite let s = s.as_ref();
// snip
}
// alternative vvv?
fn into_string(s: can Into<String>) {
let s = s.into();
// snip
}
fn callsite() {
let my_str = "...";
as_ref_str(my_str);
into_string(my_str);
// instead of the following to avoid generics
as_ref_str(my_str.as_ref());
as_ref_str(&my_str);
into_string(my_str.into());
}
Arbitrary prefixed string literals
This would could integrate well with a good comptime system. Is it worth?
comptime fn regex(lit: &str) { }
let regex = regex""
// could lower to
let regex = comptime { regex("") }
// e.g. c"im a cstr"
Enforce purity and capabilities
Enforce purity through effects, doc_effects, capability system or another system that suits.
Track
- writes to stdio
- calls to unsafe
- hidden interaction with randomness (e.g. Rust HashMap)
- allocations
This is one of the largest goals here
Require value binding
Can be simply implemented as a lint. Acts like if everything was (Rust) #[must_use]
Rational number type
Motivation: experiment with floats as a last resort construct and work with rational numbers
no semicolon in statements, functional syntax?
- OCaml syntax?
fn foo(arg: Ty) ->
a = make_foo();
b = make_bar(a);
()
Concern: how to know to return or not last value
- depend on the type system? consider it a return if type matches, seems rebust to errors
fn boo(arg: Ty) {
_ = make_blah()
}
Hot-reloading
Would be fun to play with that. Kind of like dotnet watch.
Path inference
enum InferKind {
Infer,
NoInfer,
}
fn foo(qrcode: InferKind);
foo(.Infer);
match InferKind.Infer {
.Infer -> {}
.NoInfer -> {}
}
Pipe operator/construct
Motivation: avoid nesting calls or let repetition
Question: how to tell where are going the arguments
Related: See Uniform function call syntax
let value =
compute_value(my_looooooooog_value_name)
|> transform_value1()
|> transform_value2()
|> transform_value3();
// reuse closure syntax?
let value =
compute_value(my_looooooooog_value_name)
|> transform_value1
|> |v| transform_value2(v)
|> |v| transform_value3(v, 2);
// semantically equivalent to
let value = transform_value3(transform_value2(transform_value1(compute_value(my_looooooooog_value_name))), 3);
// which may format as
let value = transform_value3(
transform_value2(transform_value1(compute_value(
my_looooooooog_value_name,
))),
3,
);
// or chaining the same name to avoid nesting and keeping clarity
let value = compute_value(my_looooooooog_value_name);
let value = transform_value1(value);
let value = transform_value2(value);
let value = transform_value3(value, 3);
Operators
@ pattern
| pattern, bit-or
& pattern, bit-and
_ pattern
~ pattern?, bit-not?
# attributes
% bitop? do not use for rem
* bitop
= bitop
+ bitop
- bitop
/ bitop
? unary try
! unary not
\ escape
'' char, lifetime?
"" string
`` ?
() delim value param
[] delim index
{} delim stmts, group
<> delim type param
, punct param
: type
; stmt end
. field
^ ???
$ raw ident?
struct Fields {
foo: uint,
$let: uint,
}
fields.$let
Effects
Reading
-
Focuses on static effects (i.e. effects that give information at compile time, not attached to the concept of coroutine).
Samples
Name resolution
effect Resolve {
fn resolve(name: Sym) &ref Ty;
}
fn main() {
let items = <..>;
let map = Map<Sym, Ty>;
// I don't like the try/catch-like syntax for effects
handle {
items.for |item| {
'resolve(item);
}
} with {
// no type in effect handling
fn resolve(name) {
map.entry(name).match {
// not long arrows
.Occupied |ty| effect_continue &ty
.Vaccant || {
// it's not in the handle ctx anymore???
let ty = resolve_item(name);
map.insert(name, ty);
effect_continue &ty;
}
}
}
}
}
fn resolve_item(item: hir::Item) Ty \ Resolve {
// translate hir to ty
// ...
let ty = 'resolve(item.subtype);
// ...
ty
}
Capabilities
Reading
-
⭐ Context Capabilities - tmandry
Discusses the addition of a context system in Rust.
withkeyword for functions and traits implementations.Matches what I had in mind for a capability system. It is possible to make capabilities no-cost by using ZSTs.
Implementation passes context as extra function arguments.
-
Designing with Static Capabilities and Effects: Use, Mention, and Invariants
Samples
struct Env; // ZST. impl uses a global type
struct BasicArena { }
#explicit capability env = Env; // capability use-case
#implicit capability arena = BasicArena; // context use-case
fn std::env::var(s: AsRef<str>) String with Env {}
fn std::env::var(s: AsRef<str>) String with Env {}
fn main() () \ {env} {
let arena = BasicArena.new();
use arena as Arena;
foo()
// pass some capabilities explicitly
.with(Env)
}
fn foo() uint \ {arena, env} {
let id = arena.alloc(1);
let var = std::env::var("foo")?;
}
Metaprogramming
Reading
-
The Metaprogramming Dilemma - gingerBill (Odin)
Defines a list of metaprogramming categories:
- Introspection (and Reflection for OOP languages)
- Compile Time Execution (CTE)
- Template Programming
- Macros (Textual and Syntactic)
- Parametric Polymorphism (“Generics”)
My goal for generics is to have generics obviously, some kind of checked template programming that uses introspection, vm/interpreted compile time execution that is sandboxed and is proven deterministic but can still access things like the heap (Rust const time is way too restrictive, too much hacks)
-
Parametricity, or Comptime is Bonkers
Though I have not used Zig much, I agree that their comptime are is much free form, and although all abstractions are leaky, it makes it too easy to depend on some construction
Maybe never allow compiler intrinsic to output information without a trait (except
TypeId).// has to be the identity because T has no observable properties
fn foo<T>(foo: T) T {}
// as opposed to
fn foo<T: Any>(value: &T) _ {} // we only care about opaque value identity, is used for type maps context in Rust
fn foo<T: Shape>() _ {} // e.g. shows that it uses T for reflection
fn foo<T: Layout>() _ {} // layout only gives information for size/alignement
Samples
How to balance between user power and good ergonomics.
Shape intrinsic
- Motivation: do not depend on text/token manipulation macros
let shape = Shape::<Directions>::new().variants
// equals: let shape = &[Brush, Line, Rectangle]
Inline statements
- Motivation: do not depend on text/token manipulation macros
struct Foo { bla: Boo, bar: Baz }
fn deserialize(s: &str) -> T {
let result = Foo::default();
inline for field in Shape::new(result).fields {
field.value = deserialize(s);
}
result
}
inline if std.cfg.host.is_linux() {
} else {
}
Assignment reflection
[..] I very frequently find myself wanting to know the name I’m “about to be assigned to”.
-
Motivation: in languages where you specify a lot of data inline (e.g. Nix) you tend often have
"name" = function("name", { options }) -
Concern: value changes with an LSP rename. Maybe be the source of confusion
-
Concern: makes the expression dependent on context → you cannot extract that expression somewhere else
let kalei = Source::new(`name);
// becomes
let kalei = Source::new("kalei");
Pipeline
Current front-end compiler pipeline looks like the following:
Input is an input, Bold is a step, Italic is an artefact.
-> is used by, => produces
-
Source code -> Lexing & Parsing => AST
Understand the form of the user source code.
-
AST -> Item Collection, Item Resolution => NameEnvironment, ResolutionMap
Go through each item and associate names of items to
ast::NodeIds.It makes sense to collect before AST Lowering because
- this information can be used during the lowering for path resolution
- it is used to provide diagnostics to user code, AST is closer to what the user wrote
-
AST, ResolutionMap -> AST Lowering => HIR
We lower the AST to a flat structure (soon™) that is the HIR.
HIR is much more usable in the context of a compiler.
-
HIR -> Type Collection => TypeEnvironment
Go through each item and compute their
ty::EarlyItemTy.Once, the collection phase is done, these are immediately resolved to
ty::LateTys. -
HIR, NameEnvironment, TypeEnvironment -> Inference (per function) => InferResult
-
HIR, InferResult -> Code Generation => Object
Glossary
There are many core types that are used throughout this compiler. Here is a non-exhaustive list of them and their use.
Contexts
-
SessionCtx(scx)Holds information needed for the whole compiler session. Things such as options, diagnostics, debug output.
-
DiagnosticCtx(dcx)Contains all the diagnostic accumulation and emission related code.
-
TyCtx(tcx)Holds information needed for type inference, type check and their results.
IDs
-
session::SymbolCreated when lexing files. It is an index to interned identifiers and strings.
-
session::Spanandsession::BytePosUsed extensively throughout the different intermediate representations to show good diagnostics to the user.
-
ast::NodeIdCreated during the parsing step.
-
resolve::DefIdCreated during the collection step.
-
hir::NodeIdCreated during the lowering step.
hir::ExprId: they wrap normalNodeIdbut ensure that they point to an expression.
TyKinds
-
hir::TyKindHolds the item definition of a type.
-
ty::TyKind<InferKind, RefKind>Holds the concrete definition of a type.
-
ty::EarlyItemTy(ty::TyKind<NoInfer, DefId>) -
ty::InferTy(ty::TyKind<Infer, NoRef>)same but keeps an extra field to store type variables in the inference step
-
ty::LateTy(ty::TyKind<NoInfer, NoRef>)
-
Parsing
Future
Flat AST storage
-
Super-Flat ASTs (HackerNews comments)
Raises important question of perf vs. DevEx.
Some random idea of representation
type Foo { id: u32 }
lexes as this stream with token length
IDENT(4) IDENT(3) LBRA(1) IDENT(2) COLON(1) IDENT(3) RBRA(1)
which parses as this tree stream (braces are not represented in the stream)
{ NAME { NAME TY } FIELDS_END } ITEM_END
Type Checking
Reading
-
In particular Algorithm W was the first implemented algorithm for function-level type inference.
I intend to implement a form of System Fω.
-
Complete and Easy Bidirectional Typechecking for Higher-Rank Polymorphism
-
Rust
Rust uses a form of System Fω with Bidirectional typechecking implemented using a unification (union-find) algorithm.
Code Generation
Future
Language Abstract Machine
One thing that helped me to understand the concept of pointer provenance, was to realise that each language acts in its own abstract machine. An abstract machine is emulated through an interpreter.
This is why we have language specifications and undefined behaviour. Specifications, among other things, define the semantics of this abstract machine.
It is easy to overlook the abstract machine because its model and behaviour can map closely to the underlying implementation. This is especially true for a language like Rust or C++ that are designed to be systems programming languages. Whether the language is implemented as an interpreter or compiled down to machine code, you need to understand how this translation works to have a better control on performance, or <…>. By the way, it is easy to argue that machine code is “just” a lower level language that is in turn interpreted by the CPU.
Among the implementations of Rust, one is miri (MIR Interpreter).
This said, it would be nice to define an abstract machine for this language.