Arx Syntax Specification (Lexical)¶

Status: draft 0.1.0

This document defines Arx lexical/token-level behavior for editor tooling.

Normative source: src/arx/lexer/syntax.json

Non-goal: full parsing or AST semantics.

1) Source of Truth¶

syntax.json is canonical.
Any derived editor grammar or tokenizer should be generated from, or manually validated against, that file.
If this Markdown and the JSON disagree, the JSON wins.

2) Whitespace and Newlines¶

Indentation is significant.
Canonical indentation unit is 2 spaces.
Newlines are significant boundaries for indentation handling.
Blank lines reset indentation tracking for that line.
Tabs policy is not finalized.
TODO(ARX-LEX-WS-001): decide whether tabs are forbidden or normalized.

Example¶

fn abs(x):
  if x < 0:
    return 0 - x
  else:
    return x

3) Comments¶

Line comments start with # and continue to end of line.
Block comments are not specified.

Example¶

fn print_star(n):
  putchard(42)  # ascii '*'

Edge Cases¶

# inside future string syntax is undefined until strings are specified.
TODO(ARX-LEX-COMMENT-002): decide whether // should become a comment delimiter.

4) Strings and Escapes¶

Current status: not yet specified for Arx lexical v0.1.0.

No official quote delimiters.
No official escape sequences.
No triple-quoted or raw string prefixes.
No string interpolation syntax.

TODO(ARX-LEX-STRINGS-001): define "..." / '...' behavior and escapes. TODO(ARX-LEX-STRINGS-002): define raw strings and interpolation policy.

Highlighter Guidance (current)¶

Treat quote characters as plain punctuation/operators.
Avoid speculative string scopes unless a tool explicitly opts into future-preview rules.

5) Numbers¶

Supported lexical number classes:

Decimal integers (e.g., 0, 42)
Decimal floats with one dot (e.g., 3.14, .5, 5.)

Not currently specified:

Non-decimal bases (0x, 0b, 0o)
Exponent forms (1e9)
Numeric separators (1_000)
Type suffixes (42u32, 1.0f32)

Edge Cases¶

1.2.3 is invalid (multiple decimal points).
. alone is invalid as a numeric literal.

6) Identifiers and Unicode Policy¶

Identifier shape:

Start: Unicode letter or _
Continue: Unicode alphanumeric or _

Reference regex for tooling:

(?:[_\p{L}])(?:[_\p{L}\p{N}])*

Policy status:

Current behavior tracks host-runtime Unicode character categories.
TODO(ARX-LEX-IDENT-001): lock policy to either XID_* classes or ASCII-only.

7) Keywords¶

Reserved keywords:

class
const
else
extern
fn
for
if
import
in
return
then
var

Contextual keywords:

as
from

Edge Cases¶

fnx, if_else, returnValue are identifiers, not keywords.
Case-sensitive matching: Fn and IF are identifiers.

8) Operators and Punctuation¶

Current operator/punctuation set for lexical highlighting:

Assignment/comparison/arithmetic: =, <, >, +, -, *, /
Structural punctuation: ., @, :, ,, ;

Brackets:

()
[]
{}

Notes:

Multi-character operators are not finalized.
Annotation lines use @[ followed by comma-separated modifier names and a closing ].
Grouped named imports use parentheses, require from, and allow trailing commas.
TODO(ARX-LEX-OPS-001): decide on ==, !=, <=, >=, ->.

9) Canonical Lexical Examples¶

Keywords vs identifiers¶

fn fnx(x):
  var if_else = x in
    return if_else

Numeric forms¶

fn demo() -> none:
  var a = 42 in
    var b = 3.14 in
      var c = .5 in
        var d = 5. in
          a + b + c + d

Comments¶

extern putchard(x)  # imported symbol

Annotation lines¶

@[public, static]
class Math:
  @[public, constant]
  version: int32 = 1

Imports¶

import std.math
import std.math as math

import sin from std.math
import sin as sine from std.math

import (sin, cos, tan as tangent) from std.math

import (
  sin,
  cos,
  tan as tangent,
) from std.math

10) Lightweight Consistency Checks (Repo-Agnostic)¶

Canonical check:
Parse src/arx/lexer/syntax.json.
Verify required keys exist (keywords, comment, strings, numbers, identifiers, whitespace, brackets).
Derived artifacts check:
If/when an editor grammar exists, assert it was generated from or validated against the JSON manifest.
Fail CI if keyword arrays differ.
Drift check:
Add a simple script that loads JSON and scans corpus snippets (if present) to ensure each declared keyword/operator appears in at least one sample.
Change discipline:
Require PRs touching syntax tooling to update syntax.json first, then derived files.