# Gulie — Guile Linter/Formatter: Architecture & Implementation Plan ## Context No linter, formatter, or static analyser exists for Guile Scheme. We're building one from scratch, called **gulie**. The tool is written in Guile itself, reusing as much of Guile's infrastructure as possible (reader, compiler, Tree-IL analyses, warning system). The design draws on patterns observed in 7 reference tools (see `docs/INSPIRATION.md`). Guile 3.0.11 is available in the devenv. No source code exists yet. --- ## High-level architecture Two independent passes, extensible to three: ``` .gulie.sexp (config) | file.scm ──┬──> [Tokenizer] ──> tokens ──> [CST parser] ──> CST | | | [Pass 1: Surface] line rules + CST rules | | | diagnostics-1 | └──> [Guile reader] ──> s-exprs ──> [Guile compiler] ──> Tree-IL | [Pass 2: Semantic] built-in analyses + custom Tree-IL rules | diagnostics-2 | [merge + suppress + sort + report/fix] ``` **Why two passes?** Guile's reader (`ice-9/read.scm:949-973`) irrecoverably strips comments, whitespace, and datum comments in `next-non-whitespace`. There is no way to get formatting info AND semantic info from one parse. Accepting this and building two clean, independent passes is simpler than fighting the reader. --- ## Module structure ``` gulie/ bin/gulie # CLI entry point (executable Guile script) gulie/ cli.scm # (gulie cli) — arg parsing, dispatch config.scm # (gulie config) — .gulie.sexp loading, defaults, merging diagnostic.scm # (gulie diagnostic) — record type, sorting, formatting tokenizer.scm # (gulie tokenizer) — hand-written lexer, preserves everything cst.scm # (gulie cst) — token stream → concrete syntax tree compiler.scm # (gulie compiler) — Guile compile wrapper, warning capture rule.scm # (gulie rule) — rule record, registry, define-rule macros engine.scm # (gulie engine) — orchestrator: file discovery, pass sequencing fixer.scm # (gulie fixer) — fix application (bottom-to-top edits) suppression.scm # (gulie suppression) — ; gulie:suppress parsing/filtering formatter.scm # (gulie formatter) — cost-based optimal pretty-printer rules/ surface.scm # (gulie rules surface) — trailing-ws, line-length, tabs, blanks indentation.scm # (gulie rules indentation) — indent checking vs CST comments.scm # (gulie rules comments) — comment style conventions semantic.scm # (gulie rules semantic) — wrappers around Guile's analyses idiom.scm # (gulie rules idiom) — pattern-based suggestions via match module-form.scm # (gulie rules module-form) — define-module checks test/ test-tokenizer.scm test-cst.scm test-rules-surface.scm test-rules-semantic.scm fixtures/ clean/ # .scm files producing zero diagnostics violations/ # .scm + .expected pairs (snapshot testing) ``` ~16 source files. Each has one clear job. --- ## Key components ### Tokenizer (`gulie/tokenizer.scm`) Hand-written character-by-character state machine. Must handle the same lexical syntax as Guile's reader but **preserve** what the reader discards. ```scheme (define-record-type (make-token type text line column) token? (type token-type) ;; symbol (see list below) (text token-text) ;; string: exact source text (line token-line) ;; integer: 1-based (column token-column)) ;; integer: 0-based ``` Token types (~15): `open-paren`, `close-paren`, `symbol`, `number`, `string`, `keyword`, `boolean`, `character`, `prefix` (`'`, `` ` ``, `,`, `,@`, `#'`, etc.), `special` (`#;`, `#(`, `#vu8(`, etc.), `line-comment`, `block-comment`, `whitespace`, `newline`, `dot`. **Critical invariant:** `(string-concatenate (map token-text (tokenize input)))` must reproduce the original input exactly. This is our primary roundtrip test. Estimated size: ~200-250 lines. Reference: Mallet's tokenizer (163 lines CL). ### CST (`gulie/cst.scm`) Trivial parenthesised tree built from the token stream: ```scheme (define-record-type (make-cst-node open close children) cst-node? (open cst-node-open) ;; for ( [ { (close cst-node-close) ;; for ) ] } (children cst-node-children)) ;; list of | ``` Children is a flat list of interleaved atoms (tokens) and nested nodes. Comments and whitespace are children like anything else. The first non-whitespace symbol child of a `` identifies the form (`define`, `let`, `cond`, etc.) — enough for indentation rules. Estimated size: ~80-100 lines. ### Compiler wrapper (`gulie/compiler.scm`) Wraps Guile's compile pipeline to capture warnings as structured diagnostics: ```scheme ;; Key Guile APIs we delegate to: ;; - (system base compile): read-and-compile, compile, default-warning-level ;; - (language tree-il analyze): make-analyzer, analyze-tree ;; - (system base message): %warning-types, current-warning-port ``` Strategy: call `read-and-compile` with `#:to 'tree-il` and `#:warning-level 2` while redirecting `current-warning-port` to a string port, then parse the warning output into `` records. Alternatively, invoke `make-analyzer` directly and hook the warning printers. Guile's built-in analyses (all free): - `unused-variable-analysis` - `unused-toplevel-analysis` - `unused-module-analysis` - `shadowed-toplevel-analysis` - `make-use-before-definition-analysis` (unbound variables) - `arity-analysis` (wrong arg count) - `format-analysis` (format string validation) ### Rule system (`gulie/rule.scm`) ```scheme (define-record-type (make-rule name description severity category type check-proc fix-proc) rule? (name rule-name) ;; symbol (description rule-description) ;; string (severity rule-severity) ;; 'error | 'warning | 'info (category rule-category) ;; 'format | 'style | 'correctness | 'idiom (type rule-type) ;; 'line | 'cst | 'tree-il (check-proc rule-check-proc) ;; procedure (signature depends on type) (fix-proc rule-fix-proc)) ;; procedure | #f ``` Three rule types with different check signatures: - **`'line`** — `(lambda (file line-num line-text config) -> diagnostics)` — fastest, no parsing - **`'cst`** — `(lambda (file cst config) -> diagnostics)` — needs tokenizer+CST - **`'tree-il`** — `(lambda (file tree-il env config) -> diagnostics)` — needs compilation Global registry: `*rules*` alist, populated at module load time via `register-rule!`. Convenience macros: `define-line-rule`, `define-cst-rule`, `define-tree-il-rule`. ### Diagnostic record (`gulie/diagnostic.scm`) ```scheme (define-record-type (make-diagnostic file line column severity rule message fix) diagnostic? (file diagnostic-file) ;; string (line diagnostic-line) ;; integer, 1-based (column diagnostic-column) ;; integer, 0-based (severity diagnostic-severity) ;; symbol (rule diagnostic-rule) ;; symbol (message diagnostic-message) ;; string (fix diagnostic-fix)) ;; | #f ``` Standard output: `file:line:column: severity: rule: message` ### Config (`gulie/config.scm`) File: `.gulie.sexp` in project root (plain s-expression, read with `(read)`, never evaluated): ```scheme ((line-length . 80) (indent . 2) (enable trailing-whitespace line-length unused-variable arity-mismatch) (disable tabs) (rules (line-length (max . 100))) (indent-rules (with-syntax . 1) (match . 1)) (ignore "build/**" ".direnv/**")) ``` Precedence: CLI flags > config file > built-in defaults. `--init` generates a template with all rules listed and commented. ### Suppression (`gulie/suppression.scm`) ```scheme ;; gulie:suppress trailing-whitespace — suppress on next line (define x "messy") (define x "messy") ; gulie:suppress — suppress on this line ;; gulie:disable line-length — region disable ... code ... ;; gulie:enable line-length — region enable ``` Parsed from raw text before rules run. Produces a suppression map that filters diagnostics after all rules have emitted. --- ## Indentation rules The key data is `scheme-indent-function` values from `.dir-locals.el` — an integer N meaning "N arguments on first line, then body indented +2": ```scheme (define *default-indent-rules* '((define . 1) (define* . 1) (define-public . 1) (define-syntax . 1) (define-module . 0) (lambda . 1) (lambda* . 1) (let . 1) (let* . 1) (letrec . 1) (letrec* . 1) (if . #f) (cond . 0) (case . 1) (when . 1) (unless . 1) (match . 1) (syntax-case . 2) (with-syntax . 1) (begin . 0) (do . 2) (parameterize . 1) (guard . 1))) ``` Overridable via config `indent-rules`. The indentation checker walks the CST, identifies the form by its first symbol child, looks up the rule, and compares actual indentation to expected. --- ## Formatting conventions (Guile vs Guix) Both use 2-space indent, same special-form conventions. Key difference: - **Guile:** 72-char fill column, `;;; {Section}` headers - **Guix:** 78-80 char fill column, `;;` headers Our default config targets Guile conventions. A Guix preset can override `line-length` and comment style. --- ## Formatter: cost-based optimal pretty-printing The formatter (`gulie/formatter.scm`) is a later-phase component that **rewrites** files with correct layout, as opposed to the indentation checker which merely **reports** violations. ### Why cost-based? When deciding where to break lines in a long expression, there are often multiple valid options. A greedy approach (fill as much as fits, then break) produces mediocre output — it can't "look ahead" to see that a break earlier would produce a better overall layout. The Wadler/Leijen family of algorithms evaluates alternative layouts and selects the optimal one. ### The algorithm (Wadler/Leijen, as used by fmt's `pretty-expressive`) The pretty-printer works with an abstract **document** type: ``` doc = text(string) — literal text | line — line break (or space if flattened) | nest(n, doc) — increase indent by n | concat(doc, doc) — concatenation | alt(doc, doc) — choose better of two layouts | group(doc) — try flat first, break if doesn't fit ``` The key operator is `alt(a, b)` — "try layout A, but if it overflows the page width, use layout B instead." The algorithm evaluates both alternatives and picks the one with the lower **cost vector**: ``` cost = [badness, height, characters] badness — quadratic penalty for exceeding page width height — number of lines used characters — total chars (tiebreaker) ``` This produces provably optimal output: the layout that minimises overflow while using the fewest lines. ### How it fits our architecture ``` CST (from tokenizer + cst.scm) → [doc generator] convert CST nodes to abstract doc, using form-specific rules → [layout solver] evaluate alternatives, select optimal layout → [renderer] emit formatted text with comments preserved ``` The **doc generator** uses the same form-identification logic as the indentation checker (first symbol child of a CST node) to apply form-specific layout rules. For example: - `define` — name on first line, body indented - `let` — bindings as aligned block, body indented - `cond` — each clause on its own line These rules are data (the `indent-rules` table extended with layout hints), making the formatter configurable just like the checker. ### Implementation approach We can either: 1. **Port `pretty-expressive`** from Racket — the core algorithm is ~300 lines, well-documented in academic papers 2. **Upgrade Guile's `(ice-9 pretty-print)`** — it already knows form-specific indentation rules but uses greedy layout; we'd replace the layout engine with cost-based selection Option 1 is cleaner (purpose-built). Option 2 reuses more existing code but would be a heavier modification. We'll decide when we reach that phase. ### Phase note The formatter is **Phase 6** work. Phases 0-4 deliver a useful checker without it. The indentation checker (Phase 4) validates existing formatting; the formatter (Phase 6) rewrites it. The checker comes first because it's simpler and immediately useful in CI. --- ## CLI interface ``` gulie [OPTIONS] [FILE|DIR...] --check Report issues, exit non-zero on findings (default) --fix Fix mode: auto-fix what's possible, report the rest --format Format mode: rewrite files with optimal layout --init Generate .gulie.sexp template --pass PASS Run only: surface, semantic, all (default: all) --rule RULE Enable only this rule (repeatable) --disable RULE Disable this rule (repeatable) --severity SEV Minimum severity: error, warning, info --output FORMAT Output: standard (default), json, compact --config FILE Config file path (default: auto-discover) --list-rules List all rules and exit --version Print version ``` Exit codes: 0 = clean, 1 = findings, 2 = config error, 3 = internal error. --- ## Implementation phases ### Phase 0: Skeleton - `bin/gulie` — shebang script, loads CLI module - `(gulie cli)` — basic arg parsing (`--check`, `--version`, file args) - `(gulie diagnostic)` — record type + standard formatter - `(gulie rule)` — record type + registry + `register-rule!` - `(gulie engine)` — discovers `.scm` files, runs line rules, reports - One trivial rule: `trailing-whitespace` (line rule) - **Verification:** `gulie --check some-file.scm` reports trailing whitespace ### Phase 1: Tokenizer + CST + surface rules - `(gulie tokenizer)` — hand-written lexer - `(gulie cst)` — token → tree - Surface rules: `trailing-whitespace`, `line-length`, `no-tabs`, `blank-lines` - Comment rule: `comment-semicolons` (check `;`/`;;`/`;;;` usage) - Roundtrip test: tokenize → concat = original - Snapshot tests for each rule ### Phase 2: Semantic rules (compiler pass) - `(gulie compiler)` — `read-and-compile` wrapper, warning capture - Semantic rules wrapping Guile's built-in analyses: `unused-variable`, `unused-toplevel`, `unbound-variable`, `arity-mismatch`, `format-string`, `shadowed-toplevel`, `unused-module` - **Verification:** run against Guile and Guix source files, check false-positive rate ### Phase 3: Config + suppression - `(gulie config)` — `.gulie.sexp` loading + merging - `(gulie suppression)` — inline comment suppression - `--init` command - Rule enable/disable via config and CLI ### Phase 4: Indentation checking - `(gulie rules indentation)` — CST-based indent checker - Default indent rules for standard Guile forms - Configurable `indent-rules` in `.gulie.sexp` ### Phase 5: Fix mode + idiom rules - `(gulie fixer)` — bottom-to-top edit application - Auto-fix for: trailing whitespace, line-length (where possible) - `(gulie rules idiom)` — `match`-based pattern suggestions on Tree-IL - `(gulie rules module-form)` — `define-module` form checks (sorted imports, etc.) ### Phase 6: Formatter (cost-based optimal layout) - `(gulie formatter)` — Wadler/Leijen pretty-printer with cost-based selection - Abstract document type: `text`, `line`, `nest`, `concat`, `alt`, `group` - Form-specific layout rules (reuse indent-rules table + layout hints) - Comment preservation through formatting - `--format` CLI mode - **Verification:** format Guile/Guix source files, diff against originals, verify roundtrip stability (format twice = same output) ### Phase 7: Cross-module analysis (future) - Load multiple modules, walk dependency graph - Unused exports, cross-module arity checks - `--pass cross-module` CLI option --- ## Testing strategy 1. **Roundtrip test** (tokenizer): tokenize → concat must equal original input 2. **Snapshot tests**: `fixtures/violations/rule-name.scm` + `.expected` pairs 3. **Clean file tests**: `fixtures/clean/*.scm` must produce zero diagnostics 4. **Unit tests**: `(srfi srfi-64)` for tokenizer, CST, config, diagnostics 5. **Real-world corpus**: run against `test/guix/` and `refs/guile/module/` for false-positive rate validation 6. **Formatter idempotency**: `format(format(x)) = format(x)` for all test files --- ## Key design decisions | Decision | Rationale | |----------|-----------| | Hand-written tokenizer, not extending Guile's reader | The reader is ~1000 lines of nested closures not designed for extension. A clean 200-line tokenizer is easier to write/test. | | Two independent passes, not a unified AST | Reader strips comments irrecoverably. Accepting this gives clean separation. | | Delegate to Guile's built-in analyses | They're battle-tested, handle macroexpansion edge cases, and are maintained upstream. | | `(ice-9 match)` for idiom rules, not logic programming | Built-in, fast, sufficient. miniKanren can be added later if needed. | | S-expression config, not YAML/TOML | Zero deps. Our users write Scheme. `(read)` does the parsing. | | Flat CST (parens + interleaved tokens), not rich AST | Enough for indentation/formatting checks. No overengineering. | | Cost-based optimal layout for the formatter | Greedy formatters produce mediocre output. Wadler/Leijen is cleaner and provably correct. Worth the investment when we reach that phase. | | Checker first, formatter later | Checking is simpler, immediately useful in CI, and validates the tokenizer/CST infrastructure that the formatter will build on. | --- ## Critical files to reference during implementation - `refs/guile/module/ice-9/read.scm:949-973` — what the reader discards (our tokenizer must keep) - `refs/guile/module/language/tree-il/analyze.scm:1461-1479` — `make-analyzer` API - `refs/guile/module/system/base/compile.scm:298-340` — `read-and-compile` / `compile` - `refs/guile/module/system/base/message.scm:83-220` — `%warning-types` definitions - `refs/guile/module/language/tree-il.scm` — Tree-IL node types and traversal - `refs/guile/module/ice-9/pretty-print.scm` — existing pretty-printer (form-specific rules to extract) - `refs/mallet/src/parser/tokenizer.lisp` — reference tokenizer (163 lines) - `refs/fmt/conventions.rkt` — form-specific formatting rules (100+ forms) - `refs/fmt/main.rkt` — cost-based layout selection implementation