Files
gulie/docs/PLAN.md
2026-04-01 23:35:50 +02:00

471 lines
19 KiB
Markdown

# Gulie — Guile Linter/Formatter: Architecture & Implementation Plan
## Context
No linter, formatter, or static analyser exists for Guile Scheme. We're building
one from scratch, called **gulie**. The tool is written in Guile itself, reusing
as much of Guile's infrastructure as possible (reader, compiler, Tree-IL
analyses, warning system). The design draws on patterns observed in 7 reference
tools (see `docs/INSPIRATION.md`).
Guile 3.0.11 is available in the devenv. No source code exists yet.
---
## High-level architecture
Two independent passes, extensible to three:
```
.gulie.sexp (config)
|
file.scm ──┬──> [Tokenizer] ──> tokens ──> [CST parser] ──> CST
| |
| [Pass 1: Surface] line rules + CST rules
| |
| diagnostics-1
|
└──> [Guile reader] ──> s-exprs ──> [Guile compiler] ──> Tree-IL
|
[Pass 2: Semantic] built-in analyses + custom Tree-IL rules
|
diagnostics-2
|
[merge + suppress + sort + report/fix]
```
**Why two passes?** Guile's reader (`ice-9/read.scm:949-973`) irrecoverably
strips comments, whitespace, and datum comments in `next-non-whitespace`. There
is no way to get formatting info AND semantic info from one parse. Accepting this
and building two clean, independent passes is simpler than fighting the reader.
---
## Module structure
```
gulie/
bin/gulie # CLI entry point (executable Guile script)
gulie/
cli.scm # (gulie cli) — arg parsing, dispatch
config.scm # (gulie config) — .gulie.sexp loading, defaults, merging
diagnostic.scm # (gulie diagnostic) — record type, sorting, formatting
tokenizer.scm # (gulie tokenizer) — hand-written lexer, preserves everything
cst.scm # (gulie cst) — token stream → concrete syntax tree
compiler.scm # (gulie compiler) — Guile compile wrapper, warning capture
rule.scm # (gulie rule) — rule record, registry, define-rule macros
engine.scm # (gulie engine) — orchestrator: file discovery, pass sequencing
fixer.scm # (gulie fixer) — fix application (bottom-to-top edits)
suppression.scm # (gulie suppression) — ; gulie:suppress parsing/filtering
formatter.scm # (gulie formatter) — cost-based optimal pretty-printer
rules/
surface.scm # (gulie rules surface) — trailing-ws, line-length, tabs, blanks
indentation.scm # (gulie rules indentation) — indent checking vs CST
comments.scm # (gulie rules comments) — comment style conventions
semantic.scm # (gulie rules semantic) — wrappers around Guile's analyses
idiom.scm # (gulie rules idiom) — pattern-based suggestions via match
module-form.scm # (gulie rules module-form) — define-module checks
test/
test-tokenizer.scm
test-cst.scm
test-rules-surface.scm
test-rules-semantic.scm
fixtures/
clean/ # .scm files producing zero diagnostics
violations/ # .scm + .expected pairs (snapshot testing)
```
~16 source files. Each has one clear job.
---
## Key components
### Tokenizer (`gulie/tokenizer.scm`)
Hand-written character-by-character state machine. Must handle the same lexical
syntax as Guile's reader but **preserve** what the reader discards.
```scheme
(define-record-type <token>
(make-token type text line column)
token?
(type token-type) ;; symbol (see list below)
(text token-text) ;; string: exact source text
(line token-line) ;; integer: 1-based
(column token-column)) ;; integer: 0-based
```
Token types (~15): `open-paren`, `close-paren`, `symbol`, `number`, `string`,
`keyword`, `boolean`, `character`, `prefix` (`'`, `` ` ``, `,`, `,@`, `#'`,
etc.), `special` (`#;`, `#(`, `#vu8(`, etc.), `line-comment`, `block-comment`,
`whitespace`, `newline`, `dot`.
**Critical invariant:** `(string-concatenate (map token-text (tokenize input)))` must
reproduce the original input exactly. This is our primary roundtrip test.
Estimated size: ~200-250 lines. Reference: Mallet's tokenizer (163 lines CL).
### CST (`gulie/cst.scm`)
Trivial parenthesised tree built from the token stream:
```scheme
(define-record-type <cst-node>
(make-cst-node open close children)
cst-node?
(open cst-node-open) ;; <token> for ( [ {
(close cst-node-close) ;; <token> for ) ] }
(children cst-node-children)) ;; list of <cst-node> | <token>
```
Children is a flat list of interleaved atoms (tokens) and nested nodes. Comments
and whitespace are children like anything else.
The first non-whitespace symbol child of a `<cst-node>` identifies the form
(`define`, `let`, `cond`, etc.) — enough for indentation rules.
Estimated size: ~80-100 lines.
### Compiler wrapper (`gulie/compiler.scm`)
Wraps Guile's compile pipeline to capture warnings as structured diagnostics:
```scheme
;; Key Guile APIs we delegate to:
;; - (system base compile): read-and-compile, compile, default-warning-level
;; - (language tree-il analyze): make-analyzer, analyze-tree
;; - (system base message): %warning-types, current-warning-port
```
Strategy: call `read-and-compile` with `#:to 'tree-il` and `#:warning-level 2`
while redirecting `current-warning-port` to a string port, then parse the
warning output into `<diagnostic>` records. Alternatively, invoke `make-analyzer`
directly and hook the warning printers.
Guile's built-in analyses (all free):
- `unused-variable-analysis`
- `unused-toplevel-analysis`
- `unused-module-analysis`
- `shadowed-toplevel-analysis`
- `make-use-before-definition-analysis` (unbound variables)
- `arity-analysis` (wrong arg count)
- `format-analysis` (format string validation)
### Rule system (`gulie/rule.scm`)
```scheme
(define-record-type <rule>
(make-rule name description severity category type check-proc fix-proc)
rule?
(name rule-name) ;; symbol
(description rule-description) ;; string
(severity rule-severity) ;; 'error | 'warning | 'info
(category rule-category) ;; 'format | 'style | 'correctness | 'idiom
(type rule-type) ;; 'line | 'cst | 'tree-il
(check-proc rule-check-proc) ;; procedure (signature depends on type)
(fix-proc rule-fix-proc)) ;; procedure | #f
```
Three rule types with different check signatures:
- **`'line`** — `(lambda (file line-num line-text config) -> diagnostics)` — fastest, no parsing
- **`'cst`** — `(lambda (file cst config) -> diagnostics)` — needs tokenizer+CST
- **`'tree-il`** — `(lambda (file tree-il env config) -> diagnostics)` — needs compilation
Global registry: `*rules*` alist, populated at module load time via
`register-rule!`. Convenience macros: `define-line-rule`, `define-cst-rule`,
`define-tree-il-rule`.
### Diagnostic record (`gulie/diagnostic.scm`)
```scheme
(define-record-type <diagnostic>
(make-diagnostic file line column severity rule message fix)
diagnostic?
(file diagnostic-file) ;; string
(line diagnostic-line) ;; integer, 1-based
(column diagnostic-column) ;; integer, 0-based
(severity diagnostic-severity) ;; symbol
(rule diagnostic-rule) ;; symbol
(message diagnostic-message) ;; string
(fix diagnostic-fix)) ;; <fix> | #f
```
Standard output: `file:line:column: severity: rule: message`
### Config (`gulie/config.scm`)
File: `.gulie.sexp` in project root (plain s-expression, read with `(read)`,
never evaluated):
```scheme
((line-length . 80)
(indent . 2)
(enable trailing-whitespace line-length unused-variable arity-mismatch)
(disable tabs)
(rules
(line-length (max . 100)))
(indent-rules
(with-syntax . 1)
(match . 1))
(ignore "build/**" ".direnv/**"))
```
Precedence: CLI flags > config file > built-in defaults.
`--init` generates a template with all rules listed and commented.
### Suppression (`gulie/suppression.scm`)
```scheme
;; gulie:suppress trailing-whitespace — suppress on next line
(define x "messy")
(define x "messy") ; gulie:suppress — suppress on this line
;; gulie:disable line-length — region disable
... code ...
;; gulie:enable line-length — region enable
```
Parsed from raw text before rules run. Produces a suppression map that filters
diagnostics after all rules have emitted.
---
## Indentation rules
The key data is `scheme-indent-function` values from `.dir-locals.el` — an
integer N meaning "N arguments on first line, then body indented +2":
```scheme
(define *default-indent-rules*
'((define . 1) (define* . 1) (define-public . 1) (define-syntax . 1)
(define-module . 0) (lambda . 1) (lambda* . 1)
(let . 1) (let* . 1) (letrec . 1) (letrec* . 1)
(if . #f) (cond . 0) (case . 1) (when . 1) (unless . 1)
(match . 1) (syntax-case . 2) (with-syntax . 1)
(begin . 0) (do . 2) (parameterize . 1) (guard . 1)))
```
Overridable via config `indent-rules`. The indentation checker walks the CST,
identifies the form by its first symbol child, looks up the rule, and compares
actual indentation to expected.
---
## Formatting conventions (Guile vs Guix)
Both use 2-space indent, same special-form conventions. Key difference:
- **Guile:** 72-char fill column, `;;; {Section}` headers
- **Guix:** 78-80 char fill column, `;;` headers
Our default config targets Guile conventions. A Guix preset can override
`line-length` and comment style.
---
## Formatter: cost-based optimal pretty-printing
The formatter (`gulie/formatter.scm`) is a later-phase component that
**rewrites** files with correct layout, as opposed to the indentation checker
which merely **reports** violations.
### Why cost-based?
When deciding where to break lines in a long expression, there are often multiple
valid options. A greedy approach (fill as much as fits, then break) produces
mediocre output — it can't "look ahead" to see that a break earlier would produce
a better overall layout. The Wadler/Leijen family of algorithms evaluates
alternative layouts and selects the optimal one.
### The algorithm (Wadler/Leijen, as used by fmt's `pretty-expressive`)
The pretty-printer works with an abstract **document** type:
```
doc = text(string) — literal text
| line — line break (or space if flattened)
| nest(n, doc) — increase indent by n
| concat(doc, doc) — concatenation
| alt(doc, doc) — choose better of two layouts
| group(doc) — try flat first, break if doesn't fit
```
The key operator is `alt(a, b)` — "try layout A, but if it overflows the page
width, use layout B instead." The algorithm evaluates both alternatives and
picks the one with the lower **cost vector**:
```
cost = [badness, height, characters]
badness — quadratic penalty for exceeding page width
height — number of lines used
characters — total chars (tiebreaker)
```
This produces provably optimal output: the layout that minimises overflow while
using the fewest lines.
### How it fits our architecture
```
CST (from tokenizer + cst.scm)
→ [doc generator] convert CST nodes to abstract doc, using form-specific rules
→ [layout solver] evaluate alternatives, select optimal layout
→ [renderer] emit formatted text with comments preserved
```
The **doc generator** uses the same form-identification logic as the indentation
checker (first symbol child of a CST node) to apply form-specific layout rules.
For example:
- `define` — name on first line, body indented
- `let` — bindings as aligned block, body indented
- `cond` — each clause on its own line
These rules are data (the `indent-rules` table extended with layout hints),
making the formatter configurable just like the checker.
### Implementation approach
We can either:
1. **Port `pretty-expressive`** from Racket — the core algorithm is ~300 lines,
well-documented in academic papers
2. **Upgrade Guile's `(ice-9 pretty-print)`** — it already knows form-specific
indentation rules but uses greedy layout; we'd replace the layout engine with
cost-based selection
Option 1 is cleaner (purpose-built). Option 2 reuses more existing code but
would be a heavier modification. We'll decide when we reach that phase.
### Phase note
The formatter is **Phase 6** work. Phases 0-4 deliver a useful checker without
it. The indentation checker (Phase 4) validates existing formatting; the
formatter (Phase 6) rewrites it. The checker comes first because it's simpler
and immediately useful in CI.
---
## CLI interface
```
gulie [OPTIONS] [FILE|DIR...]
--check Report issues, exit non-zero on findings (default)
--fix Fix mode: auto-fix what's possible, report the rest
--format Format mode: rewrite files with optimal layout
--init Generate .gulie.sexp template
--pass PASS Run only: surface, semantic, all (default: all)
--rule RULE Enable only this rule (repeatable)
--disable RULE Disable this rule (repeatable)
--severity SEV Minimum severity: error, warning, info
--output FORMAT Output: standard (default), json, compact
--config FILE Config file path (default: auto-discover)
--list-rules List all rules and exit
--version Print version
```
Exit codes: 0 = clean, 1 = findings, 2 = config error, 3 = internal error.
---
## Implementation phases
### Phase 0: Skeleton
- `bin/gulie` — shebang script, loads CLI module
- `(gulie cli)` — basic arg parsing (`--check`, `--version`, file args)
- `(gulie diagnostic)` — record type + standard formatter
- `(gulie rule)` — record type + registry + `register-rule!`
- `(gulie engine)` — discovers `.scm` files, runs line rules, reports
- One trivial rule: `trailing-whitespace` (line rule)
- **Verification:** `gulie --check some-file.scm` reports trailing whitespace
### Phase 1: Tokenizer + CST + surface rules
- `(gulie tokenizer)` — hand-written lexer
- `(gulie cst)` — token → tree
- Surface rules: `trailing-whitespace`, `line-length`, `no-tabs`, `blank-lines`
- Comment rule: `comment-semicolons` (check `;`/`;;`/`;;;` usage)
- Roundtrip test: tokenize → concat = original
- Snapshot tests for each rule
### Phase 2: Semantic rules (compiler pass)
- `(gulie compiler)` — `read-and-compile` wrapper, warning capture
- Semantic rules wrapping Guile's built-in analyses:
`unused-variable`, `unused-toplevel`, `unbound-variable`, `arity-mismatch`,
`format-string`, `shadowed-toplevel`, `unused-module`
- **Verification:** run against Guile and Guix source files, check false-positive rate
### Phase 3: Config + suppression
- `(gulie config)` — `.gulie.sexp` loading + merging
- `(gulie suppression)` — inline comment suppression
- `--init` command
- Rule enable/disable via config and CLI
### Phase 4: Indentation checking
- `(gulie rules indentation)` — CST-based indent checker
- Default indent rules for standard Guile forms
- Configurable `indent-rules` in `.gulie.sexp`
### Phase 5: Fix mode + idiom rules
- `(gulie fixer)` — bottom-to-top edit application
- Auto-fix for: trailing whitespace, line-length (where possible)
- `(gulie rules idiom)` — `match`-based pattern suggestions on Tree-IL
- `(gulie rules module-form)` — `define-module` form checks (sorted imports, etc.)
### Phase 6: Formatter (cost-based optimal layout)
- `(gulie formatter)` — Wadler/Leijen pretty-printer with cost-based selection
- Abstract document type: `text`, `line`, `nest`, `concat`, `alt`, `group`
- Form-specific layout rules (reuse indent-rules table + layout hints)
- Comment preservation through formatting
- `--format` CLI mode
- **Verification:** format Guile/Guix source files, diff against originals,
verify roundtrip stability (format twice = same output)
### Phase 7: Cross-module analysis (future)
- Load multiple modules, walk dependency graph
- Unused exports, cross-module arity checks
- `--pass cross-module` CLI option
---
## Testing strategy
1. **Roundtrip test** (tokenizer): tokenize → concat must equal original input
2. **Snapshot tests**: `fixtures/violations/rule-name.scm` + `.expected` pairs
3. **Clean file tests**: `fixtures/clean/*.scm` must produce zero diagnostics
4. **Unit tests**: `(srfi srfi-64)` for tokenizer, CST, config, diagnostics
5. **Real-world corpus**: run against `test/guix/` and `refs/guile/module/` for
false-positive rate validation
6. **Formatter idempotency**: `format(format(x)) = format(x)` for all test files
---
## Key design decisions
| Decision | Rationale |
|----------|-----------|
| Hand-written tokenizer, not extending Guile's reader | The reader is ~1000 lines of nested closures not designed for extension. A clean 200-line tokenizer is easier to write/test. |
| Two independent passes, not a unified AST | Reader strips comments irrecoverably. Accepting this gives clean separation. |
| Delegate to Guile's built-in analyses | They're battle-tested, handle macroexpansion edge cases, and are maintained upstream. |
| `(ice-9 match)` for idiom rules, not logic programming | Built-in, fast, sufficient. miniKanren can be added later if needed. |
| S-expression config, not YAML/TOML | Zero deps. Our users write Scheme. `(read)` does the parsing. |
| Flat CST (parens + interleaved tokens), not rich AST | Enough for indentation/formatting checks. No overengineering. |
| Cost-based optimal layout for the formatter | Greedy formatters produce mediocre output. Wadler/Leijen is cleaner and provably correct. Worth the investment when we reach that phase. |
| Checker first, formatter later | Checking is simpler, immediately useful in CI, and validates the tokenizer/CST infrastructure that the formatter will build on. |
---
## Critical files to reference during implementation
- `refs/guile/module/ice-9/read.scm:949-973` — what the reader discards (our tokenizer must keep)
- `refs/guile/module/language/tree-il/analyze.scm:1461-1479` — `make-analyzer` API
- `refs/guile/module/system/base/compile.scm:298-340` — `read-and-compile` / `compile`
- `refs/guile/module/system/base/message.scm:83-220` — `%warning-types` definitions
- `refs/guile/module/language/tree-il.scm` — Tree-IL node types and traversal
- `refs/guile/module/ice-9/pretty-print.scm` — existing pretty-printer (form-specific rules to extract)
- `refs/mallet/src/parser/tokenizer.lisp` — reference tokenizer (163 lines)
- `refs/fmt/conventions.rkt` — form-specific formatting rules (100+ forms)
- `refs/fmt/main.rkt` — cost-based layout selection implementation