471 lines
19 KiB
Markdown
471 lines
19 KiB
Markdown
# Gulie — Guile Linter/Formatter: Architecture & Implementation Plan
|
|
|
|
## Context
|
|
|
|
No linter, formatter, or static analyser exists for Guile Scheme. We're building
|
|
one from scratch, called **gulie**. The tool is written in Guile itself, reusing
|
|
as much of Guile's infrastructure as possible (reader, compiler, Tree-IL
|
|
analyses, warning system). The design draws on patterns observed in 7 reference
|
|
tools (see `docs/INSPIRATION.md`).
|
|
|
|
Guile 3.0.11 is available in the devenv. No source code exists yet.
|
|
|
|
---
|
|
|
|
## High-level architecture
|
|
|
|
Two independent passes, extensible to three:
|
|
|
|
```
|
|
.gulie.sexp (config)
|
|
|
|
|
file.scm ──┬──> [Tokenizer] ──> tokens ──> [CST parser] ──> CST
|
|
| |
|
|
| [Pass 1: Surface] line rules + CST rules
|
|
| |
|
|
| diagnostics-1
|
|
|
|
|
└──> [Guile reader] ──> s-exprs ──> [Guile compiler] ──> Tree-IL
|
|
|
|
|
[Pass 2: Semantic] built-in analyses + custom Tree-IL rules
|
|
|
|
|
diagnostics-2
|
|
|
|
|
[merge + suppress + sort + report/fix]
|
|
```
|
|
|
|
**Why two passes?** Guile's reader (`ice-9/read.scm:949-973`) irrecoverably
|
|
strips comments, whitespace, and datum comments in `next-non-whitespace`. There
|
|
is no way to get formatting info AND semantic info from one parse. Accepting this
|
|
and building two clean, independent passes is simpler than fighting the reader.
|
|
|
|
---
|
|
|
|
## Module structure
|
|
|
|
```
|
|
gulie/
|
|
bin/gulie # CLI entry point (executable Guile script)
|
|
gulie/
|
|
cli.scm # (gulie cli) — arg parsing, dispatch
|
|
config.scm # (gulie config) — .gulie.sexp loading, defaults, merging
|
|
diagnostic.scm # (gulie diagnostic) — record type, sorting, formatting
|
|
tokenizer.scm # (gulie tokenizer) — hand-written lexer, preserves everything
|
|
cst.scm # (gulie cst) — token stream → concrete syntax tree
|
|
compiler.scm # (gulie compiler) — Guile compile wrapper, warning capture
|
|
rule.scm # (gulie rule) — rule record, registry, define-rule macros
|
|
engine.scm # (gulie engine) — orchestrator: file discovery, pass sequencing
|
|
fixer.scm # (gulie fixer) — fix application (bottom-to-top edits)
|
|
suppression.scm # (gulie suppression) — ; gulie:suppress parsing/filtering
|
|
formatter.scm # (gulie formatter) — cost-based optimal pretty-printer
|
|
rules/
|
|
surface.scm # (gulie rules surface) — trailing-ws, line-length, tabs, blanks
|
|
indentation.scm # (gulie rules indentation) — indent checking vs CST
|
|
comments.scm # (gulie rules comments) — comment style conventions
|
|
semantic.scm # (gulie rules semantic) — wrappers around Guile's analyses
|
|
idiom.scm # (gulie rules idiom) — pattern-based suggestions via match
|
|
module-form.scm # (gulie rules module-form) — define-module checks
|
|
test/
|
|
test-tokenizer.scm
|
|
test-cst.scm
|
|
test-rules-surface.scm
|
|
test-rules-semantic.scm
|
|
fixtures/
|
|
clean/ # .scm files producing zero diagnostics
|
|
violations/ # .scm + .expected pairs (snapshot testing)
|
|
```
|
|
|
|
~16 source files. Each has one clear job.
|
|
|
|
---
|
|
|
|
## Key components
|
|
|
|
### Tokenizer (`gulie/tokenizer.scm`)
|
|
|
|
Hand-written character-by-character state machine. Must handle the same lexical
|
|
syntax as Guile's reader but **preserve** what the reader discards.
|
|
|
|
```scheme
|
|
(define-record-type <token>
|
|
(make-token type text line column)
|
|
token?
|
|
(type token-type) ;; symbol (see list below)
|
|
(text token-text) ;; string: exact source text
|
|
(line token-line) ;; integer: 1-based
|
|
(column token-column)) ;; integer: 0-based
|
|
```
|
|
|
|
Token types (~15): `open-paren`, `close-paren`, `symbol`, `number`, `string`,
|
|
`keyword`, `boolean`, `character`, `prefix` (`'`, `` ` ``, `,`, `,@`, `#'`,
|
|
etc.), `special` (`#;`, `#(`, `#vu8(`, etc.), `line-comment`, `block-comment`,
|
|
`whitespace`, `newline`, `dot`.
|
|
|
|
**Critical invariant:** `(string-concatenate (map token-text (tokenize input)))` must
|
|
reproduce the original input exactly. This is our primary roundtrip test.
|
|
|
|
Estimated size: ~200-250 lines. Reference: Mallet's tokenizer (163 lines CL).
|
|
|
|
### CST (`gulie/cst.scm`)
|
|
|
|
Trivial parenthesised tree built from the token stream:
|
|
|
|
```scheme
|
|
(define-record-type <cst-node>
|
|
(make-cst-node open close children)
|
|
cst-node?
|
|
(open cst-node-open) ;; <token> for ( [ {
|
|
(close cst-node-close) ;; <token> for ) ] }
|
|
(children cst-node-children)) ;; list of <cst-node> | <token>
|
|
```
|
|
|
|
Children is a flat list of interleaved atoms (tokens) and nested nodes. Comments
|
|
and whitespace are children like anything else.
|
|
|
|
The first non-whitespace symbol child of a `<cst-node>` identifies the form
|
|
(`define`, `let`, `cond`, etc.) — enough for indentation rules.
|
|
|
|
Estimated size: ~80-100 lines.
|
|
|
|
### Compiler wrapper (`gulie/compiler.scm`)
|
|
|
|
Wraps Guile's compile pipeline to capture warnings as structured diagnostics:
|
|
|
|
```scheme
|
|
;; Key Guile APIs we delegate to:
|
|
;; - (system base compile): read-and-compile, compile, default-warning-level
|
|
;; - (language tree-il analyze): make-analyzer, analyze-tree
|
|
;; - (system base message): %warning-types, current-warning-port
|
|
```
|
|
|
|
Strategy: call `read-and-compile` with `#:to 'tree-il` and `#:warning-level 2`
|
|
while redirecting `current-warning-port` to a string port, then parse the
|
|
warning output into `<diagnostic>` records. Alternatively, invoke `make-analyzer`
|
|
directly and hook the warning printers.
|
|
|
|
Guile's built-in analyses (all free):
|
|
- `unused-variable-analysis`
|
|
- `unused-toplevel-analysis`
|
|
- `unused-module-analysis`
|
|
- `shadowed-toplevel-analysis`
|
|
- `make-use-before-definition-analysis` (unbound variables)
|
|
- `arity-analysis` (wrong arg count)
|
|
- `format-analysis` (format string validation)
|
|
|
|
### Rule system (`gulie/rule.scm`)
|
|
|
|
```scheme
|
|
(define-record-type <rule>
|
|
(make-rule name description severity category type check-proc fix-proc)
|
|
rule?
|
|
(name rule-name) ;; symbol
|
|
(description rule-description) ;; string
|
|
(severity rule-severity) ;; 'error | 'warning | 'info
|
|
(category rule-category) ;; 'format | 'style | 'correctness | 'idiom
|
|
(type rule-type) ;; 'line | 'cst | 'tree-il
|
|
(check-proc rule-check-proc) ;; procedure (signature depends on type)
|
|
(fix-proc rule-fix-proc)) ;; procedure | #f
|
|
```
|
|
|
|
Three rule types with different check signatures:
|
|
- **`'line`** — `(lambda (file line-num line-text config) -> diagnostics)` — fastest, no parsing
|
|
- **`'cst`** — `(lambda (file cst config) -> diagnostics)` — needs tokenizer+CST
|
|
- **`'tree-il`** — `(lambda (file tree-il env config) -> diagnostics)` — needs compilation
|
|
|
|
Global registry: `*rules*` alist, populated at module load time via
|
|
`register-rule!`. Convenience macros: `define-line-rule`, `define-cst-rule`,
|
|
`define-tree-il-rule`.
|
|
|
|
### Diagnostic record (`gulie/diagnostic.scm`)
|
|
|
|
```scheme
|
|
(define-record-type <diagnostic>
|
|
(make-diagnostic file line column severity rule message fix)
|
|
diagnostic?
|
|
(file diagnostic-file) ;; string
|
|
(line diagnostic-line) ;; integer, 1-based
|
|
(column diagnostic-column) ;; integer, 0-based
|
|
(severity diagnostic-severity) ;; symbol
|
|
(rule diagnostic-rule) ;; symbol
|
|
(message diagnostic-message) ;; string
|
|
(fix diagnostic-fix)) ;; <fix> | #f
|
|
```
|
|
|
|
Standard output: `file:line:column: severity: rule: message`
|
|
|
|
### Config (`gulie/config.scm`)
|
|
|
|
File: `.gulie.sexp` in project root (plain s-expression, read with `(read)`,
|
|
never evaluated):
|
|
|
|
```scheme
|
|
((line-length . 80)
|
|
(indent . 2)
|
|
(enable trailing-whitespace line-length unused-variable arity-mismatch)
|
|
(disable tabs)
|
|
(rules
|
|
(line-length (max . 100)))
|
|
(indent-rules
|
|
(with-syntax . 1)
|
|
(match . 1))
|
|
(ignore "build/**" ".direnv/**"))
|
|
```
|
|
|
|
Precedence: CLI flags > config file > built-in defaults.
|
|
|
|
`--init` generates a template with all rules listed and commented.
|
|
|
|
### Suppression (`gulie/suppression.scm`)
|
|
|
|
```scheme
|
|
;; gulie:suppress trailing-whitespace — suppress on next line
|
|
(define x "messy")
|
|
|
|
(define x "messy") ; gulie:suppress — suppress on this line
|
|
|
|
;; gulie:disable line-length — region disable
|
|
... code ...
|
|
;; gulie:enable line-length — region enable
|
|
```
|
|
|
|
Parsed from raw text before rules run. Produces a suppression map that filters
|
|
diagnostics after all rules have emitted.
|
|
|
|
---
|
|
|
|
## Indentation rules
|
|
|
|
The key data is `scheme-indent-function` values from `.dir-locals.el` — an
|
|
integer N meaning "N arguments on first line, then body indented +2":
|
|
|
|
```scheme
|
|
(define *default-indent-rules*
|
|
'((define . 1) (define* . 1) (define-public . 1) (define-syntax . 1)
|
|
(define-module . 0) (lambda . 1) (lambda* . 1)
|
|
(let . 1) (let* . 1) (letrec . 1) (letrec* . 1)
|
|
(if . #f) (cond . 0) (case . 1) (when . 1) (unless . 1)
|
|
(match . 1) (syntax-case . 2) (with-syntax . 1)
|
|
(begin . 0) (do . 2) (parameterize . 1) (guard . 1)))
|
|
```
|
|
|
|
Overridable via config `indent-rules`. The indentation checker walks the CST,
|
|
identifies the form by its first symbol child, looks up the rule, and compares
|
|
actual indentation to expected.
|
|
|
|
---
|
|
|
|
## Formatting conventions (Guile vs Guix)
|
|
|
|
Both use 2-space indent, same special-form conventions. Key difference:
|
|
- **Guile:** 72-char fill column, `;;; {Section}` headers
|
|
- **Guix:** 78-80 char fill column, `;;` headers
|
|
|
|
Our default config targets Guile conventions. A Guix preset can override
|
|
`line-length` and comment style.
|
|
|
|
---
|
|
|
|
## Formatter: cost-based optimal pretty-printing
|
|
|
|
The formatter (`gulie/formatter.scm`) is a later-phase component that
|
|
**rewrites** files with correct layout, as opposed to the indentation checker
|
|
which merely **reports** violations.
|
|
|
|
### Why cost-based?
|
|
|
|
When deciding where to break lines in a long expression, there are often multiple
|
|
valid options. A greedy approach (fill as much as fits, then break) produces
|
|
mediocre output — it can't "look ahead" to see that a break earlier would produce
|
|
a better overall layout. The Wadler/Leijen family of algorithms evaluates
|
|
alternative layouts and selects the optimal one.
|
|
|
|
### The algorithm (Wadler/Leijen, as used by fmt's `pretty-expressive`)
|
|
|
|
The pretty-printer works with an abstract **document** type:
|
|
|
|
```
|
|
doc = text(string) — literal text
|
|
| line — line break (or space if flattened)
|
|
| nest(n, doc) — increase indent by n
|
|
| concat(doc, doc) — concatenation
|
|
| alt(doc, doc) — choose better of two layouts
|
|
| group(doc) — try flat first, break if doesn't fit
|
|
```
|
|
|
|
The key operator is `alt(a, b)` — "try layout A, but if it overflows the page
|
|
width, use layout B instead." The algorithm evaluates both alternatives and
|
|
picks the one with the lower **cost vector**:
|
|
|
|
```
|
|
cost = [badness, height, characters]
|
|
|
|
badness — quadratic penalty for exceeding page width
|
|
height — number of lines used
|
|
characters — total chars (tiebreaker)
|
|
```
|
|
|
|
This produces provably optimal output: the layout that minimises overflow while
|
|
using the fewest lines.
|
|
|
|
### How it fits our architecture
|
|
|
|
```
|
|
CST (from tokenizer + cst.scm)
|
|
→ [doc generator] convert CST nodes to abstract doc, using form-specific rules
|
|
→ [layout solver] evaluate alternatives, select optimal layout
|
|
→ [renderer] emit formatted text with comments preserved
|
|
```
|
|
|
|
The **doc generator** uses the same form-identification logic as the indentation
|
|
checker (first symbol child of a CST node) to apply form-specific layout rules.
|
|
For example:
|
|
|
|
- `define` — name on first line, body indented
|
|
- `let` — bindings as aligned block, body indented
|
|
- `cond` — each clause on its own line
|
|
|
|
These rules are data (the `indent-rules` table extended with layout hints),
|
|
making the formatter configurable just like the checker.
|
|
|
|
### Implementation approach
|
|
|
|
We can either:
|
|
1. **Port `pretty-expressive`** from Racket — the core algorithm is ~300 lines,
|
|
well-documented in academic papers
|
|
2. **Upgrade Guile's `(ice-9 pretty-print)`** — it already knows form-specific
|
|
indentation rules but uses greedy layout; we'd replace the layout engine with
|
|
cost-based selection
|
|
|
|
Option 1 is cleaner (purpose-built). Option 2 reuses more existing code but
|
|
would be a heavier modification. We'll decide when we reach that phase.
|
|
|
|
### Phase note
|
|
|
|
The formatter is **Phase 6** work. Phases 0-4 deliver a useful checker without
|
|
it. The indentation checker (Phase 4) validates existing formatting; the
|
|
formatter (Phase 6) rewrites it. The checker comes first because it's simpler
|
|
and immediately useful in CI.
|
|
|
|
---
|
|
|
|
## CLI interface
|
|
|
|
```
|
|
gulie [OPTIONS] [FILE|DIR...]
|
|
|
|
--check Report issues, exit non-zero on findings (default)
|
|
--fix Fix mode: auto-fix what's possible, report the rest
|
|
--format Format mode: rewrite files with optimal layout
|
|
--init Generate .gulie.sexp template
|
|
--pass PASS Run only: surface, semantic, all (default: all)
|
|
--rule RULE Enable only this rule (repeatable)
|
|
--disable RULE Disable this rule (repeatable)
|
|
--severity SEV Minimum severity: error, warning, info
|
|
--output FORMAT Output: standard (default), json, compact
|
|
--config FILE Config file path (default: auto-discover)
|
|
--list-rules List all rules and exit
|
|
--version Print version
|
|
```
|
|
|
|
Exit codes: 0 = clean, 1 = findings, 2 = config error, 3 = internal error.
|
|
|
|
---
|
|
|
|
## Implementation phases
|
|
|
|
### Phase 0: Skeleton
|
|
- `bin/gulie` — shebang script, loads CLI module
|
|
- `(gulie cli)` — basic arg parsing (`--check`, `--version`, file args)
|
|
- `(gulie diagnostic)` — record type + standard formatter
|
|
- `(gulie rule)` — record type + registry + `register-rule!`
|
|
- `(gulie engine)` — discovers `.scm` files, runs line rules, reports
|
|
- One trivial rule: `trailing-whitespace` (line rule)
|
|
- **Verification:** `gulie --check some-file.scm` reports trailing whitespace
|
|
|
|
### Phase 1: Tokenizer + CST + surface rules
|
|
- `(gulie tokenizer)` — hand-written lexer
|
|
- `(gulie cst)` — token → tree
|
|
- Surface rules: `trailing-whitespace`, `line-length`, `no-tabs`, `blank-lines`
|
|
- Comment rule: `comment-semicolons` (check `;`/`;;`/`;;;` usage)
|
|
- Roundtrip test: tokenize → concat = original
|
|
- Snapshot tests for each rule
|
|
|
|
### Phase 2: Semantic rules (compiler pass)
|
|
- `(gulie compiler)` — `read-and-compile` wrapper, warning capture
|
|
- Semantic rules wrapping Guile's built-in analyses:
|
|
`unused-variable`, `unused-toplevel`, `unbound-variable`, `arity-mismatch`,
|
|
`format-string`, `shadowed-toplevel`, `unused-module`
|
|
- **Verification:** run against Guile and Guix source files, check false-positive rate
|
|
|
|
### Phase 3: Config + suppression
|
|
- `(gulie config)` — `.gulie.sexp` loading + merging
|
|
- `(gulie suppression)` — inline comment suppression
|
|
- `--init` command
|
|
- Rule enable/disable via config and CLI
|
|
|
|
### Phase 4: Indentation checking
|
|
- `(gulie rules indentation)` — CST-based indent checker
|
|
- Default indent rules for standard Guile forms
|
|
- Configurable `indent-rules` in `.gulie.sexp`
|
|
|
|
### Phase 5: Fix mode + idiom rules
|
|
- `(gulie fixer)` — bottom-to-top edit application
|
|
- Auto-fix for: trailing whitespace, line-length (where possible)
|
|
- `(gulie rules idiom)` — `match`-based pattern suggestions on Tree-IL
|
|
- `(gulie rules module-form)` — `define-module` form checks (sorted imports, etc.)
|
|
|
|
### Phase 6: Formatter (cost-based optimal layout)
|
|
- `(gulie formatter)` — Wadler/Leijen pretty-printer with cost-based selection
|
|
- Abstract document type: `text`, `line`, `nest`, `concat`, `alt`, `group`
|
|
- Form-specific layout rules (reuse indent-rules table + layout hints)
|
|
- Comment preservation through formatting
|
|
- `--format` CLI mode
|
|
- **Verification:** format Guile/Guix source files, diff against originals,
|
|
verify roundtrip stability (format twice = same output)
|
|
|
|
### Phase 7: Cross-module analysis (future)
|
|
- Load multiple modules, walk dependency graph
|
|
- Unused exports, cross-module arity checks
|
|
- `--pass cross-module` CLI option
|
|
|
|
---
|
|
|
|
## Testing strategy
|
|
|
|
1. **Roundtrip test** (tokenizer): tokenize → concat must equal original input
|
|
2. **Snapshot tests**: `fixtures/violations/rule-name.scm` + `.expected` pairs
|
|
3. **Clean file tests**: `fixtures/clean/*.scm` must produce zero diagnostics
|
|
4. **Unit tests**: `(srfi srfi-64)` for tokenizer, CST, config, diagnostics
|
|
5. **Real-world corpus**: run against `test/guix/` and `refs/guile/module/` for
|
|
false-positive rate validation
|
|
6. **Formatter idempotency**: `format(format(x)) = format(x)` for all test files
|
|
|
|
---
|
|
|
|
## Key design decisions
|
|
|
|
| Decision | Rationale |
|
|
|----------|-----------|
|
|
| Hand-written tokenizer, not extending Guile's reader | The reader is ~1000 lines of nested closures not designed for extension. A clean 200-line tokenizer is easier to write/test. |
|
|
| Two independent passes, not a unified AST | Reader strips comments irrecoverably. Accepting this gives clean separation. |
|
|
| Delegate to Guile's built-in analyses | They're battle-tested, handle macroexpansion edge cases, and are maintained upstream. |
|
|
| `(ice-9 match)` for idiom rules, not logic programming | Built-in, fast, sufficient. miniKanren can be added later if needed. |
|
|
| S-expression config, not YAML/TOML | Zero deps. Our users write Scheme. `(read)` does the parsing. |
|
|
| Flat CST (parens + interleaved tokens), not rich AST | Enough for indentation/formatting checks. No overengineering. |
|
|
| Cost-based optimal layout for the formatter | Greedy formatters produce mediocre output. Wadler/Leijen is cleaner and provably correct. Worth the investment when we reach that phase. |
|
|
| Checker first, formatter later | Checking is simpler, immediately useful in CI, and validates the tokenizer/CST infrastructure that the formatter will build on. |
|
|
|
|
---
|
|
|
|
## Critical files to reference during implementation
|
|
|
|
- `refs/guile/module/ice-9/read.scm:949-973` — what the reader discards (our tokenizer must keep)
|
|
- `refs/guile/module/language/tree-il/analyze.scm:1461-1479` — `make-analyzer` API
|
|
- `refs/guile/module/system/base/compile.scm:298-340` — `read-and-compile` / `compile`
|
|
- `refs/guile/module/system/base/message.scm:83-220` — `%warning-types` definitions
|
|
- `refs/guile/module/language/tree-il.scm` — Tree-IL node types and traversal
|
|
- `refs/guile/module/ice-9/pretty-print.scm` — existing pretty-printer (form-specific rules to extract)
|
|
- `refs/mallet/src/parser/tokenizer.lisp` — reference tokenizer (163 lines)
|
|
- `refs/fmt/conventions.rkt` — form-specific formatting rules (100+ forms)
|
|
- `refs/fmt/main.rkt` — cost-based layout selection implementation
|