gulie/docs/PLAN.md

# Gulie — Guile Linter/Formatter: Architecture & Implementation Plan

## Context

No linter, formatter, or static analyser exists for Guile Scheme. We're building
one from scratch, called **gulie**. The tool is written in Guile itself, reusing
as much of Guile's infrastructure as possible (reader, compiler, Tree-IL
analyses, warning system). The design draws on patterns observed in 7 reference
tools (see `docs/INSPIRATION.md`).

Guile 3.0.11 is available in the devenv. No source code exists yet.

---

## High-level architecture

Two independent passes, extensible to three:

```
                         .gulie.sexp (config)
                              |
  file.scm ──┬──> [Tokenizer] ──> tokens ──> [CST parser] ──> CST
              |         |
              |   [Pass 1: Surface]  line rules + CST rules
              |         |
              |    diagnostics-1
              |
              └──> [Guile reader] ──> s-exprs ──> [Guile compiler] ──> Tree-IL
                        |
                  [Pass 2: Semantic]  built-in analyses + custom Tree-IL rules
                        |
                   diagnostics-2
                        |
              [merge + suppress + sort + report/fix]
```

**Why two passes?** Guile's reader (`ice-9/read.scm:949-973`) irrecoverably
strips comments, whitespace, and datum comments in `next-non-whitespace`. There
is no way to get formatting info AND semantic info from one parse. Accepting this
and building two clean, independent passes is simpler than fighting the reader.

---

## Module structure

```
gulie/
  bin/gulie                        # CLI entry point (executable Guile script)
  gulie/
    cli.scm                        # (gulie cli) — arg parsing, dispatch
    config.scm                     # (gulie config) — .gulie.sexp loading, defaults, merging
    diagnostic.scm                 # (gulie diagnostic) — record type, sorting, formatting
    tokenizer.scm                  # (gulie tokenizer) — hand-written lexer, preserves everything
    cst.scm                        # (gulie cst) — token stream → concrete syntax tree
    compiler.scm                   # (gulie compiler) — Guile compile wrapper, warning capture
    rule.scm                       # (gulie rule) — rule record, registry, define-rule macros
    engine.scm                     # (gulie engine) — orchestrator: file discovery, pass sequencing
    fixer.scm                      # (gulie fixer) — fix application (bottom-to-top edits)
    suppression.scm                # (gulie suppression) — ; gulie:suppress parsing/filtering
    formatter.scm                  # (gulie formatter) — cost-based optimal pretty-printer
    rules/
      surface.scm                  # (gulie rules surface) — trailing-ws, line-length, tabs, blanks
      indentation.scm              # (gulie rules indentation) — indent checking vs CST
      comments.scm                 # (gulie rules comments) — comment style conventions
      semantic.scm                 # (gulie rules semantic) — wrappers around Guile's analyses
      idiom.scm                    # (gulie rules idiom) — pattern-based suggestions via match
      module-form.scm              # (gulie rules module-form) — define-module checks
  test/
    test-tokenizer.scm
    test-cst.scm
    test-rules-surface.scm
    test-rules-semantic.scm
    fixtures/
      clean/                       # .scm files producing zero diagnostics
      violations/                  # .scm + .expected pairs (snapshot testing)
```

~16 source files. Each has one clear job.

---

## Key components

### Tokenizer (`gulie/tokenizer.scm`)

Hand-written character-by-character state machine. Must handle the same lexical
syntax as Guile's reader but **preserve** what the reader discards.

```scheme
(define-record-type <token>
  (make-token type text line column)
  token?
  (type   token-type)      ;; symbol (see list below)
  (text   token-text)      ;; string: exact source text
  (line   token-line)      ;; integer: 1-based
  (column token-column))   ;; integer: 0-based
```

Token types (~15): `open-paren`, `close-paren`, `symbol`, `number`, `string`,
`keyword`, `boolean`, `character`, `prefix` (`'`, `` ` ``, `,`, `,@`, `#'`,
etc.), `special` (`#;`, `#(`, `#vu8(`, etc.), `line-comment`, `block-comment`,
`whitespace`, `newline`, `dot`.

**Critical invariant:** `(string-concatenate (map token-text (tokenize input)))` must
reproduce the original input exactly. This is our primary roundtrip test.

Estimated size: ~200-250 lines. Reference: Mallet's tokenizer (163 lines CL).

### CST (`gulie/cst.scm`)

Trivial parenthesised tree built from the token stream:

```scheme
(define-record-type <cst-node>
  (make-cst-node open close children)
  cst-node?
  (open     cst-node-open)       ;; <token> for ( [ {
  (close    cst-node-close)      ;; <token> for ) ] }
  (children cst-node-children))  ;; list of <cst-node> | <token>
```

Children is a flat list of interleaved atoms (tokens) and nested nodes. Comments
and whitespace are children like anything else.

The first non-whitespace symbol child of a `<cst-node>` identifies the form
(`define`, `let`, `cond`, etc.) — enough for indentation rules.

Estimated size: ~80-100 lines.

### Compiler wrapper (`gulie/compiler.scm`)

Wraps Guile's compile pipeline to capture warnings as structured diagnostics:

```scheme
;; Key Guile APIs we delegate to:
;; - (system base compile): read-and-compile, compile, default-warning-level
;; - (language tree-il analyze): make-analyzer, analyze-tree
;; - (system base message): %warning-types, current-warning-port
```

Strategy: call `read-and-compile` with `#:to 'tree-il` and `#:warning-level 2`
while redirecting `current-warning-port` to a string port, then parse the
warning output into `<diagnostic>` records. Alternatively, invoke `make-analyzer`
directly and hook the warning printers.

Guile's built-in analyses (all free):
- `unused-variable-analysis`
- `unused-toplevel-analysis`
- `unused-module-analysis`
- `shadowed-toplevel-analysis`
- `make-use-before-definition-analysis` (unbound variables)
- `arity-analysis` (wrong arg count)
- `format-analysis` (format string validation)

### Rule system (`gulie/rule.scm`)

```scheme
(define-record-type <rule>
  (make-rule name description severity category type check-proc fix-proc)
  rule?
  (name        rule-name)         ;; symbol
  (description rule-description)  ;; string
  (severity    rule-severity)     ;; 'error | 'warning | 'info
  (category    rule-category)     ;; 'format | 'style | 'correctness | 'idiom
  (type        rule-type)         ;; 'line | 'cst | 'tree-il
  (check-proc  rule-check-proc)  ;; procedure (signature depends on type)
  (fix-proc    rule-fix-proc))   ;; procedure | #f
```

Three rule types with different check signatures:
- **`'line`** — `(lambda (file line-num line-text config) -> diagnostics)` — fastest, no parsing
- **`'cst`** — `(lambda (file cst config) -> diagnostics)` — needs tokenizer+CST
- **`'tree-il`** — `(lambda (file tree-il env config) -> diagnostics)` — needs compilation

Global registry: `*rules*` alist, populated at module load time via
`register-rule!`. Convenience macros: `define-line-rule`, `define-cst-rule`,
`define-tree-il-rule`.

### Diagnostic record (`gulie/diagnostic.scm`)

```scheme
(define-record-type <diagnostic>
  (make-diagnostic file line column severity rule message fix)
  diagnostic?
  (file     diagnostic-file)      ;; string
  (line     diagnostic-line)      ;; integer, 1-based
  (column   diagnostic-column)    ;; integer, 0-based
  (severity diagnostic-severity)  ;; symbol
  (rule     diagnostic-rule)      ;; symbol
  (message  diagnostic-message)   ;; string
  (fix      diagnostic-fix))      ;; <fix> | #f
```

Standard output: `file:line:column: severity: rule: message`

### Config (`gulie/config.scm`)

File: `.gulie.sexp` in project root (plain s-expression, read with `(read)`,
never evaluated):

```scheme
((line-length . 80)
 (indent . 2)
 (enable trailing-whitespace line-length unused-variable arity-mismatch)
 (disable tabs)
 (rules
   (line-length (max . 100)))
 (indent-rules
   (with-syntax . 1)
   (match . 1))
 (ignore "build/**" ".direnv/**"))
```

Precedence: CLI flags > config file > built-in defaults.

`--init` generates a template with all rules listed and commented.

### Suppression (`gulie/suppression.scm`)

```scheme
;; gulie:suppress trailing-whitespace   — suppress on next line
(define x    "messy")

(define x    "messy") ; gulie:suppress  — suppress on this line

;; gulie:disable line-length            — region disable
... code ...
;; gulie:enable line-length             — region enable
```

Parsed from raw text before rules run. Produces a suppression map that filters
diagnostics after all rules have emitted.

---

## Indentation rules

The key data is `scheme-indent-function` values from `.dir-locals.el` — an
integer N meaning "N arguments on first line, then body indented +2":

```scheme
(define *default-indent-rules*
  '((define . 1) (define* . 1) (define-public . 1) (define-syntax . 1)
    (define-module . 0) (lambda . 1) (lambda* . 1)
    (let . 1) (let* . 1) (letrec . 1) (letrec* . 1)
    (if . #f) (cond . 0) (case . 1) (when . 1) (unless . 1)
    (match . 1) (syntax-case . 2) (with-syntax . 1)
    (begin . 0) (do . 2) (parameterize . 1) (guard . 1)))
```

Overridable via config `indent-rules`. The indentation checker walks the CST,
identifies the form by its first symbol child, looks up the rule, and compares
actual indentation to expected.

---

## Formatting conventions (Guile vs Guix)

Both use 2-space indent, same special-form conventions. Key difference:
- **Guile:** 72-char fill column, `;;; {Section}` headers
- **Guix:** 78-80 char fill column, `;;` headers

Our default config targets Guile conventions. A Guix preset can override
`line-length` and comment style.

---

## Formatter: cost-based optimal pretty-printing

The formatter (`gulie/formatter.scm`) is a later-phase component that
**rewrites** files with correct layout, as opposed to the indentation checker
which merely **reports** violations.

### Why cost-based?

When deciding where to break lines in a long expression, there are often multiple
valid options. A greedy approach (fill as much as fits, then break) produces
mediocre output — it can't "look ahead" to see that a break earlier would produce
a better overall layout. The Wadler/Leijen family of algorithms evaluates
alternative layouts and selects the optimal one.

### The algorithm (Wadler/Leijen, as used by fmt's `pretty-expressive`)

The pretty-printer works with an abstract **document** type:

```
doc = text(string)       — literal text
    | line               — line break (or space if flattened)
    | nest(n, doc)       — increase indent by n
    | concat(doc, doc)   — concatenation
    | alt(doc, doc)      — choose better of two layouts
    | group(doc)         — try flat first, break if doesn't fit
```

The key operator is `alt(a, b)` — "try layout A, but if it overflows the page
width, use layout B instead." The algorithm evaluates both alternatives and
picks the one with the lower **cost vector**:

```
cost = [badness, height, characters]

  badness    — quadratic penalty for exceeding page width
  height     — number of lines used
  characters — total chars (tiebreaker)
```

This produces provably optimal output: the layout that minimises overflow while
using the fewest lines.

### How it fits our architecture

```
CST (from tokenizer + cst.scm)
  → [doc generator] convert CST nodes to abstract doc, using form-specific rules
  → [layout solver] evaluate alternatives, select optimal layout
  → [renderer] emit formatted text with comments preserved
```

The **doc generator** uses the same form-identification logic as the indentation
checker (first symbol child of a CST node) to apply form-specific layout rules.
For example:

- `define` — name on first line, body indented
- `let` — bindings as aligned block, body indented
- `cond` — each clause on its own line

These rules are data (the `indent-rules` table extended with layout hints),
making the formatter configurable just like the checker.

### Implementation approach

We can either:
1. **Port `pretty-expressive`** from Racket — the core algorithm is ~300 lines,
   well-documented in academic papers
2. **Upgrade Guile's `(ice-9 pretty-print)`** — it already knows form-specific
   indentation rules but uses greedy layout; we'd replace the layout engine with
   cost-based selection

Option 1 is cleaner (purpose-built). Option 2 reuses more existing code but
would be a heavier modification. We'll decide when we reach that phase.

### Phase note

The formatter is **Phase 6** work. Phases 0-4 deliver a useful checker without
it. The indentation checker (Phase 4) validates existing formatting; the
formatter (Phase 6) rewrites it. The checker comes first because it's simpler
and immediately useful in CI.

---

## CLI interface

```
gulie [OPTIONS] [FILE|DIR...]

  --check           Report issues, exit non-zero on findings (default)
  --fix             Fix mode: auto-fix what's possible, report the rest
  --format          Format mode: rewrite files with optimal layout
  --init            Generate .gulie.sexp template
  --pass PASS       Run only: surface, semantic, all (default: all)
  --rule RULE       Enable only this rule (repeatable)
  --disable RULE    Disable this rule (repeatable)
  --severity SEV    Minimum severity: error, warning, info
  --output FORMAT   Output: standard (default), json, compact
  --config FILE     Config file path (default: auto-discover)
  --list-rules      List all rules and exit
  --version         Print version
```

Exit codes: 0 = clean, 1 = findings, 2 = config error, 3 = internal error.

---

## Implementation phases

### Phase 0: Skeleton
- `bin/gulie` — shebang script, loads CLI module
- `(gulie cli)` — basic arg parsing (`--check`, `--version`, file args)
- `(gulie diagnostic)` — record type + standard formatter
- `(gulie rule)` — record type + registry + `register-rule!`
- `(gulie engine)` — discovers `.scm` files, runs line rules, reports
- One trivial rule: `trailing-whitespace` (line rule)
- **Verification:** `gulie --check some-file.scm` reports trailing whitespace

### Phase 1: Tokenizer + CST + surface rules
- `(gulie tokenizer)` — hand-written lexer
- `(gulie cst)` — token → tree
- Surface rules: `trailing-whitespace`, `line-length`, `no-tabs`, `blank-lines`
- Comment rule: `comment-semicolons` (check `;`/`;;`/`;;;` usage)
- Roundtrip test: tokenize → concat = original
- Snapshot tests for each rule

### Phase 2: Semantic rules (compiler pass)
- `(gulie compiler)` — `read-and-compile` wrapper, warning capture
- Semantic rules wrapping Guile's built-in analyses:
  `unused-variable`, `unused-toplevel`, `unbound-variable`, `arity-mismatch`,
  `format-string`, `shadowed-toplevel`, `unused-module`
- **Verification:** run against Guile and Guix source files, check false-positive rate

### Phase 3: Config + suppression
- `(gulie config)` — `.gulie.sexp` loading + merging
- `(gulie suppression)` — inline comment suppression
- `--init` command
- Rule enable/disable via config and CLI

### Phase 4: Indentation checking
- `(gulie rules indentation)` — CST-based indent checker
- Default indent rules for standard Guile forms
- Configurable `indent-rules` in `.gulie.sexp`

### Phase 5: Fix mode + idiom rules
- `(gulie fixer)` — bottom-to-top edit application
- Auto-fix for: trailing whitespace, line-length (where possible)
- `(gulie rules idiom)` — `match`-based pattern suggestions on Tree-IL
- `(gulie rules module-form)` — `define-module` form checks (sorted imports, etc.)

### Phase 6: Formatter (cost-based optimal layout)
- `(gulie formatter)` — Wadler/Leijen pretty-printer with cost-based selection
- Abstract document type: `text`, `line`, `nest`, `concat`, `alt`, `group`
- Form-specific layout rules (reuse indent-rules table + layout hints)
- Comment preservation through formatting
- `--format` CLI mode
- **Verification:** format Guile/Guix source files, diff against originals,
  verify roundtrip stability (format twice = same output)

### Phase 7: Cross-module analysis (future)
- Load multiple modules, walk dependency graph
- Unused exports, cross-module arity checks
- `--pass cross-module` CLI option

---

## Testing strategy

1. **Roundtrip test** (tokenizer): tokenize → concat must equal original input
2. **Snapshot tests**: `fixtures/violations/rule-name.scm` + `.expected` pairs
3. **Clean file tests**: `fixtures/clean/*.scm` must produce zero diagnostics
4. **Unit tests**: `(srfi srfi-64)` for tokenizer, CST, config, diagnostics
5. **Real-world corpus**: run against `test/guix/` and `refs/guile/module/` for
   false-positive rate validation
6. **Formatter idempotency**: `format(format(x)) = format(x)` for all test files

---

## Key design decisions

| Decision | Rationale |
|----------|-----------|
| Hand-written tokenizer, not extending Guile's reader | The reader is ~1000 lines of nested closures not designed for extension. A clean 200-line tokenizer is easier to write/test. |
| Two independent passes, not a unified AST | Reader strips comments irrecoverably. Accepting this gives clean separation. |
| Delegate to Guile's built-in analyses | They're battle-tested, handle macroexpansion edge cases, and are maintained upstream. |
| `(ice-9 match)` for idiom rules, not logic programming | Built-in, fast, sufficient. miniKanren can be added later if needed. |
| S-expression config, not YAML/TOML | Zero deps. Our users write Scheme. `(read)` does the parsing. |
| Flat CST (parens + interleaved tokens), not rich AST | Enough for indentation/formatting checks. No overengineering. |
| Cost-based optimal layout for the formatter | Greedy formatters produce mediocre output. Wadler/Leijen is cleaner and provably correct. Worth the investment when we reach that phase. |
| Checker first, formatter later | Checking is simpler, immediately useful in CI, and validates the tokenizer/CST infrastructure that the formatter will build on. |

---

## Critical files to reference during implementation

- `refs/guile/module/ice-9/read.scm:949-973` — what the reader discards (our tokenizer must keep)
- `refs/guile/module/language/tree-il/analyze.scm:1461-1479` — `make-analyzer` API
- `refs/guile/module/system/base/compile.scm:298-340` — `read-and-compile` / `compile`
- `refs/guile/module/system/base/message.scm:83-220` — `%warning-types` definitions
- `refs/guile/module/language/tree-il.scm` — Tree-IL node types and traversal
- `refs/guile/module/ice-9/pretty-print.scm` — existing pretty-printer (form-specific rules to extract)
- `refs/mallet/src/parser/tokenizer.lisp` — reference tokenizer (163 lines)
- `refs/fmt/conventions.rkt` — form-specific formatting rules (100+ forms)
- `refs/fmt/main.rkt` — cost-based layout selection implementation