Files

Steffen Beyer d0115672dd First iteration

2026-04-01 23:35:50 +02:00

19 KiB

Raw Permalink Blame History

Gulie — Guile Linter/Formatter: Architecture & Implementation Plan

Context

No linter, formatter, or static analyser exists for Guile Scheme. We're building one from scratch, called gulie. The tool is written in Guile itself, reusing as much of Guile's infrastructure as possible (reader, compiler, Tree-IL analyses, warning system). The design draws on patterns observed in 7 reference tools (see docs/INSPIRATION.md).

Guile 3.0.11 is available in the devenv. No source code exists yet.

High-level architecture

Two independent passes, extensible to three:

                         .gulie.sexp (config)
                              |
  file.scm ──┬──> [Tokenizer] ──> tokens ──> [CST parser] ──> CST
              |         |
              |   [Pass 1: Surface]  line rules + CST rules
              |         |
              |    diagnostics-1
              |
              └──> [Guile reader] ──> s-exprs ──> [Guile compiler] ──> Tree-IL
                        |
                  [Pass 2: Semantic]  built-in analyses + custom Tree-IL rules
                        |
                   diagnostics-2
                        |
              [merge + suppress + sort + report/fix]

Why two passes? Guile's reader (ice-9/read.scm:949-973) irrecoverably strips comments, whitespace, and datum comments in next-non-whitespace. There is no way to get formatting info AND semantic info from one parse. Accepting this and building two clean, independent passes is simpler than fighting the reader.

Module structure

gulie/
  bin/gulie                        # CLI entry point (executable Guile script)
  gulie/
    cli.scm                        # (gulie cli) — arg parsing, dispatch
    config.scm                     # (gulie config) — .gulie.sexp loading, defaults, merging
    diagnostic.scm                 # (gulie diagnostic) — record type, sorting, formatting
    tokenizer.scm                  # (gulie tokenizer) — hand-written lexer, preserves everything
    cst.scm                        # (gulie cst) — token stream → concrete syntax tree
    compiler.scm                   # (gulie compiler) — Guile compile wrapper, warning capture
    rule.scm                       # (gulie rule) — rule record, registry, define-rule macros
    engine.scm                     # (gulie engine) — orchestrator: file discovery, pass sequencing
    fixer.scm                      # (gulie fixer) — fix application (bottom-to-top edits)
    suppression.scm                # (gulie suppression) — ; gulie:suppress parsing/filtering
    formatter.scm                  # (gulie formatter) — cost-based optimal pretty-printer
    rules/
      surface.scm                  # (gulie rules surface) — trailing-ws, line-length, tabs, blanks
      indentation.scm              # (gulie rules indentation) — indent checking vs CST
      comments.scm                 # (gulie rules comments) — comment style conventions
      semantic.scm                 # (gulie rules semantic) — wrappers around Guile's analyses
      idiom.scm                    # (gulie rules idiom) — pattern-based suggestions via match
      module-form.scm              # (gulie rules module-form) — define-module checks
  test/
    test-tokenizer.scm
    test-cst.scm
    test-rules-surface.scm
    test-rules-semantic.scm
    fixtures/
      clean/                       # .scm files producing zero diagnostics
      violations/                  # .scm + .expected pairs (snapshot testing)

~16 source files. Each has one clear job.

Key components

Tokenizer (`gulie/tokenizer.scm`)

Hand-written character-by-character state machine. Must handle the same lexical syntax as Guile's reader but preserve what the reader discards.

(define-record-type <token>
  (make-token type text line column)
  token?
  (type   token-type)      ;; symbol (see list below)
  (text   token-text)      ;; string: exact source text
  (line   token-line)      ;; integer: 1-based
  (column token-column))   ;; integer: 0-based

Token types (~15): open-paren, close-paren, symbol, number, string, keyword, boolean, character, prefix (', `, ,, ,@, #', etc.), special (#;, #(, #vu8(, etc.), line-comment, block-comment, whitespace, newline, dot.

Critical invariant: (string-concatenate (map token-text (tokenize input))) must reproduce the original input exactly. This is our primary roundtrip test.

Estimated size: ~200-250 lines. Reference: Mallet's tokenizer (163 lines CL).

CST (`gulie/cst.scm`)

Trivial parenthesised tree built from the token stream:

(define-record-type <cst-node>
  (make-cst-node open close children)
  cst-node?
  (open     cst-node-open)       ;; <token> for ( [ {
  (close    cst-node-close)      ;; <token> for ) ] }
  (children cst-node-children))  ;; list of <cst-node> | <token>

Children is a flat list of interleaved atoms (tokens) and nested nodes. Comments and whitespace are children like anything else.

The first non-whitespace symbol child of a <cst-node> identifies the form (define, let, cond, etc.) — enough for indentation rules.

Estimated size: ~80-100 lines.

Compiler wrapper (`gulie/compiler.scm`)

Wraps Guile's compile pipeline to capture warnings as structured diagnostics:

;; Key Guile APIs we delegate to:
;; - (system base compile): read-and-compile, compile, default-warning-level
;; - (language tree-il analyze): make-analyzer, analyze-tree
;; - (system base message): %warning-types, current-warning-port

Strategy: call read-and-compile with #:to 'tree-il and #:warning-level 2 while redirecting current-warning-port to a string port, then parse the warning output into <diagnostic> records. Alternatively, invoke make-analyzer directly and hook the warning printers.

Guile's built-in analyses (all free):

unused-variable-analysis
unused-toplevel-analysis
unused-module-analysis
shadowed-toplevel-analysis
make-use-before-definition-analysis (unbound variables)
arity-analysis (wrong arg count)
format-analysis (format string validation)

Rule system (`gulie/rule.scm`)

(define-record-type <rule>
  (make-rule name description severity category type check-proc fix-proc)
  rule?
  (name        rule-name)         ;; symbol
  (description rule-description)  ;; string
  (severity    rule-severity)     ;; 'error | 'warning | 'info
  (category    rule-category)     ;; 'format | 'style | 'correctness | 'idiom
  (type        rule-type)         ;; 'line | 'cst | 'tree-il
  (check-proc  rule-check-proc)  ;; procedure (signature depends on type)
  (fix-proc    rule-fix-proc))   ;; procedure | #f

Three rule types with different check signatures:

'line — (lambda (file line-num line-text config) -> diagnostics) — fastest, no parsing
'cst — (lambda (file cst config) -> diagnostics) — needs tokenizer+CST
'tree-il — (lambda (file tree-il env config) -> diagnostics) — needs compilation

Global registry: *rules* alist, populated at module load time via register-rule!. Convenience macros: define-line-rule, define-cst-rule, define-tree-il-rule.

Diagnostic record (`gulie/diagnostic.scm`)

(define-record-type <diagnostic>
  (make-diagnostic file line column severity rule message fix)
  diagnostic?
  (file     diagnostic-file)      ;; string
  (line     diagnostic-line)      ;; integer, 1-based
  (column   diagnostic-column)    ;; integer, 0-based
  (severity diagnostic-severity)  ;; symbol
  (rule     diagnostic-rule)      ;; symbol
  (message  diagnostic-message)   ;; string
  (fix      diagnostic-fix))      ;; <fix> | #f

Standard output: file:line:column: severity: rule: message

Config (`gulie/config.scm`)

File: .gulie.sexp in project root (plain s-expression, read with (read), never evaluated):

((line-length . 80)
 (indent . 2)
 (enable trailing-whitespace line-length unused-variable arity-mismatch)
 (disable tabs)
 (rules
   (line-length (max . 100)))
 (indent-rules
   (with-syntax . 1)
   (match . 1))
 (ignore "build/**" ".direnv/**"))

Precedence: CLI flags > config file > built-in defaults.

--init generates a template with all rules listed and commented.

Suppression (`gulie/suppression.scm`)

;; gulie:suppress trailing-whitespace   — suppress on next line
(define x    "messy")

(define x    "messy") ; gulie:suppress  — suppress on this line

;; gulie:disable line-length            — region disable
... code ...
;; gulie:enable line-length             — region enable

Parsed from raw text before rules run. Produces a suppression map that filters diagnostics after all rules have emitted.

Indentation rules

The key data is scheme-indent-function values from .dir-locals.el — an integer N meaning "N arguments on first line, then body indented +2":

(define *default-indent-rules*
  '((define . 1) (define* . 1) (define-public . 1) (define-syntax . 1)
    (define-module . 0) (lambda . 1) (lambda* . 1)
    (let . 1) (let* . 1) (letrec . 1) (letrec* . 1)
    (if . #f) (cond . 0) (case . 1) (when . 1) (unless . 1)
    (match . 1) (syntax-case . 2) (with-syntax . 1)
    (begin . 0) (do . 2) (parameterize . 1) (guard . 1)))

Overridable via config indent-rules. The indentation checker walks the CST, identifies the form by its first symbol child, looks up the rule, and compares actual indentation to expected.

Formatting conventions (Guile vs Guix)

Both use 2-space indent, same special-form conventions. Key difference:

Guile: 72-char fill column, ;;; {Section} headers
Guix: 78-80 char fill column, ;; headers

Our default config targets Guile conventions. A Guix preset can override line-length and comment style.

Formatter: cost-based optimal pretty-printing

The formatter (gulie/formatter.scm) is a later-phase component that rewrites files with correct layout, as opposed to the indentation checker which merely reports violations.

Why cost-based?

When deciding where to break lines in a long expression, there are often multiple valid options. A greedy approach (fill as much as fits, then break) produces mediocre output — it can't "look ahead" to see that a break earlier would produce a better overall layout. The Wadler/Leijen family of algorithms evaluates alternative layouts and selects the optimal one.

The algorithm (Wadler/Leijen, as used by fmt's `pretty-expressive`)

The pretty-printer works with an abstract document type:

doc = text(string)       — literal text
    | line               — line break (or space if flattened)
    | nest(n, doc)       — increase indent by n
    | concat(doc, doc)   — concatenation
    | alt(doc, doc)      — choose better of two layouts
    | group(doc)         — try flat first, break if doesn't fit

The key operator is alt(a, b) — "try layout A, but if it overflows the page width, use layout B instead." The algorithm evaluates both alternatives and picks the one with the lower cost vector:

cost = [badness, height, characters]

  badness    — quadratic penalty for exceeding page width
  height     — number of lines used
  characters — total chars (tiebreaker)

This produces provably optimal output: the layout that minimises overflow while using the fewest lines.

How it fits our architecture

CST (from tokenizer + cst.scm)
  → [doc generator] convert CST nodes to abstract doc, using form-specific rules
  → [layout solver] evaluate alternatives, select optimal layout
  → [renderer] emit formatted text with comments preserved

The doc generator uses the same form-identification logic as the indentation checker (first symbol child of a CST node) to apply form-specific layout rules. For example:

define — name on first line, body indented
let — bindings as aligned block, body indented
cond — each clause on its own line

These rules are data (the indent-rules table extended with layout hints), making the formatter configurable just like the checker.

Implementation approach

We can either:

Port pretty-expressive from Racket — the core algorithm is ~300 lines, well-documented in academic papers
Upgrade Guile's (ice-9 pretty-print) — it already knows form-specific indentation rules but uses greedy layout; we'd replace the layout engine with cost-based selection

Option 1 is cleaner (purpose-built). Option 2 reuses more existing code but would be a heavier modification. We'll decide when we reach that phase.

Phase note

The formatter is Phase 6 work. Phases 0-4 deliver a useful checker without it. The indentation checker (Phase 4) validates existing formatting; the formatter (Phase 6) rewrites it. The checker comes first because it's simpler and immediately useful in CI.

CLI interface

gulie [OPTIONS] [FILE|DIR...]

  --check           Report issues, exit non-zero on findings (default)
  --fix             Fix mode: auto-fix what's possible, report the rest
  --format          Format mode: rewrite files with optimal layout
  --init            Generate .gulie.sexp template
  --pass PASS       Run only: surface, semantic, all (default: all)
  --rule RULE       Enable only this rule (repeatable)
  --disable RULE    Disable this rule (repeatable)
  --severity SEV    Minimum severity: error, warning, info
  --output FORMAT   Output: standard (default), json, compact
  --config FILE     Config file path (default: auto-discover)
  --list-rules      List all rules and exit
  --version         Print version

Exit codes: 0 = clean, 1 = findings, 2 = config error, 3 = internal error.

Implementation phases

Phase 0: Skeleton

bin/gulie — shebang script, loads CLI module
(gulie cli) — basic arg parsing (--check, --version, file args)
(gulie diagnostic) — record type + standard formatter
(gulie rule) — record type + registry + register-rule!
(gulie engine) — discovers .scm files, runs line rules, reports
One trivial rule: trailing-whitespace (line rule)
Verification: gulie --check some-file.scm reports trailing whitespace

Phase 1: Tokenizer + CST + surface rules

(gulie tokenizer) — hand-written lexer
(gulie cst) — token → tree
Surface rules: trailing-whitespace, line-length, no-tabs, blank-lines
Comment rule: comment-semicolons (check ;/;;/;;; usage)
Roundtrip test: tokenize → concat = original
Snapshot tests for each rule

Phase 2: Semantic rules (compiler pass)

(gulie compiler) — read-and-compile wrapper, warning capture
Semantic rules wrapping Guile's built-in analyses: unused-variable, unused-toplevel, unbound-variable, arity-mismatch, format-string, shadowed-toplevel, unused-module
Verification: run against Guile and Guix source files, check false-positive rate

Phase 3: Config + suppression

(gulie config) — .gulie.sexp loading + merging
(gulie suppression) — inline comment suppression
--init command
Rule enable/disable via config and CLI

Phase 4: Indentation checking

(gulie rules indentation) — CST-based indent checker
Default indent rules for standard Guile forms
Configurable indent-rules in .gulie.sexp

Phase 5: Fix mode + idiom rules

(gulie fixer) — bottom-to-top edit application
Auto-fix for: trailing whitespace, line-length (where possible)
(gulie rules idiom) — match-based pattern suggestions on Tree-IL
(gulie rules module-form) — define-module form checks (sorted imports, etc.)

Phase 6: Formatter (cost-based optimal layout)

(gulie formatter) — Wadler/Leijen pretty-printer with cost-based selection
Abstract document type: text, line, nest, concat, alt, group
Form-specific layout rules (reuse indent-rules table + layout hints)
Comment preservation through formatting
--format CLI mode
Verification: format Guile/Guix source files, diff against originals, verify roundtrip stability (format twice = same output)

Phase 7: Cross-module analysis (future)

Load multiple modules, walk dependency graph
Unused exports, cross-module arity checks
--pass cross-module CLI option

Testing strategy

Roundtrip test (tokenizer): tokenize → concat must equal original input
Snapshot tests: fixtures/violations/rule-name.scm + .expected pairs
Clean file tests: fixtures/clean/*.scm must produce zero diagnostics
Unit tests: (srfi srfi-64) for tokenizer, CST, config, diagnostics
Real-world corpus: run against test/guix/ and refs/guile/module/ for false-positive rate validation
Formatter idempotency: format(format(x)) = format(x) for all test files

Key design decisions

Decision	Rationale
Hand-written tokenizer, not extending Guile's reader	The reader is ~1000 lines of nested closures not designed for extension. A clean 200-line tokenizer is easier to write/test.
Two independent passes, not a unified AST	Reader strips comments irrecoverably. Accepting this gives clean separation.
Delegate to Guile's built-in analyses	They're battle-tested, handle macroexpansion edge cases, and are maintained upstream.
`(ice-9 match)` for idiom rules, not logic programming	Built-in, fast, sufficient. miniKanren can be added later if needed.
S-expression config, not YAML/TOML	Zero deps. Our users write Scheme. `(read)` does the parsing.
Flat CST (parens + interleaved tokens), not rich AST	Enough for indentation/formatting checks. No overengineering.
Cost-based optimal layout for the formatter	Greedy formatters produce mediocre output. Wadler/Leijen is cleaner and provably correct. Worth the investment when we reach that phase.
Checker first, formatter later	Checking is simpler, immediately useful in CI, and validates the tokenizer/CST infrastructure that the formatter will build on.

Critical files to reference during implementation

refs/guile/module/ice-9/read.scm:949-973 — what the reader discards (our tokenizer must keep)
refs/guile/module/language/tree-il/analyze.scm:1461-1479 — make-analyzer API
refs/guile/module/system/base/compile.scm:298-340 — read-and-compile / compile
refs/guile/module/system/base/message.scm:83-220 — %warning-types definitions
refs/guile/module/language/tree-il.scm — Tree-IL node types and traversal
refs/guile/module/ice-9/pretty-print.scm — existing pretty-printer (form-specific rules to extract)
refs/mallet/src/parser/tokenizer.lisp — reference tokenizer (163 lines)
refs/fmt/conventions.rkt — form-specific formatting rules (100+ forms)
refs/fmt/main.rkt — cost-based layout selection implementation

19 KiB Raw Permalink Blame History