19 KiB
Gulie — Guile Linter/Formatter: Architecture & Implementation Plan
Context
No linter, formatter, or static analyser exists for Guile Scheme. We're building
one from scratch, called gulie. The tool is written in Guile itself, reusing
as much of Guile's infrastructure as possible (reader, compiler, Tree-IL
analyses, warning system). The design draws on patterns observed in 7 reference
tools (see docs/INSPIRATION.md).
Guile 3.0.11 is available in the devenv. No source code exists yet.
High-level architecture
Two independent passes, extensible to three:
.gulie.sexp (config)
|
file.scm ──┬──> [Tokenizer] ──> tokens ──> [CST parser] ──> CST
| |
| [Pass 1: Surface] line rules + CST rules
| |
| diagnostics-1
|
└──> [Guile reader] ──> s-exprs ──> [Guile compiler] ──> Tree-IL
|
[Pass 2: Semantic] built-in analyses + custom Tree-IL rules
|
diagnostics-2
|
[merge + suppress + sort + report/fix]
Why two passes? Guile's reader (ice-9/read.scm:949-973) irrecoverably
strips comments, whitespace, and datum comments in next-non-whitespace. There
is no way to get formatting info AND semantic info from one parse. Accepting this
and building two clean, independent passes is simpler than fighting the reader.
Module structure
gulie/
bin/gulie # CLI entry point (executable Guile script)
gulie/
cli.scm # (gulie cli) — arg parsing, dispatch
config.scm # (gulie config) — .gulie.sexp loading, defaults, merging
diagnostic.scm # (gulie diagnostic) — record type, sorting, formatting
tokenizer.scm # (gulie tokenizer) — hand-written lexer, preserves everything
cst.scm # (gulie cst) — token stream → concrete syntax tree
compiler.scm # (gulie compiler) — Guile compile wrapper, warning capture
rule.scm # (gulie rule) — rule record, registry, define-rule macros
engine.scm # (gulie engine) — orchestrator: file discovery, pass sequencing
fixer.scm # (gulie fixer) — fix application (bottom-to-top edits)
suppression.scm # (gulie suppression) — ; gulie:suppress parsing/filtering
formatter.scm # (gulie formatter) — cost-based optimal pretty-printer
rules/
surface.scm # (gulie rules surface) — trailing-ws, line-length, tabs, blanks
indentation.scm # (gulie rules indentation) — indent checking vs CST
comments.scm # (gulie rules comments) — comment style conventions
semantic.scm # (gulie rules semantic) — wrappers around Guile's analyses
idiom.scm # (gulie rules idiom) — pattern-based suggestions via match
module-form.scm # (gulie rules module-form) — define-module checks
test/
test-tokenizer.scm
test-cst.scm
test-rules-surface.scm
test-rules-semantic.scm
fixtures/
clean/ # .scm files producing zero diagnostics
violations/ # .scm + .expected pairs (snapshot testing)
~16 source files. Each has one clear job.
Key components
Tokenizer (gulie/tokenizer.scm)
Hand-written character-by-character state machine. Must handle the same lexical syntax as Guile's reader but preserve what the reader discards.
(define-record-type <token>
(make-token type text line column)
token?
(type token-type) ;; symbol (see list below)
(text token-text) ;; string: exact source text
(line token-line) ;; integer: 1-based
(column token-column)) ;; integer: 0-based
Token types (~15): open-paren, close-paren, symbol, number, string,
keyword, boolean, character, prefix (', `, ,, ,@, #',
etc.), special (#;, #(, #vu8(, etc.), line-comment, block-comment,
whitespace, newline, dot.
Critical invariant: (string-concatenate (map token-text (tokenize input))) must
reproduce the original input exactly. This is our primary roundtrip test.
Estimated size: ~200-250 lines. Reference: Mallet's tokenizer (163 lines CL).
CST (gulie/cst.scm)
Trivial parenthesised tree built from the token stream:
(define-record-type <cst-node>
(make-cst-node open close children)
cst-node?
(open cst-node-open) ;; <token> for ( [ {
(close cst-node-close) ;; <token> for ) ] }
(children cst-node-children)) ;; list of <cst-node> | <token>
Children is a flat list of interleaved atoms (tokens) and nested nodes. Comments and whitespace are children like anything else.
The first non-whitespace symbol child of a <cst-node> identifies the form
(define, let, cond, etc.) — enough for indentation rules.
Estimated size: ~80-100 lines.
Compiler wrapper (gulie/compiler.scm)
Wraps Guile's compile pipeline to capture warnings as structured diagnostics:
;; Key Guile APIs we delegate to:
;; - (system base compile): read-and-compile, compile, default-warning-level
;; - (language tree-il analyze): make-analyzer, analyze-tree
;; - (system base message): %warning-types, current-warning-port
Strategy: call read-and-compile with #:to 'tree-il and #:warning-level 2
while redirecting current-warning-port to a string port, then parse the
warning output into <diagnostic> records. Alternatively, invoke make-analyzer
directly and hook the warning printers.
Guile's built-in analyses (all free):
unused-variable-analysisunused-toplevel-analysisunused-module-analysisshadowed-toplevel-analysismake-use-before-definition-analysis(unbound variables)arity-analysis(wrong arg count)format-analysis(format string validation)
Rule system (gulie/rule.scm)
(define-record-type <rule>
(make-rule name description severity category type check-proc fix-proc)
rule?
(name rule-name) ;; symbol
(description rule-description) ;; string
(severity rule-severity) ;; 'error | 'warning | 'info
(category rule-category) ;; 'format | 'style | 'correctness | 'idiom
(type rule-type) ;; 'line | 'cst | 'tree-il
(check-proc rule-check-proc) ;; procedure (signature depends on type)
(fix-proc rule-fix-proc)) ;; procedure | #f
Three rule types with different check signatures:
'line—(lambda (file line-num line-text config) -> diagnostics)— fastest, no parsing'cst—(lambda (file cst config) -> diagnostics)— needs tokenizer+CST'tree-il—(lambda (file tree-il env config) -> diagnostics)— needs compilation
Global registry: *rules* alist, populated at module load time via
register-rule!. Convenience macros: define-line-rule, define-cst-rule,
define-tree-il-rule.
Diagnostic record (gulie/diagnostic.scm)
(define-record-type <diagnostic>
(make-diagnostic file line column severity rule message fix)
diagnostic?
(file diagnostic-file) ;; string
(line diagnostic-line) ;; integer, 1-based
(column diagnostic-column) ;; integer, 0-based
(severity diagnostic-severity) ;; symbol
(rule diagnostic-rule) ;; symbol
(message diagnostic-message) ;; string
(fix diagnostic-fix)) ;; <fix> | #f
Standard output: file:line:column: severity: rule: message
Config (gulie/config.scm)
File: .gulie.sexp in project root (plain s-expression, read with (read),
never evaluated):
((line-length . 80)
(indent . 2)
(enable trailing-whitespace line-length unused-variable arity-mismatch)
(disable tabs)
(rules
(line-length (max . 100)))
(indent-rules
(with-syntax . 1)
(match . 1))
(ignore "build/**" ".direnv/**"))
Precedence: CLI flags > config file > built-in defaults.
--init generates a template with all rules listed and commented.
Suppression (gulie/suppression.scm)
;; gulie:suppress trailing-whitespace — suppress on next line
(define x "messy")
(define x "messy") ; gulie:suppress — suppress on this line
;; gulie:disable line-length — region disable
... code ...
;; gulie:enable line-length — region enable
Parsed from raw text before rules run. Produces a suppression map that filters diagnostics after all rules have emitted.
Indentation rules
The key data is scheme-indent-function values from .dir-locals.el — an
integer N meaning "N arguments on first line, then body indented +2":
(define *default-indent-rules*
'((define . 1) (define* . 1) (define-public . 1) (define-syntax . 1)
(define-module . 0) (lambda . 1) (lambda* . 1)
(let . 1) (let* . 1) (letrec . 1) (letrec* . 1)
(if . #f) (cond . 0) (case . 1) (when . 1) (unless . 1)
(match . 1) (syntax-case . 2) (with-syntax . 1)
(begin . 0) (do . 2) (parameterize . 1) (guard . 1)))
Overridable via config indent-rules. The indentation checker walks the CST,
identifies the form by its first symbol child, looks up the rule, and compares
actual indentation to expected.
Formatting conventions (Guile vs Guix)
Both use 2-space indent, same special-form conventions. Key difference:
- Guile: 72-char fill column,
;;; {Section}headers - Guix: 78-80 char fill column,
;;headers
Our default config targets Guile conventions. A Guix preset can override
line-length and comment style.
Formatter: cost-based optimal pretty-printing
The formatter (gulie/formatter.scm) is a later-phase component that
rewrites files with correct layout, as opposed to the indentation checker
which merely reports violations.
Why cost-based?
When deciding where to break lines in a long expression, there are often multiple valid options. A greedy approach (fill as much as fits, then break) produces mediocre output — it can't "look ahead" to see that a break earlier would produce a better overall layout. The Wadler/Leijen family of algorithms evaluates alternative layouts and selects the optimal one.
The algorithm (Wadler/Leijen, as used by fmt's pretty-expressive)
The pretty-printer works with an abstract document type:
doc = text(string) — literal text
| line — line break (or space if flattened)
| nest(n, doc) — increase indent by n
| concat(doc, doc) — concatenation
| alt(doc, doc) — choose better of two layouts
| group(doc) — try flat first, break if doesn't fit
The key operator is alt(a, b) — "try layout A, but if it overflows the page
width, use layout B instead." The algorithm evaluates both alternatives and
picks the one with the lower cost vector:
cost = [badness, height, characters]
badness — quadratic penalty for exceeding page width
height — number of lines used
characters — total chars (tiebreaker)
This produces provably optimal output: the layout that minimises overflow while using the fewest lines.
How it fits our architecture
CST (from tokenizer + cst.scm)
→ [doc generator] convert CST nodes to abstract doc, using form-specific rules
→ [layout solver] evaluate alternatives, select optimal layout
→ [renderer] emit formatted text with comments preserved
The doc generator uses the same form-identification logic as the indentation checker (first symbol child of a CST node) to apply form-specific layout rules. For example:
define— name on first line, body indentedlet— bindings as aligned block, body indentedcond— each clause on its own line
These rules are data (the indent-rules table extended with layout hints),
making the formatter configurable just like the checker.
Implementation approach
We can either:
- Port
pretty-expressivefrom Racket — the core algorithm is ~300 lines, well-documented in academic papers - Upgrade Guile's
(ice-9 pretty-print)— it already knows form-specific indentation rules but uses greedy layout; we'd replace the layout engine with cost-based selection
Option 1 is cleaner (purpose-built). Option 2 reuses more existing code but would be a heavier modification. We'll decide when we reach that phase.
Phase note
The formatter is Phase 6 work. Phases 0-4 deliver a useful checker without it. The indentation checker (Phase 4) validates existing formatting; the formatter (Phase 6) rewrites it. The checker comes first because it's simpler and immediately useful in CI.
CLI interface
gulie [OPTIONS] [FILE|DIR...]
--check Report issues, exit non-zero on findings (default)
--fix Fix mode: auto-fix what's possible, report the rest
--format Format mode: rewrite files with optimal layout
--init Generate .gulie.sexp template
--pass PASS Run only: surface, semantic, all (default: all)
--rule RULE Enable only this rule (repeatable)
--disable RULE Disable this rule (repeatable)
--severity SEV Minimum severity: error, warning, info
--output FORMAT Output: standard (default), json, compact
--config FILE Config file path (default: auto-discover)
--list-rules List all rules and exit
--version Print version
Exit codes: 0 = clean, 1 = findings, 2 = config error, 3 = internal error.
Implementation phases
Phase 0: Skeleton
bin/gulie— shebang script, loads CLI module(gulie cli)— basic arg parsing (--check,--version, file args)(gulie diagnostic)— record type + standard formatter(gulie rule)— record type + registry +register-rule!(gulie engine)— discovers.scmfiles, runs line rules, reports- One trivial rule:
trailing-whitespace(line rule) - Verification:
gulie --check some-file.scmreports trailing whitespace
Phase 1: Tokenizer + CST + surface rules
(gulie tokenizer)— hand-written lexer(gulie cst)— token → tree- Surface rules:
trailing-whitespace,line-length,no-tabs,blank-lines - Comment rule:
comment-semicolons(check;/;;/;;;usage) - Roundtrip test: tokenize → concat = original
- Snapshot tests for each rule
Phase 2: Semantic rules (compiler pass)
(gulie compiler)—read-and-compilewrapper, warning capture- Semantic rules wrapping Guile's built-in analyses:
unused-variable,unused-toplevel,unbound-variable,arity-mismatch,format-string,shadowed-toplevel,unused-module - Verification: run against Guile and Guix source files, check false-positive rate
Phase 3: Config + suppression
(gulie config)—.gulie.sexploading + merging(gulie suppression)— inline comment suppression--initcommand- Rule enable/disable via config and CLI
Phase 4: Indentation checking
(gulie rules indentation)— CST-based indent checker- Default indent rules for standard Guile forms
- Configurable
indent-rulesin.gulie.sexp
Phase 5: Fix mode + idiom rules
(gulie fixer)— bottom-to-top edit application- Auto-fix for: trailing whitespace, line-length (where possible)
(gulie rules idiom)—match-based pattern suggestions on Tree-IL(gulie rules module-form)—define-moduleform checks (sorted imports, etc.)
Phase 6: Formatter (cost-based optimal layout)
(gulie formatter)— Wadler/Leijen pretty-printer with cost-based selection- Abstract document type:
text,line,nest,concat,alt,group - Form-specific layout rules (reuse indent-rules table + layout hints)
- Comment preservation through formatting
--formatCLI mode- Verification: format Guile/Guix source files, diff against originals, verify roundtrip stability (format twice = same output)
Phase 7: Cross-module analysis (future)
- Load multiple modules, walk dependency graph
- Unused exports, cross-module arity checks
--pass cross-moduleCLI option
Testing strategy
- Roundtrip test (tokenizer): tokenize → concat must equal original input
- Snapshot tests:
fixtures/violations/rule-name.scm+.expectedpairs - Clean file tests:
fixtures/clean/*.scmmust produce zero diagnostics - Unit tests:
(srfi srfi-64)for tokenizer, CST, config, diagnostics - Real-world corpus: run against
test/guix/andrefs/guile/module/for false-positive rate validation - Formatter idempotency:
format(format(x)) = format(x)for all test files
Key design decisions
| Decision | Rationale |
|---|---|
| Hand-written tokenizer, not extending Guile's reader | The reader is ~1000 lines of nested closures not designed for extension. A clean 200-line tokenizer is easier to write/test. |
| Two independent passes, not a unified AST | Reader strips comments irrecoverably. Accepting this gives clean separation. |
| Delegate to Guile's built-in analyses | They're battle-tested, handle macroexpansion edge cases, and are maintained upstream. |
(ice-9 match) for idiom rules, not logic programming |
Built-in, fast, sufficient. miniKanren can be added later if needed. |
| S-expression config, not YAML/TOML | Zero deps. Our users write Scheme. (read) does the parsing. |
| Flat CST (parens + interleaved tokens), not rich AST | Enough for indentation/formatting checks. No overengineering. |
| Cost-based optimal layout for the formatter | Greedy formatters produce mediocre output. Wadler/Leijen is cleaner and provably correct. Worth the investment when we reach that phase. |
| Checker first, formatter later | Checking is simpler, immediately useful in CI, and validates the tokenizer/CST infrastructure that the formatter will build on. |
Critical files to reference during implementation
refs/guile/module/ice-9/read.scm:949-973— what the reader discards (our tokenizer must keep)refs/guile/module/language/tree-il/analyze.scm:1461-1479—make-analyzerAPIrefs/guile/module/system/base/compile.scm:298-340—read-and-compile/compilerefs/guile/module/system/base/message.scm:83-220—%warning-typesdefinitionsrefs/guile/module/language/tree-il.scm— Tree-IL node types and traversalrefs/guile/module/ice-9/pretty-print.scm— existing pretty-printer (form-specific rules to extract)refs/mallet/src/parser/tokenizer.lisp— reference tokenizer (163 lines)refs/fmt/conventions.rkt— form-specific formatting rules (100+ forms)refs/fmt/main.rkt— cost-based layout selection implementation