Technical Deep Dive 2026-04-10 8 min read

Why Tree-sitter Is the Future of Legacy Code Analysis

Traditional regex-based parsers fail on complex legacy languages. Tree-sitter provides incremental, error-tolerant parsing that handles COBOL, PL/I, and VB6 with 94%+ accuracy at native speed.

By AITYTECH Engineering

Legacy code analysis has long been plagued by a fundamental problem: most parsers weren't designed for languages like COBOL, PL/I, or JCL. Traditional approaches using regex patterns or hand-written recursive descent parsers break down when facing column-sensitive formatting, continuation lines, and the sheer complexity of 60-year-old language specifications.

The Problem with Traditional Parsers

Consider COBOL's column-based format: columns 1-6 are sequence numbers, column 7 is the indicator area (comments, continuations), columns 8-72 are the program text, and columns 73-80 are identification. A naive parser that treats source code as a stream of characters will immediately fail.

Most legacy analysis tools work around this with preprocessing steps that strip columns, normalize formatting, and apply regex patterns. This approach has three critical flaws:

Loss of position information — after preprocessing, line numbers and column positions no longer map to the original source
Error cascading — a single unrecognized construct causes the entire file to fail parsing
No incremental parsing — every edit requires a full reparse, making IDE integration impractical

Tree-sitter: A Different Approach

Tree-sitter, developed by Max Brunsfeld at GitHub, takes a fundamentally different approach. Instead of failing on unrecognized input, it produces a concrete syntax tree (CST) that includes error nodes alongside successfully parsed constructs. This is exactly what legacy code analysis needs.

Key advantages for legacy languages:

Error tolerance — parse 94%+ of real-world COBOL even with dialect variations
Incremental parsing — only re-parse changed regions, enabling real-time IDE support
Consistent AST format — the same query API works across COBOL, JCL, PL/I, VB6, and every other supported language
Native speed — generated C parsers run at 10+ MB/s, orders of magnitude faster than interpreted alternatives

Benchmark: Tree-sitter vs Traditional Parsers

We benchmarked our tree-sitter COBOL parser against three traditional approaches using 276 real-world COBOL files (1.8MB total):

Approach	Parse Time	Success Rate	Error Recovery
Tree-sitter (our parser)	12ms for 1,200-line file	100%	Full — error nodes in AST
Regex-based (typical)	~200ms	~70%	None — fails completely
ANTLR4 COBOL grammar	~80ms	~85%	Partial — skip tokens
Hand-written parser	~40ms	~90%	Manual — per-construct

The tree-sitter approach is not only 3-16x faster but achieves the highest success rate with the best error recovery. When a COBOL program uses a dialect extension the parser doesn't know about, it marks that region as an error node while correctly parsing everything else.

Multi-Language: One API for All Legacy Languages

Perhaps the most powerful aspect of tree-sitter for legacy modernization is cross-language consistency. The same tree-sitter query that finds all procedure divisions in COBOL can be adapted to find all subroutines in VB6, all procedures in PL/I, or all job steps in JCL.

This means modernization tools built on tree-sitter can support new languages by simply adding a grammar — no parser rewrite required. Our service currently supports 7 legacy languages through a single unified API, all parsing at native C speed.

Getting Started

Our open-source legacy parser service provides a REST API for tree-sitter-powered analysis. Upload any COBOL, JCL, PL/I, VB6, VB.NET, PowerBuilder, or Assembly file and get a complete AST with error reporting in microseconds.