Language support
ohara extracts HEAD-snapshot symbols (classes, methods, functions,
records, sealed types, …) from source files using
tree-sitter. The
extracted symbols feed the vec_symbol, fts_symbol, and
fts_symbol_name tables and contribute one of the three retrieval
lanes in find_pattern.
Hunks (the diff bodies that drive the other two retrieval lanes) are
language-agnostic — they’re extracted by git2 regardless of the file
type, so even unsupported languages still appear in
find_pattern results via the BM25 hunk-text and vector lanes. Symbol
extraction is the language-specific piece.
Supported languages
| Language | Since | Notes |
|---|---|---|
| Rust | v0.1 | Functions, methods, structs, enums, traits, impls. |
| Python | v0.1 | Classes, methods, top-level functions. |
| Java 17+ | v0.4 | Classes (incl. sealed), interfaces, records, enums, methods. |
| Kotlin 1.9 / 2.0 | v0.4 | Classes (incl. sealed), data classes, objects + companion objects, interfaces, top-level + member functions. |
| TypeScript | v0.8 | .ts, .tsx. Functions, classes, methods, arrow-function consts, interfaces, type aliases, enums. |
| JavaScript | v0.8 | .js, .jsx, .mjs, .cjs. Functions, classes, methods, arrow-function consts. |
Java + Kotlin specifics
The v0.4 release was specifically designed to make ohara useful on Spring-flavored codebases, which means:
- Sealed types and records (Java 17+) and data classes / objects / companion objects (Kotlin) are first-class symbol kinds.
- Annotations stay inside
source_text. Decorators like@RestController,@Service,@Component,@SpringBootApplication, and Kotlin’s@Composable/@Serializableare preserved verbatim in the symbol’ssource_textfield. That means embeddings and BM25 both pick up Spring-style markers without any new query syntax — a query like “REST controller for user signup” matches symbols annotated with@RestControllerdirectly.
TypeScript + JavaScript specifics
The v0.8 release adds first-class symbol extraction for the JS/TS
ecosystem via the
tree-sitter-typescript
0.23 and
tree-sitter-javascript
0.23 grammars. Both grammars are loaded through the
tree-sitter-language
ABI shim so they keep working as the host tree-sitter runtime
advances.
- TypeScript dispatches on
.tsand.tsx; the latter is parsed with the TSX dialect of the grammar so JSX expressions inside.tsxcomponents don’t derail symbol extraction. - JavaScript dispatches on
.js,.jsx,.mjs, and.cjs. JSX inside.jsxparses cleanly via the same grammar. - Symbols extracted (both languages):
functiondeclarations,classdeclarations, methods (includingconstructor,get/set,static, andasync), and arrow-functionconsts (export const Foo = (…) => {…}) — the last one matters because the modern React/Next idiom expresses components and hooks as arrow-bound consts rather thanfunctiondeclarations. - Symbols extracted (TypeScript only):
interfacedeclarations,typealiases, andenumdeclarations. - Intentionally not extracted yet: decorators (preserved inside
source_textlike Java/Kotlin annotations, but not surfaced as separate symbols), ambient declarations in.d.tsfiles (the file extension dispatches normally butdeclareblocks aren’t symbolized), andnamespace/moduledeclarations (TS-only legacy module syntax). These are tracked on the roadmap.
Chunking
All six languages share the v0.3 AST sibling-merge chunker
(implemented in ohara-parse): depth-first traversal of the
tree-sitter parse tree, accumulate sibling nodes into the current
chunk while the running token total stays under a 500-token budget
(token ≈ char_count / 4 for chunking decisions). When a single node
exceeds the budget, emit it alone — preserves semantic locality at
the cost of larger chunks for occasional huge functions.
The chunker also records sibling-symbol names in a JSON-encoded
sibling_names column on each symbol row. That column feeds the
fts_symbol_name BM25 lane, so a query for “X” matches not just
symbols literally named X but symbols that share a parent with one.
As of v0.6 the per-language extractors live under
crates/ohara-parse/src/languages/ (one module per language) — same
behavior, tighter directory tree.
Grammar and parser versions
Plan-18 (v0.7) modernized the tree-sitter stack: the workspace tracks
tree-sitter 0.25, the rust/python/java grammars were bumped to their
current upstream releases (rust 0.21 → 0.24, python 0.21 → 0.25, java
0.21 → 0.23), and the kotlin grammar was swapped from the abandoned
fwcd/tree-sitter-kotlin to the more-recently-maintained
tree-sitter-grammars/tree-sitter-kotlin-ng
fork.
All four parser_versions bumped from "1" to "2" in lockstep, so
indexes built before v0.7 receive a query_compatible_needs_refresh
verdict per plan-13 (docs/superpowers/plans/2026-05-02-ohara-plan-13-index-metadata-and-rebuild-safety.md):
queries still work, but the symbol/hunk derived rows can be
out-of-date. Recovery is ohara index --force, which re-walks HEAD
symbols and rewrites derived rows without touching the embedding
vectors (model + dimension are unchanged).
Future languages
Tracked on the Roadmap. New languages are mostly a
matter of adding a tree-sitter grammar, a node-kind → symbol-kind
mapping, and a fixture file to the per-language unit tests in
crates/ohara-parse/. Annotations / decorators should follow the
“keep them in source_text” rule so the retrieval-side query story
stays consistent.