Skip to content

feat(sparql-anything): add SparqlAnythingConverter for chunked non-RDF to RDF conversion#511

Open
ddeboer wants to merge 2 commits into
mainfrom
feat/sparql-anything-converter
Open

feat(sparql-anything): add SparqlAnythingConverter for chunked non-RDF to RDF conversion#511
ddeboer wants to merge 2 commits into
mainfrom
feat/sparql-anything-converter

Conversation

@ddeboer

@ddeboer ddeboer commented Jun 23, 2026

Copy link
Copy Markdown
Member

Summary

Adds the @lde/sparql-anything package with a SparqlAnythingConverter — the first LDE piece of the geonames-rdf migration. It ports geonames-rdf's map.sh convert loop to TypeScript on top of an @lde/task-runner. Convert-only scope (no selector, transform, validation, or chaining).

What it does

new SparqlAnythingConverter({ queryFile, jarPath, adminCodesFile, taskRunner }).convert(chunkPaths, outputPath):

  • Runs the SPARQL Anything CLI once per input chunk to bound memory use (fx:ondisk plus per-process isolation), then stream-concatenates the per-chunk N-Triples into a single output file. Streaming keeps multi-gigabyte outputs out of memory; N-Triples has no prefixes or document structure, so plain concatenation is always valid.
  • Substitutes each chunk's path into the query's {SOURCE} placeholder through a temporary -q file, rather than map.sh's inline --query "$(sed …)". A temp file sidesteps shell-escaping a large SPARQL query passed to a shell: true runner.
  • Aborts the whole conversion when any chunk's process exits non-zero (the runner's wait() rejects), so a crashed chunk can never be silently dropped from the output — the file-output equivalent of map.sh's set -e.

It is not a pipeline Executor: it produces a file, not quads, so it sits standalone rather than inside Pipeline → Stage → Executor. Chunking stays in the caller (geonames download.sh); the converter consumes pre-split chunks.

Design notes

  • The converter takes a TaskRunner rather than spawning processes itself, so it runs unchanged on the host (NativeTaskRunner), in Docker, or anywhere else.
  • Implementation lives in src/sparql-anything-converter.ts (re-exported from index.ts) so V8 coverage measures it — the base config excludes **/index.ts.

Tests

Unit tests drive the converter through a stubbed TaskRunner (vertical TDD slices): the per-chunk CLI contract (-q temp file, --load, --format NT, --output), {SOURCE} substitution, multi-chunk concatenation order, and the fail-fast abort contract. 100% coverage.

The real-jar integration test (actual SPARQL Anything jar over a tiny CSV fixture) and the geonames-rdf consumption bridge (how the shell/Docker repo invokes a TS converter) are deferred follow-ups.

ddeboer added 2 commits June 22, 2026 12:42
…F to RDF conversion

- Run the SPARQL Anything CLI once per input chunk (via an @lde/task-runner)
  to bound memory use, then stream-concatenate the per-chunk N-Triples into
  one file.
- Substitute each chunk's path into the query's `{SOURCE}` placeholder through
  a temporary `-q` file, avoiding shell-escaping a large inline SPARQL query.
- Abort the whole conversion when any chunk's process exits non-zero, so a
  crashed chunk can never be silently dropped from the output.
- Scaffold the @lde/sparql-anything package (0.1.0) and list it in the root
  README packages table and architecture diagram.
@ddeboer ddeboer force-pushed the feat/sparql-anything-converter branch from 80488b3 to 9e59373 Compare June 23, 2026 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant