Provenance & lineage

Geopera records lineage at the moment data is created — not as an afterthought, a log, or a best-effort background job. When an operation produces an artifact, the lineage record is written in the same transaction that creates the artifact. If the lineage can’t be recorded, the operation does not succeed.

This page explains what that means in practice, what gets recorded, and why it’s a guarantee rather than a feature you have to enable.

Provenance by construction

Most systems treat audit trails as something bolted on: a log line, an event emitted after the fact, a table someone hopefully writes to. Those drift — code paths get added that forget to log, and the trail has holes.

Geopera inverts this. The operation pipeline has a fixed shape, and for any operation that isn’t a pure read it runs:

… → open provenance → execute → validate output → seal provenance → audit

The seal step enforces a contract: if an operation declares that it produces an artifact (an item, an order, a job…), it must have emitted that artifact’s lineage, or the operation fails closed with a provenance error — even if the underlying work “succeeded.” There is no flag to turn this off, and no privileged code path that skips it. A capability that produces data and forgets to record where it came from cannot ship.

The result: every produced artifact carries a complete record of how it came to exist, by construction.

What gets recorded

Each producing operation emits one or more artifacts, each linked by derived-from edges to the things it came from. The artifact kinds today:

KindEmitted by (examples)Typically derived from
orderorders.archive.place, orders.tasking.placethe captures / AOIs it was placed for
itemuploads.complete, order delivery, processing outputthe upload, order, or job that produced it
assetasset ingestionits parent item
processing_jobprocessing.create, clip.create_from_itemits input items/assets
reportreports.generatethe items/analytics it summarised
share_linkshare.link.createthe item or collection it exports
collection, projectcollections.create, projects.createtheir parent scope

For example:

  • An archive order is recorded as an order artifact derived from each capture it ordered (relation: "ordered").
  • A delivered item is recorded as an item derived from its order (relation: "delivered_for").
  • An uploaded item is an item derived from its upload session (relation: "uploaded").
  • A clip output is an item derived from both the processing job (relation: "produced_by") and the source item it was clipped from.

Chained together, these edges form a walkable graph: a clipped item → the job that made it → the source item → the order that delivered the source → the captures that order bought. Lineage survives even if an upstream source is later archived.

The idempotent no-op

A producing operation that legitimately produces nothing on a given call — e.g. uploads.complete on an already-completed session, or an order placement that returned a cached result on an idempotent retry — marks itself an idempotent no-op. The seal contract recognises this and doesn’t demand a fresh emit. So idempotent retries stay safe and provenance stays honest: a record is written exactly when an artifact is actually created, and not otherwise.

Why it matters

  • Auditability — for any artifact you can answer “where did this come from?” with a record that was written atomically with the artifact, not reconstructed from logs.
  • Reproducibility — the inputs an output was derived from are captured, so a result can be traced back to exactly what produced it.
  • Trust across clients — because the guarantee lives in the kernel, it holds no matter who called the operation: the portal, an SDK, an AI agent, or a worker registering its job’s outputs all produce the same complete lineage.

Integrity of the record

The provenance store is written through a privileged, security-defined path: operations emit through the kernel, which writes the artifact and its edges atomically; application roles cannot write provenance rows directly. So the lineage graph reflects what the governed operations actually did — it can’t be forged by a client that bypasses the emit path, because there is no such path.

Reading lineage

Walk the graph over the API with provenance.get (scope provenance:read):

bash
curl -s -X POST https://api.geopera.com/v1/op/provenance.get \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "entity_type": "item", "entity_id": "it_3a9c...", "direction": "up", "max_depth": 20 }'
json
{
  "root": { "entity_type": "item", "entity_id": "it_3a9c..." },
  "direction": "up",
  "edges": [
    { "depth": 1, "src_type": "item", "src_id": "it_src...", "dst_type": "item", "dst_id": "it_3a9c...", "relation": "duplicated_from", "invocation_id": "..." }
  ],
  "nodes": [ { "entity_type": "item", "entity_id": "it_3a9c...", "depth": 0 }, { "entity_type": "item", "entity_id": "it_src...", "depth": 1 } ]
}
  • directionup (ancestors: “how was this produced?”), down (descendants: “what came from this?”), or both. max_depth bounds the walk (1–50).
  • Supported root types: item, order, processing_job, collection, project.
  • Org-scoped: you can only read lineage for a root your organization owns — a cross-org or unknown root returns 404 (no existence leak), exactly like every other read.