<!-- Source: https://docs.geopera.com/api-reference/guides/provenance · Markdown for LLMs -->

# Provenance & lineage

Geopera records **lineage at the moment data is created** — not as an afterthought, a
log, or a best-effort background job. When an operation produces an artifact, the
lineage record is written in the **same transaction** that creates the artifact. If the
lineage can't be recorded, the operation does not succeed.

This page explains what that means in practice, what gets recorded, and why it's a
guarantee rather than a feature you have to enable.

## Provenance by construction

Most systems treat audit trails as something bolted on: a log line, an event emitted
after the fact, a table someone hopefully writes to. Those drift — code paths get added
that forget to log, and the trail has holes.

Geopera inverts this. The operation pipeline has a fixed shape, and for any operation
that isn't a pure read it runs:

```
… → open provenance → execute → validate output → seal provenance → audit
```

The **seal** step enforces a contract: if an operation *declares* that it produces an
artifact (an item, an order, a job…), it **must** have emitted that artifact's lineage,
or the operation fails closed with a provenance error — even if the underlying work
"succeeded." There is no flag to turn this off, and no privileged code path that skips
it. A capability that produces data and forgets to record where it came from cannot ship.

The result: every produced artifact carries a complete record of how it came to exist,
by construction.

## What gets recorded

Each producing operation emits one or more **artifacts**, each linked by **derived-from
edges** to the things it came from. The artifact kinds today:

| Kind | Emitted by (examples) | Typically derived from |
|---|---|---|
| `order` | `orders.archive.place`, `orders.tasking.place` | the captures / AOIs it was placed for |
| `item` | `uploads.complete`, order delivery, processing output | the upload, order, or job that produced it |
| `asset` | asset ingestion | its parent item |
| `processing_job` | `processing.create`, `clip.create_from_item` | its input items/assets |
| `report` | `reports.generate` | the items/analytics it summarised |
| `share_link` | `share.link.create` | the item or collection it exports |
| `collection`, `project` | `collections.create`, `projects.create` | their parent scope |

For example:

- An **archive order** is recorded as an `order` artifact derived from each capture it
  ordered (`relation: "ordered"`).
- A **delivered item** is recorded as an `item` derived from its order
  (`relation: "delivered_for"`).
- An **uploaded item** is an `item` derived from its upload session
  (`relation: "uploaded"`).
- A **clip output** is an `item` derived from both the processing job
  (`relation: "produced_by"`) and the source item it was clipped from.

Chained together, these edges form a walkable graph: a clipped item → the job that made
it → the source item → the order that delivered the source → the captures that order
bought. Lineage survives even if an upstream source is later archived.

## The idempotent no-op

A producing operation that legitimately produces *nothing* on a given call — e.g.
`uploads.complete` on an already-completed session, or an order placement that returned
a cached result on an idempotent retry — marks itself an idempotent no-op. The seal
contract recognises this and doesn't demand a fresh emit. So idempotent retries stay
safe and provenance stays honest: a record is written exactly when an artifact is
actually created, and not otherwise.

## Why it matters

- **Auditability** — for any artifact you can answer "where did this come from?" with a
  record that was written atomically with the artifact, not reconstructed from logs.
- **Reproducibility** — the inputs an output was derived from are captured, so a result
  can be traced back to exactly what produced it.
- **Trust across clients** — because the guarantee lives in the kernel, it holds no
  matter who called the operation: the portal, an SDK, an AI agent, or a worker
  registering its job's outputs all produce the same complete lineage.

## Integrity of the record

The provenance store is written through a privileged, security-defined path: operations
emit through the kernel, which writes the artifact and its edges atomically; application
roles cannot write provenance rows directly. So the lineage graph reflects what the
governed operations actually did — it can't be forged by a client that bypasses the
emit path, because there is no such path.

## Reading lineage

Walk the graph over the API with **`provenance.get`** (scope `provenance:read`):

```bash
curl -s -X POST https://api.geopera.com/v1/op/provenance.get \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "entity_type": "item", "entity_id": "it_3a9c...", "direction": "up", "max_depth": 20 }'
```

```json
{
  "root": { "entity_type": "item", "entity_id": "it_3a9c..." },
  "direction": "up",
  "edges": [
    { "depth": 1, "src_type": "item", "src_id": "it_src...", "dst_type": "item", "dst_id": "it_3a9c...", "relation": "duplicated_from", "invocation_id": "..." }
  ],
  "nodes": [ { "entity_type": "item", "entity_id": "it_3a9c...", "depth": 0 }, { "entity_type": "item", "entity_id": "it_src...", "depth": 1 } ]
}
```

- **`direction`** — `up` (ancestors: "how was this produced?"), `down` (descendants:
  "what came from this?"), or `both`. **`max_depth`** bounds the walk (1–50).
- Supported root types: `item`, `order`, `processing_job`, `collection`, `project`.
- **Org-scoped:** you can only read lineage for a root your organization owns — a
  cross-org or unknown root returns `404` (no existence leak), exactly like every other
  read.
