Clarify role of ChainID

The OCI Image config document covers the calculation of the `ChainID` but it doesn't go into why this is useful or how to best leverage.

The best way to view it is a hash of ordering of applied layers.

Let's say we have layers A, B, C, ordered from bottom to top, where A is the base and C is the top. Defining `|` as a binary application operator, the root filesystem may be `A|B|C`. While it is implied that `C` is only useful when applied to `A|B`, the identifier `C` is insufficient to identify this result, as we'd have the equality `C = A|B|C`, which isn't true.

The main issue is when we have two definitions of `C`, `C = C` and `C = A|B|C`. If this is true (with some handwaving), `C = x|C` where `x = any application` must be true. This means that if an attacker can define `x`, relying on `C` provides no guarantee that the layers were applied in any order.

The `ChainID` addresses this problem by being defined as a compound hash. __We differentiate the changeset `C`, from the order dependent application `A|B|C` by saying that the resulting rootfs is identified by ChainID(A|B|C), which can be calculated by `ImageConfig.rootfs`.__

The definition from the spec is something like this (also, see the [base implementation](https://github.com/docker/docker/blob/85bc735b4a56223c84971839d819ff8dc494c181/layer/layer.go#L254)):

```
ChainID(layer[N]) = SHA256hex(ChainID(layer[N-1]) + " " + DiffID(layer[N])).
```

(Note that this definition is slightly insufficient, because it implies that layer[N] is `layer[0]|...|layer[N-1]|layer[N]`, which we indicate doesn't quite add up above)

With our expanded example, the we can have a symbolic definition of `ChainID(C)`, which is a variation on some function `Hchain(A|B|C)`, with some notation hand-waving. 

```
ChainID(A) = DiffID(A)
ChainID(A|B) = SHA256(ChainID(A) + " " + DiffID(B))
ChainID(A|B|C) = SHA256(ChainID(A|B) + " " + DiffID(C))
```

(Note that we may be missing the base case, `ChainID(A) = DiffID(A)`, as well)

Let's expand this, for fun:

```
ChainID(A|B|C) = SHA256(SHA256(DiffID(A) + " " + DiffID(B)) + " " + DiffID(C))
```

Hopefully, the above is illustrative of the _actual_ contents of the `ChainID`.

Most importantly, `ChainID(C) != ChainID(A|B|C)`, otherwise, `ChainID(C) = DiffID(C)`, which is the base case, could not be true.

Taking these considerations, we can write a new definition in the following form:

```
ChainID(L0) =  DiffID(L0)
ChainID(L0|...|Ln-1|Ln) =  SHA256(ChainID(L0|...|Ln-1) + " " + DiffID(Ln))
```

While the notation is a little obtuse (suggestions welcome), it better reflects the recursive nature of the algorithm and the fact that the `ChainID` is not a property of the layer, but a property of the application of layers.

The provides the following implications:

- [ ] Update the specification (#586)
  - [ ] Provide better context on the usage and role of ChainID -> implementations should use it to identify unpacking result.
  - [ ] clarify the recursive nature of this algorithm.
- [x] Provide implementation of `ChainID` function. (#486)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify role of ChainID #482

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarify role of ChainID #482

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions