Skip to content

deprecate/go-best-effort-is-lfs-tracked #196

@bwalsh

Description

@bwalsh

Feature Request: Deprecate Pure-Go .gitattributes Matching in Favor of git check-attr

Summary

Deprecate the existing “best-effort pure-Go matcher for .gitattributes and standardize on authoritative attribute resolution via git check-attr for all path-based routing, filtering, and policy decisions.

The pure-Go matcher is inherently incomplete and will produce incorrect results in common, real-world Git repositories due to Git’s attribute precedence rules. These failures are subtle, hard to debug, and can lead to incorrect routing (RO vs RW), incorrect enforcement, or data integrity issues.


Motivation

Git attributes are not a simple pattern-matching file. They are resolved by Git using:

  • hierarchical precedence
  • multiple attribute sources
  • overrides and negation
  • path-relative scope
  • repo configuration and info files

Re-implementing this logic outside of Git is brittle and error-prone.
Git already exposes the correct resolution mechanism via:

git check-attr

Using Git as the source of truth eliminates ambiguity and guarantees correctness.


Problem Statement

The current pure-Go matcher:

  • Parses a single .gitattributes file
  • Applies “last match wins” semantics locally
  • Ignores Git’s full attribute resolution rules

This approach cannot faithfully replicate Git behavior and will return incorrect results in many common scenarios.


Typical Failure Scenarios

Below are non-edge-case, real-world situations where a best-effort matcher will give the wrong answer.


1. Nested .gitattributes Files

Scenario

.gitattributes
data/** drs.route=ro

data/projectA/.gitattributes
*.dat drs.route=rw

Path

data/projectA/file.dat

Correct Git behavior

drs.route = rw

Pure-Go failure

  • Only reads the root .gitattributes
  • Returns ro
  • Routes uploads incorrectly

Git resolves attributes per directory, not per file, and applies the closest .gitattributes.


2. Attribute Overrides and Unsets

Scenario

*.dat drs.route=ro
scratch/** -drs.route

Path

scratch/test.dat

Correct Git behavior

drs.route = unspecified

Pure-Go failure

  • Treats -drs.route as unknown or ignores it
  • Incorrectly keeps ro

Unset semantics are core to Git attributes and are difficult to model correctly.


3. info/attributes and Global Attributes

Git reads attributes from multiple sources:

Order of precedence (simplified):

  1. .gitattributes in the same directory
  2. Parent .gitattributes
  3. .git/info/attributes
  4. Global attributes (core.attributesFile)

Scenario

.git/info/attributes
TARGET-ALL-P2/** drs.route=ro

No .gitattributes in the repo.

Correct Git behavior

drs.route = ro

Pure-Go failure

  • Never looks at .git/info/attributes
  • Returns unspecified

This is extremely common in controlled or managed repos.


4. Attribute Macros and Composition

Scenario

[attr]readonly
drs.route=ro

data/** readonly

Correct Git behavior

drs.route = ro

Pure-Go failure

  • Does not expand attribute macros
  • Misses the route entirely

Macros are first-class Git features and are used heavily in larger repos.


5. Path Normalization and Platform Semantics

Git attribute matching uses:

  • forward-slash normalization
  • repo-relative paths
  • special handling for directories vs files

Scenario

  • Windows paths (\)
  • symlinked worktrees
  • submodules

A custom matcher will almost always diverge from Git’s behavior across platforms.


6. Renames and History-Sensitive Evaluation

Git evaluates attributes based on the current tree context, not historical assumptions.

Scenario

  • File moved from scratch/TARGET-ALL-P2/
  • Different routing rules apply

Correct Git behavior

  • Attributes reflect current path

Pure-Go failure

  • Cached or inferred rules from old paths
  • Incorrect routing after renames

Impact

Incorrect attribute resolution can cause:

  • Files routed to the wrong backend (RO vs RW)
  • Uploads denied or allowed incorrectly
  • Silent policy violations
  • Extremely difficult debugging (“works locally but not in CI”)

Because attribute resolution happens inside Git, any divergence introduces correctness risk.


Proposed Change

Deprecate

  • The “best-effort pure-Go .gitattributes matcher”

Standardize On

  • Calling git check-attr for all attribute lookups

Example:

git check-attr drs.route -- path/to/file

This provides:

  • Exact Git semantics
  • Correct precedence handling
  • Consistent behavior across platforms and environments

Migration Plan

  1. Mark the pure-Go matcher as deprecated

  2. Update internal callers to use git check-attr

  3. Retain the pure-Go matcher only as:

    • a test helper, or
    • a last-ditch fallback with explicit warnings

Alternatives Considered

  • Re-implement full Git attribute resolution in Go
    ❌ High complexity, high maintenance, guaranteed drift over time

  • Maintain both implementations
    ❌ Ambiguous correctness, inconsistent behavior

Using Git itself is the simplest, most robust solution.


Recommendation

Deprecate and remove the pure-Go attribute matcher in favor of authoritative resolution via git check-attr.

Git already solved this problem. We should not re-implement it.


Additional Rationale: Typical Git LFS Filter Scenarios Where Best-Effort Matching Fails

Git LFS usage amplifies the risk of incorrect attribute resolution because filter decisions affect both content storage and transfer semantics. A wrong answer doesn’t just misroute metadata — it can lead to missing objects, failed pushes, or corrupted workflows.

Below are common, real-world LFS patterns where a best-effort .gitattributes matcher will fail.


1. Mixed LFS / Non-LFS Paths with Overrides

Scenario

*.dat filter=lfs diff=lfs merge=lfs -text

# Explicitly exclude scratch outputs
scratch/** -filter -diff -merge

Path

scratch/results/output.dat

Correct Git behavior

  • filter = unspecified (NOT LFS)
  • File is stored directly in Git

Best-effort failure

  • Sees *.dat filter=lfs
  • Ignores or mishandles -filter
  • Treats file as LFS-managed

Impact

  • Pointer file written where raw content was expected
  • Downstream tools fail on unexpected pointer files
  • Users see “why is my scratch output in LFS?”

2. Nested LFS Rules with Directory-Scoped Overrides

Scenario

.gitattributes
*.bin filter=lfs

data/raw/.gitattributes
*.bin -filter

Path

data/raw/sample.bin

Correct Git behavior

  • Not tracked by LFS

Best-effort failure

  • Only evaluates root .gitattributes
  • Treats file as LFS-managed

Impact

  • Large raw files unintentionally pushed through LFS
  • Uploads fail or are routed incorrectly
  • Hard to diagnose because the rule looks correct to the user

3. LFS Enablement via Attribute Macros

Scenario

[attr]lfsdata
filter=lfs diff=lfs merge=lfs -text

*.bam lfsdata
*.cram lfsdata

Correct Git behavior

  • .bam and .cram files are LFS-tracked

Best-effort failure

  • Does not expand attribute macros
  • Returns filter=unspecified

Impact

  • Large genomics files committed directly into Git
  • Repository bloat
  • Silent failure until repo size explodes

This pattern is very common in scientific and media repositories.


4. info/attributes Used to Enforce LFS Globally

Scenario

.git/info/attributes
*.mp4 filter=lfs diff=lfs merge=lfs -text

No .gitattributes committed to the repo.

Correct Git behavior

  • .mp4 files are LFS-managed

Best-effort failure

  • Never reads .git/info/attributes
  • Treats files as non-LFS

Impact

  • CI and developer machines behave differently
  • LFS rules appear to “randomly not apply”
  • Violates operator expectations in managed environments

5. Conditional LFS Usage by Directory

Scenario

data/** filter=lfs
data/tmp/** -filter

Path

data/tmp/intermediate.bin

Correct Git behavior

  • Not LFS-managed

Best-effort failure

  • Applies first match only
  • Or applies both incorrectly
  • Returns filter=lfs

Impact

  • Temporary/intermediate files end up as LFS pointers
  • Users delete temp dirs and break LFS history
  • Garbage collection and pruning become unsafe

6. Rename-Sensitive LFS Semantics

Scenario

  • File initially in scratch/ (not LFS)
  • Later renamed to data/ (LFS-tracked)
scratch/** -filter
data/**    filter=lfs

Correct Git behavior

  • LFS applies based on current path, not history

Best-effort failure

  • Cached or inferred rules based on old location
  • Incorrectly treats renamed file as non-LFS

Impact

  • Pointer not created when expected
  • Push fails with (missing) because bytes aren’t in LFS store
  • Extremely confusing user experience

7. Cross-Platform Path Matching Issues

Git attribute matching:

  • normalizes to /
  • applies repo-relative paths
  • handles directories specially

Best-effort failure modes

  • Windows \ paths
  • Case sensitivity mismatches
  • Incorrect matching for ** patterns

Impact

  • LFS works on macOS/Linux, fails on Windows
  • Routing differs between developer machines and CI

Why This Matters More for LFS Than Other Attributes

For attributes like text or eol, a wrong answer is annoying.

For LFS, a wrong answer can cause:

  • pointer files where raw data is expected
  • raw data where pointers are required
  • missing objects at push time
  • irreversible repo pollution

Because LFS affects storage, transport, and history, correctness is non-negotiable.


Conclusion (Reinforced)

Any best-effort .gitattributes matcher will inevitably diverge from Git’s behavior in common LFS use cases.

For LFS-related decisions (filter=lfs, routing, policy enforcement):

git check-attr is not just preferable — it is required for correctness.

This strengthens the case to deprecate the pure-Go matcher entirely and rely on Git as the single source of truth.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions