Skip to content

Improve station filtering with quality evaluation pipeline#74

Merged
bkeepers merged 4 commits intomainfrom
rewrite-filtering-v2
Mar 3, 2026
Merged

Improve station filtering with quality evaluation pipeline#74
bkeepers merged 4 commits intomainfrom
rewrite-filtering-v2

Conversation

@bkeepers
Copy link
Contributor

@bkeepers bkeepers commented Mar 3, 2026

The previous approach filtered TICON stations during import using ad-hoc rules for proximity to other stations. This replaces that with a single, auditable quality evaluation pipeline that treats all stations uniformly.

Quality evaluation pipeline

tools/evaluate-quality.ts evaluates all stations and produces quality.json, which is checked in and used as the source of truth for filtering stations at runtime.

$ tools/evaluate-quality.ts
Loading stations...
Loaded 8279 stations (NOAA: 3451, TICON: 4828)

Computing quality factors...
Computing coverage...
Deduplicating...

=== Quality Evaluation Summary ===

Accepted: 6075 (NOAA: 3426, TICON: 2649)
Rejected: 2204
  constituents: 40
  datum: 421
  duplicate: 944
  range: 169
  superseded: 630

Score distribution (accepted stations):
  Min: 32  P25: 74  Median: 83  P75: 84  Max: 100

Accepted stations with issues: 581

Wrote quality.json (8279 entries)

The evaluation:

  • Applies hard gates that immediately reject: datum ordering violations, negligible tidal range (<2cm MHW–MLW), superseded sources (noaa/rws_hist suffixes in TICON), and missing or zero-amplitude essential constituents (M2, S2, K1, O1).
  • Computes six scored factors (0–1 each): epoch length (19-year nodal cycle target), recency of epoch, source authority tier, quality flags, amplitude plausibility, and geographic coverage value.
  • Deduplicates surviving stations regardless of source. The higher-scoring station wins, except a NOAA reference with accepted subordinates always beats a source-only station. Duplicates are defined as stations that are:
    • within 50m (always)
    • within 100m when one station has accepted subordinates depending on it.
    • within 500m if their M2 amplitude ratio ≥ 0.9 (harmonically similar)
  • Handles four edge cases from the previous pipeline: NOAA stations with placeholder epoch dates (0000-0000) receive benefit-of-the-doubt scores; subordinate stations inherit epoch/recency from their reference; NOAA reference stations with subordinate dependents are protected from dedup loss.

Quality result

The evaluation produces quality.json (generated, checked in), which contains a record for each station with its accepted status, score (0–100), contributing factors, issues found, reason for rejection if applicable, and whether it was marked redundant in deduplication.

Rejection breakdown:

  • superseded: 630 — TICON stations duplicating NOAA/RWS data we have from the primary source
  • duplicate: 955 — deemed a duplicate of another station based on proximity and similarity, with the lower-scoring station rejected
  • datum: 421 — definitional datum ordering violations (MHW ≤ MSL, MSL ≤ MLW, MLW < LAT, MHW < MLLW, HAT < MHHW)
  • range: 169 — negligible tidal range < 2cm
  • constituents: 40 — missing essential constituents, zero-amplitude essentials, or P1 > K1 (physically impossible)

The issues array provides detailed flags for each station, which can be used for debugging or further analysis.

Here is an example record for an accepted and rejected station:

[
  {
    "id": "noaa/8725649",
    "accepted": true,
    "score": 80,
    "factors": {
      "epoch": 1,
      "recency": 0.75,
      "source": 1,
      "quality": 1,
      "amplitude": 0.75,
      "coverage": 0.181
    },
    "issues": ["K2 amplitude (0.0360) exceeds S2 (0.0200)"]
  },
  {
    "id": "noaa/8762483",
    "accepted": false,
    "score": 0,
    "factors": {
      "epoch": 1,
      "recency": 0.75,
      "source": 1,
      "quality": 1,
      "amplitude": 0,
      "coverage": 0
    },
    "issues": [
      "MLW (-0.04) < MLLW (-0.039)",
      "Missing constituents for prediction: M2, S2, K1, O1"
    ],
    "reason": "constituents"
  }
]

Runtime filtering

quality.json is loaded at runtime. The npm module now exports both stations (filtered) and allStations (unfiltered):

import { stations, allStations } from '@neaps/tide-database';

@bkeepers bkeepers force-pushed the rewrite-filtering-v2 branch from 23a0264 to edb6df0 Compare March 3, 2026 15:08
@bkeepers bkeepers requested a review from Copilot March 3, 2026 15:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces ad-hoc, source-specific station filtering with a single quality evaluation pipeline that generates an auditable quality.json used for runtime filtering, and updates the repository’s station data set accordingly.

Changes:

  • Adds many new TICON station JSON records to the dataset.
  • Updates Prettier configuration to ignore the generated quality.json.
  • (Per description) Introduces a quality evaluation pipeline and runtime filtering based on quality.json.

Reviewed changes

Copilot reviewed 70 out of 1842 changed files in this pull request and generated no comments.

Show a summary per file
File Description
data/ticon/atka-9461710-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/astoria-9439040-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/arun_platform-arn-gbr-cco.json Adds a new station record (inputs to quality evaluation).
data/ticon/artificial_is-8537614-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/armitage_island-9449932-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/arko-2545-swe-smhi.json Adds a new station record (inputs to quality evaluation).
data/ticon/arena_cove-9416841-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/apra_harbor-1630000-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/apalachicola-8728690-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/anzio-anz-ita-ispra.json Adds a new station record (inputs to quality evaluation).
data/ticon/annapolis-8575512-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/anklam-9660001-deu-wsv.json Adds a new station record (inputs to quality evaluation).
data/ticon/andenes-anx-nor-nhs.json Adds a new station record (inputs to quality evaluation).
data/ticon/anchorage-9455920-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/american_falls_ny-9063010-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/althagen-9650024-deu-wsv.json Adds a new station record (inputs to quality evaluation).
data/ticon/alpena_mi-9075065-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/alliance-8761727-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/alitak-9457804-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/alicante_i-ali-esp-da_mm.json Adds a new station record (inputs to quality evaluation).
data/ticon/alexandria-807-egy-uhslc_fd.json Adds a new station record (inputs to quality evaluation).
data/ticon/alblasserdam-albsdm-nld-rws_hist.json Adds a new station record (inputs to quality evaluation).
data/ticon/albany-8518995-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/alameda-9414750-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/aguadilla-9759412-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/adak_island-9461380-usa-noaa.json Adds a new station record (inputs to quality evaluation).
data/ticon/acapulcoapi_radar-41-mex-unam.json Adds a new station record (inputs to quality evaluation).
data/ticon/acapulcoapi_nivelcbs-41-mex-unam.json Adds a new station record (inputs to quality evaluation).
data/ticon/acajutla-082-slv-uhslc_fd.json Adds a new station record (inputs to quality evaluation).
data/ticon/aberdeen-9441187-usa-noaa.json Adds a new station record (inputs to quality evaluation).
.prettierignore Ignores generated quality.json to avoid formatting churn.
Comments suppressed due to low confidence (5)

data/ticon/arun_platform-arn-gbr-cco.json:27

  • These datum values appear internally inconsistent: MHW is greater than MHHW, and MLLW is greater than MLW. If your quality pipeline rejects datum ordering violations (as described in the PR), this station will be auto-rejected. Suggest verifying the sourced datum values and correcting them (or, if these values are intentionally preserved from upstream despite being invalid, update the station metadata to clearly indicate the known datum ordering issue).
    "MHHW": 2.122,
    "MHW": 2.124,
    "MSL": 0,
    "MTL": 0.055,
    "MLW": -2.013,
    "MLLW": -2.012,

data/ticon/anklam-9660001-deu-wsv.json:23

  • MHW is greater than MHHW, which is a datum ordering violation. If the new evaluation pipeline hard-rejects datum ordering problems, this record will be rejected; please verify/correct the datum values or flag this explicitly as a known upstream issue.
    "MHHW": 0.002,
    "MHW": 0.003,

data/ticon/alliance-8761727-usa-noaa.json:23

  • MHHW is lower than MHW, which is a datum ordering violation. This will likely trigger the pipeline’s datum hard-gate; please verify/correct the datum values (or annotate as a known upstream issue if keeping the original values is intentional).
    "MHHW": -0.004,
    "MHW": 0.005,

data/ticon/arun_platform-arn-gbr-cco.json:20

  • This record’s datums show clear ordering issues (e.g., MHW > MHHW, MLLW > MLW), so the disclaimer "No obvious issues" is misleading. Suggest updating the disclaimer to reflect the known issue if keeping the values as-is.
  "disclaimers": "No obvious issues",

data/ticon/american_falls_ny-9063010-usa-noaa.json:2

  • The station name casing/abbreviation looks inconsistent with typical place naming (e.g., state abbreviation). Consider normalizing to a consistent convention used elsewhere in the dataset (for example, "American Falls, NY" or "American Falls (NY)") to improve readability.
  "name": "American Falls Ny",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bkeepers bkeepers force-pushed the rewrite-filtering-v2 branch from 18c4bf5 to edb6df0 Compare March 3, 2026 22:00
@bkeepers bkeepers merged commit 28f0d16 into main Mar 3, 2026
6 checks passed
@bkeepers bkeepers deleted the rewrite-filtering-v2 branch March 3, 2026 22:23
@bkeepers bkeepers changed the title Rewrite filtering Improve station filtering with quality evaluation pipeline Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants