Skip to content

chitralabs/schemamatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

schemamatch

Maven Central CI codecov License Java

Compare and diff two tabular datasets (CSV, Excel, JSON) in Java with zero runtime dependencies.


Features

  • Zero runtime dependencies — pure Java 11, no transitive classpath pollution
  • 📄 Multi-format — CSV, TSV, XLSX, XLS, and JSON array inputs
  • 🔑 Key-column row matching — diff by business key, not row position
  • 🔢 Numeric tolerance — flag 0.5% price differences, ignore floating-point noise
  • 📅 Date normalization2024-01-15 == 01/15/2024 when enabled
  • 🔤 Case-insensitive comparison option
  • 📊 HTML reports — beautiful inline-CSS diff report, zero dependencies
  • 🗄️ JSON reports — machine-readable output for CI pipelines
  • 🚀 Streaming mode — 100K-row files under 50MB heap

Quick Start

Maven

<dependency>
    <groupId>io.github.chitralabs.schemamatch</groupId>
    <artifactId>schemamatch-core</artifactId>
    <version>1.0.0</version>
</dependency>

Gradle

implementation 'io.github.chitralabs.schemamatch:schemamatch-core:1.0.0'

Usage

One-liner diff

DiffResult result = SchemaMatcher.diff("baseline.csv", "actual.csv");
System.out.println(result.isIdentical());         // false
System.out.println(result.getRowDiffCount());     // 3

Generate HTML report

SchemaMatcher.diff("before.xlsx", "after.xlsx")
             .report("diff-report.html");

With options

DiffResult r = SchemaMatcher.options()
        .keyColumn("customer_id")     // match rows by key, not position
        .tolerance(0.01)              // 1% numeric tolerance
        .ignoreCase(true)             // case-insensitive string comparison
        .diff("v1.csv", "v2.csv");

// Inspect column changes
r.getColumnDiffs().forEach(cd ->
    System.out.println(cd.getChangeType() + ": " + cd.getActualColumnName()));

// Inspect row changes
r.getRowDiffs().forEach(rd -> {
    System.out.println("Row " + rd.getRowIndex() + " [" + rd.getChangeType() + "]");
    rd.getChangedValues().forEach(vd ->
        System.out.println("  " + vd.getColumnName() + ": " +
                           vd.getBaselineValue() + " → " + vd.getActualValue()));
});

JSON report for CI

SchemaMatcher.diff("expected.csv", "actual.csv").report("diff.json");
// Fails CI if jq '.rowDiffCount > 0' diff.json

Supported Formats

Format Extension Notes
CSV .csv RFC 4180, quoted fields, embedded commas
TSV .tsv Tab-delimited variant
Excel .xlsx Requires Apache POI on classpath
Excel .xls Legacy format, requires Apache POI
JSON .json Top-level array of objects

For Excel support, add Apache POI to your own pom.xml:

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>5.2.5</version>
</dependency>

Performance

File Size Mode Heap Time
10K rows Standard < 20MB < 1s
100K rows Streaming < 50MB ~3s
1M rows Streaming < 50MB ~25s

Enable streaming for large files:

SchemaMatcher.options().streaming(5000).diff("huge.csv", "huge2.csv");

Related


License

Apache License 2.0 — see LICENSE.

© 2026 Chitrapradha Ganesan — github.com/chitralabs

About

Compare and diff two tabular datasets (CSV, Excel, SON) in Java with zero dependencies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages