Skip to content

Add tests and fix corner cases for Parquet/GeoArrow extension type conversion#10065

Open
paleolimbot wants to merge 5 commits into
apache:mainfrom
paleolimbot:parquet-geospatial-typo
Open

Add tests and fix corner cases for Parquet/GeoArrow extension type conversion#10065
paleolimbot wants to merge 5 commits into
apache:mainfrom
paleolimbot:parquet-geospatial-typo

Conversation

@paleolimbot
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot commented Jun 3, 2026

Which issue does this PR close?

Rationale for this change

There were several issues with conversion identified when I tried to integrate this in SedonaDB and that came to light when the spec was recently clarified.

I am sorry for missing these changes when I reviewed the initial implementation.

What changes are included in this PR?

  • A Parquet crs of None for geometry or geography is now converted to a GeoArrow CRS of "OGC:CRS84" (the named value for the default CRS in the Parquet spec)
  • A Parquet crs of "srid:0" is now converted to a GeoArrow "omitted" CRS. This was recently clarified in the Parquet spec (srid:0 is a named example in the list of allowed values).
  • A GeoArrow missing CRS is now encoded as "srid:0"
  • A GeoArrow CRS that is "lonlat-like" is now encoded as a Parquet crs of None. This logic was included in the previous implementation but was reversed (Parquet CRSes that looked like lonlat were omitted when written to GeoArrow, which is not correct).
  • The GeoArrow metadata struct uses the name "algorithm" and was serializing it to JSON. The GeoArrow spec uses the "edges" key. This led to invalid metadata being generated which was either rejected or incorrectly interpreted by consumers.

Are these changes tested?

Yes. I added high-level end-to-end LogicalType <-> extension metadata tests, since that is what matters (there were a few lower level tests that I updated as well).

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jun 3, 2026
Comment on lines +437 to +495
// Geometry with default CRS (defaults to OGC:CRS84 per Parquet spec)
(LogicalType::geometry(None), r#"{"crs":"OGC:CRS84"}"#),
// Geometry with srid:0 should result in an unset (omitted) CRS
(LogicalType::geometry(Some("srid:0".to_string())), r#"{}"#),
// Geometry with custom CRSes (authority:code and partial projjson)
(
LogicalType::geometry(Some("EPSG:4267".to_string())),
r#"{"crs":"EPSG:4267"}"#,
),
(
LogicalType::geometry(Some(
r#"{"id":{"authority":"EPSG","code":4326}}"#.to_string(),
)),
r#"{"crs":{"id":{"authority":"EPSG","code":4326}}}"#,
),
// Geography with default CRS (default OGC:CRS84, spherical edges)
(
LogicalType::geography(None, None),
r#"{"crs":"OGC:CRS84","edges":"spherical"}"#,
),
// Geography with explicit edges
(
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::SPHERICAL)),
r#"{"crs":"OGC:CRS84","edges":"spherical"}"#,
),
(
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::KARNEY)),
r#"{"crs":"OGC:CRS84","edges":"karney"}"#,
),
(
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::VINCENTY)),
r#"{"crs":"OGC:CRS84","edges":"vincenty"}"#,
),
(
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::ANDOYER)),
r#"{"crs":"OGC:CRS84","edges":"andoyer"}"#,
),
(
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::THOMAS)),
r#"{"crs":"OGC:CRS84","edges":"thomas"}"#,
),
// Geometry with srid:0 should result in an unset (omitted) CRS
// and spherical edges
(
LogicalType::geography(Some("srid:0".to_string()), None),
r#"{"edges":"spherical"}"#,
),
// Geography with custom CRSes (authority:code and partial projjson)
(
LogicalType::geography(Some("EPSG:4267".to_string()), None),
r#"{"crs":"EPSG:4267","edges":"spherical"}"#,
),
(
LogicalType::geography(
Some(r#"{"id":{"authority":"EPSG","code":4326}}"#.to_string()),
None,
),
r#"{"crs":{"id":{"authority":"EPSG","code":4326}},"edges":"spherical"}"#,
),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the test cases for reading Parquet files. For GeoArrow implementation to correctly recognize the intent of the CRS field in the Parquet file, this is the GeoArrow extension metadata that must be produced in each of these cases.

There were a few problems with the existing implementation:

  • LogicalType::geometry(None) (i.e., Parquet default CRS) was read as as {} (which in GeoArrow land means "I don't know what the CRS is)
  • LogicalType::geography(Some(...), None) (i.e. Parquet default edge algorithm) was read as {"crs":"..."} (i.e., no edges in the GeoArrow metadata, which consumers would recognize as GEOMETRY and not GEOGRAPHY)
  • LogicalType::geography(Some(...), Some(EdgeInterpolationAlgorithm::SPHERICAL)) (i.e. an actual edge algorithm) was read as {"crs":"...","algorithm":"spherical"}. The "algorithm" key isn't valid in GeoArrow and this would be rejected by consumers or ignored and read as GEOMETRY (not geography).

Comment on lines +540 to +612
// Geometry with no CRS should be GEOMETRY(srid:0)
(r#"{}"#, LogicalType::geometry(Some("srid:0".to_string()))),
// Geometry with string CRS
(
r#"{"crs":"EPSG:4267"}"#,
LogicalType::geometry(Some("\"EPSG:4267\"".to_string())),
),
// Geometry with PROJJSON CRS
(
r#"{"crs":{"id":{"authority":"EPSG","code":3857}}}"#,
LogicalType::geometry(Some(
r#"{"id":{"authority":"EPSG","code":3857}}"#.to_string(),
)),
),
// Geometry with lon/lat CRSes (canonically removed because lon/lat is the
// default Parquet CRS)
(r#"{"crs":"OGC:CRS84"}"#, LogicalType::geometry(None)),
(r#"{"crs":"EPSG:4326"}"#, LogicalType::geometry(None)),
(
r#"{"crs":{"id":{"authority":"EPSG","code":4326}}}"#,
LogicalType::geometry(None),
),
(
r#"{"crs":{"id":{"authority":"EPSG","code":"4326"}}}"#,
LogicalType::geometry(None),
),
(
r#"{"crs":{"id":{"authority":"OGC","code":"CRS84"}}}"#,
LogicalType::geometry(None),
),
// Geography with no CRS, spherical edges
(
r#"{"edges":"spherical"}"#,
LogicalType::geography(Some("srid:0".to_string()), None),
),
// Geography with OGC:CRS84 and spherical edges
(
r#"{"crs":"OGC:CRS84","edges":"spherical"}"#,
LogicalType::geography(None, None),
),
// Geography with different edge algorithms
(
r#"{"crs":"OGC:CRS84","edges":"karney"}"#,
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::KARNEY)),
),
(
r#"{"crs":"OGC:CRS84","edges":"vincenty"}"#,
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::VINCENTY)),
),
(
r#"{"crs":"OGC:CRS84","edges":"andoyer"}"#,
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::ANDOYER)),
),
(
r#"{"crs":"OGC:CRS84","edges":"thomas"}"#,
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::THOMAS)),
),
// Geography with custom CRS and edges
(
r#"{"crs":"EPSG:4267","edges":"karney"}"#,
LogicalType::geography(
Some("\"EPSG:4267\"".to_string()),
Some(EdgeInterpolationAlgorithm::KARNEY),
),
),
// Geography with PROJJSON CRS
(
r#"{"crs":{"id":{"authority":"EPSG","code":4267}},"edges":"spherical"}"#,
LogicalType::geography(
Some(r#"{"id":{"authority":"EPSG","code":4267}}"#.to_string()),
None,
),
),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the tests for writing. The main issue with writing was that "edges":"spherical", which GeoArrow uses to communicate GEOGRAPHY, was not actually writing a geography type because the implementation expected "algorithm". An unset CRS (geoarrow metadata {}) was also written as a Parquet default CRS which is technically incorrect but also not common.

@paleolimbot paleolimbot marked this pull request as ready for review June 5, 2026 02:18
@paleolimbot
Copy link
Copy Markdown
Member Author

cc @BlakeOrth

@BlakeOrth
Copy link
Copy Markdown
Contributor

Thanks for the ping, I'll make some time to review this today.

Copy link
Copy Markdown
Contributor

@BlakeOrth BlakeOrth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here look good to me, I didn't find anything meaningful that would require any follow-up. Thanks for taking care of this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet geospatial conversion uses metadata key "algorithm" instead of "edges" in geoarrow metadata

2 participants