Add tests and fix corner cases for Parquet/GeoArrow extension type conversion#10065
Add tests and fix corner cases for Parquet/GeoArrow extension type conversion#10065paleolimbot wants to merge 5 commits into
Conversation
| // Geometry with default CRS (defaults to OGC:CRS84 per Parquet spec) | ||
| (LogicalType::geometry(None), r#"{"crs":"OGC:CRS84"}"#), | ||
| // Geometry with srid:0 should result in an unset (omitted) CRS | ||
| (LogicalType::geometry(Some("srid:0".to_string())), r#"{}"#), | ||
| // Geometry with custom CRSes (authority:code and partial projjson) | ||
| ( | ||
| LogicalType::geometry(Some("EPSG:4267".to_string())), | ||
| r#"{"crs":"EPSG:4267"}"#, | ||
| ), | ||
| ( | ||
| LogicalType::geometry(Some( | ||
| r#"{"id":{"authority":"EPSG","code":4326}}"#.to_string(), | ||
| )), | ||
| r#"{"crs":{"id":{"authority":"EPSG","code":4326}}}"#, | ||
| ), | ||
| // Geography with default CRS (default OGC:CRS84, spherical edges) | ||
| ( | ||
| LogicalType::geography(None, None), | ||
| r#"{"crs":"OGC:CRS84","edges":"spherical"}"#, | ||
| ), | ||
| // Geography with explicit edges | ||
| ( | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::SPHERICAL)), | ||
| r#"{"crs":"OGC:CRS84","edges":"spherical"}"#, | ||
| ), | ||
| ( | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::KARNEY)), | ||
| r#"{"crs":"OGC:CRS84","edges":"karney"}"#, | ||
| ), | ||
| ( | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::VINCENTY)), | ||
| r#"{"crs":"OGC:CRS84","edges":"vincenty"}"#, | ||
| ), | ||
| ( | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::ANDOYER)), | ||
| r#"{"crs":"OGC:CRS84","edges":"andoyer"}"#, | ||
| ), | ||
| ( | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::THOMAS)), | ||
| r#"{"crs":"OGC:CRS84","edges":"thomas"}"#, | ||
| ), | ||
| // Geometry with srid:0 should result in an unset (omitted) CRS | ||
| // and spherical edges | ||
| ( | ||
| LogicalType::geography(Some("srid:0".to_string()), None), | ||
| r#"{"edges":"spherical"}"#, | ||
| ), | ||
| // Geography with custom CRSes (authority:code and partial projjson) | ||
| ( | ||
| LogicalType::geography(Some("EPSG:4267".to_string()), None), | ||
| r#"{"crs":"EPSG:4267","edges":"spherical"}"#, | ||
| ), | ||
| ( | ||
| LogicalType::geography( | ||
| Some(r#"{"id":{"authority":"EPSG","code":4326}}"#.to_string()), | ||
| None, | ||
| ), | ||
| r#"{"crs":{"id":{"authority":"EPSG","code":4326}},"edges":"spherical"}"#, | ||
| ), |
There was a problem hiding this comment.
These are the test cases for reading Parquet files. For GeoArrow implementation to correctly recognize the intent of the CRS field in the Parquet file, this is the GeoArrow extension metadata that must be produced in each of these cases.
There were a few problems with the existing implementation:
LogicalType::geometry(None)(i.e., Parquet default CRS) was read as as{}(which in GeoArrow land means "I don't know what the CRS is)LogicalType::geography(Some(...), None)(i.e. Parquet default edge algorithm) was read as{"crs":"..."}(i.e., no edges in the GeoArrow metadata, which consumers would recognize as GEOMETRY and not GEOGRAPHY)LogicalType::geography(Some(...), Some(EdgeInterpolationAlgorithm::SPHERICAL))(i.e. an actual edge algorithm) was read as{"crs":"...","algorithm":"spherical"}. The"algorithm"key isn't valid in GeoArrow and this would be rejected by consumers or ignored and read as GEOMETRY (not geography).
| // Geometry with no CRS should be GEOMETRY(srid:0) | ||
| (r#"{}"#, LogicalType::geometry(Some("srid:0".to_string()))), | ||
| // Geometry with string CRS | ||
| ( | ||
| r#"{"crs":"EPSG:4267"}"#, | ||
| LogicalType::geometry(Some("\"EPSG:4267\"".to_string())), | ||
| ), | ||
| // Geometry with PROJJSON CRS | ||
| ( | ||
| r#"{"crs":{"id":{"authority":"EPSG","code":3857}}}"#, | ||
| LogicalType::geometry(Some( | ||
| r#"{"id":{"authority":"EPSG","code":3857}}"#.to_string(), | ||
| )), | ||
| ), | ||
| // Geometry with lon/lat CRSes (canonically removed because lon/lat is the | ||
| // default Parquet CRS) | ||
| (r#"{"crs":"OGC:CRS84"}"#, LogicalType::geometry(None)), | ||
| (r#"{"crs":"EPSG:4326"}"#, LogicalType::geometry(None)), | ||
| ( | ||
| r#"{"crs":{"id":{"authority":"EPSG","code":4326}}}"#, | ||
| LogicalType::geometry(None), | ||
| ), | ||
| ( | ||
| r#"{"crs":{"id":{"authority":"EPSG","code":"4326"}}}"#, | ||
| LogicalType::geometry(None), | ||
| ), | ||
| ( | ||
| r#"{"crs":{"id":{"authority":"OGC","code":"CRS84"}}}"#, | ||
| LogicalType::geometry(None), | ||
| ), | ||
| // Geography with no CRS, spherical edges | ||
| ( | ||
| r#"{"edges":"spherical"}"#, | ||
| LogicalType::geography(Some("srid:0".to_string()), None), | ||
| ), | ||
| // Geography with OGC:CRS84 and spherical edges | ||
| ( | ||
| r#"{"crs":"OGC:CRS84","edges":"spherical"}"#, | ||
| LogicalType::geography(None, None), | ||
| ), | ||
| // Geography with different edge algorithms | ||
| ( | ||
| r#"{"crs":"OGC:CRS84","edges":"karney"}"#, | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::KARNEY)), | ||
| ), | ||
| ( | ||
| r#"{"crs":"OGC:CRS84","edges":"vincenty"}"#, | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::VINCENTY)), | ||
| ), | ||
| ( | ||
| r#"{"crs":"OGC:CRS84","edges":"andoyer"}"#, | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::ANDOYER)), | ||
| ), | ||
| ( | ||
| r#"{"crs":"OGC:CRS84","edges":"thomas"}"#, | ||
| LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::THOMAS)), | ||
| ), | ||
| // Geography with custom CRS and edges | ||
| ( | ||
| r#"{"crs":"EPSG:4267","edges":"karney"}"#, | ||
| LogicalType::geography( | ||
| Some("\"EPSG:4267\"".to_string()), | ||
| Some(EdgeInterpolationAlgorithm::KARNEY), | ||
| ), | ||
| ), | ||
| // Geography with PROJJSON CRS | ||
| ( | ||
| r#"{"crs":{"id":{"authority":"EPSG","code":4267}},"edges":"spherical"}"#, | ||
| LogicalType::geography( | ||
| Some(r#"{"id":{"authority":"EPSG","code":4267}}"#.to_string()), | ||
| None, | ||
| ), | ||
| ), |
There was a problem hiding this comment.
These are the tests for writing. The main issue with writing was that "edges":"spherical", which GeoArrow uses to communicate GEOGRAPHY, was not actually writing a geography type because the implementation expected "algorithm". An unset CRS (geoarrow metadata {}) was also written as a Parquet default CRS which is technically incorrect but also not common.
|
cc @BlakeOrth |
|
Thanks for the ping, I'll make some time to review this today. |
BlakeOrth
left a comment
There was a problem hiding this comment.
The changes here look good to me, I didn't find anything meaningful that would require any follow-up. Thanks for taking care of this!
Which issue does this PR close?
Rationale for this change
There were several issues with conversion identified when I tried to integrate this in SedonaDB and that came to light when the spec was recently clarified.
I am sorry for missing these changes when I reviewed the initial implementation.
What changes are included in this PR?
Nonefor geometry or geography is now converted to a GeoArrow CRS of"OGC:CRS84"(the named value for the default CRS in the Parquet spec)"srid:0"is now converted to a GeoArrow "omitted" CRS. This was recently clarified in the Parquet spec (srid:0 is a named example in the list of allowed values)."srid:0"None. This logic was included in the previous implementation but was reversed (Parquet CRSes that looked like lonlat were omitted when written to GeoArrow, which is not correct)."algorithm"and was serializing it to JSON. The GeoArrow spec uses the"edges"key. This led to invalid metadata being generated which was either rejected or incorrectly interpreted by consumers.Are these changes tested?
Yes. I added high-level end-to-end LogicalType <-> extension metadata tests, since that is what matters (there were a few lower level tests that I updated as well).