Skip to content

Conversation

@ivorbosloper
Copy link
Collaborator

The downside of our geopandas-based FiboaConverters is that the dataframe need to be in memory. This can be huge. A completely different approach would be to use duckdb for conversion. To test it's viability, I took the dataset (Japan) with largest dataframe. I don't need a 128GB+ machine anymore, it's fast & performant on a laptop using 20GB memory and runs way faster (<30minutes).

I need some help with the parquet metadata...

@m-mohr
Copy link
Contributor

m-mohr commented Dec 8, 2025

You can set KV_METADATA in DuckDB? That's awesome and resolves the primary issue we had on our list! Will test later, thanks! Maybe it makes sense to jump on a call for the metadata discussion.

@ivorbosloper
Copy link
Collaborator Author

You can set KV_METADATA in DuckDB? That's awesome and resolves the primary issue we had on our list! Will test later, thanks! Maybe it makes sense to jump on a call for the metadata discussion.

Yes. I first tried the python API, but that doesn't expose the Metadata option. Would be great to have a call on the metadata details, you wrote the vecorel-parquet logic and seem well informed...

@ivorbosloper
Copy link
Collaborator Author

ivorbosloper commented Dec 9, 2025

I'm still stuck on fiboa validation:

fiboa validate jptest.parquet     
fiboa CLI 0.20.1 - Validator

Validating jptest.parquet: INVALID
 - geometry: Nullability differs, is True but must be False
 - id: Nullability differs, is True but must be False

As I understand it, the validator takes the parquet schema and checks it with the fiboa schema (see implementation ). The fiboa schema is built from the extensions (and is correct in this case), but the parquet schema is implicitly created by duckdb (with TO 'file.parquet' (FORMAT parquet)) based on the resultset of the query.

Can I force this to a 'non-null' result? I've tried Casting with ::VARCHAR NOT NULL, but that's not the correct syntax.

Maybe related to duckdb/duckdb#13949

@ivorbosloper
Copy link
Collaborator Author

ivorbosloper commented Dec 9, 2025

I think duckdb parquet writer doesn't support setting the nullability derived from the resultset.

$ duckdb
D CREATE TABLE test(id varchar not null);
D insert into test values ('A'), ('B');
D COPY (select * from test) to '/tmp/test.parquet' (FORMAT PARQUET);
D exit

$ parquet-dump-schema /tmp/test.parquet 
required group field_id=-1 duckdb_schema {
  optional binary field_id=-1 id (String);
}

same for COPY test to '/tmp/test.parquet' (FORMAT PARQUET)

@m-mohr
Copy link
Contributor

m-mohr commented Dec 9, 2025

For me that sounds like a bug in duckdb. Is there an open issue for it, otherwise maybe open one?

@ivorbosloper
Copy link
Collaborator Author

For me that sounds like a bug in duckdb. Is there an open issue for it, otherwise maybe open one?

Is "Non-nullability" a property of a query result column? Maybe this information is lost. But at least it's a feature request..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants