Conversation
The existing Avro serialization library didn't support DATE typed fields. Folded in additional Databricks Avro lib components and moved the Avro libs to a .cloned package to avoid conflicts.
Horible, ugly not so good hacks * Convert DateType => String (this type isn't often used, we use it for birth dates) * Encode TimestampType in microsecond * Generate and pass a long a schema to BigQuery * Added a handy bigQueryTableExists() util method
This method extracts the schema from the Dataframe and uses it when reading in and generating the table schema. The old mechanism simply uses the Avro defition to infer the table schema.
|
Shame that we have to copy the avro-spark completely to get this to work. Would it be reasonable to upstream those two changes to the original project and pull it in as a dependency? The nanosecond change though in particular seems like it would break a lot of existing code if done in that project. That being said looking at the current structure of the Avro project I don't see a good way to extend to only change the conversion of those two types. Another option, which also isn't great, is to capture the Row prior to write and convert those two fields to Milliseconds or String. The schema for BigQuery could be generated prior to this conversion. This doesn't feel like the correct solution either though. |
|
Right, I agree it's sort of tricky situation. I'll take a look and see if I can come up with a better route - maybe I could push some upstream flags into the Avro library. |
This pull request shows off various hacks I made to Spark to correctly write data into BigQuery using Avro as the import format. I had to change the Avro format so that:
I also pull in code from Appsflyer to build the JSON schema that is used to tell BigQuery how to handle some of the types that are not natively supported by Avro.