Skip to content
This repository was archived by the owner on Dec 15, 2021. It is now read-only.

Comments

Conceptual pull request#35

Open
markncooper wants to merge 4 commits intospotify:masterfrom
markncooper:master
Open

Conceptual pull request#35
markncooper wants to merge 4 commits intospotify:masterfrom
markncooper:master

Conversation

@markncooper
Copy link

@markncooper markncooper commented May 2, 2017

This pull request shows off various hacks I made to Spark to correctly write data into BigQuery using Avro as the import format. I had to change the Avro format so that:

  • Objects of TimestampType are encoded as micros (to match BigQuery)
  • Objects of DateType are encoded as strings (this is a lesser used type that wasn't previously supported

I also pull in code from Appsflyer to build the JSON schema that is used to tell BigQuery how to handle some of the types that are not natively supported by Avro.

Mark Cooper added 4 commits March 20, 2017 13:29
The existing Avro serialization library didn't support DATE typed
fields. Folded in additional Databricks Avro lib components and moved
the Avro libs to a .cloned package to avoid conflicts.
Horible, ugly not so good hacks

* Convert DateType => String (this type isn't often used, we use it for birth dates)
* Encode TimestampType in microsecond
* Generate and pass a long a schema to BigQuery
* Added a handy bigQueryTableExists() util method
This method extracts the schema from the Dataframe and
uses it when reading in and generating the table schema. The old
mechanism simply uses the Avro defition to infer the table
schema.
@richwhitjr
Copy link
Contributor

Shame that we have to copy the avro-spark completely to get this to work. Would it be reasonable to upstream those two changes to the original project and pull it in as a dependency? The nanosecond change though in particular seems like it would break a lot of existing code if done in that project. That being said looking at the current structure of the Avro project I don't see a good way to extend to only change the conversion of those two types.

Another option, which also isn't great, is to capture the Row prior to write and convert those two fields to Milliseconds or String. The schema for BigQuery could be generated prior to this conversion. This doesn't feel like the correct solution either though.

@markncooper
Copy link
Author

markncooper commented May 3, 2017

Right, I agree it's sort of tricky situation. I'll take a look and see if I can come up with a better route - maybe I could push some upstream flags into the Avro library.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants