Conceptual pull request by markncooper · Pull Request #35 · spotify/spark-bigquery

markncooper · 2017-05-02T00:23:11Z

This pull request shows off various hacks I made to Spark to correctly write data into BigQuery using Avro as the import format. I had to change the Avro format so that:

Objects of TimestampType are encoded as micros (to match BigQuery)
Objects of DateType are encoded as strings (this is a lesser used type that wasn't previously supported

I also pull in code from Appsflyer to build the JSON schema that is used to tell BigQuery how to handle some of the types that are not natively supported by Avro.

The existing Avro serialization library didn't support DATE typed fields. Folded in additional Databricks Avro lib components and moved the Avro libs to a .cloned package to avoid conflicts.

Horible, ugly not so good hacks * Convert DateType => String (this type isn't often used, we use it for birth dates) * Encode TimestampType in microsecond * Generate and pass a long a schema to BigQuery * Added a handy bigQueryTableExists() util method

This method extracts the schema from the Dataframe and uses it when reading in and generating the table schema. The old mechanism simply uses the Avro defition to infer the table schema.

richwhitjr · 2017-05-03T14:20:00Z

Shame that we have to copy the avro-spark completely to get this to work. Would it be reasonable to upstream those two changes to the original project and pull it in as a dependency? The nanosecond change though in particular seems like it would break a lot of existing code if done in that project. That being said looking at the current structure of the Avro project I don't see a good way to extend to only change the conversion of those two types.

Another option, which also isn't great, is to capture the Row prior to write and convert those two fields to Milliseconds or String. The schema for BigQuery could be generated prior to this conversion. This doesn't feel like the correct solution either though.

markncooper · 2017-05-03T16:18:55Z

Right, I agree it's sort of tricky situation. I'll take a look and see if I can come up with a better route - maybe I could push some upstream flags into the Avro library.

Mark Cooper added 4 commits March 20, 2017 13:29

Fixed lack of support for DATE types.

d0667e4

The existing Avro serialization library didn't support DATE typed fields. Folded in additional Databricks Avro lib components and moved the Avro libs to a .cloned package to avoid conflicts.

Bump version to: 0.2.1-DATEFIX

b4ecc25

Adds a new saveAsBigQueryTableWithRichSchema method

8cfd1af

This method extracts the schema from the Dataframe and uses it when reading in and generating the table schema. The old mechanism simply uses the Avro defition to infer the table schema.

lambiase mentioned this pull request Nov 8, 2017

spark_read_bigquery reads dates, times and timestamps as microseconds from the Unix epoch miraisolutions/sparkbq#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Conceptual pull request#35

Conceptual pull request#35
markncooper wants to merge 4 commits intospotify:masterfrom
markncooper:master

markncooper commented May 2, 2017 •

edited

Loading

Uh oh!

richwhitjr commented May 3, 2017

Uh oh!

markncooper commented May 3, 2017 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

markncooper commented May 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

richwhitjr commented May 3, 2017

Uh oh!

markncooper commented May 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

markncooper commented May 2, 2017 •

edited

Loading

markncooper commented May 3, 2017 •

edited

Loading