Skip to content
This repository was archived by the owner on Dec 15, 2021. It is now read-only.
This repository was archived by the owner on Dec 15, 2021. It is now read-only.

The Apache Avro library failed to parse the header #57

@matthew-fishkin

Description

@matthew-fishkin

Spark version: 2.2.0
Spotify/spark-bigquery version: 0.2.2

Hi,

I am trying to use the saveAsBigQuery table function to write a schema that has an array of struct as a field. However, I am getting the following error:

The Apache Avro library failed to parse the header with the follwing error: Invalid namespace: .topic_scores

The offending field is:


{
            "type": [
                {
                    "items": [
                        {
                            "namespace": ".topic_scores",
                            "type": "record",
                            "name": "topic_scores",
                            "fields": [
                                {
                                    "type": "int",
                                    "name": "index"
                                },
                                {
                                    "type": "float",
                                    "name": "score"
                                }
                            ]
                        },
                        "null"
                    ],
                    "type": "array"
                },
                "null"
            ],
            "name": "topic_scores"
        }

You can see that the namespace field begins with a dot. My guess is that the issue stems from https://github.com/spotify/spark-bigquery/blob/master/src/main/scala/com/databricks/spark/avro/SchemaConverters.scala#L342-L346

I can't find a way to configure the recordNamespace value. According to avro documentation:

You can specify the record name and namespace like this:

import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().master("local").getOrCreate()
val df = spark.read.avro("src/test/resources/episodes.avro")

val name = "AvroTest"
val namespace = "com.databricks.spark.avro"
val parameters = Map("recordName" -> name, "recordNamespace" -> namespace)

df.write.options(parameters).avro("/tmp/output")

I think this is the line that reads that option, and sets the value to an empty string if not provided: https://github.com/databricks/spark-avro/blob/branch-4.0/src/main/scala/com/databricks/spark/avro/DefaultSource.scala#L114

These options are not parameterized anywhere in the Spotify library. Has anyone seen this issue or have a workaround? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions