Skip to content

Deequ/PyDeequ compatibility with Spark-Connect #221

@SemyonSinchenko

Description

@SemyonSinchenko

Is your feature request related to a problem? Please describe.
At the moment python-deequ relies on direct calls to the underlying Spark JVM via SparkSession._jvm. But in the PySpark Connect _jvm is not available at all. So, at the moment python-deequ works only PySpark Classic, not in PySpark Connect. There were already reported problems from some of Spark-Connect users: reported incompatibility of python-deequ with Spark-Connect.

Describe the solution you'd like

  • Add into deequ-core a Spark-Connect plugin and the corresponding protobuf messages for deequ classes/objects;
  • Add into python-deequ an alternative implementation of API that is built on top of generated from protobuf python classes.

Describe alternatives you've considered
At the moment I do not know any other alternative. Based on the documentation of Spark-Connect and discussions in Spark mailing list / Jira, there are no plans to support _jvm in Connect.

Additional context
It is not so hard to do. I did it for educational purposes recently and it required:

  • About 350 lines of protobuf code;
  • About 550 lines of Scala code that implements plugin and protobuf parser;
  • Corresponding updates of python-deequ.

I'm willing to make all the required work and support the protobuf code with a new Python API.

In the deequ-core it may be done as a submodule. In the python-deequ it may be done as an extras. For example, pip install pydeequ will work as today, but pip install pydeequ[connect] will install also some protobuf dependencies and generated from proto classes.

It seems to me that adopting of deequ/python-deequ to Spark-Connect may be done without breaking changes. And it opens not only a way to make pydeequ works on Spark-Connect, but also a potential for creating others deequ APIs for Spark-Connect Go, Spark-Connect Rust, etc. Also it opens a way to use deequ from Spark-Connect via Java/Scala.

P.S. Using protobuf allows also to decouple Scala from Python at all and avoid all the problems with default, Option, etc. And it may be used not only in Connect but in PySpark Classic too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions