Deequ/PyDeequ compatibility with Spark-Connect

**Is your feature request related to a problem? Please describe.**
At the moment `python-deequ` relies on direct calls to the underlying Spark JVM via `SparkSession._jvm`. But in the PySpark Connect `_jvm` is not available at all. So, at the moment `python-deequ` works only PySpark Classic, not in PySpark Connect. There were already reported problems from some of Spark-Connect users: [reported incompatibility of python-deequ with Spark-Connect](https://github.com/awslabs/python-deequ/issues/192#issuecomment-2209498425).

**Describe the solution you'd like**
- Add into deequ-core a Spark-Connect plugin and the corresponding protobuf messages for deequ classes/objects;
- Add into `python-deequ` an alternative implementation of API that is built on top of generated from protobuf python classes.

**Describe alternatives you've considered**
At the moment I do not know any other alternative. Based on the documentation of Spark-Connect and discussions in Spark mailing list / Jira, there are no plans to support `_jvm` in Connect.

**Additional context**
It is not so hard to do. I [did it for educational purposes recently](https://semyonsinchenko.github.io/ssinchenko/post/porting_deequ_to_sparkconnect/) and it required:
- About 350 lines of protobuf code;
- About 550 lines of Scala code that implements plugin and protobuf parser;
- Corresponding updates of `python-deequ`.

I'm willing to make all the required work and support the protobuf code with a new Python API.

In the deequ-core it may be done as a submodule. In the python-deequ it may be done as an extras. For example, `pip install pydeequ` will work as today, but `pip install pydeequ[connect]` will install also some protobuf dependencies and generated from proto classes.

It seems to me that adopting of deequ/python-deequ to Spark-Connect may be done without breaking changes. And it opens not only a way to make pydeequ works on Spark-Connect, but also a potential for creating others deequ APIs for Spark-Connect Go, Spark-Connect Rust, etc. Also it opens a way to use deequ from Spark-Connect via Java/Scala.

P.S. Using protobuf allows also to decouple Scala from Python at all and avoid all the problems with `default`, `Option`, etc. And it may be used not only in Connect but in PySpark Classic too.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deequ/PyDeequ compatibility with Spark-Connect #221

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deequ/PyDeequ compatibility with Spark-Connect #221

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions