Supporting Directory-Based Dataset Access in UDFs #4352

kunwp1 · 2026-04-08T04:37:14Z

kunwp1
Apr 8, 2026
Collaborator

The Problem

Currently, Texera's DatasetFileDocument API allows users to fetch individual files as streams. However, some bioinformatics and data science libraries require a local filesystem directory path to function.

For example, R’s read10X(data.dir=...) and Python’s scanpy.read_10x_mtx() expect a folder containing a specific set of files (e.g., matrix.mtx.gz, barcodes.tsv.gz). These functions cannot operate on isolated file streams; they need a physical directory handle.

Root Cause

Texera's backend (LakeFS) stores data as flat objects using path-separator names. Consequently, there is no native "folder" object to pass to a UDF. Our current Python SDK and backend endpoints lack a mechanism to materialize a specific path prefix (a "folder") as a local directory on a computing unit.

Design Choice 1: Client-Side File-by-File Materialization

Introduce a DatasetFolderDocument class in the Python SDK that simulates a directory by downloading all objects with a matching prefix.

with DatasetFolderDocument("/bob@texera.com/myDataset/v1/Counts") as local_path:
    adata = sc.read_10x_mtx(local_path)

Workflow:

SDK queries the existing file-tree API to enumerate all files under the specified prefix.
SDK downloads each file individually via presigned URLs.
SDK recreates the sub-directory structure in a local /tmp directory.
The context manager returns the path and handles cleanup on exit.

Pros:

Requires no backend changes; purely an SDK-level implementation.

Cons:

N HTTP round-trips.
Potential for partial state failures (e.g., download fails halfway through).
Temp disk usage on the worker is proportional to folder size.

Design Choice 2: Server-Side Archiving

Add a backend REST endpoint that accepts a folder path and streams a single archive (e.g., ZIP or Tar) containing the requested files. The user experience is same as Design Choice 1.

with DatasetFolderDocument("/bob@texera.com/myDataset/v1/Counts") as local_path:
    adata = sc.read_10x_mtx(local_path)

Workflow:

SDK makes a single request to the new endpoint.
The backend filters the LakeFS objects and streams a ZIP on-the-fly.
The SDK downloads and extracts this single archive to a local temp directory.

Pros:

Reduces N HTTP round-trips to 1
Atomic download

Cons:

Requires a new backend endpoint
Temp disk usage on the worker is proportional to folder size.

Yicong-Huang · 2026-04-08T04:39:41Z

Yicong-Huang
Apr 8, 2026
Collaborator

Thanks for starting this discussion. one quick clarification question, do we always have to send a copy of the files to the CU? I thought we are attaching a volume instead.

2 replies

kunwp1 Apr 8, 2026
Collaborator Author

The CU has a local volume, but it starts empty. Our data currently lives in Object Storage rather than on a physical hard drive. We can have two ways to get it to the CU:

The Download Approach: We grab the files from the cloud and save a copy onto the CU’s local volume. It's simple and reliable, which is why I'm leaning toward it.
The Mounting Approach: This is likely what you are thinking of: making the cloud storage appear as a local folder. I haven't explored much on this but it seems much harder to set up.

Yicong-Huang Apr 8, 2026
Collaborator

on the high level, I think mounting is a more feasible option. related topic: we also would need to be able solve different env management that @SarahAsad23 is exploring. I can see the relationship: python virtual env is also just a directory, and CU just need a path of the python callable, which has all the dependencies available in the directory.

The mounting approach may not be that hard: instead of download the files to the CU's local volume, download it to a shared volume instead, and mount it to CU. Then in python, the Document API can be resolved and point to a local path.

chenlica · 2026-04-08T05:01:42Z

chenlica
Apr 8, 2026
Collaborator

@xuang7 Can you chime in based on your recent experience of migrating a few full stacks to the platform? I will chime in later.

0 replies

aglinxinyuan · 2026-04-08T05:03:21Z

aglinxinyuan
Apr 8, 2026
Collaborator

Can we utilize our existing data processing interface to support this? I.e. use binary tuple as a single file, and use table of binary tuples as a directory. I don't like the idea of bypassing the assumption that data always comes from upstream operators. If we allow any operator to read the data from filesystem, any operator can become a source operator and it's no longer a data flow.

9 replies

kunwp1 Apr 8, 2026
Collaborator Author

I know it's doable but our current implementation is not very easy and straightforward to read a folder that contains list of files (can be thousands) from a dataset.

chenlica Apr 8, 2026
Collaborator

Before providing more technical input (later), I want to say that this "file-path interface within a dataset" feature has been requested by several other users in the past.

Yicong-Huang Apr 8, 2026
Collaborator

we can provide Document API to expose a directory.

Yicong-Huang Apr 8, 2026
Collaborator

whether it is a directory or files, the underneath technique should be the same.

tuple(contains a URL field of file_or_dir_path) -> Python UDF
-> Python Document API (url)
-> make the data available on local fs ( by fetching, or attaching volume of the data)
-> library can use it.

chenlica Apr 8, 2026
Collaborator

We will have an offline discussion and report here.

xuang7 · 2026-04-08T05:31:26Z

xuang7
Apr 8, 2026

Recently, I have been working on integrating brain imaging visualization tools into the platform. These tools typically expect a folder containing a specific set of files, such as a manifest, coordinates, cell metadata, and other supporting files.

Since we do not currently have a directory-access API, the current workaround is essentially a file-by-file presigned URL approach at the frontend level. For each file, the viewer has to:

call /api/dataset/presign-download?... to get a presigned URL
fetch the actual file content from MinIO using that URL

This is effectively similar to Design Choice 1, except that it is currently happening at the frontend level. It works as a workaround since my use case can lazy-load files on demand, but it may become inefficient when a dataset contains a large number of files. A directory-access API would simplify this significantly and would be very helpful for tools that require the full directory.

0 replies

kunwp1 · 2026-04-09T00:48:18Z

kunwp1
Apr 9, 2026
Collaborator Author

Following our offline discussion, we've decided to integrate new operators being developed by @aglinxinyuan with the Python UDF. The proposed workflow is as follows:

Dataset Selector (Source): This operator accepts a dataset URI (e.g., /ownerEmail/datasetName/versionName) and flattens its structure into a table of file URIs. For instance, a nested structure like /folder1/file1 and /folder1/folder2/file2 will be returned as individual string rows.
Text to File Scan: It's a downstream operator that resolves the URIs into file contents. Users can toggle an option to include the original URI as an attribute, resulting in tuples of (file_uri, file_content).
Python UDF: This operator consumes these tuples, providing users with the raw paths and data.

We initially considered using io.BytesIO to provide a folder-like interface within the UDF. However, it seems like io.ByteIO can only mimic a single file, not a folder-like file system. So, the responsibility for reconstructing or mimicking a file-tree structure (e.g., tempfile) will rest with the UDF logic itself.

Please feel free to add your comments or suggestions on this approach.

CC: @aglinxinyuan @chenlica @xuang7

2 replies

chenlica Apr 9, 2026
Collaborator

@aglinxinyuan Please confirm if this summary is consistent with your solution and implemented operators.

aglinxinyuan Apr 9, 2026
Collaborator

It's accurate.

Supporting Directory-Based Dataset Access in UDFs #4352

Uh oh!

kunwp1 Apr 8, 2026 Collaborator

The Problem

Root Cause

Design Choice 1: Client-Side File-by-File Materialization

Design Choice 2: Server-Side Archiving

Replies: 5 comments · 13 replies

Uh oh!

Yicong-Huang Apr 8, 2026 Collaborator

Uh oh!

kunwp1 Apr 8, 2026 Collaborator Author

Uh oh!

Yicong-Huang Apr 8, 2026 Collaborator

Uh oh!

chenlica Apr 8, 2026 Collaborator

Uh oh!

Uh oh!

aglinxinyuan Apr 8, 2026 Collaborator

Uh oh!

Uh oh!

kunwp1 Apr 8, 2026 Collaborator Author

Uh oh!

Uh oh!

chenlica Apr 8, 2026 Collaborator

Uh oh!

Yicong-Huang Apr 8, 2026 Collaborator

Uh oh!

Yicong-Huang Apr 8, 2026 Collaborator

Uh oh!

chenlica Apr 8, 2026 Collaborator

Uh oh!

Uh oh!

xuang7 Apr 8, 2026

Uh oh!

kunwp1 Apr 9, 2026 Collaborator Author

Uh oh!

chenlica Apr 9, 2026 Collaborator

Uh oh!

aglinxinyuan Apr 9, 2026 Collaborator

kunwp1
Apr 8, 2026
Collaborator

Replies: 5 comments 13 replies

Yicong-Huang
Apr 8, 2026
Collaborator

kunwp1 Apr 8, 2026
Collaborator Author

Yicong-Huang Apr 8, 2026
Collaborator

chenlica
Apr 8, 2026
Collaborator

aglinxinyuan
Apr 8, 2026
Collaborator

kunwp1 Apr 8, 2026
Collaborator Author

chenlica Apr 8, 2026
Collaborator

Yicong-Huang Apr 8, 2026
Collaborator

Yicong-Huang Apr 8, 2026
Collaborator

chenlica Apr 8, 2026
Collaborator

xuang7
Apr 8, 2026

kunwp1
Apr 9, 2026
Collaborator Author

chenlica Apr 9, 2026
Collaborator

aglinxinyuan Apr 9, 2026
Collaborator