Replies: 5 comments 13 replies
-
|
Thanks for starting this discussion. one quick clarification question, do we always have to send a copy of the files to the CU? I thought we are attaching a volume instead. |
Beta Was this translation helpful? Give feedback.
-
|
@xuang7 Can you chime in based on your recent experience of migrating a few full stacks to the platform? I will chime in later. |
Beta Was this translation helpful? Give feedback.
-
|
Can we utilize our existing data processing interface to support this? I.e. use binary tuple as a single file, and use table of binary tuples as a directory. I don't like the idea of bypassing the assumption that data always comes from upstream operators. If we allow any operator to read the data from filesystem, any operator can become a source operator and it's no longer a data flow. |
Beta Was this translation helpful? Give feedback.
-
|
Recently, I have been working on integrating brain imaging visualization tools into the platform. These tools typically expect a folder containing a specific set of files, such as a manifest, coordinates, cell metadata, and other supporting files. Since we do not currently have a directory-access API, the current workaround is essentially a file-by-file presigned URL approach at the frontend level. For each file, the viewer has to:
This is effectively similar to Design Choice 1, except that it is currently happening at the frontend level. It works as a workaround since my use case can lazy-load files on demand, but it may become inefficient when a dataset contains a large number of files. A directory-access API would simplify this significantly and would be very helpful for tools that require the full directory. |
Beta Was this translation helpful? Give feedback.
-
|
Following our offline discussion, we've decided to integrate new operators being developed by @aglinxinyuan with the Python UDF. The proposed workflow is as follows:
We initially considered using Please feel free to add your comments or suggestions on this approach. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The Problem
Currently, Texera's
DatasetFileDocumentAPI allows users to fetch individual files as streams. However, some bioinformatics and data science libraries require a local filesystem directory path to function.For example, R’s
read10X(data.dir=...)and Python’sscanpy.read_10x_mtx()expect a folder containing a specific set of files (e.g.,matrix.mtx.gz,barcodes.tsv.gz). These functions cannot operate on isolated file streams; they need a physical directory handle.Root Cause
Texera's backend (LakeFS) stores data as flat objects using path-separator names. Consequently, there is no native "folder" object to pass to a UDF. Our current Python SDK and backend endpoints lack a mechanism to materialize a specific path prefix (a "folder") as a local directory on a computing unit.
Design Choice 1: Client-Side File-by-File Materialization
Introduce a
DatasetFolderDocumentclass in the Python SDK that simulates a directory by downloading all objects with a matching prefix.Workflow:
/tmpdirectory.Pros:
Cons:
Design Choice 2: Server-Side Archiving
Add a backend REST endpoint that accepts a folder path and streams a single archive (e.g., ZIP or Tar) containing the requested files. The user experience is same as Design Choice 1.
Workflow:
Pros:
Cons:
Beta Was this translation helpful? Give feedback.
All reactions