Skip to content

kunwp1/texera-rudf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Texera R UDF

R language support for Apache Texera, enabling data processing workflows using R code.

Installation

Prerequisites

R Installation (version 4.5.2)

Required R packages (install these specific versions):

# Install tested versions
install.packages("remotes")
remotes::install_version("arrow", version = "22.0.0.1")
remotes::install_version("coro", version = "1.1.0")
remotes::install_version("aws.s3", version = "0.3.22")

Install Plugin

# Install from GitHub
pip install git+https://github.com/kunwp1/texera-rudf.git

# Development install
git clone https://github.com/kunwp1/texera-rudf.git
cd texera-rudf
pip install -e .

Usage

The plugin provides two APIs for processing data in Texera workflows:

Tuple API (Row-by-Row Processing)

Source Operator:

library(coro)
coro::generator(function() {
  yield(list(col1 = "Hello World!", col2 = 1.0, col3 = TRUE))
})

UDF Operator:

library(coro)
coro::generator(function(tuple, port) {
  tuple$col4 <- tuple$col2 * 2
  yield(tuple)
})

Table API (Batch Processing)

Source Operator:

function() {
  df <- data.frame(
    col1 = "Hello World!",
    col2 = 1.0,
    col3 = TRUE
  )
  return(df)
}

UDF Operator:

function(table, port) {
  table$col4 <- table$col2 * 2
  return(table)
}

Large Binary Support

Handle large binary objects (images, files, etc.) via S3-compatible storage:

Writing Large Binary:

library(coro)
coro::generator(function() {
  # Create a new large binary object
  lb <- largebinary()
  
  # Write data to it
  stream <- LargeBinaryOutputStream(lb)
  stream$write(charToRaw("Hello, Large Binary World!"))
  stream$close()
  
  yield(list(file_content = lb))
})

Reading Large Binary:

library(coro)
coro::generator(function(tuple, port) {
  # Read from large binary object
  stream <- LargeBinaryInputStream(tuple$file_content)
  data <- stream$read()
  stream$close()
  
  # Convert raw bytes to string
  content <- rawToChar(data)
  
  tuple$content_text <- content
  yield(tuple)
})

Features

  • Tuple API: Row-by-row processing with R generators
  • Table API: Batch processing with R dataframes
  • Apache Arrow: Efficient data transfer between Python and R
  • Large Binary Support: Handle large objects via S3-compatible storage

Requirements

Tested Versions

This plugin has been tested and verified to work with the following versions:

Python Environment:

  • Python: 3.10, 3.11, 3.12
  • rpy2: 3.5.11
  • rpy2-arrow: 0.0.8

R Environment:

  • R: 4.5.2
  • arrow: 22.0.0.1
  • coro: 1.1.0
  • aws.s3: 0.3.22

Other versions may work but have not been tested and are not guaranteed to be compatible.

License

Licensed under the MIT License. See LICENSE for details.

Contributing

Contributions are welcome! To contribute:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Links

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages