diff --git a/DESCRIPTION b/DESCRIPTION index 356eafe..7916adc 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,13 +1,15 @@ Package: EndpointR Title: Connects to various Machine Learning inference providers -Version: 0.2 -Authors@R: - person("Jack", "Penzer", , "Jack.penzer@sharecreative.com", role = c("aut", "cre")) +Version: 0.2.1 +Authors@R: c( + person("Jack", "Penzer", , "Jack.penzer@sharecreative.com", role = c("aut", "cre")), + person("Claude", "AI", role = "aut") + ) Description: EndpointR is a 'batteries included', open-source R package for connecting to various APIs for Machine Learning model predictions. EndpointR is built for company-specific use cases, so may not be useful to a wide audience. License: MIT + file LICENSE Encoding: UTF-8 Roxygen: list(markdown = TRUE) -RoxygenNote: 7.3.2 +RoxygenNote: 7.3.3 Suggests: spelling, broom, @@ -33,7 +35,8 @@ Imports: tibble, S7, jsonvalidate, - arrow + arrow, + curl VignetteBuilder: knitr Depends: R (>= 3.5) diff --git a/NAMESPACE b/NAMESPACE index 92850c2..4485b08 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -22,6 +22,17 @@ export(hf_get_model_max_length) export(hf_perform_request) export(json_dump) export(json_schema) +export(oai_batch_build_completions_req) +export(oai_batch_build_embed_req) +export(oai_batch_cancel) +export(oai_batch_list) +export(oai_batch_parse_completions) +export(oai_batch_parse_embeddings) +export(oai_batch_prepare_completions) +export(oai_batch_prepare_embeddings) +export(oai_batch_start) +export(oai_batch_status) +export(oai_batch_upload) export(oai_build_completions_request) export(oai_build_completions_request_list) export(oai_build_embedding_request) @@ -32,6 +43,10 @@ export(oai_embed_batch) export(oai_embed_chunks) export(oai_embed_df) export(oai_embed_text) +export(oai_file_content) +export(oai_file_delete) +export(oai_file_list) +export(oai_file_upload) export(perform_requests_with_strategy) export(process_response) export(safely_from_json) diff --git a/NEWS.md b/NEWS.md index 23bd895..865a503 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,4 +1,36 @@ -# EndpointR 0.2 +# EndpointR 0.2.1 + +## OpenAI Batch API + +Adds support for OpenAI's asynchronous Batch API, offering 50% cost savings and higher rate limits compared to synchronous endpoints. Ideal for large-scale embeddings, classifications, and batch inference tasks. + +**Request preparation:** + +- `oai_batch_build_embed_req()` - Build a single embedding request row +- `oai_batch_prepare_embeddings()` - Prepare an entire data frame for batch embeddings +- `oai_batch_build_completions_req()` - Build a single chat completions request row +- `oai_batch_prepare_completions()` - Prepare an entire data frame for batch completions (supports structured outputs via JSON schema) + +**Job management:** + +- `oai_batch_upload()` - Upload prepared JSONL to OpenAI Files API +- `oai_batch_start()` - Trigger a batch job on an uploaded file +- `oai_batch_status()` - Check the status of a running batch job +- `oai_batch_list()` - List all batch jobs associated with your API key +- `oai_batch_cancel()` - Cancel an in-progress batch job + +**Results parsing:** + +- `oai_batch_parse_embeddings()` - Parse batch embedding results into a tidy data frame +- `oai_batch_parse_completions()` - Parse batch completion results into a tidy data frame + +## OpenAI Files API + +- `oai_file_list()` - List files uploaded to the OpenAI Files API +- `oai_file_content()` - Retrieve the content of a file (e.g., batch results) +- `oai_file_delete()` - Delete a file from the Files API + +# EndpointR 0.2.0 - error message and status propagation improvement. Now writes .error, .error_msg (standardised across package), and .status. Main change is preventing httr2 eating the errors before we can deal with them - adds parquet writing to oai_complete_df and oai_embed_df diff --git a/R/openai_batch_api.R b/R/openai_batch_api.R new file mode 100644 index 0000000..246e8cf --- /dev/null +++ b/R/openai_batch_api.R @@ -0,0 +1,632 @@ +# embed request building ---- +#' Create a single OpenAI Batch API - Embedding request +#' +#' This function prepares a single row of data for the OpenAI Batch/Files APIs, where each row should be valid JSON. The APIs do not guarantee the results will be in the same order, so we need to provide an ID with each request. +#' +#' @param input Text input to embed +#' @param id A custom, unique row ID +#' @param model The embedding model to use +#' @param dimensions Number of embedding dimensions (NULL uses model default) +#' @param method The HTTP request type, usually 'POST' +#' @param encoding_format Data type of the embedding values +#' @param endpoint The API endpoint path, e.g. /v1/embeddings +#' +#' @returns a row of JSON +#' +#' @export +#' @examples +#' \dontrun{ +#' text <- "embed_me" +#' id <- "id_1" +#' batch_req <- oai_batch_build_embed_req(text, id) +#' } +oai_batch_build_embed_req <- function(input, id, model = "text-embedding-3-small", dimensions = NULL, method = "POST", encoding_format = "float", endpoint = "/v1/embeddings") { + + body <- purrr::compact( + # use compact so that if dimensions is NULL it gets dropped from the req + list( + input = input, + model = model, + dimensions = dimensions, + encoding_format = encoding_format + )) + + embed_row <- list( + custom_id = id, + method = method, + url = endpoint, + body = body + ) + + embed_row_json <- jsonlite::toJSON(embed_row, + auto_unbox = TRUE) + + return(embed_row_json) +} + +#' Prepare a Data Frame for the OpenAI Batch API - Embeddings +#' +#' @details Takes an entire data frame and turns each row into a valid line of JSON ready for a .jsonl file upload to the OpenAI Files API + Batch API job trigger. +#' +#' Each request must have its own ID, as the Batch API makes no guarantees about the order the results will be returned in. +#' +#' To reduce the overall size, and the explanatory power of the Embeddings, you can set dimensions to lower than the default (which vary based on model). +#' +#' @param df A data frame containing text to process +#' @param text_var Name of the column containing input text +#' @param id_var Name of the column to use as row ID +#' @inheritParams oai_batch_build_embed_req +#' +#' @returns A list of JSON requests +#' +#' @export +#' @examples +#' \dontrun{ +#' df <- data.frame( +#' id = c("doc_1", "doc_2", "doc_3"), +#' text = c("Hello world", "Embedding text", "Another document") +#' ) +#' jsonl_content <- oai_batch_prepare_embeddings(df, text_var = text, id_var = id) +#' } +oai_batch_prepare_embeddings <- function(df, text_var, id_var, model = "text-embedding-3-small", dimensions = NULL, method = "POST", encoding_format = "float", endpoint = "/v1/embeddings") { + + text_sym <- rlang::ensym(text_var) + id_sym <- rlang::ensym(id_var) + + .texts <- dplyr::pull(df, !!text_sym) + .ids <- dplyr::pull(df, !!id_sym) + + if (!.validate_batch_inputs(.ids, .texts)) { + return("") + } + + reqs <- purrr::map2_chr(.texts, .ids, \(x, y) { + oai_batch_build_embed_req( + input = x, + id = as.character(y), + model = model, + dimensions = dimensions, + method = method, + encoding_format = encoding_format, + endpoint = endpoint + ) + }) + + reqs <- paste0(reqs, collapse = "\n") + + return(reqs) +} + + +#' Create a Single OpenAI Batch API - Chat Completions Request +#' +#' This function prepares a single row of data for the OpenAI Batch/Files APIs, +#' where each row should be valid JSON. The APIs do not guarantee the results +#' will be in the same order, so we need to provide an ID with each request. +#' +#' @param input Text input (user message) for the completion +#' @param id A custom, unique row ID +#' @param model The chat completion model to use +#' @param system_prompt Optional system prompt to guide the model's behaviour +#' @param temperature Sampling temperature (0 = deterministic, higher = more random) +#' @param max_tokens Maximum number of tokens to generate +#' @param schema Optional JSON schema for structured output (json_schema object or list) +#' @param method The HTTP request type, usually 'POST' +#' @param endpoint The API endpoint path, e.g. /v1/chat/completions +#' +#' @returns A row of JSON suitable for the Batch API +#' +#' @export +#' @examples +#' \dontrun{ +#' req <- oai_batch_build_completions_req( +#' input = "What is the capital of France?", +#' id = "query_1", +#' model = "gpt-4o-mini", +#' temperature = 0 +#' ) +#' } +oai_batch_build_completions_req <- function(input, id, model = "gpt-4o-mini", system_prompt = NULL, temperature = 0, max_tokens = 500L, schema = NULL, method = "POST", endpoint = "/v1/chat/completions") { + + messages <- list() + + if (!is.null(system_prompt)) { + messages <- append(messages, list(list(role = "system", content = system_prompt))) + } + + messages <- append(messages, list(list(role = "user", content = input))) + + body <- list( + model = model, + messages = messages, + temperature = temperature, + max_tokens = max_tokens + ) + + if (!is.null(schema)) { + if (S7::S7_inherits(schema, json_schema)) { + body$response_format <- json_dump(schema) + } else if (is.list(schema)) { + body$response_format <- schema + } + } + + req_row <- list( + custom_id = as.character(id), + method = method, + url = endpoint, + body = body + ) + + jsonlite::toJSON(req_row, auto_unbox = TRUE) +} + +#' Prepare a Data Frame for the OpenAI Batch API - Chat Completions +#' +#' @description Takes an entire data frame and turns each row into a valid line +#' of JSON ready for a .jsonl file upload to the OpenAI Files API + Batch API +#' job trigger. +#' +#' @details Each request must have its own ID, as the Batch API makes no +#' guarantees about the order the results will be returned in. +#' +#' @param df A data frame containing text to process +#' @param text_var Name of the column containing input text +#' @param id_var Name of the column to use as row ID +#' @inheritParams oai_batch_build_completions_req +#' +#' @returns A character string of newline-separated JSON requests +#' +#' @export +#' @examples +#' \dontrun{ +#' df <- data.frame( +#' id = c("q1", "q2"), +#' prompt = c("What is 2+2?", "Explain gravity briefly.") +#' ) +#' jsonl_content <- oai_batch_prepare_completions( +#' df, +#' text_var = prompt, +#' id_var = id, +#' system_prompt = "You are a helpful assistant." +#' ) +#' } +oai_batch_prepare_completions <- function(df, text_var, id_var, model = "gpt-4o-mini", system_prompt = NULL, temperature = 0, max_tokens = 500L, schema = NULL, method = "POST", endpoint = "/v1/chat/completions") { + + text_sym <- rlang::ensym(text_var) + id_sym <- rlang::ensym(id_var) + + .texts <- dplyr::pull(df, !!text_sym) + .ids <- dplyr::pull(df, !!id_sym) + + if (!.validate_batch_inputs(.ids, .texts)) { + return("") + } + + ## pre-process schema once if S7 object to avoid repeated json_dump() calls + if (!is.null(schema) && S7::S7_inherits(schema, json_schema)) { + schema <- json_dump(schema) + } + + reqs <- purrr::map2_chr(.texts, .ids, \(x, y) { + oai_batch_build_completions_req( + input = x, + id = as.character(y), + model = model, + system_prompt = system_prompt, + temperature = temperature, + max_tokens = max_tokens, + schema = schema, + method = method, + endpoint = endpoint + ) + }) + + return(paste0(reqs, collapse = "\n")) +} + + +#' Prepare and upload a file to be uploaded to the OpenAI Batch API +#' +#' +#' +#' @param jsonl_rows Rows of valid JSON, output of an oai_batch_prepare* function +#' @param purpose The intended purpose of the uploaded file. Must be one of +#' "batch", "fine-tune", "assistants", "vision", "user_data", or "evals". +#' @param key_name Name of the environment variable containing your API key +#' @param endpoint_url OpenAI API endpoint URL (default: OpenAI's Files API V1) +#' +#' @returns Metadata for an upload to the OpenAI Files API +#' +#' @export +#' @seealso `oai_files_upload()`, `oai_files_list()` +#' @examples +#' \dontrun{ +#' df <- data.frame( +#' id = c("doc_1", "doc_2"), +#' text = c("Hello world", "Goodbye world") +#' ) +#' jsonl_content <- oai_batch_prepare_embeddings(df, text_var = text, id_var = id) +#' file_info <- oai_batch_upload(jsonl_content) +#' file_info$id # Use this ID to create a batch job +#' } +oai_batch_upload <- function(jsonl_rows, purpose = c("batch", "fine-tune", "assistants", "vision", "user_data", "evals"), key_name = "OPENAI_API_KEY", endpoint_url = "https://api.openai.com/v1/files") { + + purpose <- match.arg(purpose) + + .tmp <- tempfile(fileext = ".jsonl") + on.exit(unlink(.tmp)) # if session crashes we drop the file from mem safely + writeLines(jsonl_rows, .tmp) # send the content to the temp file for uploading to OAI + # question here is whether to also save this somewhere by force... + # once OAI have the file it's backed up for 30 days. + + result <- oai_file_upload( + file = .tmp, + purpose = purpose, + key_name = key_name, + endpoint_url = endpoint_url + ) + + + return(result) +} + +# batch job management ---- +#' Trigger a batch job to run on an uploaded file +#' +#' @details Once a file has been uploaded to the OpenAI Files API it's necessary to trigger the batch job. This will ensure that your file is processed, and processing is finalised within the 24 hour guarantee. +#' +#' It's important to choose the right endpoint. If processing should be done by the Completions API, be sure to route to v1/chat/completions, and this must match each row in your uploaded file. +#' +#' Batch Job Ids start with "batch_", you'll receive a warning if you try to check batch status on a Files API file (the Files/Batch API set up is a lil bit clumsy for me) +#' +#' @param file_id File ID returned by oai_batch_upload() +#' @param endpoint The API endpoint path, e.g. /v1/embeddings +#' @param completion_window Time window for batch completion (OpenAI guarantees 24h only) +#' @param metadata Optional list of metadata to tag the batch with +#' @inheritParams oai_batch_upload +#' +#' @returns Metadata about an OpenAI Batch Job Including the batch ID +#' +#' @export +#' @examples +#' \dontrun{ +#' # After uploading a file with oai_batch_upload() +#' batch_job <- oai_batch_start( +#' file_id = "file-abc123", +#' endpoint = "/v1/embeddings" +#' ) +#' batch_job$id # Use this to check status later +#' } +oai_batch_start <- function(file_id, endpoint = c("/v1/embeddings", "/v1/chat/completions"), completion_window = "24h", metadata = NULL, key_name = "OPENAI_API_KEY") { + + endpoint <- match.arg(endpoint) + api_key <- get_api_key(key_name) + + body <- list( + input_file_id = file_id, + endpoint = endpoint, + completion_window = completion_window + ) + + if (!is.null(metadata)) { + body$metadata <- metadata + } + + httr2::request("https://api.openai.com/v1/batches") |> + httr2::req_auth_bearer_token(api_key) |> + httr2::req_body_json(body) |> + httr2::req_error(is_error = ~ FALSE) |> + httr2::req_perform() |> + httr2::resp_body_json() +} + +#' Check the Status of a Batch Job on the OpenAI Batch API +#' +#' @param batch_id Batch identifier (starts with 'batch_'), returned by oai_batch_start() +#' @inheritParams oai_batch_upload +#' +#' @returns Metadata about an OpenAI Batch API Job, including status, error_file_id, output_file_id, input_file_id etc. +#' +#' @export +#' @examples +#' \dontrun{ +#' status <- oai_batch_status("batch_abc123") +#' status$status # e.g., "completed", "in_progress", "failed" +#' status$output_file_id # File ID for results when completed +#' } +oai_batch_status <- function(batch_id, key_name = "OPENAI_API_KEY") { + + api_key <- get_api_key(key_name) + + httr2::request(paste0("https://api.openai.com/v1/batches/", batch_id)) |> + httr2::req_auth_bearer_token(api_key) |> + httr2::req_error(is_error = ~ FALSE) |> + httr2::req_perform() |> + httr2::resp_body_json() +} + +#' List Batch Jobs on the OpenAI Batch API +#' +#' Retrieve a paginated list of batch jobs associated with your API key. +#' +#' @param limit Maximum number of batch jobs to return +#' @param after Cursor for pagination; batch ID to start after +#' @inheritParams oai_batch_upload +#' +#' @returns A list containing batch job metadata and pagination information +#' +#' @export +#' @examples +#' \dontrun{ +#' # List recent batch jobs +#' batches <- oai_batch_list(limit = 10) +#' +#' # Paginate through results +#' next_page <- oai_batch_list(after = batches$last_id) +#' } +oai_batch_list <- function(limit = 20L, after = NULL, key_name = "OPENAI_API_KEY") { + + api_key <- get_api_key(key_name) + + req <- httr2::request("https://api.openai.com/v1/batches") |> + httr2::req_auth_bearer_token(api_key) |> + httr2::req_url_query(limit = limit) + + if (!is.null(after)) { + req <- httr2::req_url_query(req, after = after) + } + + req |> + httr2::req_error(is_error = ~ FALSE) |> + httr2::req_perform() |> + httr2::resp_body_json() +} + +#' Cancel a Running Batch Job on the OpenAI Batch API +#' +#' Cancels an in-progress batch job. The batch will stop processing new +#' requests, but requests already being processed may still complete. +#' +#' @inheritParams oai_batch_status +#' @inheritParams oai_batch_upload +#' +#' @returns Metadata about the cancelled batch job +#' +#' @export +#' @examples +#' \dontrun{ +#' # Cancel a batch job that's taking too long +#' cancelled <- oai_batch_cancel("batch_abc123") +#' cancelled$status # Will be "cancelling" or "cancelled" +#' } +oai_batch_cancel <- function(batch_id, key_name = "OPENAI_API_KEY") { + + api_key <- get_api_key(key_name) + + httr2::request(paste0("https://api.openai.com/v1/batches/", batch_id, "/cancel")) |> + httr2::req_auth_bearer_token(api_key) |> + httr2::req_method("POST") |> + httr2::req_error(is_error = ~ FALSE) |> + httr2::req_perform() |> + httr2::resp_body_json() +} + + +# results parsing ---- +#' Parse an Embeddings Batch Job into a Data Frame +#' +#' Parses the JSONL content returned from a completed embeddings batch job +#' and converts it into a tidy data frame with one row per embedding. +#' +#' @param content Character string of JSONL content from the batch output file +#' @param original_df Optional original data frame to rename custom_id column +#' @param id_var If original_df provided, the column name to rename custom_id to +#' +#' @returns A tibble with custom_id (or renamed), .error, .error_msg, and +#' embedding dimensions (V1, V2, ..., Vn) +#' +#' @export +#' @examples +#' \dontrun{ +#' # After downloading batch results with oai_files_content() +#' content <- oai_files_content(status$output_file_id) +#' embeddings_df <- oai_batch_parse_embeddings(content) +#' +#' # Optionally rename the ID column to match original data +#' embeddings_df <- oai_batch_parse_embeddings( +#' content, +#' original_df = my_df, +#' id_var = doc_id +#' ) +#' } +oai_batch_parse_embeddings <- function(content, original_df = NULL, id_var = NULL) { + + lines <- strsplit(content, "\n")[[1]] + lines <- lines[nchar(lines) > 0] + + if (length(lines) == 0) { + return(tibble::tibble( + custom_id = character(), + .error = logical(), + .error_msg = character() + )) + } + + parsed <- purrr::imap(lines, \(line, idx) { + tryCatch( + jsonlite::fromJSON(line, simplifyVector = FALSE), + error = function(e) { + list( + custom_id = paste0("__PARSE_ERROR_LINE_", idx), + error = list(message = paste("Failed to parse JSONL line", idx, ":", conditionMessage(e))) + ) + } + ) + }) + + results <- purrr::map(parsed, function(item) { + custom_id <- item$custom_id + + if (!is.null(item$error)) { + return(tibble::tibble( + custom_id = custom_id, + .error = TRUE, + .error_msg = item$error$message %||% "Unknown error" + )) + } + + embedding <- purrr::pluck(item, "response", "body", "data", 1, "embedding",.default = NULL) + + if (is.null(embedding)) { + return(tibble::tibble( + custom_id = custom_id, + .error = TRUE, + .error_msg = "No embedding found in response" + )) + } + + embed_tibble <- embedding |> + as.list() |> + stats::setNames(paste0("V", seq_along(embedding))) |> + tibble::as_tibble() + + tibble::tibble( + custom_id = custom_id, + .error = FALSE, + .error_msg = NA_character_ + ) |> + dplyr::bind_cols(embed_tibble) + }) + + result <- purrr::list_rbind(results) + + if (!is.null(original_df) && !is.null(id_var)) { + id_sym <- rlang::ensym(id_var) + id_col_name <- rlang::as_name(id_sym) + result <- result |> + dplyr::rename(!!id_col_name := custom_id) + } + + return(result) +} + +#' Parse a Completions Batch Job into a Data Frame +#' +#' Parses the JSONL content returned from a completed chat completions batch +#' job and converts it into a tidy data frame with one row per response. +#' +#' @inheritParams oai_batch_parse_embeddings +#' +#' @returns A tibble with custom_id (or renamed), content, .error, and .error_msg +#' +#' @export +#' @examples +#' \dontrun{ +#' # After downloading batch results with oai_files_content() +#' content <- oai_files_content(status$output_file_id) +#' completions_df <- oai_batch_parse_completions(content) +#' +#' # Optionally rename the ID column to match original data +#' completions_df <- oai_batch_parse_completions( +#' content, +#' original_df = my_df, +#' id_var = query_id +#' ) +#' } +oai_batch_parse_completions <- function(content, original_df = NULL, id_var = NULL) { + + lines <- strsplit(content, "\n")[[1]] + lines <- lines[nchar(lines) > 0] + + if (length(lines) == 0) { + return(tibble::tibble( + custom_id = character(), + content = character(), + .error = logical(), + .error_msg = character() + )) + } + + parsed <- purrr::imap(lines, \(line, idx) { + tryCatch( + jsonlite::fromJSON(line, simplifyVector = FALSE), + error = function(e) { + list( + custom_id = paste0("__PARSE_ERROR_LINE_", idx), + error = list(message = paste("Failed to parse JSONL line", idx, ":", conditionMessage(e))) + ) + } + ) + }) + + results <- purrr::map(parsed, function(item) { + custom_id <- item$custom_id + + if (!is.null(item$error)) { + return(tibble::tibble( + custom_id = custom_id, + content = NA_character_, + .error = TRUE, + .error_msg = item$error$message %||% "Unknown error" + )) + } + + response_content <- purrr::pluck( + item, "response", "body", "choices", 1, "message", "content", + .default = NA_character_ + ) + + tibble::tibble( + custom_id = custom_id, + content = response_content, + .error = FALSE, + .error_msg = NA_character_ + ) + }) + + result <- purrr::list_rbind(results) + + if (!is.null(original_df) && !is.null(id_var)) { + id_sym <- rlang::ensym(id_var) + id_col_name <- rlang::as_name(id_sym) + result <- result |> + dplyr::rename(!!id_col_name := custom_id) + } + + return(result) +} + + +# internal/helpers ---- +#' @keywords internal +.validate_batch_inputs <- function(.ids, .texts, max_requests = 50000) { + n_requests <- length(.texts) + + if (n_requests == 0) { + cli::cli_warn("Input is empty. Returning empty JSONL string.") + return(FALSE) + } + + if (anyDuplicated(.ids)) { + duplicated_ids <- unique(.ids[duplicated(.ids)]) + cli::cli_abort(c( + "custom_id values must be unique within a batch", + "x" = "Found {length(duplicated_ids)} duplicate ID{?s}: {.val {head(duplicated_ids, 3)}}" + )) + } + + if (n_requests > max_requests) { + cli::cli_abort(c( + "OpenAI Batch API supports maximum {max_requests} requests per batch", + "x" = "Attempting to create {n_requests} requests", + "i" = "Consider splitting your data into multiple batches" + )) + } + + if (n_requests > 10000) { + cli::cli_alert_info("Large batch with {n_requests} requests - processing may take significant time") + } + + return(TRUE) +} \ No newline at end of file diff --git a/R/openai_files_api.R b/R/openai_files_api.R new file mode 100644 index 0000000..fb03bb8 --- /dev/null +++ b/R/openai_files_api.R @@ -0,0 +1,162 @@ +#' List Files on the OpenAI Files API +#' +#' Retrieve a list of files that have been uploaded to the OpenAI Files API, +#' filtered by purpose. Files are retained for 30 days after upload. +#' +#' @param purpose The intended purpose of the uploaded file. Must be one of +#' "batch", "fine-tune", "assistants", "vision", "user_data", or "evals". +#' @param key_name Name of the environment variable containing your API key +#' +#' @returns A list containing file metadata and pagination information. Each +#' file entry includes id, filename, purpose, bytes, created_at, and status. +#' +#' @export +#' @seealso [oai_file_content()] to retrieve file contents, +#' [oai_file_delete()] to remove files, +#' [oai_batch_upload()] to upload batch files +#' @examples +#' \dontrun{ +#' # List all batch files +#' batch_files <- oai_file_list(purpose = "batch") +#' +#' # List fine-tuning files +#' ft_files <- oai_file_list(purpose = "fine-tune") +#' +#' # Access file IDs +#' file_ids <- purrr::map_chr(batch_files$data, "id") +#' } +oai_file_list <- function(purpose = c("batch", "fine-tune", "assistants", "vision", "user_data", "evals"), key_name = "OPENAI_API_KEY") { + + purpose <- match.arg(purpose) + api_key <- get_api_key(key_name) + + httr2::request("https://api.openai.com/v1/files") |> + httr2::req_auth_bearer_token(api_key) |> + httr2::req_url_query(purpose = purpose) |> + httr2::req_error(is_error = ~ FALSE) |> + httr2::req_perform() |> + httr2::resp_body_json() + +} + + +#' Upload a file to the OpenAI Files API +#' +#' @param file File object you wish to upload +#' @param purpose The intended purpose of the uploaded file. Must be one of +#' "batch", "fine-tune", "assistants", "vision", "user_data", or "evals". +#' @param key_name Name of the environment variable containing your API key +#' @param endpoint_url OpenAI API endpoint URL (default: OpenAI's Files API V1) +#' +#' @returns File upload status and metadata inlcuding id, purpose, filename, created_at etc. +#' @seealso \url{https://platform.openai.com/docs/api-reference/files?lang=curl} +#' @export +#' @examples +#' \dontrun{ +#' tmp <- tempfile(fileext = ".jsonl") +#' writeLines("Hello!", tmp) +#' oai_file_upload( +#' file = tmp, +#' purpose = "user_data" +#' ) +#' +#' } +oai_file_upload <- function(file, purpose = c("batch", "fine-tune", "assistants", "vision", "user_data", "evals"), key_name = "OPENAI_API_KEY", endpoint_url = "https://api.openai.com/v1/files") { + + api_key <- get_api_key(key_name) + purpose <- match.arg(purpose) + stopifnot("`file` must be a file object" = is.character(file) && file.exists(file)) + + resp <- httr2::request(base_url = endpoint_url) |> + httr2::req_auth_bearer_token(api_key) |> + httr2::req_body_multipart(file = curl::form_file(file), purpose = purpose) |> # use `req_body_multipart` instead of `req_body_file` to send 'purpose' with file + httr2::req_error(is_error = ~ FALSE) |> # let errors from providers surface rather than be caught by httr2. v.helpful for developing prompts/schemas and debugging APIs + httr2::req_perform() + + result <- httr2::resp_body_json(resp) + + if(httr2::resp_status(resp) >= 400) { + error_msg <- result$error$message %||% "Unknown error" + cli::cli_abort(c( + "Failed to upload file to OpenAI Files API", + "x" = error_msg + )) + } + + return(result) +} + + +#' Delete a File from the OpenAI Files API +#' +#' Permanently deletes a file from the OpenAI Files API. This action cannot +#' be undone. Note that files associated with active batch jobs cannot be +#' deleted until the job completes. +#' +#' @param file_id File identifier (starts with 'file-'), returned by +#' [oai_batch_upload()] or [oai_file_list()] +#' @param key_name Name of the environment variable containing your API key +#' +#' @returns A list containing the file id, object type, and deletion status +#' (deleted = TRUE/FALSE) +#' +#' @export +#' @seealso [oai_file_list()] to find file IDs, +#' [oai_file_content()] to retrieve file contents before deletion +#' @examples +#' \dontrun{ +#' # Delete a specific file +#' result <- oai_file_delete("file-abc123") +#' result$deleted # TRUE if successful +#' +#' } +oai_file_delete <- function(file_id, key_name = "OPENAI_API_KEY") { + + api_key <- get_api_key(key_name) + + httr2::request(paste0("https://api.openai.com/v1/files/", file_id)) |> + httr2::req_auth_bearer_token(api_key) |> + httr2::req_method("DELETE") |> + httr2::req_error(is_error = ~ FALSE) |> + httr2::req_perform() |> + httr2::resp_body_json() +} + +#' Retrieve Content from a File on the OpenAI Files API +#' +#' Downloads and returns the content of a file stored on the OpenAI Files API. +#' For batch job outputs, this returns JSONL content that can be parsed with +#' [oai_batch_parse_embeddings()] or [oai_batch_parse_completions()]. +#' +#' @param file_id File identifier (starts with 'file-'), typically the +#' output_file_id from [oai_batch_status()] +#' @param key_name Name of the environment variable containing your API key +#' +#' @returns A character string containing the file contents. For batch outputs, +#' this is JSONL format (one JSON object per line). +#' +#' @export +#' @seealso [oai_batch_status()] to get output_file_id from completed batches, +#' [oai_batch_parse_embeddings()] and [oai_batch_parse_completions()] to +#' parse batch results +#' @examples +#' \dontrun{ +#' # Get batch job status and download results +#' status <- oai_batch_status("batch_abc123") +#' +#' if (status$status == "completed") { +#' content <- oai_file_content(status$output_file_id) +#' results <- oai_batch_parse_embeddings(content) +#' } +#' } +oai_file_content <- function(file_id, key_name = "OPENAI_API_KEY") { + + api_key <- get_api_key(key_name) + + resp <- httr2::request(paste0("https://api.openai.com/v1/files/", file_id, "/content")) |> + httr2::req_auth_bearer_token(api_key) |> + httr2::req_error(is_error = ~ FALSE) |> + httr2::req_perform() + + httr2::resp_body_string(resp) +} diff --git a/R/zzz.R b/R/zzz.R index a925ff3..e6dfe96 100644 --- a/R/zzz.R +++ b/R/zzz.R @@ -1,5 +1,5 @@ utils::globalVariables(c(".embeddings", ".request", ".response", ".row_num", ".data", ".error", - ".error_msg", ".status", "original_index", "text", ":=", ".row_id", "id", "label", "score", "verbose")) + ".error_msg", ".status", "original_index", "text", ":=", ".row_id", "id", "label", "score", "verbose", "custom_id")) .onLoad <- function(...) { S7::methods_register() diff --git a/README.Rmd b/README.Rmd index d946ed1..6c96856 100644 --- a/README.Rmd +++ b/README.Rmd @@ -272,12 +272,17 @@ metadata$endpoint_url Read the [LLM Providers Vignette](articles/llm_providers.html), and the [Structured Outputs Vignette](articles/structured_outputs_json_schema.html) for more information on common workflows with the OpenAI Chat Completions API [^1] -[^1]: Content pending implementation for Anthroic Messages API, Gemini API, and OpenAI Responses API +[^1]: Content pending implementation for Anthropic Messages API, Gemini API, and OpenAI Responses API # API Key Security - Read the [httr2 vignette](https://httr2.r-lib.org/articles/wrapping-apis.html#basics){target="_blank"} on managing your API keys securely and encrypting them. +- Read the [EndpointR API Keys](articles/api_keys.html) vignette for information on which API keys you need for each endpoint we support, and how to securely import those API keys into your .Renvironfile. + +# Batch Jobs + +- Read the [EndpointR vignette](articles/sync_async.html) on Synchronous vs Asynchronous APIs - Read the [EndpointR API Keys](articles/api_keys.html) vignette for information on which API keys you need for wach endpoint we support, and how to securely import those API keys into your .Renvironfile. --- diff --git a/README.md b/README.md index c61f480..4ad6404 100644 --- a/README.md +++ b/README.md @@ -291,9 +291,15 @@ information on common workflows with the OpenAI Chat Completions API and encrypting them. - Read the [EndpointR API Keys](articles/api_keys.html) vignette for - information on which API keys you need for wach endpoint we support, + information on which API keys you need for each endpoint we support, and how to securely import those API keys into your .Renvironfile. +# Batch Jobs + +- Read the [EndpointR vignette](articles/sync_async.html) on Synchronous + vs Asynchronous APIs + +[^1]: Content pending implementation for Anthropic Messages API, Gemini ------------------------------------------------------------------------ SAMY Data Science diff --git a/_pkgdown.yml b/_pkgdown.yml index 89baf89..d173353 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -66,14 +66,9 @@ navbar: - text: Advanced Topics - text: Improving Performance href: articles/improving_performance.html - news: - text: Changelog - href: news/index.html + - text: Synchronous vs Asynchronous (Batch) APIs + href: articles/sync_async.html -footer: - structure: - left: developed_by - right: built_with reference: - title: "Getting Started" @@ -149,6 +144,30 @@ reference: - json_dump - validate_response +- title: "OpenAI Files API" + desc: "Functions for uploading and managing files on OpenAI's Files API" + contents: + - oai_file_list + - oai_file_upload + - oai_file_delete + - oai_file_content + +- title: "OpenAI Batch API" + desc: "Functions for managing batches on OpenAI's Batch API" + contents: + - oai_batch_upload + - oai_batch_start + - oai_batch_status + - oai_batch_list + - oai_batch_cancel + - oai_batch_build_embed_req + - oai_batch_prepare_embeddings + - oai_batch_parse_embeddings + - oai_batch_build_completions_req + - oai_batch_prepare_completions + - oai_batch_parse_completions + + - title: "Schema Builders" desc: "Helper functions for creating different types of JSON schema properties" contents: @@ -184,6 +203,8 @@ reference: authors: Jack Penzer: href: https://github.com/jpcompartir + Claude: + href: https://claude.ai repo: url: @@ -195,6 +216,10 @@ development: news: releases: + - text: "Version 0.2.1" + href: news/index.html#endpointr-021 + - text: "Version 0.2.0" + href: news/index.html#endpointr-020 - text: "Version 0.2" href: news/index.html#endpointr-012 - text: "Version 0.1.2" diff --git a/data/df_embeddings_hf.rda b/data/df_embeddings_hf.rda index 66a4a80..6cfcdd5 100644 Binary files a/data/df_embeddings_hf.rda and b/data/df_embeddings_hf.rda differ diff --git a/data/df_sentiment_classification_example.rda b/data/df_sentiment_classification_example.rda index 8bcb1a4..2b533f6 100644 Binary files a/data/df_sentiment_classification_example.rda and b/data/df_sentiment_classification_example.rda differ diff --git a/dev_docs/01_integrations.qmd b/dev_docs/01_integrations.qmd index f85a047..bbcb769 100644 --- a/dev_docs/01_integrations.qmd +++ b/dev_docs/01_integrations.qmd @@ -103,6 +103,10 @@ oai_complete_good_auth <- oai_complete_df( oai_complete_good_auth ``` +## Batch API + +TODO: + # Hugging Face ## hf embed diff --git a/dev_docs/initial_release.qmd b/dev_docs/initial_release.qmd index 38752e9..105a717 100644 --- a/dev_docs/initial_release.qmd +++ b/dev_docs/initial_release.qmd @@ -495,3 +495,4 @@ How much information to print? From {httr}'s docs: \> This is a wrapper around req_verbose() that uses an integer to control verbosity: - 0: no output - 1: show headers - 2: show headers and bodies - 3: show headers, bodies, and curl status messages You can also pass in a value for 'path', which will save the response to a file, we'll look more at how to manage this later. + diff --git a/dev_docs/openai_batch_api.qmd b/dev_docs/openai_batch_api.qmd new file mode 100644 index 0000000..8d1a733 --- /dev/null +++ b/dev_docs/openai_batch_api.qmd @@ -0,0 +1,325 @@ +--- +title: "openai_batch_api" +format: html +--- + +So... we could actually re-use some of the logic from oai_embed.R / oai_classify.R. There are some small differences, e.g. we feed a stub of the endpoint URL to each request, and then we upload the batch of inputs as a file directly to the files API, then create the batch. + +```{r} +embed_req <- oai_build_embedding_request("xx", dimensions = 324) + +body <- embed_req$body$data + +row <- list( + custom_id = "xx", + method = "POST", + url = "/v1/embeddings", + body = embed_req$body$data +) +``` + +Initial thought was to just stream_out, but stream_out expects a data frame as input. Which we can use further down the line, but not here. + +```{r} +jsonlite::stream_out(row, con = stdout()) + +tib_w_row <- tibble::tibble(rows = list(row)) + +jsonlite::stream_out(tib_w_row, con = file("test_dir/jsonl_outputs/batch_api_test.jsonl")) + +read_in <- jsonlite::stream_in(con = file("test_dir/jsonl_outputs/batch_api_test.jsonl")) |> jsonlite::toJSON() +read_in["rows"] + +tibble::tibble(x = 1:10^5) |> + chunk_dataframe(chunk_size = 80000) +``` + +So instead, we want to take the row, convert it to JSON with `auto_unbox = TRUE` and then writeLines to a `.jsonl` file. Recall that `auto_unbox` just stops each k:v pair's value being treated as a list + +```{r} +# stream in./ out not the right way +jsonlite::toJSON(row, auto_unbox = TRUE) |> + writeLines("test_dir/jsonl_outputs/batch_api_test_write_lines.jsonl") + + +x <- readLines("test_dir/jsonl_outputs/batch_api_test_write_lines.jsonl") +jsonlite::toJSON(x, auto_unbox = TRUE) +``` + +We said we could re-use some of the logic, but looking at it we don't really benefit from using httr2 for each request - it's unnecessary overhead. So we just create lists for now. We may use httr2 for the actual batch request (but maybe not!) + +```{r} +single_batch_row <- oai_batch_build_embed_req("hello", "1") + +list_rows <- purrr::map(1:10, \(x) oai_batch_build_embed_req("hello", x)) +``` + +Then we can write them to a file as follows, and send it to OpenAI as a batch job. + +```{r} +writeLines( + unlist(list_rows), + "test_dir/jsonl_outputs/write_ten_lines.jsonl") + + +readLines( + "test_dir/jsonl_outputs/write_ten_lines.jsonl", + n = 2 +) +``` + +```{r} +test_df <- tibble::tibble( + x = letters, + y = 1:length(letters) +) + +test_df |> + mutate( + reqs = map2_chr(x, y, \(text, id) { + oai_batch_build_embed_req( + text, + id, + dimensions = 324 + ) + }) + ) + +xx <- test_df |> + oai_batch_prepare_embeddings( + x, + y + ) + +oai_batch_upload( + xx +) + +batch_job_data <- oai_batch_file_list() +temp_id <- batch_job_data$data[[1]]$id + +oai_batch_file_delete(temp_id) + +oai_batch_file_list() +``` + +# Testing + +## Embeddings + +```{r} +uploaded_files <- oai_file_list( + purpose = "batch" +) +assertthat::validate_that(length(uploaded_files$data) == 0) +``` + +First batch failed due to ID beingan integer, so ID has to be a string... + +```{r} +embedding_rows <- test_df |> + oai_batch_prepare_embeddings( + x, + y + ) + +embedding_file <- oai_batch_upload(embedding_rows) + +embedding_batch <- oai_batch_start(embedding_file$id, + endpoint = "/v1/embeddings") + +batch_jobs <- oai_batch_list() +batch_jobs +oai_batch_status(embedding_batch$id) +``` + +And then this time, so we need to fix how the tmpfile is created and prepend with batch: + +``` +$error$message +[1] "Invalid 'batch_id': 'file-K6vaHgwcJsE5z1MMFvVMix'. Expected an ID that begins with 'batch'." +``` + +Once we've made sure the ID is a string, then we create the file, upload it, start the batch, check the status, download it. + +I would say that this still feels quite janky. There's a few different file ids going on, and then we need to handle any errors separately it seems? + +```{r} +embedding_rows_id_string <- test_df |> + oai_batch_prepare_embeddings( + x, + y + ) + +embedding_file_id_string <- oai_batch_upload(embedding_rows_id_string) + +embedding_batch_id_string <- oai_batch_start(embedding_file_id_string$id, + endpoint = "/v1/embeddings") + +embedding_batch_metadata <- oai_batch_status( + embedding_batch_id_string$id # "batch_6960e0b48bf481909751c76756ac9fec" +) + +output_file_contents <- oai_file_content( + embedding_batch_metadata$output_file_id +) + + +oai_batch_parse_embeddings(output_file_contents, original_df = NULL) +``` + +## Low dimensions + +Looks good, 327 columns = 324 + 3 (id, .error, .error_msg) + +```{r} +ld_embed_rows <- test_df |> + oai_batch_prepare_embeddings( + x, + y, + dimensions = 324 + ) + +ld_file <- oai_batch_upload( + jsonl_rows = ld_embed_rows, + purpose = "batch" +) + +ld_batch_job <- oai_batch_start(ld_file$id, endpoint = "/v1/embeddings") + +oai_batch_status(ld_batch_job$id)[["status"]] +ld_results <- oai_file_content(oai_batch_status(ld_batch_job$id)[["output_file_id"]]) + +oai_batch_parse_embeddings(ld_results) + +``` + +Delete/clean up + +We can delete the input/output files, but it doesn't seem like we can actually delete batches. + +```{r} +oai_batch_list()[["data"]] |> + map( pluck("id") +) |> unlist() + +oai_batch_status("batch_6960ed03eea8819080aaa69a8982de66") +oai_file_delete("file-4f8STaon74XE5yP6M7mWmH") +oai_file_delete("file-FteMgeWV4mK85kdcntGU29") + +oai_batch_status("batch_6960dfe7d8ac8190acf682a59e844b71") +oai_file_delete("file-HRFKx63PimYZYqsraQHR6T") +oai_file_delete("file-HRFKx63PimYZYqsraQHR6T") +oai_batch_status("batch_6960e0b48bf481909751c76756ac9fec") +oai_file_delete('file-TcV6dGkFEsrzGTNxkPKYCb') # output file +oai_file_delete('file-K6vaHgwcJsE5z1MMFvVMix') # input file + +``` + +## Completions + +testing the funcs for Completions, we need to get an input, make a file, start the batch, checl the status, retrieve the content +```{r} +completions_req <- oai_batch_build_completions_req( + input = "Tell me a joke about my country, the United Kingdom", + id = "id_1" +) + +completions_file <- oai_batch_upload( + completions_req +) +``` + +Do need to remember to fill in endpoint = "/v1/chat/completions" here instead of default arg for embeddings +```{r} +completions_batch <- oai_batch_start( + completions_file$id, + endpoint = "/v1/chat/completions" +) + +completions_status <- oai_batch_status( + completions_batch$id +) + +completions_status$status +completions_status$output_file_id # file-Q8TaFRoCYGHRZJKiYQKhx9 + +output <- oai_file_content(completions_status$output_file_id) + +oai_batch_parse_completions(output) |> + purrr::pluck("content", 1) +``` + +### With Schema + +```{r} +joke_schema <- create_json_schema( + name = "joke_schema", + description = "A set up and a punchline", + schema = schema_object( + setup = schema_string("The set up for the joke"), + punchline = schema_string("The punchline of the joke, make it pop"), + required = c("setup", "punchline") + ) +) + +joke_schema + +completions_req_w_schema <- oai_batch_build_completions_req( + input = "Tell me a joke about my country, the United Kingdom", + id = "id_1", + schema = joke_schema, + temperature = 1 +) + +.file <- oai_batch_upload( + completions_req_w_schema +) + +.batch <- oai_batch_start(.file$id, endpoint = "/v1/chat/completions") + +oai_batch_status(.batch$id)[["status"]] + +oai_batch_status(.batch$id)[["output_file_id"]] |> + oai_file_content() +oai_batch_status(.batch$id)[[""]] +``` + +To deal with a batch output that had strutured outputs, we need to first parse the batch, and then use the same approach we use elsewhere in Endpoint - may work on this as it still feels quite clunky +```{r} +.content <- oai_batch_status(.batch$id)[["output_file_id"]] |> + oai_file_content() + +parsed_batch <- oai_batch_parse_completions(.content) # custom_id, content, .error, .error_msg + +parsed_batch |> + dplyr::mutate( + parsed = purrr::map(content, + \(x) safely_from_json(x)) + ) |> + tidyr::unnest_wider(parsed) +``` + + +# Files API + +Write the generic oai_file_upload func and then call that it in oai_batch_upload +Rename oai_batch_file_upload --> oai_batch_upload(?) + +```{r} +tmp <- tempfile(fileext = ".jsonl") +writeLines("Hello!", tmp) +readLines(tmp) +file.path(tmp) + +test_upload <- oai_file_upload( + file = tmp, + purpose = "user_data" +) # file must be a file object + +file(tmp) +``` + +```{r} + +``` \ No newline at end of file diff --git a/man/EndpointR-package.Rd b/man/EndpointR-package.Rd index 1935951..de47fdd 100644 --- a/man/EndpointR-package.Rd +++ b/man/EndpointR-package.Rd @@ -18,5 +18,10 @@ Useful links: \author{ \strong{Maintainer}: Jack Penzer \email{Jack.penzer@sharecreative.com} +Authors: +\itemize{ + \item Claude AI +} + } \keyword{internal} diff --git a/man/oai_batch_build_completions_req.Rd b/man/oai_batch_build_completions_req.Rd new file mode 100644 index 0000000..15bb799 --- /dev/null +++ b/man/oai_batch_build_completions_req.Rd @@ -0,0 +1,55 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_batch_api.R +\name{oai_batch_build_completions_req} +\alias{oai_batch_build_completions_req} +\title{Create a Single OpenAI Batch API - Chat Completions Request} +\usage{ +oai_batch_build_completions_req( + input, + id, + model = "gpt-4o-mini", + system_prompt = NULL, + temperature = 0, + max_tokens = 500L, + schema = NULL, + method = "POST", + endpoint = "/v1/chat/completions" +) +} +\arguments{ +\item{input}{Text input (user message) for the completion} + +\item{id}{A custom, unique row ID} + +\item{model}{The chat completion model to use} + +\item{system_prompt}{Optional system prompt to guide the model's behaviour} + +\item{temperature}{Sampling temperature (0 = deterministic, higher = more random)} + +\item{max_tokens}{Maximum number of tokens to generate} + +\item{schema}{Optional JSON schema for structured output (json_schema object or list)} + +\item{method}{The HTTP request type, usually 'POST'} + +\item{endpoint}{The API endpoint path, e.g. /v1/chat/completions} +} +\value{ +A row of JSON suitable for the Batch API +} +\description{ +This function prepares a single row of data for the OpenAI Batch/Files APIs, +where each row should be valid JSON. The APIs do not guarantee the results +will be in the same order, so we need to provide an ID with each request. +} +\examples{ +\dontrun{ +req <- oai_batch_build_completions_req( + input = "What is the capital of France?", + id = "query_1", + model = "gpt-4o-mini", + temperature = 0 +) +} +} diff --git a/man/oai_batch_build_embed_req.Rd b/man/oai_batch_build_embed_req.Rd new file mode 100644 index 0000000..d30b31b --- /dev/null +++ b/man/oai_batch_build_embed_req.Rd @@ -0,0 +1,44 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_batch_api.R +\name{oai_batch_build_embed_req} +\alias{oai_batch_build_embed_req} +\title{Create a single OpenAI Batch API - Embedding request} +\usage{ +oai_batch_build_embed_req( + input, + id, + model = "text-embedding-3-small", + dimensions = NULL, + method = "POST", + encoding_format = "float", + endpoint = "/v1/embeddings" +) +} +\arguments{ +\item{input}{Text input to embed} + +\item{id}{A custom, unique row ID} + +\item{model}{The embedding model to use} + +\item{dimensions}{Number of embedding dimensions (NULL uses model default)} + +\item{method}{The HTTP request type, usually 'POST'} + +\item{encoding_format}{Data type of the embedding values} + +\item{endpoint}{The API endpoint path, e.g. /v1/embeddings} +} +\value{ +a row of JSON +} +\description{ +This function prepares a single row of data for the OpenAI Batch/Files APIs, where each row should be valid JSON. The APIs do not guarantee the results will be in the same order, so we need to provide an ID with each request. +} +\examples{ +\dontrun{ +text <- "embed_me" +id <- "id_1" +batch_req <- oai_batch_build_embed_req(text, id) +} +} diff --git a/man/oai_batch_cancel.Rd b/man/oai_batch_cancel.Rd new file mode 100644 index 0000000..9340942 --- /dev/null +++ b/man/oai_batch_cancel.Rd @@ -0,0 +1,27 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_batch_api.R +\name{oai_batch_cancel} +\alias{oai_batch_cancel} +\title{Cancel a Running Batch Job on the OpenAI Batch API} +\usage{ +oai_batch_cancel(batch_id, key_name = "OPENAI_API_KEY") +} +\arguments{ +\item{batch_id}{Batch identifier (starts with 'batch_'), returned by oai_batch_start()} + +\item{key_name}{Name of the environment variable containing your API key} +} +\value{ +Metadata about the cancelled batch job +} +\description{ +Cancels an in-progress batch job. The batch will stop processing new +requests, but requests already being processed may still complete. +} +\examples{ +\dontrun{ +# Cancel a batch job that's taking too long +cancelled <- oai_batch_cancel("batch_abc123") +cancelled$status # Will be "cancelling" or "cancelled" +} +} diff --git a/man/oai_batch_list.Rd b/man/oai_batch_list.Rd new file mode 100644 index 0000000..38f87e7 --- /dev/null +++ b/man/oai_batch_list.Rd @@ -0,0 +1,30 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_batch_api.R +\name{oai_batch_list} +\alias{oai_batch_list} +\title{List Batch Jobs on the OpenAI Batch API} +\usage{ +oai_batch_list(limit = 20L, after = NULL, key_name = "OPENAI_API_KEY") +} +\arguments{ +\item{limit}{Maximum number of batch jobs to return} + +\item{after}{Cursor for pagination; batch ID to start after} + +\item{key_name}{Name of the environment variable containing your API key} +} +\value{ +A list containing batch job metadata and pagination information +} +\description{ +Retrieve a paginated list of batch jobs associated with your API key. +} +\examples{ +\dontrun{ +# List recent batch jobs +batches <- oai_batch_list(limit = 10) + +# Paginate through results +next_page <- oai_batch_list(after = batches$last_id) +} +} diff --git a/man/oai_batch_parse_completions.Rd b/man/oai_batch_parse_completions.Rd new file mode 100644 index 0000000..e4aa101 --- /dev/null +++ b/man/oai_batch_parse_completions.Rd @@ -0,0 +1,36 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_batch_api.R +\name{oai_batch_parse_completions} +\alias{oai_batch_parse_completions} +\title{Parse a Completions Batch Job into a Data Frame} +\usage{ +oai_batch_parse_completions(content, original_df = NULL, id_var = NULL) +} +\arguments{ +\item{content}{Character string of JSONL content from the batch output file} + +\item{original_df}{Optional original data frame to rename custom_id column} + +\item{id_var}{If original_df provided, the column name to rename custom_id to} +} +\value{ +A tibble with custom_id (or renamed), content, .error, and .error_msg +} +\description{ +Parses the JSONL content returned from a completed chat completions batch +job and converts it into a tidy data frame with one row per response. +} +\examples{ +\dontrun{ +# After downloading batch results with oai_files_content() +content <- oai_files_content(status$output_file_id) +completions_df <- oai_batch_parse_completions(content) + +# Optionally rename the ID column to match original data +completions_df <- oai_batch_parse_completions( + content, + original_df = my_df, + id_var = query_id +) +} +} diff --git a/man/oai_batch_parse_embeddings.Rd b/man/oai_batch_parse_embeddings.Rd new file mode 100644 index 0000000..a977c88 --- /dev/null +++ b/man/oai_batch_parse_embeddings.Rd @@ -0,0 +1,37 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_batch_api.R +\name{oai_batch_parse_embeddings} +\alias{oai_batch_parse_embeddings} +\title{Parse an Embeddings Batch Job into a Data Frame} +\usage{ +oai_batch_parse_embeddings(content, original_df = NULL, id_var = NULL) +} +\arguments{ +\item{content}{Character string of JSONL content from the batch output file} + +\item{original_df}{Optional original data frame to rename custom_id column} + +\item{id_var}{If original_df provided, the column name to rename custom_id to} +} +\value{ +A tibble with custom_id (or renamed), .error, .error_msg, and +embedding dimensions (V1, V2, ..., Vn) +} +\description{ +Parses the JSONL content returned from a completed embeddings batch job +and converts it into a tidy data frame with one row per embedding. +} +\examples{ +\dontrun{ +# After downloading batch results with oai_files_content() +content <- oai_files_content(status$output_file_id) +embeddings_df <- oai_batch_parse_embeddings(content) + +# Optionally rename the ID column to match original data +embeddings_df <- oai_batch_parse_embeddings( + content, + original_df = my_df, + id_var = doc_id +) +} +} diff --git a/man/oai_batch_prepare_completions.Rd b/man/oai_batch_prepare_completions.Rd new file mode 100644 index 0000000..98bc0fd --- /dev/null +++ b/man/oai_batch_prepare_completions.Rd @@ -0,0 +1,66 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_batch_api.R +\name{oai_batch_prepare_completions} +\alias{oai_batch_prepare_completions} +\title{Prepare a Data Frame for the OpenAI Batch API - Chat Completions} +\usage{ +oai_batch_prepare_completions( + df, + text_var, + id_var, + model = "gpt-4o-mini", + system_prompt = NULL, + temperature = 0, + max_tokens = 500L, + schema = NULL, + method = "POST", + endpoint = "/v1/chat/completions" +) +} +\arguments{ +\item{df}{A data frame containing text to process} + +\item{text_var}{Name of the column containing input text} + +\item{id_var}{Name of the column to use as row ID} + +\item{model}{The chat completion model to use} + +\item{system_prompt}{Optional system prompt to guide the model's behaviour} + +\item{temperature}{Sampling temperature (0 = deterministic, higher = more random)} + +\item{max_tokens}{Maximum number of tokens to generate} + +\item{schema}{Optional JSON schema for structured output (json_schema object or list)} + +\item{method}{The HTTP request type, usually 'POST'} + +\item{endpoint}{The API endpoint path, e.g. /v1/chat/completions} +} +\value{ +A character string of newline-separated JSON requests +} +\description{ +Takes an entire data frame and turns each row into a valid line +of JSON ready for a .jsonl file upload to the OpenAI Files API + Batch API +job trigger. +} +\details{ +Each request must have its own ID, as the Batch API makes no +guarantees about the order the results will be returned in. +} +\examples{ +\dontrun{ +df <- data.frame( + id = c("q1", "q2"), + prompt = c("What is 2+2?", "Explain gravity briefly.") +) +jsonl_content <- oai_batch_prepare_completions( + df, + text_var = prompt, + id_var = id, + system_prompt = "You are a helpful assistant." +) +} +} diff --git a/man/oai_batch_prepare_embeddings.Rd b/man/oai_batch_prepare_embeddings.Rd new file mode 100644 index 0000000..045768b --- /dev/null +++ b/man/oai_batch_prepare_embeddings.Rd @@ -0,0 +1,56 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_batch_api.R +\name{oai_batch_prepare_embeddings} +\alias{oai_batch_prepare_embeddings} +\title{Prepare a Data Frame for the OpenAI Batch API - Embeddings} +\usage{ +oai_batch_prepare_embeddings( + df, + text_var, + id_var, + model = "text-embedding-3-small", + dimensions = NULL, + method = "POST", + encoding_format = "float", + endpoint = "/v1/embeddings" +) +} +\arguments{ +\item{df}{A data frame containing text to process} + +\item{text_var}{Name of the column containing input text} + +\item{id_var}{Name of the column to use as row ID} + +\item{model}{The embedding model to use} + +\item{dimensions}{Number of embedding dimensions (NULL uses model default)} + +\item{method}{The HTTP request type, usually 'POST'} + +\item{encoding_format}{Data type of the embedding values} + +\item{endpoint}{The API endpoint path, e.g. /v1/embeddings} +} +\value{ +A list of JSON requests +} +\description{ +Prepare a Data Frame for the OpenAI Batch API - Embeddings +} +\details{ +Takes an entire data frame and turns each row into a valid line of JSON ready for a .jsonl file upload to the OpenAI Files API + Batch API job trigger. + +Each request must have its own ID, as the Batch API makes no guarantees about the order the results will be returned in. + +To reduce the overall size, and the explanatory power of the Embeddings, you can set dimensions to lower than the default (which vary based on model). +} +\examples{ +\dontrun{ +df <- data.frame( + id = c("doc_1", "doc_2", "doc_3"), + text = c("Hello world", "Embedding text", "Another document") +) +jsonl_content <- oai_batch_prepare_embeddings(df, text_var = text, id_var = id) +} +} diff --git a/man/oai_batch_start.Rd b/man/oai_batch_start.Rd new file mode 100644 index 0000000..592b67a --- /dev/null +++ b/man/oai_batch_start.Rd @@ -0,0 +1,48 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_batch_api.R +\name{oai_batch_start} +\alias{oai_batch_start} +\title{Trigger a batch job to run on an uploaded file} +\usage{ +oai_batch_start( + file_id, + endpoint = c("/v1/embeddings", "/v1/chat/completions"), + completion_window = "24h", + metadata = NULL, + key_name = "OPENAI_API_KEY" +) +} +\arguments{ +\item{file_id}{File ID returned by oai_batch_upload()} + +\item{endpoint}{The API endpoint path, e.g. /v1/embeddings} + +\item{completion_window}{Time window for batch completion (OpenAI guarantees 24h only)} + +\item{metadata}{Optional list of metadata to tag the batch with} + +\item{key_name}{Name of the environment variable containing your API key} +} +\value{ +Metadata about an OpenAI Batch Job Including the batch ID +} +\description{ +Trigger a batch job to run on an uploaded file +} +\details{ +Once a file has been uploaded to the OpenAI Files API it's necessary to trigger the batch job. This will ensure that your file is processed, and processing is finalised within the 24 hour guarantee. + +It's important to choose the right endpoint. If processing should be done by the Completions API, be sure to route to v1/chat/completions, and this must match each row in your uploaded file. + +Batch Job Ids start with "batch_", you'll receive a warning if you try to check batch status on a Files API file (the Files/Batch API set up is a lil bit clumsy for me) +} +\examples{ +\dontrun{ +# After uploading a file with oai_batch_upload() +batch_job <- oai_batch_start( + file_id = "file-abc123", + endpoint = "/v1/embeddings" +) +batch_job$id # Use this to check status later +} +} diff --git a/man/oai_batch_status.Rd b/man/oai_batch_status.Rd new file mode 100644 index 0000000..635b467 --- /dev/null +++ b/man/oai_batch_status.Rd @@ -0,0 +1,26 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_batch_api.R +\name{oai_batch_status} +\alias{oai_batch_status} +\title{Check the Status of a Batch Job on the OpenAI Batch API} +\usage{ +oai_batch_status(batch_id, key_name = "OPENAI_API_KEY") +} +\arguments{ +\item{batch_id}{Batch identifier (starts with 'batch_'), returned by oai_batch_start()} + +\item{key_name}{Name of the environment variable containing your API key} +} +\value{ +Metadata about an OpenAI Batch API Job, including status, error_file_id, output_file_id, input_file_id etc. +} +\description{ +Check the Status of a Batch Job on the OpenAI Batch API +} +\examples{ +\dontrun{ +status <- oai_batch_status("batch_abc123") +status$status # e.g., "completed", "in_progress", "failed" +status$output_file_id # File ID for results when completed +} +} diff --git a/man/oai_batch_upload.Rd b/man/oai_batch_upload.Rd new file mode 100644 index 0000000..220348f --- /dev/null +++ b/man/oai_batch_upload.Rd @@ -0,0 +1,43 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_batch_api.R +\name{oai_batch_upload} +\alias{oai_batch_upload} +\title{Prepare and upload a file to be uploaded to the OpenAI Batch API} +\usage{ +oai_batch_upload( + jsonl_rows, + purpose = c("batch", "fine-tune", "assistants", "vision", "user_data", "evals"), + key_name = "OPENAI_API_KEY", + endpoint_url = "https://api.openai.com/v1/files" +) +} +\arguments{ +\item{jsonl_rows}{Rows of valid JSON, output of an oai_batch_prepare* function} + +\item{purpose}{The intended purpose of the uploaded file. Must be one of +"batch", "fine-tune", "assistants", "vision", "user_data", or "evals".} + +\item{key_name}{Name of the environment variable containing your API key} + +\item{endpoint_url}{OpenAI API endpoint URL (default: OpenAI's Files API V1)} +} +\value{ +Metadata for an upload to the OpenAI Files API +} +\description{ +Prepare and upload a file to be uploaded to the OpenAI Batch API +} +\examples{ +\dontrun{ +df <- data.frame( + id = c("doc_1", "doc_2"), + text = c("Hello world", "Goodbye world") +) +jsonl_content <- oai_batch_prepare_embeddings(df, text_var = text, id_var = id) +file_info <- oai_batch_upload(jsonl_content) +file_info$id # Use this ID to create a batch job +} +} +\seealso{ +\code{oai_files_upload()}, \code{oai_files_list()} +} diff --git a/man/oai_file_content.Rd b/man/oai_file_content.Rd new file mode 100644 index 0000000..471c658 --- /dev/null +++ b/man/oai_file_content.Rd @@ -0,0 +1,39 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_files_api.R +\name{oai_file_content} +\alias{oai_file_content} +\title{Retrieve Content from a File on the OpenAI Files API} +\usage{ +oai_file_content(file_id, key_name = "OPENAI_API_KEY") +} +\arguments{ +\item{file_id}{File identifier (starts with 'file-'), typically the +output_file_id from \code{\link[=oai_batch_status]{oai_batch_status()}}} + +\item{key_name}{Name of the environment variable containing your API key} +} +\value{ +A character string containing the file contents. For batch outputs, +this is JSONL format (one JSON object per line). +} +\description{ +Downloads and returns the content of a file stored on the OpenAI Files API. +For batch job outputs, this returns JSONL content that can be parsed with +\code{\link[=oai_batch_parse_embeddings]{oai_batch_parse_embeddings()}} or \code{\link[=oai_batch_parse_completions]{oai_batch_parse_completions()}}. +} +\examples{ +\dontrun{ +# Get batch job status and download results +status <- oai_batch_status("batch_abc123") + +if (status$status == "completed") { + content <- oai_file_content(status$output_file_id) + results <- oai_batch_parse_embeddings(content) +} +} +} +\seealso{ +\code{\link[=oai_batch_status]{oai_batch_status()}} to get output_file_id from completed batches, +\code{\link[=oai_batch_parse_embeddings]{oai_batch_parse_embeddings()}} and \code{\link[=oai_batch_parse_completions]{oai_batch_parse_completions()}} to +parse batch results +} diff --git a/man/oai_file_delete.Rd b/man/oai_file_delete.Rd new file mode 100644 index 0000000..b552eba --- /dev/null +++ b/man/oai_file_delete.Rd @@ -0,0 +1,35 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_files_api.R +\name{oai_file_delete} +\alias{oai_file_delete} +\title{Delete a File from the OpenAI Files API} +\usage{ +oai_file_delete(file_id, key_name = "OPENAI_API_KEY") +} +\arguments{ +\item{file_id}{File identifier (starts with 'file-'), returned by +\code{\link[=oai_batch_upload]{oai_batch_upload()}} or \code{\link[=oai_file_list]{oai_file_list()}}} + +\item{key_name}{Name of the environment variable containing your API key} +} +\value{ +A list containing the file id, object type, and deletion status +(deleted = TRUE/FALSE) +} +\description{ +Permanently deletes a file from the OpenAI Files API. This action cannot +be undone. Note that files associated with active batch jobs cannot be +deleted until the job completes. +} +\examples{ +\dontrun{ +# Delete a specific file +result <- oai_file_delete("file-abc123") +result$deleted # TRUE if successful + +} +} +\seealso{ +\code{\link[=oai_file_list]{oai_file_list()}} to find file IDs, +\code{\link[=oai_file_content]{oai_file_content()}} to retrieve file contents before deletion +} diff --git a/man/oai_file_list.Rd b/man/oai_file_list.Rd new file mode 100644 index 0000000..55c25d4 --- /dev/null +++ b/man/oai_file_list.Rd @@ -0,0 +1,42 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_files_api.R +\name{oai_file_list} +\alias{oai_file_list} +\title{List Files on the OpenAI Files API} +\usage{ +oai_file_list( + purpose = c("batch", "fine-tune", "assistants", "vision", "user_data", "evals"), + key_name = "OPENAI_API_KEY" +) +} +\arguments{ +\item{purpose}{The intended purpose of the uploaded file. Must be one of +"batch", "fine-tune", "assistants", "vision", "user_data", or "evals".} + +\item{key_name}{Name of the environment variable containing your API key} +} +\value{ +A list containing file metadata and pagination information. Each +file entry includes id, filename, purpose, bytes, created_at, and status. +} +\description{ +Retrieve a list of files that have been uploaded to the OpenAI Files API, +filtered by purpose. Files are retained for 30 days after upload. +} +\examples{ +\dontrun{ +# List all batch files +batch_files <- oai_file_list(purpose = "batch") + +# List fine-tuning files +ft_files <- oai_file_list(purpose = "fine-tune") + +# Access file IDs +file_ids <- purrr::map_chr(batch_files$data, "id") +} +} +\seealso{ +\code{\link[=oai_file_content]{oai_file_content()}} to retrieve file contents, +\code{\link[=oai_file_delete]{oai_file_delete()}} to remove files, +\code{\link[=oai_batch_upload]{oai_batch_upload()}} to upload batch files +} diff --git a/man/oai_file_upload.Rd b/man/oai_file_upload.Rd new file mode 100644 index 0000000..9386629 --- /dev/null +++ b/man/oai_file_upload.Rd @@ -0,0 +1,43 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/openai_files_api.R +\name{oai_file_upload} +\alias{oai_file_upload} +\title{Upload a file to the OpenAI Files API} +\usage{ +oai_file_upload( + file, + purpose = c("batch", "fine-tune", "assistants", "vision", "user_data", "evals"), + key_name = "OPENAI_API_KEY", + endpoint_url = "https://api.openai.com/v1/files" +) +} +\arguments{ +\item{file}{File object you wish to upload} + +\item{purpose}{The intended purpose of the uploaded file. Must be one of +"batch", "fine-tune", "assistants", "vision", "user_data", or "evals".} + +\item{key_name}{Name of the environment variable containing your API key} + +\item{endpoint_url}{OpenAI API endpoint URL (default: OpenAI's Files API V1)} +} +\value{ +File upload status and metadata inlcuding id, purpose, filename, created_at etc. +} +\description{ +Upload a file to the OpenAI Files API +} +\examples{ +\dontrun{ + tmp <- tempfile(fileext = ".jsonl") + writeLines("Hello!", tmp) + oai_file_upload( + file = tmp, + purpose = "user_data" +) + +} +} +\seealso{ +\url{https://platform.openai.com/docs/api-reference/files?lang=curl} +} diff --git a/tests/testthat/test-openai_batch_api.R b/tests/testthat/test-openai_batch_api.R new file mode 100644 index 0000000..e1e368f --- /dev/null +++ b/tests/testthat/test-openai_batch_api.R @@ -0,0 +1,306 @@ +test_that("oai_batch_build_embed_req creates a row of JSON and responds to its input arguments", { + no_dims <- expect_no_error( + oai_batch_build_embed_req( + "hello", + "1234" + ) + ) + + no_dims_str <- jsonlite::fromJSON(no_dims) + + with_dims <- expect_no_error( + oai_batch_build_embed_req( + "hello", + "134", + dimensions = 124 + ) + ) + + with_dims_str <- jsonlite::fromJSON(with_dims) + + expect_equal(with_dims_str$body$dimensions, 124) + expect_setequal(names(no_dims_str), names(with_dims_str)) + + expect_true(no_dims_str$method == "POST") + expect_equal(no_dims_str$url, "/v1/embeddings") + expect_equal(no_dims_str$body$model, "text-embedding-3-small") +}) + +test_that("oai_batch_build_completions_req creates valid JSON structure", { + result <- oai_batch_build_completions_req( + input = "Hello", + id = "test_1", + model = "gpt-4o-mini" + ) + + parsed <- jsonlite::fromJSON(result, simplifyVector = FALSE) + + expect_equal(parsed$custom_id, "test_1") + expect_equal(parsed$method, "POST") + expect_equal(parsed$url, "/v1/chat/completions") + expect_equal(parsed$body$model, "gpt-4o-mini") + expect_equal(length(parsed$body$messages), 1) + expect_equal(parsed$body$messages[[1]]$role, "user") + expect_equal(parsed$body$messages[[1]]$content, "Hello") +}) + +test_that("oai_batch_build_completions_req handles system_prompt", { + result <- oai_batch_build_completions_req( + input = "Hello", + id = "test_2", + system_prompt = "You are helpful" + ) + + parsed <- jsonlite::fromJSON(result, simplifyVector = FALSE) + + expect_equal(length(parsed$body$messages), 2) + expect_equal(parsed$body$messages[[1]]$role, "system") + expect_equal(parsed$body$messages[[1]]$content, "You are helpful") + expect_equal(parsed$body$messages[[2]]$role, "user") +}) + +test_that("oai_batch_build_completions_req handles schema as list", { + test_schema <- list( + type = "json_schema", + json_schema = list( + name = "test", + schema = list(type = "object", properties = list(sentiment = list(type = "string"))) + ) + ) + + result <- oai_batch_build_completions_req( + input = "Hello", + id = "test_3", + schema = test_schema + ) + + parsed <- jsonlite::fromJSON(result) + + expect_true("response_format" %in% names(parsed$body)) + expect_equal(parsed$body$response_format$type, "json_schema") +}) + +test_that("oai_batch_build_completions_req handles json_schema S7 object", { + test_schema <- create_json_schema( + name = "sentiment_schema", + description = "Sentiment analysis result", + schema = schema_object( + sentiment = schema_string("The sentiment", enum = c("positive", "negative", "neutral")), + required = c("sentiment") + ) + ) + + result <- oai_batch_build_completions_req( + input = "Analyse the sentiment of this text", + id = "test_s7_schema", + schema = test_schema + ) + + parsed <- jsonlite::fromJSON(result, simplifyVector = FALSE) + + expect_true("response_format" %in% names(parsed$body)) + expect_equal(parsed$body$response_format$type, "json_schema") + expect_equal(parsed$body$response_format$json_schema$name, "sentiment_schema") + expect_equal(parsed$body$response_format$json_schema$strict, TRUE) +}) + +test_that("oai_batch_build_completions_req respects temperature and max_tokens", { + result <- oai_batch_build_completions_req( + input = "Hello", + id = "test_4", + temperature = 0.7, + max_tokens = 1000L + ) + + parsed <- jsonlite::fromJSON(result) + + expect_equal(parsed$body$temperature, 0.7) + expect_equal(parsed$body$max_tokens, 1000) +}) + +test_that("oai_batch_prepare_completions creates valid JSONL", { + test_df <- tibble::tibble( + id = c("a", "b"), + text = c("Hello", "World") + ) + + result <- oai_batch_prepare_completions( + df = test_df, + text_var = text, + id_var = id + ) + + lines <- strsplit(result, "\n")[[1]] + expect_equal(length(lines), 2) + + parsed <- purrr::map(lines, \(x) jsonlite::fromJSON(x, simplifyVector = FALSE)) + expect_equal(parsed[[1]]$custom_id, "a") + expect_equal(parsed[[2]]$custom_id, "b") + expect_equal(parsed[[1]]$body$messages[[1]]$content, "Hello") +}) + +test_that("oai_batch_prepare_completions handles system_prompt across all rows", { + test_df <- tibble::tibble( + id = c("a", "b"), + text = c("Hello", "World") + ) + + result <- oai_batch_prepare_completions( + df = test_df, + text_var = text, + id_var = id, + system_prompt = "Be brief" + ) + + lines <- strsplit(result, "\n")[[1]] + parsed <- purrr::map(lines, \(x) jsonlite::fromJSON(x, simplifyVector = FALSE)) + + expect_equal(parsed[[1]]$body$messages[[1]]$role, "system") + expect_equal(parsed[[2]]$body$messages[[1]]$role, "system") +}) + + +test_that("oai_batch_parse_embeddings handles success response", { + mock_content <- '{"custom_id":"1","response":{"body":{"data":[{"embedding":[0.1,0.2,0.3]}]}},"error":null}' + + result <- oai_batch_parse_embeddings(mock_content) + + expect_equal(nrow(result), 1) + expect_equal(result$custom_id, "1") + expect_false(result$.error) + expect_true("V1" %in% names(result)) + expect_equal(result$V1, 0.1) + expect_equal(result$V2, 0.2) + expect_equal(result$V3, 0.3) +}) + +test_that("oai_batch_parse_embeddings handles error response", { + mock_content <- '{"custom_id":"1","response":null,"error":{"message":"Rate limit exceeded"}}' + + result <- oai_batch_parse_embeddings(mock_content) + + expect_equal(nrow(result), 1) + expect_true(result$.error) + expect_equal(result$.error_msg, "Rate limit exceeded") +}) + +test_that("oai_batch_parse_embeddings handles multiple rows", { + mock_content <- paste0( + '{"custom_id":"1","response":{"body":{"data":[{"embedding":[0.1,0.2]}]}},"error":null}', + '\n', + '{"custom_id":"2","response":{"body":{"data":[{"embedding":[0.3,0.4]}]}},"error":null}' + ) + + result <- oai_batch_parse_embeddings(mock_content) + + expect_equal(nrow(result), 2) + expect_equal(result$custom_id, c("1", "2")) + expect_equal(result$V1, c(0.1, 0.3)) +}) + + +test_that("oai_batch_parse_completions handles success response", { + mock_content <- '{"custom_id":"1","response":{"body":{"choices":[{"message":{"content":"Hello back"}}]}},"error":null}' + + result <- oai_batch_parse_completions(mock_content) + + expect_equal(nrow(result), 1) + expect_equal(result$custom_id, "1") + expect_equal(result$content, "Hello back") + expect_false(result$.error) +}) + +test_that("oai_batch_parse_completions handles error response", { + mock_content <- '{"custom_id":"1","response":null,"error":{"message":"API error"}}' + + result <- oai_batch_parse_completions(mock_content) + + expect_equal(nrow(result), 1) + expect_true(result$.error) + expect_equal(result$.error_msg, "API error") + expect_true(is.na(result$content)) +}) + +test_that("oai_batch_parse_completions handles JSON schema content", { + mock_content <- '{"custom_id":"1","response":{"body":{"choices":[{"message":{"content":"{\\"sentiment\\":\\"positive\\"}"}}]}},"error":null}' + + result <- oai_batch_parse_completions(mock_content) + + expect_equal(result$content, '{"sentiment":"positive"}') + parsed_content <- jsonlite::fromJSON(result$content) + expect_equal(parsed_content$sentiment, "positive") +}) + +test_that("oai_batch_parse_completions renames id column when original_df provided", { + mock_content <- '{"custom_id":"doc_1","response":{"body":{"choices":[{"message":{"content":"test"}}]}},"error":null}' + + original_df <- tibble::tibble( + my_id = "doc_1", + text = "Hello" + ) + + result <- oai_batch_parse_completions(mock_content, original_df, id_var = "my_id") + + expect_true("my_id" %in% names(result)) + expect_false("custom_id" %in% names(result)) + expect_equal(result$my_id, "doc_1") +}) + +test_that("oai_batch_prepare_embeddings rejects duplicate IDs", { + test_df <- tibble::tibble( + id = c("a", "a", "b"), + text = c("Text 1", "Text 2", "Text 3") + ) + + expect_error( + oai_batch_prepare_embeddings(test_df, text, id), + "custom_id values must be unique" + ) +}) + +test_that("oai_batch_prepare_completions rejects duplicate IDs", { + test_df <- tibble::tibble( + id = c("x", "y", "x"), + text = c("Hello", "World", "Again") + ) + + expect_error( + oai_batch_prepare_completions(test_df, text, id), + "custom_id values must be unique" + ) +}) + +test_that("oai_batch_prepare_embeddings handles empty dataframe with warning", { + test_df <- tibble::tibble(id = character(), text = character()) + + expect_warning( + result <- oai_batch_prepare_embeddings(test_df, text, id), + "Input is empty" + ) + expect_equal(result, "") +}) + +test_that("oai_batch_prepare_completions handles empty dataframe with warning", { + test_df <- tibble::tibble(id = character(), text = character()) + + expect_warning( + result <- oai_batch_prepare_completions(test_df, text, id), + "Input is empty" + ) + expect_equal(result, "") +}) + +test_that("oai_batch_parse_embeddings handles empty input", { + result <- oai_batch_parse_embeddings("") + expect_equal(nrow(result), 0) + expect_true("custom_id" %in% names(result)) + expect_true(".error" %in% names(result)) +}) + +test_that("oai_batch_parse_completions handles empty input", { + result <- oai_batch_parse_completions("") + + expect_equal(nrow(result), 0) + expect_true("custom_id" %in% names(result)) + expect_true("content" %in% names(result)) +}) diff --git a/tests/testthat/test-openai_files_api.R b/tests/testthat/test-openai_files_api.R new file mode 100644 index 0000000..8ce10c1 --- /dev/null +++ b/tests/testthat/test-openai_files_api.R @@ -0,0 +1,15 @@ +test_that("oai_file_upload errors when given inappropriate inputs", { + expect_error( + oai_file_upload("tmp"), + "must be a file" + ) + + .tmp <- tempfile() + writeLines(.tmp, "Hello!") + + expect_error( + oai_file_upload(.tmp, purpose = "life"), + "should be one of" + ) + +}) diff --git a/todos.qmd b/todos.qmd index 65db2ae..d7d710e 100644 --- a/todos.qmd +++ b/todos.qmd @@ -2,13 +2,19 @@ # Versions +## 0.2.1 + +- [ ] OpenAI Batch API + - [ ] Embeddings + - [ ] Completions + ## 0.2 - [ ] Support for Anthropic API - [ ] Batches - - [ ] Messages (Completions) + - [x] Messages (Completions) - [x] Structured Outputs -- [ ] Support for Gemini API +- [ ] Support for Gemini API (moving to later release) - [ ] Embeddings - [ ] Completions - [ ] Structured Outputs diff --git a/vignettes/hugging_face_inference.Rmd b/vignettes/hugging_face_inference.Rmd index f985801..17ae967 100644 --- a/vignettes/hugging_face_inference.Rmd +++ b/vignettes/hugging_face_inference.Rmd @@ -370,7 +370,7 @@ embedding_result |> count(.error) # View any failures (column names match your original data frame) failures <- embedding_result |> filter(.error == TRUE) |> - select(id, .error_message) + select(id, .error_msg) # Extract just the embeddings for successful rows embeddings_only <- embedding_result |> @@ -416,7 +416,7 @@ The result includes: - Your original ID and text columns (with their original names preserved) - Classification labels (e.g., POSITIVE, NEGATIVE) - Confidence scores -- Error tracking columns (`.error`, `.error_message`) +- Error tracking columns (`.error`, `.error_msg`) - Chunk tracking (`.chunk`) > **NOTE**: Classification labels are model and task specific. Check the model card on Hugging Face for label mappings. diff --git a/vignettes/sync_async.Rmd b/vignettes/sync_async.Rmd new file mode 100644 index 0000000..7f4b2ac --- /dev/null +++ b/vignettes/sync_async.Rmd @@ -0,0 +1,268 @@ +--- +title: "Synchronous vs Asynchronous/Batch APIs" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Synchronous vs Asynchronous APIs} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +```{r setup} +library(EndpointR) +``` + +# Introduction + +Most of EndpointR's integrations are with synchronous APIs such as [Completions](https://platform.openai.com/docs/api-reference/completions) by OpenAI, Hugging Face's [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/en/index), and Messages by [Anthropic](https://platform.claude.com/docs/en/api/messages). When using these APIs, we send a HTTP request, wait a second or two and receive a response. + +However, data scientists often need to process an entire data frame, resulting in thousands or millions of HTTP requests. This is inefficient because: + +1. Cost - Providers don't offer discounts for these requests +2. Session Blocking - Our coding consoles get blocked for hours at a time +3. Rate Limits - Providers enforce stricter rate limits on these APIs + +A solution to these problems is to use providers' 'Batch APIs' which offer asynchronous results. These often come with a 50% discount and higher rate limits, with a guarantee of results within a time frame, e.g. 24 hours. + +> **TIP**: It's worth noting that the results are often ready much faster, consider checking in 1-2 hours after triggering the batch. + +# Quickstart + +The OpenAI Batch API workflow follows three stages: **prepare**, **submit**, and **retrieve**. Below are complete examples for embeddings and completions. + +## Batch Embeddings + +```{r batch-embeddings, eval = FALSE} +# 1. Prepare your data +df <- data.frame( + id = c("doc_1", "doc_2", "doc_3"), + text = c( + "The quick brown fox jumps over the lazy dog", + "Machine learning is transforming data science", + "R is a powerful language for statistical computing" + ) +) + +# 2. Prepare requests for the Batch API +jsonl_content <- oai_batch_prepare_embeddings( + df, + text_var = text, + id_var = id, + model = "text-embedding-3-small", + dimensions = 256 +) + +# 3. Upload to the Files API +file_info <- oai_batch_file_upload(jsonl_content) +file_info$id +#> "file-abc123..." + +# 4. Trigger the batch job +batch_job <- oai_batch_start( + file_id = file_info$id, + endpoint = "/v1/embeddings" +) +batch_job$id +#> "batch-xyz789..." + +# 5. Check status (repeat until completed) +status <- oai_batch_status(batch_job$id) +status$status +#> "in_progress" ... later ... "completed" + +# 6. Download and parse results +content <- oai_file_content(status$output_file_id) +embeddings_df <- oai_batch_parse_embeddings(content) + +# Result: tidy data frame with id and embedding dimensions (V1, V2, ..., V256) +embeddings_df +#> # A tibble +#> custom_id .error .error_msg V1 V2 V3 ... +#> ... +#> 1 doc_1 FALSE NA 0.023 -0.041 0.018 ... +#> 2 doc_2 FALSE NA -0.015 0.032 0.044 ... +#> 3 doc_3 FALSE NA 0.008 -0.027 0.031 ... +``` + +## Batch Completions + +```{r batch-completions, eval = FALSE} +# 1. Prepare your data +df <- data.frame( + id = c("q1", "q2", "q3"), + prompt = c( + "What is the capital of France?", + "Explain photosynthesis in one sentence.", + "What is 2 + 2?" + ) +) + +# 2. Prepare requests +jsonl_content <- oai_batch_prepare_completions( + df, + text_var = prompt, + id_var = id, + model = "gpt-4o-mini", + system_prompt = "You are a helpful assistant. Be concise.", + temperature = 0, + max_tokens = 100 +) + +# 3. Upload and trigger batch job +file_info <- oai_batch_file_upload(jsonl_content) +batch_job <- oai_batch_start( + file_id = file_info$id, + endpoint = "/v1/chat/completions" +) + +# 4. Check status and retrieve results +status <- oai_batch_status(batch_job$id) +# ... wait for status$status == "completed" ... + +content <- oai_file_content(status$output_file_id) +completions_df <- oai_batch_parse_completions(content) + +completions_df +#> # A tibble +#> custom_id content .error .error_msg +#> +#> 1 q1 The capital of France is Paris. FALSE NA +#> 2 q2 Photosynthesis converts sunlight into energy FALSE NA +#> 3 q3 2 + 2 equals 4. FALSE NA +``` + +## Batch Completions with Structured Output + +For classification tasks or when you need structured data back, combine the Batch API with JSON schemas: + +```{r batch-completions-schema, eval = FALSE} +# 1. Define a schema for sentiment classification +sentiment_schema <- create_json_schema( + name = "sentiment_analysis", + schema_object( + sentiment = schema_enum( + c("positive", "negative", "neutral"), + description = "The sentiment of the text" + ), + confidence = schema_number( + description = "Confidence score between 0 and 1" + ) + ) +) + +# 2. Prepare data +df <- data.frame( + id = c("review_1", "review_2", "review_3"), + text = c( + "This product is absolutely fantastic! Best purchase ever.", + "Terrible quality, broke after one day. Complete waste of money.", + "It's okay, nothing special but does the job." + ) +) + +# 3. Prepare requests with schema +jsonl_content <- oai_batch_prepare_completions( + df, + text_var = text, + id_var = id, + model = "gpt-4o-mini", + system_prompt = "Analyse the sentiment of the following text.", + schema = sentiment_schema, + temperature = 0 +) + +# 4. Upload and trigger batch job +file_info <- oai_batch_file_upload(jsonl_content) +batch_job <- oai_batch_start( + file_id = file_info$id, + endpoint = "/v1/chat/completions" +) + +# 5. Retrieve and parse results +status <- oai_batch_status(batch_job$id) +content <- oai_file_content(status$output_file_id) +results_df <- oai_batch_parse_completions(content) + +# The content column contains JSON that can be parsed +results_df$content +#> [1] "{\"sentiment\":\"positive\",\"confidence\":0.95}" +#> [2] "{\"sentiment\":\"negative\",\"confidence\":0.92}" +#> [3] "{\"sentiment\":\"neutral\",\"confidence\":0.78}" + +# Parse the JSON content into columns +results_df |> + dplyr::mutate( + parsed = purrr::map(content, jsonlite::fromJSON) + ) |> + tidyr::unnest_wider(parsed) +#> # A tibble +#> custom_id sentiment confidence .error .error_msg +#> +#> 1 review_1 positive 0.95 FALSE NA +#> 2 review_2 negative 0.92 FALSE NA +#> 3 review_3 neutral 0.78 FALSE NA +``` + +> **Limits**: Each batch file can contain up to 50,000 requests or 200MB, whichever is reached first. For larger datasets, split into multiple batches. + +# When to choose Synchronous vs Asynchronous + +> For a more comprehensive treatment, and motivating examples [OpenAI's official documentation/guide](https://platform.openai.com/docs/guides/batch) is a good place to start. + +| | Synchronous | Asynchronous (Batch) | +|----|----|----| +| Cost | Full price per token | \~50% Discount per token | +| Latency | Real-time | Up to 24 hours | +| Use Case | Experimentation, Prompt testing, Schema development, User-facing applications, | Recurrent workflows (evals etc.), embedding large datasets, classifying large datasets | +| Data Size | Up to \~10,000 | \~10,000+ | + +> **Recommendation**: Use the Synchronous API when you need immediate feedback e.g. prompt or schema development, and for small datasets where cost savings are irrelevant. Once everything is figured out, move to the Batch API to save on cost. + +# Cleaning Up + +Once the batch job has been completed, the associated files will live on the OpenAI API, inside the Files API. Your OpenAI account will be charged for storage, so it's best to download the results and save in your org's own cloud storage. + +```{r, eval = FALSE} +oai_file_delete(file_info$id) # delete the input file + +oai_file_delete(status$output_file_id) # delete the output file +oai_file_delete(status$error_file_id) # delete the error file +``` + + +> **NOTE**: At the time of writing, OpenAI save information in both the Batch API and the Files API, you need to delete your input, output, error files from the *Files API*, you cannot delete from the Batch API + +# Technical Details + +## Batch Limits + +The OpenAI Batch API enforces specific limits per batch file. If your data exceeds these, you must split it into multiple batch jobs. + +- Max Requests per Batch: 50,000 +- Max File Size: 200 MB + + > **Warning**: When using Structured Outputs, the JSON schema is repeated for every single request in the batch file. For complex schemas, you may hit the 200 MB file size limit well before you reach the 50,000 row limit. + +## Underlying Request Format + +EndpointR handles the JSON formatting for you, but for debugging purposes, it is helpful to know what the API expects. Each line in the batch file is a JSON object containing a custom_id and the request body. + + +```{json} +{ + "custom_id": "doc_1", + "method": "POST", + "url": "/v1/embeddings", + "body": { + "input": "The quick brown fox...", + "model": "text-embedding-3-small", + "encoding_format": "float" + } +} +```