-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[Python] Add agent-framework-azure-ai-contentunderstanding package #4829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
yungshinlintw
wants to merge
71
commits into
microsoft:main
Choose a base branch
from
yungshinlintw:yslin/contentunderstanding-context-provider
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
71 commits
Select commit
Hold shift + click to select a range
d173f9a
feat: add agent-framework-azure-contentunderstanding package
yungshinlintw 0c2ea7b
fix: update CU fixtures with real API data, fix test assertions
yungshinlintw 8e6e73b
chore: add connector .gitignore, update uv.lock
yungshinlintw 3fbb8f7
refactor: rename to azure-ai-contentunderstanding, fix CI issues
yungshinlintw 36ee6a4
feat: add samples (document_qa, invoice_processing, multimodal_chat)
yungshinlintw dec37e8
feat: add remaining samples (devui_multimodal_agent, large_doc_file_s…
yungshinlintw dd918bb
feat: add file_search integration for large document RAG
yungshinlintw c4fe308
fix: add key-based auth support to all samples
yungshinlintw f995d7a
FEATURE(python): add analyzer auto-detection, file_search RAG, and la…
yungshinlin 85c8999
feat(cu): MIME sniffing, media-aware formatting, unified timeout, vec…
yungshinlin fcd04f1
fix: merge all CU content segments for video/audio analysis
yungshinlin 03073a5
refactor: improve CU context provider docs and remove ContentLimits
yungshinlintw 4e8a8cc
feat: support user-provided vector store in FileSearchConfig
yungshinlintw 14234d2
Merge upstream/main into yslin/contentunderstanding-context-provider
yungshinlintw 04e8dce
fix: remove ContentLimits from README code block
yungshinlintw 637a3a4
refactor: create CU client in __init__ instead of __aenter__
yungshinlintw 1f451b6
docs: add file_search param to class docstring
yungshinlintw d914fbc
feat: introduce FileSearchBackend abstraction for cross-client support
yungshinlintw cb9b5b6
refactor: FileSearchBackend abstraction + caller-owned vector store
yungshinlintw 478731e
fix: file_search reliability and sample improvements
yungshinlintw 90284e6
perf: set max_num_results=10 for file_search to reduce token usage
yungshinlintw 67975c6
fix: move import to top of file (E402 lint)
yungshinlintw 4345cbc
chore: remove unused imports
yungshinlintw 0403365
fix: align azure-ai-contentunderstanding with MAF coding conventions
yungshinlin a3c50a2
refactor: improve CU context provider API surface and fix CI
yungshinlin c6b1cc7
Merge remote-tracking branch 'origin/main' into yslin/contentundersta…
yungshinlin 123bfdf
fix: improve file_search samples and move tool guidelines to context …
yungshinlin b1ce674
feat: improve source_id, integration tests, and content assertions
yungshinlin 29975c4
feat: reject duplicate filenames, add integration tests and sample co…
yungshinlin cd72233
chore: improve doc key derivation, comments, and README
yungshinlin 6285d36
Merge branch 'main' into yslin/contentunderstanding-context-provider
yungshinlintw c3fb1c7
test: strengthen _format_result assertions with exact expected strings
yungshinlin df382a9
refactor: move invoice.pdf to shared sample_assets directory
yungshinlin b06a34e
refactor: reorganize samples into numbered dirs and simplify auth
yungshinlin b78bf9c
fix: resolve CI lint errors (D205, RUF001, E501)
yungshinlin 4eef541
refactor: overhaul samples — FoundryChatClient, sessions, remove get_…
yungshinlin f8fe7c8
feat: add 05_background_analysis sample and fix 04 session/max_wait
yungshinlin 3d10a7c
docs: update README and fix sample 06
yungshinlin b635de9
docs: rewrite README — concise format, prerequisites, CU link
yungshinlin 443b4c4
fix: resolve pyright errors in _format_result segment cast
yungshinlin 91a7410
docs: add numbered section comments and fresh sample output to all sa…
yungshinlin ef7e378
feat(devui): add video file upload support
yungshinlin 6856a27
feat: add load_settings support for env var configuration
yungshinlin c620a93
docs: polish README — fix duplicate env var, add Next steps, service …
yungshinlin b9edeaf
chore: trim invoice fixture from 199K to 33 lines
yungshinlin aa3f71c
revert: remove devui video upload changes (will be in separate PR)
yungshinlin ee341e2
feat: per-file analyzer_id override via additional_properties
yungshinlin d3c4047
Trim PDF test fixture and clarify unique filename requirement
yungshinlin 6ee5d98
Merge branch 'main' into yslin/contentunderstanding-context-provider
yungshinlintw 5ee0514
Update python/packages/azure-ai-contentunderstanding/agent_framework_…
yungshinlintw d0e98b3
Update python/packages/azure-ai-contentunderstanding/agent_framework_…
yungshinlintw dd1fffb
Update python/packages/azure-ai-contentunderstanding/samples/02-devui…
yungshinlintw 0714d17
Update python/packages/azure-ai-contentunderstanding/samples/02-devui…
yungshinlintw c456327
Update python/packages/azure-ai-contentunderstanding/samples/01-get-s…
yungshinlintw ebca922
Fix AGENTS.md to match implementation; remove unused variable in test…
yungshinlin 48c31d9
Fix premature file_search instruction for background-completed docs
yungshinlin d288fc6
fix: wrap long line in devui agent instructions (E501)
yungshinlin 053bca5
Fix Copilot review: unused logger, stray code in README, await cancel…
yungshinlin e52d28d
Sanitize doc keys and fix duplicate filename re-injection
yungshinlin 0afc812
fix: add type annotation to tasks_to_cancel for pyright
yungshinlin 860ba4e
Move per-session mutable state to state dict for session isolation
yungshinlin 898478f
Remove unused AnalysisSection enum values
yungshinlin b376ad8
Merge branch 'main' into yslin/contentunderstanding-context-provider
yungshinlintw 7f5ff2e
Recursively flatten object/array field values for cleaner LLM output
yungshinlin a5cb199
Preserve sub-field confidence; compare full expected JSON in tests
yungshinlin dd707a0
Remove incorrect MIME aliases (audio/mp4, video/x-matroska)
yungshinlin 9f31124
feat: add AnalysisInput, content_range, warnings, and category support
yungshinlin 42b5ed1
fix: falsy-0 bug in duration calc; improve test coverage
yungshinlin b930827
refactor: split _context_provider.py into focused modules
yungshinlin b73e2b8
Merge branch 'main' into yslin/contentunderstanding-context-provider
yungshinlintw 2e9f952
docs: update AGENTS.md with DocumentStatus, FileSearchBackend, and _f…
yungshinlin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| # Local-only files (not committed) | ||
| _local_only/ | ||
| *_local_only* |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| # AGENTS.md — azure-ai-contentunderstanding | ||
|
|
||
| ## Package Overview | ||
|
|
||
| `agent-framework-azure-ai-contentunderstanding` integrates Azure Content Understanding (CU) | ||
| into the Agent Framework as a context provider. It automatically analyzes file attachments | ||
| (documents, images, audio, video) and injects structured results into the LLM context. | ||
|
|
||
| ## Public API | ||
|
|
||
| | Symbol | Type | Description | | ||
| |--------|------|-------------| | ||
| | `ContentUnderstandingContextProvider` | class | Main context provider — extends `BaseContextProvider` | | ||
| | `AnalysisSection` | enum | Output section selector (MARKDOWN, FIELDS, etc.) | | ||
| | `DocumentStatus` | enum | Document lifecycle state (ANALYZING, UPLOADING, READY, FAILED) | | ||
| | `FileSearchBackend` | ABC | Abstract vector store file operations interface | | ||
| | `FileSearchConfig` | dataclass | Configuration for CU + vector store RAG mode | | ||
|
|
||
| ## Architecture | ||
|
|
||
| - **`_context_provider.py`** — Main provider implementation. Overrides `before_run()` to detect | ||
| file attachments, call the CU API, manage session state with multi-document tracking, | ||
| and auto-register retrieval tools for follow-up turns. | ||
| - **Analyzer auto-detection** — When `analyzer_id=None` (default), `_resolve_analyzer_id()` | ||
| selects the CU analyzer based on media type prefix: `audio/` → `prebuilt-audioSearch`, | ||
| `video/` → `prebuilt-videoSearch`, everything else → `prebuilt-documentSearch`. | ||
| - **Multi-segment output** — CU splits long video/audio into multiple scene segments | ||
| (each a separate `contents[]` entry with its own `startTimeMs`, `endTimeMs`, `markdown`, | ||
| and `fields`). `_extract_sections()` produces: | ||
| - `segments`: list of per-segment dicts, each with `markdown`, `fields`, `start_time_s`, `end_time_s` | ||
| - `markdown`: concatenated at top level with `---` separators (for file_search uploads) | ||
| - `duration_seconds`: computed from global `min(startTimeMs)` → `max(endTimeMs)` | ||
| - Metadata (`kind`, `resolution`): taken from the first segment | ||
| - **Speaker diarization (not identification)** — CU transcripts label speakers as | ||
| `<Speaker 1>`, `<Speaker 2>`, etc. CU does **not** identify speakers by name. | ||
| - **file_search RAG** — When `FileSearchConfig` is provided, CU-extracted markdown is | ||
| uploaded to an OpenAI vector store and a `file_search` tool is registered on the context | ||
| instead of injecting the full document content. This enables token-efficient retrieval | ||
| for large documents. | ||
| - **`_models.py`** — `AnalysisSection` enum, `DocumentStatus` enum, `DocumentEntry` TypedDict, | ||
| `FileSearchConfig` dataclass. | ||
| - **`_file_search.py`** — `FileSearchBackend` ABC, `OpenAIFileSearchBackend`, | ||
| `FoundryFileSearchBackend`. | ||
|
|
||
| ## Key Patterns | ||
|
|
||
| - Follows the Azure AI Search context provider pattern (same lifecycle, config style). | ||
| - Uses provider-scoped `state` dict for multi-document tracking across turns. | ||
| - Auto-registers `list_documents()` tool via `context.extend_tools()`. | ||
| - Configurable timeout (`max_wait`) with `asyncio.create_task()` background fallback. | ||
| - Strips supported binary attachments from `input_messages` to prevent LLM API errors. | ||
| - Explicit `analyzer_id` always overrides auto-detection (user preference wins). | ||
| - Vector store resources are cleaned up in `close()` / `__aexit__`. | ||
|
|
||
| ## Samples | ||
|
|
||
| | Sample | Description | | ||
| |--------|-------------| | ||
| | `01_document_qa.py` | Upload a PDF via URL, ask questions about it | | ||
| | `02_multi_turn_session.py` | AgentSession persistence across turns | | ||
| | `03_multimodal_chat.py` | PDF + audio + video parallel analysis | | ||
| | `04_invoice_processing.py` | Structured field extraction with `prebuilt-invoice` analyzer | | ||
| | `05_background_analysis.py` | Non-blocking analysis with `max_wait` + status tracking | | ||
| | `06_large_doc_file_search.py` | CU extraction + OpenAI vector store RAG | | ||
| | `02-devui/01-multimodal_agent/` | DevUI web UI for CU-powered chat | | ||
| | `02-devui/02-file_search_agent/` | DevUI web UI combining CU + file_search RAG | | ||
|
|
||
| ## Running Tests | ||
|
|
||
| ```bash | ||
| uv run poe test -P azure-ai-contentunderstanding | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| MIT License | ||
|
|
||
| Copyright (c) Microsoft Corporation. | ||
|
|
||
| Permission is hereby granted, free of charge, to any person obtaining a copy | ||
| of this software and associated documentation files (the "Software"), to deal | ||
| in the Software without restriction, including without limitation the rights | ||
| to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
| copies of the Software, and to permit persons to whom the Software is | ||
| furnished to do so, subject to the following conditions: | ||
|
|
||
| The above copyright notice and this permission notice shall be included in all | ||
| copies or substantial portions of the Software. | ||
|
|
||
| THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
| IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
| FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
| AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
| LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
| OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
| SOFTWARE |
128 changes: 128 additions & 0 deletions
128
python/packages/azure-ai-contentunderstanding/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| # Get Started with Azure Content Understanding in Microsoft Agent Framework | ||
|
|
||
| Please install this package via pip: | ||
|
|
||
| ```bash | ||
| pip install agent-framework-azure-ai-contentunderstanding --pre | ||
| ``` | ||
|
|
||
| ## Azure Content Understanding Integration | ||
|
|
||
| ### Prerequisites | ||
|
|
||
| Before using this package, you need an Azure Content Understanding resource: | ||
|
|
||
| 1. An active **Azure subscription** ([create one for free](https://azure.microsoft.com/pricing/purchase-options/azure-account)) | ||
| 2. A **Microsoft Foundry resource** created in a [supported region](https://learn.microsoft.com/azure/ai-services/content-understanding/language-region-support) | ||
| 3. **Default model deployments** configured for your resource (GPT-4.1, GPT-4.1-mini, text-embedding-3-large) | ||
|
|
||
| Follow the [prerequisites section](https://learn.microsoft.com/azure/ai-services/content-understanding/quickstart/use-rest-api?tabs=portal%2Cdocument&pivots=programming-language-rest#prerequisites) in the Azure Content Understanding quickstart for setup instructions. | ||
|
|
||
| ### Introduction | ||
|
|
||
| The Azure Content Understanding integration provides a context provider that automatically analyzes file attachments (documents, images, audio, video) using [Azure Content Understanding](https://learn.microsoft.com/azure/ai-services/content-understanding/) and injects structured results into the LLM context. | ||
|
|
||
| - **Document & image analysis**: State-of-the-art OCR with markdown extraction, table preservation, and structured field extraction — handles scanned PDFs, handwritten content, and complex layouts | ||
| - **Audio & video analysis**: Transcription, speaker diarization, and per-segment summaries | ||
| - **Background processing**: Configurable timeout with async background fallback for large files | ||
| - **file_search integration**: Optional vector store upload for token-efficient RAG on large documents | ||
|
|
||
| > Learn more about Azure Content Understanding capabilities at [https://learn.microsoft.com/azure/ai-services/content-understanding/](https://learn.microsoft.com/azure/ai-services/content-understanding/) | ||
|
|
||
| ### Basic Usage Example | ||
|
|
||
| See the [samples directory](samples/) which demonstrates: | ||
|
|
||
| - Single PDF upload and Q&A ([01_document_qa](samples/01-get-started/01_document_qa.py)) | ||
| - Multi-turn sessions with cached results ([02_multi_turn_session](samples/01-get-started/02_multi_turn_session.py)) | ||
| - PDF + audio + video parallel analysis ([03_multimodal_chat](samples/01-get-started/03_multimodal_chat.py)) | ||
| - Structured field extraction with prebuilt-invoice ([04_invoice_processing](samples/01-get-started/04_invoice_processing.py)) | ||
| - Non-blocking background analysis with status tracking ([05_background_analysis](samples/01-get-started/05_background_analysis.py)) | ||
| - CU extraction + OpenAI vector store RAG ([06_large_doc_file_search](samples/01-get-started/06_large_doc_file_search.py)) | ||
| - Interactive web UI with DevUI ([02-devui](samples/02-devui/)) | ||
|
|
||
| ```python | ||
| import asyncio | ||
| from agent_framework import Agent, AgentSession, Message, Content | ||
| from agent_framework.foundry import FoundryChatClient | ||
| from agent_framework_azure_ai_contentunderstanding import ContentUnderstandingContextProvider | ||
| from azure.identity import AzureCliCredential | ||
|
|
||
| credential = AzureCliCredential() | ||
|
|
||
| cu = ContentUnderstandingContextProvider( | ||
| endpoint="https://my-resource.cognitiveservices.azure.com/", | ||
| credential=credential, | ||
| max_wait=None, # block until CU extraction completes before sending to LLM | ||
| ) | ||
|
|
||
| client = FoundryChatClient( | ||
| project_endpoint="https://your-project.services.ai.azure.com", | ||
| model="gpt-4.1", | ||
| credential=credential, | ||
| ) | ||
|
|
||
| async def main(): | ||
| async with cu: | ||
| agent = Agent( | ||
| client=client, | ||
| name="DocumentQA", | ||
| instructions="You are a helpful document analyst.", | ||
| context_providers=[cu], | ||
| ) | ||
| session = AgentSession() | ||
|
|
||
| response = await agent.run( | ||
| Message(role="user", contents=[ | ||
| Content.from_text("What's on this invoice?"), | ||
| Content.from_uri( | ||
| "https://raw.githubusercontent.com/Azure-Samples/" | ||
| "azure-ai-content-understanding-assets/main/document/invoice.pdf", | ||
| media_type="application/pdf", | ||
| additional_properties={"filename": "invoice.pdf"}, | ||
| ), | ||
| ]), | ||
| session=session, | ||
| ) | ||
| print(response.text) | ||
|
|
||
| asyncio.run(main()) | ||
| ``` | ||
|
|
||
| ### Supported File Types | ||
|
|
||
| | Category | Types | | ||
| |----------|-------| | ||
| | Documents | PDF, DOCX, XLSX, PPTX, HTML, TXT, Markdown | | ||
| | Images | JPEG, PNG, TIFF, BMP | | ||
| | Audio | WAV, MP3, M4A, FLAC, OGG | | ||
| | Video | MP4, MOV, AVI, WebM | | ||
|
|
||
| For the complete list of supported file types and size limits, see [Azure Content Understanding service limits](https://learn.microsoft.com/azure/ai-services/content-understanding/service-limits#input-file-limits). | ||
|
|
||
| ### Environment Variables | ||
|
|
||
| The provider supports automatic endpoint resolution from environment variables. | ||
| When ``endpoint`` is not passed to the constructor, it is loaded from | ||
| ``AZURE_CONTENTUNDERSTANDING_ENDPOINT``: | ||
|
|
||
| ```python | ||
| # Endpoint auto-loaded from AZURE_CONTENTUNDERSTANDING_ENDPOINT env var | ||
| cu = ContentUnderstandingContextProvider(credential=credential) | ||
| ``` | ||
|
|
||
| Set these in your shell or in a `.env` file: | ||
|
|
||
| ```bash | ||
| AZURE_CONTENTUNDERSTANDING_ENDPOINT=https://your-cu-resource.cognitiveservices.azure.com/ | ||
| AZURE_AI_PROJECT_ENDPOINT=https://your-project.services.ai.azure.com | ||
| AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4.1 | ||
| ``` | ||
|
|
||
| You also need to be logged in with `az login` (for `AzureCliCredential`). | ||
|
|
||
| ### Next steps | ||
|
|
||
| - Explore the [samples directory](samples/) for complete code examples | ||
| - Read the [Azure Content Understanding documentation](https://learn.microsoft.com/azure/ai-services/content-understanding/) for detailed service information | ||
| - Learn more about the [Microsoft Agent Framework](https://aka.ms/agent-framework) | ||
28 changes: 28 additions & 0 deletions
28
...s/azure-ai-contentunderstanding/agent_framework_azure_ai_contentunderstanding/__init__.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| # Copyright (c) Microsoft. All rights reserved. | ||
|
|
||
| """Azure Content Understanding integration for Microsoft Agent Framework. | ||
|
|
||
| Provides a context provider that analyzes file attachments (documents, images, | ||
| audio, video) using Azure Content Understanding and injects structured results | ||
| into the LLM context. | ||
| """ | ||
|
|
||
| import importlib.metadata | ||
|
|
||
| from ._context_provider import ContentUnderstandingContextProvider | ||
| from ._file_search import FileSearchBackend | ||
| from ._models import AnalysisSection, DocumentStatus, FileSearchConfig | ||
|
|
||
| try: | ||
| __version__ = importlib.metadata.version(__name__) | ||
| except importlib.metadata.PackageNotFoundError: | ||
| __version__ = "0.0.0" | ||
|
|
||
| __all__ = [ | ||
| "AnalysisSection", | ||
| "ContentUnderstandingContextProvider", | ||
| "DocumentStatus", | ||
| "FileSearchBackend", | ||
| "FileSearchConfig", | ||
| "__version__", | ||
| ] |
78 changes: 78 additions & 0 deletions
78
...azure-ai-contentunderstanding/agent_framework_azure_ai_contentunderstanding/_constants.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| # Copyright (c) Microsoft. All rights reserved. | ||
|
|
||
| """Constants for Azure Content Understanding context provider. | ||
|
|
||
| Supported media types, MIME aliases, and analyzer mappings used by | ||
| the file detection and analysis pipeline. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| # MIME types used to match against the resolved media type for routing files to CU analysis. | ||
| # The media type may be provided via Content.media_type or inferred (e.g., via sniffing or filename) | ||
| # when missing or generic (such as application/octet-stream). Only files whose resolved media type is | ||
| # in this set will be processed; others are skipped. | ||
| # | ||
| # Supported input file types: | ||
| # https://learn.microsoft.com/azure/ai-services/content-understanding/service-limits#input-file-limits | ||
| SUPPORTED_MEDIA_TYPES: frozenset[str] = frozenset({ | ||
| # Documents and images | ||
| "application/pdf", | ||
| "image/jpeg", | ||
| "image/png", | ||
| "image/tiff", | ||
| "image/bmp", | ||
| "image/heif", | ||
| "image/heic", | ||
| "application/vnd.openxmlformats-officedocument.wordprocessingml.document", | ||
| "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", | ||
| "application/vnd.openxmlformats-officedocument.presentationml.presentation", | ||
| # Text | ||
| "text/plain", | ||
| "text/html", | ||
| "text/markdown", | ||
| "text/rtf", | ||
| "text/xml", | ||
| "application/xml", | ||
| "message/rfc822", | ||
| "application/vnd.ms-outlook", | ||
| # Audio | ||
| "audio/wav", | ||
| "audio/mpeg", | ||
| "audio/mp3", | ||
| "audio/mp4", | ||
| "audio/m4a", | ||
| "audio/flac", | ||
| "audio/ogg", | ||
| "audio/opus", | ||
| "audio/webm", | ||
| "audio/x-ms-wma", | ||
| "audio/aac", | ||
| "audio/amr", | ||
| "audio/3gpp", | ||
| # Video | ||
| "video/mp4", | ||
| "video/quicktime", | ||
| "video/x-msvideo", | ||
| "video/webm", | ||
| "video/x-flv", | ||
| "video/x-ms-wmv", | ||
| "video/x-ms-asf", | ||
| "video/x-matroska", | ||
| }) | ||
|
|
||
| # Mapping from filetype's MIME output to our canonical SUPPORTED_MEDIA_TYPES values. | ||
| # filetype uses some x-prefixed variants that differ from our set. | ||
| MIME_ALIASES: dict[str, str] = { | ||
| "audio/x-wav": "audio/wav", | ||
| "audio/x-flac": "audio/flac", | ||
| "video/x-m4v": "video/mp4", | ||
| } | ||
|
|
||
| # Mapping from media type prefix to the appropriate prebuilt CU analyzer. | ||
| # Used when analyzer_id is None (auto-detect mode). | ||
| MEDIA_TYPE_ANALYZER_MAP: dict[str, str] = { | ||
| "audio/": "prebuilt-audioSearch", | ||
| "video/": "prebuilt-videoSearch", | ||
| } | ||
| DEFAULT_ANALYZER: str = "prebuilt-documentSearch" |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.