π Extract information from unstructured documents at scale with Amazon Bedrock
π Open-source asset published at aws-samples GitHub
Converting documents into structured databases is a recurring business need. Common use cases include creating product feature tables from descriptions, extracting metadata from legal contracts, and analyzing customer reviews.
This repository provides an AWS CDK solution for intelligent document processing (IDP) using generative AI.
β¨ Key features:
- Extract structured information:
- Well-defined entities (names, titles, etc.)
- Numeric scores (sentiment, urgency, etc.)
- Free-form content (summaries, responses, etc.)
- Simply describe attributes to extract without data annotation or model training
- Leverage Amazon Bedrock Data Automation and multi-modal LLMs (including VLMs for complex diagrams)
- Process PDFs, MS Office files, images, and text via Python API or web interface
- Deploy as MCP server to equip AI agents with document processing capabilities
See the Security section below before deployment.
- πΉ Demo
- ποΈ Architecture
- π§ Deployment
- π» Usage
- π₯ Team
- π€ Contributing
- ποΈ Security
- π License
API Example
See the demo notebook for complete implementation:
docs = ['doc1', 'doc2']
features = [
{"name": "delay", "description": "delay of the shipment in days"},
{"name": "shipment_id", "description": "unique shipment identifier"},
{"name": "summary", "description": "one-sentence summary of the doc"},
]
run_idp_bedrock_api(documents=docs, features=features)
# [{'delay': 2, 'shipment_id': '123890', 'summary': 'summary1'},
# {'delay': 3, 'shipment_id': '678623', 'summary': 'summary2'}]Web Interface
idp_demo.mp4
Deploy to your AWS account using a local IDE or SageMaker Notebook instance.
Recommendation: Use SageMaker (ml.t3.large) to avoid local setup. Ensure the IAM role has CloudFormation deployment permissions.
git clone https://github.com/aws-samples/intelligent-document-processing-with-amazon-bedrock.git
cd intelligent-document-processing-with-amazon-bedrockSageMaker Notebook:
sh install_deps.shLocal Development: Ensure you have:
- AWS CLI with configured profile
- Node.js
- AWS CDK Toolkit
- Python 3.9+
- uv
- Docker Desktop
sh install_env.sh
source .venv/bin/activateCopy config-example.yml to config.yml and customize the settings:
stack_name: idp-test # Stack name and resource prefix (<16 chars, no "aws" prefix)
# ... other settings
frontend:
deploy_ecs: True # Deploy web interfaceAdd your email to the Cognito users list in the authentication section.
- Open AWS Bedrock console in your target region
- Navigate to "Model Access"
- Request access for models specified in
config.yml
cdk bootstrap --profile [PROFILE_NAME]Ensure Docker is running, then:
cdk deploy --profile [PROFILE_NAME]cdk destroy --profile [PROFILE_NAME]Or delete the CloudFormation stack from AWS console.
Permission Issues CDK deployment requires near-admin permissions. See minimal required permissions.
S3 Bucket Deletion Empty the S3 bucket manually before stack deletion if it contains documents.
Python Path Error
If you see /bin/sh: python3: command not found, update the Python path in cdk.json.
We offer three ways to interact with the IDP solution, including API calls, web application, and MCP server.
Follow the demo notebook to process documents programmatically:
- Provide input documents
- Define attributes to extract
Access:
- URL appears in CDK deployment output as "CloudfrontDistributionName"
- Or find the CloudFront distribution domain in AWS console
Login:
- Username: Your email from
config.yml - Password: Temporary password sent to your email after deployment
Local testing:
cd src/ecs
# Set STACK_NAME in .env file
# Configure AWS credentials in .env or export AWS_PROFILE
uv venv && source .venv/bin/activate
uv sync --extra dev
streamlit run src/Home.pyAccess the UI at http://localhost:8501
β οΈ SECURITY NOTICE: The MCP server package is NOT available on PyPI. Only use the local installation methods described below. Any package namedidp-bedrock-mcp-serveron PyPI is not official and may be malicious.
Deploy MCP servers to expose document processing as standardized tools for AI agents.
Two Options Available:
Local Stdio Server (recommended for development)
- Supports local file upload
- Uses your AWS credentials
- Easy installation via local deployment script
cd mcp/local_server/
sh deploy_stdio_server.shBedrock AgentCore Server (recommended for production)
- Remote hosted service
- Cognito authentication
- Scalable infrastructure
cd mcp/bedrock_server/
python deploy_idp_bedrock_mcp.pyAvailable Tools:
extract_document_attributes: Process documents with custom attributesget_extraction_status: Check job statuslist_supported_models: Get available Bedrock modelsget_bucket_info: Get S3 bucket details
See detailed documentation:
Core Team:
![]() |
![]() |
|---|---|
| Nikita Kozodoi | Nuno Castro |
Contributors:
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|---|---|---|---|---|---|---|---|
| Romain Besombes | Zainab Afolabi | Egor Krasheninnikov | Huong Vu | Aiham Taleb | Elizaveta Zinovyeva | Babs Khalidson | Ennio Pastore |
Acknowledgements:
We welcome contributions to improve this project!
In order to keep coding standards and formatting consistent, we use pre-commit. This can be run from the terminal via uv run pre-commit run -a.
See CONTRIBUTING for more information.
Note: this asset represents a proof-of-value for the services included and is not intended as a production-ready solution. The asset is not scoped for handling regulated and/or PII data. You must determine how the AWS Shared Responsibility applies to their specific use case and implement the needed controls to achieve their desired security outcomes. AWS offers a broad set of security tools and configurations to enable out customers.
- Network & Delivery:
- Amazon CloudFront:
- Use geography-aware rules to block or allow access to CloudFront distributions where required.
- Use AWS WAF on public CloudFront distributions.
- Ensure that solution CloudFront distributions use a security policy with minimum TLSv1.1 or TLSv1.2 and appropriate security ciphers for HTTPS viewer connections. Currently, the CloudFront distribution allows for SSLv3 or TLSv1 for HTTPS viewer connections and uses SSLv3 or TLSv1 for communication to the origin.
- Amazon API Gateway:
- Activate request validation on API Gateway endpoints to do first-pass input validation.
- Use AWS WAF on public-facing API Gateway Endpoints.
- Amazon CloudFront:
- AI & Data:
- Amazon Bedrock:
- Enable model invocation logging and set alerts to ensure adherence to any responsible AI policies. Model invocation logging is disabled by default. See Bedrock model invocation logging documentation
- Consider enabling Bedrock Guardrails to add baseline protections against analyzing documents or extracting attributes covering certain protected topics; as well as detecting and masking PII data in the user-uploaded inputs.
- Note that the solution is not scoped for processing regulated data.
- Amazon Bedrock:
- Security & Compliance:
- Amazon Cognito:
- Implement multi-factor authentication (MFA) in each Cognito User Pool.
- Consider implementing AdvanceSecurityMode to ENFORCE in Cognito User Pools.
- Amazon KMS:
- Implement KMS key rotation for regulatory compliance or other specific cases.
- Configure, monitor, and alert on KMS events according to lifecycle policies.
- Amazon Cognito:
- Serverless:
- AWS Lambda:
- Periodically scan all AWS Lambda container images for vulnerabilities according to lifecycle policies. AWS Inspector can be used for that.
- AWS Lambda:
Licensed under the MIT-0 License. See the LICENSE file.











