-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Feature Request: Filesystem Access to Pinned Files for Data Processing Workflows
Problem Statement
Currently, files pinned in PinShare are only accessible via:
- IPFS Gateway:
http://gateway:8080/ipfs/<CID> - IPFS CLI:
ipfs get <CID> - IPFS API calls
This creates friction for data processing workflows (ML training, batch processing, analysis tools) that expect direct filesystem access. Many tools and frameworks cannot easily integrate with IPFS APIs and need traditional file paths.
Use Cases
-
Machine Learning Training
- ML frameworks (TensorFlow, PyTorch) expect dataset files at paths like
/data/train/image001.jpg - Current workaround: manually export files before training
- ML frameworks (TensorFlow, PyTorch) expect dataset files at paths like
-
Batch Data Processing
- ETL pipelines, video processing, document conversion
- Tools expect input directories with predictable file structures
-
Analysis Tools
- Scientific computing tools, data analysis frameworks
- Often require file paths, not URLs or CID-based access
-
Development/Testing
- Easier to inspect and work with files during development
- Standard filesystem tools (ls, find, grep) work normally
Current Behavior
Files are stored in IPFS's content-addressed block storage (/data/ipfs/blocks/):
- Not accessible as regular files
- Blocks are named by hash, not original filename
- Requires IPFS tooling to retrieve
Proposed Solutions
Option 1: IPFS FUSE Mount (Recommended for MVP)
Implementation:
- Use IPFS's built-in FUSE support to mount
/ipfsas a filesystem - Files accessible at
/ipfs/<CID> - Optionally create symlinks with original filenames → CID paths
Example:
# Mount IPFS
ipfs mount
# Access files
cat /ipfs/bafkreib566otjk54vgjqrz44xfcgdqmjgwbgatligkned7kl5qmilzvnwq
# Or with symlinks:
/data/exports/mydocument.pdf → /ipfs/bafkreib566...Pros:
- ✅ No code changes to PinShare backend
- ✅ No storage duplication
- ✅ Read-only by default (safety)
- ✅ Automatic - all pinned files accessible
Cons:
- ❌ Requires FUSE support in container/OS
- ❌ Files accessed by CID, need symlinks for friendly names
- ❌ Can be slower than local files for large datasets
- ❌ Requires privileged container permissions
Configuration Changes:
# k8s/base/ipfs/statefulset.yaml
securityContext:
privileged: true # Required for FUSE
capabilities:
add:
- SYS_ADMIN
volumeMounts:
- name: ipfs-mount
mountPath: /ipfs
mountPropagation: BidirectionalOption 2: Export Directory with Auto-Sync
Implementation:
- Add background job to export pinned files to
/data/exports/ - Maintain mapping of SHA256 → filesystem path
- Use original filenames from metadata
- Handle filename conflicts (append hash suffix)
Example:
/data/exports/
├── mydocument.pdf
├── photo.jpg
├── report_a3f2b9.pdf # Conflict resolution
└── dataset/
├── train/
└── test/
Pros:
- ✅ Familiar file paths with original names
- ✅ Fast access (no IPFS overhead)
- ✅ Works with any tool/framework
- ✅ Can organize into subdirectories (by tag, date, etc.)
Cons:
- ❌ Doubles storage usage
- ❌ Needs sync logic (export on pin, delete on unpin)
- ❌ Filename conflict handling complexity
- ❌ Potential for desync between IPFS and exports
API Endpoints:
POST /files/export/{sha256} # Export single file
POST /files/export-all # Export all pinned files
DELETE /files/export/{sha256} # Remove exported file
GET /files/export-status # Get export directory status
Configuration:
type ExportConfig struct {
Enabled bool
ExportDir string
AutoExportOnPin bool
OrganizeByTag bool
ConflictResolution string // "append-hash", "error", "overwrite"
}Option 3: Selective Export with Tags
Implementation:
- Only export files with specific tags (e.g.,
export:ml-dataset) - Export to tag-specific directories
- Manual or automatic export based on tags
Example:
/data/exports/
├── ml-training/ # Tag: export:ml-training
│ ├── images/
│ └── labels/
└── analysis-datasets/ # Tag: export:analysis
└── data.csv
Pros:
- ✅ Storage efficient (only export what's needed)
- ✅ Organized by use case
- ✅ Clear intent via tags
Cons:
- ❌ Requires manual tagging workflow
- ❌ Still needs export sync logic
Recommended Approach
Phase 1: IPFS FUSE Mount + Symlinks
- Enable IPFS FUSE mount in Kubernetes deployment
- Add API endpoint to create symlinks:
/data/exports/{filename} → /ipfs/{CID} - Symlinks created automatically on pin, removed on unpin
- No storage duplication, minimal code changes
Phase 2: Optional Full Export
- Add export directory feature for users who need:
- Faster access (no IPFS overhead)
- Offline access
- Tools that don't support symlinks
Implementation Considerations
Storage Space
- FUSE mount: No additional storage
- Export directory: 2x storage (IPFS blocks + exported files)
- Consider storage limits and cleanup policies
Filename Conflicts
Multiple files with same name but different content (different SHA256):
mydocument.pdf (sha256: abc123...)
mydocument.pdf (sha256: def456...)
# Resolution strategies:
mydocument.pdf # First one
mydocument_def456.pdf # Second one with hash suffix
Sync Consistency
- Export on pin: File appears in export directory
- Export on unpin: File removed from export directory
- Handle failures: What if export fails but pin succeeds?
Kubernetes Considerations
- FUSE requires privileged containers or device plugins
- Export directory should use PersistentVolume
- Consider sharing exports with other pods (ReadOnlyMany PVC)
Security
- Read-only exports prevent modification
- Symlinks prevent accidental deletion of IPFS content
- Export directory permissions (who can read/write?)
Example Use Case: ML Training Pipeline
Without this feature:
# Manual steps before training
ipfs get bafk... -o /tmp/dataset/image1.jpg
ipfs get bafk... -o /tmp/dataset/image2.jpg
# ... repeat for thousands of files
python train.py --data-dir /tmp/datasetWith FUSE + symlinks:
# Files automatically available
ls /data/exports/ml-training/
# image1.jpg → /ipfs/bafk...
# image2.jpg → /ipfs/bafk...
python train.py --data-dir /data/exports/ml-trainingWith export directory:
# Files automatically exported as regular files
ls /data/exports/ml-training/
# image1.jpg (real file)
# image2.jpg (real file)
python train.py --data-dir /data/exports/ml-trainingAlternatives Considered
1. S3-compatible API
- Expose IPFS via S3-compatible API (like MinIO gateway)
- Tools that support S3 could access files
- Complexity: Need S3 gateway implementation
2. WebDAV
- Mount IPFS via WebDAV
- Cross-platform filesystem access
- Complexity: Need WebDAV server implementation
3. NFS Export
- Export IPFS via NFS
- Good for multi-node access
- Complexity: Need NFS server setup
Open Questions
- Default behavior: Should exports be opt-in or automatic?
- Directory structure: Flat or organized (by tag, date, type)?
- Cleanup policy: Keep exports forever or expire based on last access?
- Multi-user: How to handle exports for different users/tenants?
- Large files: Should very large files (>1GB) be excluded from auto-export?
- Performance: What's the performance impact of FUSE vs direct files?
Success Metrics
- ML training pipeline can access dataset files without manual export
- Batch processing jobs can run against pinned files
- Reduced friction for data scientists and engineers
- No significant performance degradation for normal PinShare operations
Related Issues
- #TBD: Tag-based file organization
- #TBD: File metadata search and filtering
- #TBD: Kubernetes persistent storage improvements
Priority: Medium-High (blocks ML/data processing use cases)
Effort: Medium (Phase 1: FUSE+symlinks), High (Phase 2: Full export)
Labels: enhancement, data-access, ml-workflows, ipfs