Add filesystem access for pinned files (ML/data processing workflows)

# Feature Request: Filesystem Access to Pinned Files for Data Processing Workflows

## Problem Statement

Currently, files pinned in PinShare are only accessible via:
- IPFS Gateway: `http://gateway:8080/ipfs/<CID>`
- IPFS CLI: `ipfs get <CID>`
- IPFS API calls

This creates friction for data processing workflows (ML training, batch processing, analysis tools) that expect direct filesystem access. Many tools and frameworks cannot easily integrate with IPFS APIs and need traditional file paths.

## Use Cases

1. **Machine Learning Training**
   - ML frameworks (TensorFlow, PyTorch) expect dataset files at paths like `/data/train/image001.jpg`
   - Current workaround: manually export files before training

2. **Batch Data Processing**
   - ETL pipelines, video processing, document conversion
   - Tools expect input directories with predictable file structures

3. **Analysis Tools**
   - Scientific computing tools, data analysis frameworks
   - Often require file paths, not URLs or CID-based access

4. **Development/Testing**
   - Easier to inspect and work with files during development
   - Standard filesystem tools (ls, find, grep) work normally

## Current Behavior

Files are stored in IPFS's content-addressed block storage (`/data/ipfs/blocks/`):
- Not accessible as regular files
- Blocks are named by hash, not original filename
- Requires IPFS tooling to retrieve

## Proposed Solutions

### Option 1: IPFS FUSE Mount (Recommended for MVP)

**Implementation:**
- Use IPFS's built-in FUSE support to mount `/ipfs` as a filesystem
- Files accessible at `/ipfs/<CID>`
- Optionally create symlinks with original filenames → CID paths

**Example:**
```bash
# Mount IPFS
ipfs mount

# Access files
cat /ipfs/bafkreib566otjk54vgjqrz44xfcgdqmjgwbgatligkned7kl5qmilzvnwq

# Or with symlinks:
/data/exports/mydocument.pdf → /ipfs/bafkreib566...
```

**Pros:**
- ✅ No code changes to PinShare backend
- ✅ No storage duplication
- ✅ Read-only by default (safety)
- ✅ Automatic - all pinned files accessible

**Cons:**
- ❌ Requires FUSE support in container/OS
- ❌ Files accessed by CID, need symlinks for friendly names
- ❌ Can be slower than local files for large datasets
- ❌ Requires privileged container permissions

**Configuration Changes:**
```yaml
# k8s/base/ipfs/statefulset.yaml
securityContext:
  privileged: true  # Required for FUSE
  capabilities:
    add:
    - SYS_ADMIN

volumeMounts:
- name: ipfs-mount
  mountPath: /ipfs
  mountPropagation: Bidirectional
```

---

### Option 2: Export Directory with Auto-Sync

**Implementation:**
- Add background job to export pinned files to `/data/exports/`
- Maintain mapping of SHA256 → filesystem path
- Use original filenames from metadata
- Handle filename conflicts (append hash suffix)

**Example:**
```
/data/exports/
├── mydocument.pdf
├── photo.jpg
├── report_a3f2b9.pdf  # Conflict resolution
└── dataset/
    ├── train/
    └── test/
```

**Pros:**
- ✅ Familiar file paths with original names
- ✅ Fast access (no IPFS overhead)
- ✅ Works with any tool/framework
- ✅ Can organize into subdirectories (by tag, date, etc.)

**Cons:**
- ❌ Doubles storage usage
- ❌ Needs sync logic (export on pin, delete on unpin)
- ❌ Filename conflict handling complexity
- ❌ Potential for desync between IPFS and exports

**API Endpoints:**
```
POST /files/export/{sha256}       # Export single file
POST /files/export-all            # Export all pinned files
DELETE /files/export/{sha256}     # Remove exported file
GET /files/export-status          # Get export directory status
```

**Configuration:**
```go
type ExportConfig struct {
    Enabled           bool
    ExportDir         string
    AutoExportOnPin   bool
    OrganizeByTag     bool
    ConflictResolution string // "append-hash", "error", "overwrite"
}
```

---

### Option 3: Selective Export with Tags

**Implementation:**
- Only export files with specific tags (e.g., `export:ml-dataset`)
- Export to tag-specific directories
- Manual or automatic export based on tags

**Example:**
```
/data/exports/
├── ml-training/          # Tag: export:ml-training
│   ├── images/
│   └── labels/
└── analysis-datasets/    # Tag: export:analysis
    └── data.csv
```

**Pros:**
- ✅ Storage efficient (only export what's needed)
- ✅ Organized by use case
- ✅ Clear intent via tags

**Cons:**
- ❌ Requires manual tagging workflow
- ❌ Still needs export sync logic

---

## Recommended Approach

**Phase 1: IPFS FUSE Mount + Symlinks**
1. Enable IPFS FUSE mount in Kubernetes deployment
2. Add API endpoint to create symlinks: `/data/exports/{filename} → /ipfs/{CID}`
3. Symlinks created automatically on pin, removed on unpin
4. No storage duplication, minimal code changes

**Phase 2: Optional Full Export**
- Add export directory feature for users who need:
  - Faster access (no IPFS overhead)
  - Offline access
  - Tools that don't support symlinks

## Implementation Considerations

### Storage Space
- FUSE mount: No additional storage
- Export directory: 2x storage (IPFS blocks + exported files)
- Consider storage limits and cleanup policies

### Filename Conflicts
Multiple files with same name but different content (different SHA256):
```
mydocument.pdf (sha256: abc123...)
mydocument.pdf (sha256: def456...)

# Resolution strategies:
mydocument.pdf          # First one
mydocument_def456.pdf   # Second one with hash suffix
```

### Sync Consistency
- Export on pin: File appears in export directory
- Export on unpin: File removed from export directory
- Handle failures: What if export fails but pin succeeds?

### Kubernetes Considerations
- FUSE requires privileged containers or device plugins
- Export directory should use PersistentVolume
- Consider sharing exports with other pods (ReadOnlyMany PVC)

### Security
- Read-only exports prevent modification
- Symlinks prevent accidental deletion of IPFS content
- Export directory permissions (who can read/write?)

## Example Use Case: ML Training Pipeline

**Without this feature:**
```bash
# Manual steps before training
ipfs get bafk... -o /tmp/dataset/image1.jpg
ipfs get bafk... -o /tmp/dataset/image2.jpg
# ... repeat for thousands of files
python train.py --data-dir /tmp/dataset
```

**With FUSE + symlinks:**
```bash
# Files automatically available
ls /data/exports/ml-training/
# image1.jpg → /ipfs/bafk...
# image2.jpg → /ipfs/bafk...
python train.py --data-dir /data/exports/ml-training
```

**With export directory:**
```bash
# Files automatically exported as regular files
ls /data/exports/ml-training/
# image1.jpg (real file)
# image2.jpg (real file)
python train.py --data-dir /data/exports/ml-training
```

## Alternatives Considered

### 1. S3-compatible API
- Expose IPFS via S3-compatible API (like MinIO gateway)
- Tools that support S3 could access files
- Complexity: Need S3 gateway implementation

### 2. WebDAV
- Mount IPFS via WebDAV
- Cross-platform filesystem access
- Complexity: Need WebDAV server implementation

### 3. NFS Export
- Export IPFS via NFS
- Good for multi-node access
- Complexity: Need NFS server setup

## Open Questions

1. **Default behavior**: Should exports be opt-in or automatic?
2. **Directory structure**: Flat or organized (by tag, date, type)?
3. **Cleanup policy**: Keep exports forever or expire based on last access?
4. **Multi-user**: How to handle exports for different users/tenants?
5. **Large files**: Should very large files (>1GB) be excluded from auto-export?
6. **Performance**: What's the performance impact of FUSE vs direct files?

## Success Metrics

- ML training pipeline can access dataset files without manual export
- Batch processing jobs can run against pinned files
- Reduced friction for data scientists and engineers
- No significant performance degradation for normal PinShare operations

## Related Issues

- #TBD: Tag-based file organization
- #TBD: File metadata search and filtering
- #TBD: Kubernetes persistent storage improvements

---

**Priority**: Medium-High (blocks ML/data processing use cases)
**Effort**: Medium (Phase 1: FUSE+symlinks), High (Phase 2: Full export)
**Labels**: `enhancement`, `data-access`, `ml-workflows`, `ipfs`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add filesystem access for pinned files (ML/data processing workflows) #6

Feature Request: Filesystem Access to Pinned Files for Data Processing Workflows

Problem Statement

Use Cases

Current Behavior

Proposed Solutions

Option 1: IPFS FUSE Mount (Recommended for MVP)

Option 2: Export Directory with Auto-Sync

Option 3: Selective Export with Tags

Recommended Approach

Implementation Considerations

Storage Space

Filename Conflicts

Sync Consistency

Kubernetes Considerations

Security

Example Use Case: ML Training Pipeline

Alternatives Considered

1. S3-compatible API

2. WebDAV

3. NFS Export

Open Questions

Success Metrics

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add filesystem access for pinned files (ML/data processing workflows) #6

Description

Feature Request: Filesystem Access to Pinned Files for Data Processing Workflows

Problem Statement

Use Cases

Current Behavior

Proposed Solutions

Option 1: IPFS FUSE Mount (Recommended for MVP)

Option 2: Export Directory with Auto-Sync

Option 3: Selective Export with Tags

Recommended Approach

Implementation Considerations

Storage Space

Filename Conflicts

Sync Consistency

Kubernetes Considerations

Security

Example Use Case: ML Training Pipeline

Alternatives Considered

1. S3-compatible API

2. WebDAV

3. NFS Export

Open Questions

Success Metrics

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions