Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ jobs:

steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Fetch full history for setuptools-scm

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
Expand Down Expand Up @@ -58,6 +60,8 @@ jobs:

steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Fetch full history for setuptools-scm

- name: Install build dependencies
run: |
Expand Down Expand Up @@ -86,6 +90,8 @@ jobs:

steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Fetch full history for setuptools-scm

- name: Set up Python
uses: actions/setup-python@v5
Expand Down
8 changes: 8 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ jobs:

steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Fetch full history for setuptools-scm

- name: Install build dependencies
run: |
Expand Down Expand Up @@ -60,6 +62,8 @@ jobs:

steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Fetch full history for setuptools-scm

- name: Create bin directory
run: mkdir -p dalla_data_processing/deduplication/bin
Expand Down Expand Up @@ -111,6 +115,8 @@ jobs:

steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Fetch full history for setuptools-scm

- name: Set up Python
uses: actions/setup-python@v5
Expand Down Expand Up @@ -167,6 +173,8 @@ jobs:

steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Fetch full history for setuptools-scm

- uses: actions/download-artifact@v4
with:
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ wheels/
.installed.cfg
*.egg

# setuptools-scm version file
dalla_data_processing/_version.py

# Virtual environments
venv/
env/
Expand Down
59 changes: 50 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,49 +10,90 @@ A comprehensive Arabic data processing pipeline with deduplication, stemming, qu

## Installation

### Quick Start (All Features)

For most users, install with all features enabled:

<b>Using uv</b>

```bash
# Install the package
uv pip install dalla-data-processing
uv pip install "dalla-data-processing[all]"
```


<b>Using pip</b>

```bash
# Install the package
pip install "dalla-data-processing[all]"
```

### Modular Installation (Advanced)

Install only the components you need to keep dependencies minimal:

```bash
# Base installation (no processing features, only core dependencies)
pip install dalla-data-processing

# Install specific features
pip install "dalla-data-processing[dedup]" # Deduplication only
pip install "dalla-data-processing[stem]" # Stemming only
pip install "dalla-data-processing[quality]" # Quality checking only
pip install "dalla-data-processing[readability]" # Readability scoring only
pip install "dalla-data-processing[pack]" # Dataset packing only

# Combine multiple features
pip install "dalla-data-processing[dedup,stem,quality]"
```

### Development Installation

<b>From Source</b>
<b>From Source (with uv - recommended)</b>

```bash
git clone https://github.com/U4RASD/dalla-data-processing.git
cd dalla-data-processing

# Using uv
uv pip install -e .
# Install all features and dev dependencies
uv sync --all-extras

# Or using pip
pip install -e .
# Or install with specific extras only
uv sync --extra dedup --extra stem
```

<b>From Source (with pip)</b>

```bash
git clone https://github.com/U4RASD/dalla-data-processing.git
cd dalla-data-processing

# Install with all features for development
pip install -e ".[all,dev]"
```

## Components

> **Note:** Each component requires its corresponding extra to be installed. Install with `[all]` to enable all features, or see [Modular Installation](#modular-installation-advanced) to install only what you need.

### 1. [Deduplication](dalla_data_processing/deduplication/README.md)
Detect and remove duplicate or near-duplicate documents from your datasets using the Onion algorithm.
- **Requires:** `[dedup]` extra

### 2. [Stemming](dalla_data_processing/stemming/README.md)
Apply morphological analysis and stemming using CAMeL Tools.
- **Requires:** `[stem]` extra

### 3. [Quality Checking](dalla_data_processing/quality/README.md)
Check text quality using morphological analysis to detect errors and foreign words.
- **Requires:** `[quality]` extra

### 4. [Readability Scoring](dalla_data_processing/readability/README.md)
Calculate readability scores using Flesch Reading Ease and Osman methods.
Contains also ranking according to both scores
- **Requires:** `[readability]` extra

### 5. [Dataset Packing](dalla_data_processing/packing/README.md)
Pack and prepare datasets for training.
- **Requires:** `[pack]` extra

## Links

Expand Down
60 changes: 39 additions & 21 deletions dalla_data_processing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,31 +8,49 @@
- Readability scoring
"""

__version__ = "0.0.1"

try:
from dalla_data_processing.core.dataset import DatasetManager

_has_dataset = True
from dalla_data_processing._version import version as __version__
except ImportError:
_has_dataset = False
DatasetManager = None
# Fallback for development without installation
try:
from importlib.metadata import PackageNotFoundError, version

try:
from dalla_data_processing.utils.tokenize import simple_word_tokenize
__version__ = version("dalla-data-processing")
except PackageNotFoundError:
__version__ = "0.0.0+unknown"

_has_tokenize = True
except ImportError:
_has_tokenize = False
simple_word_tokenize = None

try:
from dalla_data_processing.stemming import stem, stem_dataset
# Lazy imports - only import when actually used, not at package load time
def __getattr__(name):
"""Lazy load heavy modules only when accessed."""
if name == "DatasetManager":
from dalla_data_processing.core.dataset import DatasetManager

return DatasetManager
elif name == "simple_word_tokenize":
from dalla_data_processing.utils.tokenize import simple_word_tokenize

return simple_word_tokenize
elif name == "stem":
from dalla_data_processing.stemming import stem

return stem
elif name == "stem_dataset":
from dalla_data_processing.stemming import stem_dataset

return stem_dataset
elif name == "DatasetPacker":
from dalla_data_processing.packing import DatasetPacker

return DatasetPacker
raise AttributeError(f"module '{__name__}' has no attribute '{name}'")

_has_stemming = True
except ImportError:
_has_stemming = False
stem = None
stem_dataset = None

__all__ = ["DatasetManager", "simple_word_tokenize", "stem", "stem_dataset", "__version__"]
__all__ = [
"DatasetManager",
"simple_word_tokenize",
"stem",
"stem_dataset",
"DatasetPacker",
"__version__",
]
Loading
Loading