LEKSARA

Transforming Text, Empowering Insights Instantly

Indonesian-language text preparation toolkit for production review pipelines: clean, mask, and normalize in a single pass.

The library ships linguistic resources, preset-driven orchestration, and modular helpers so data teams can audit raw text, remediate sensitive content, and standardize noisy reviews without rebuilding the pipeline from scratch.

Why Leksara

Capability	What you get	Key modules
CartBoard review intake	Dataset audits with PII flags, rating detection, and noise diagnostics ready for dashboards.	`leksara.frames.cartboard`
Composable cleaning utilities	Reusable HTML stripping, casing, stopword, emoji, punctuation, and numeric cleanup helpers.	`leksara.function`
PII masking & redaction	Regex-backed replacement for Indonesian phones, emails, addresses, and national IDs with configurable modes.	`leksara.pattern` plus `leksara/resources/regex_patterns/`
Review-focused normalization	Slang/acronym expansion, contraction repair, rating extraction, and elongation trimming for Bahasa Indonesia.	`leksara.functions.review`
ReviewChain orchestrator	`leksara(...)` wrapper and `ReviewChain` class for preset pipelines, benchmarking, and hybrid custom flows.	`leksara.core.chain`
Resource-driven customization	Drop in your own dictionaries and regex rules to adapt cleaners for new verticals.	`leksara/resources/`

Deep dives live in docs/features.md alongside API tables, dependencies, and ready-to-run notebooks.

Quickstart

Create and activate a virtual environment
```
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install leksara
```
Optional extras and troubleshooting tips are documented in docs/installation.md.

Clean a review column with the ecommerce preset

import pandas as pd
from leksara import leksara

df = pd.DataFrame(
    {
        "review_id": [101, 102],
        "review_text": [
            "<p>Barangnya mantul!!! Email saya user@mail.id, WA 0812-3456-7890</p>",
            "Kualitasnya ⭐⭐⭐⭐, pengiriman 4/5. Hubungi +62 812 8888 7777",
        ],
    }
)

df["clean_text"] = leksara(df["review_text"], preset="ecommerce_review")
print(df[["review_id", "clean_text"]])

review_id	clean_text
101	`barang mantap email [EMAIL] wa [PHONE_NUMBER]`
102	`kualitas 4.0 kirim 4.0 hubung [PHONE_NUMBER]`

Audit raw text with CartBoard

from leksara.frames.cartboard import get_flags, get_stats

flags = get_flags(df, text_column="review_text")
stats = get_stats(df, text_column="review_text")

print(flags[["review_id", "pii_flag", "rating_flag", "non_alphabetical_flag"]])
print(stats.iloc[0]["stats"])  # nested histogram of noise sources

Compose a tailored pipeline with ReviewChain

from leksara import ReviewChain
from leksara.function import case_normal, remove_punctuation, remove_stopwords
from leksara.pattern import replace_email, replace_phone

chain = ReviewChain.from_steps(
    patterns=[
        (replace_phone, {"mode": "replace"}),
        (replace_email, {"mode": "replace"}),
    ],
    functions=[case_normal, remove_stopwords, remove_punctuation],
)

cleaned, metrics = chain.transform(df["review_text"], benchmark=True)

metrics includes per-step timings so you can spot bottlenecks or confirm PII masks run before downstream cleaners.

How Leksara Fits Together

Layer	Purpose	Entry points
Pipelines	High-level orchestration that accepts raw sequences and returns cleaned text plus optional benchmarks.	`leksara(...)`, `ReviewChain`
Frames	Review-table utilities for bulk audits and dashboard-friendly stats.	`leksara.frames.cartboard`
Functions	Composable cleaning helpers mirrored from the implementation modules.	`leksara.function`
Patterns	Opt-in masking utilities with regex rules for PII.	`leksara.pattern`
Resources	Dictionaries, regex rules, and whitelists that drive domain knowledge.	`leksara/resources/`
Logging & Benchmarking	Optional hooks for throughput tuning and step-level visibility.	`leksara.core.logging`, `benchmark=True`

Architectural notes, data contracts, and extension points for each layer are captured in docs/features.md.

Documentation Map

Topic	When to read	Location
Installation & environment	Provisioning a workstation or CI agent	`docs/installation.md`
Feature deep dives	Behavioral details, configuration knobs, and dependencies	`docs/features.md`
Public API reference	Signatures, argument descriptions, return payloads	`docs/api.md`
Worked examples	Copy/paste recipes for notebooks or pipelines	`docs/examples.md`
Dependency matrix	Optional packages and enterprise alignment	`docs/dependencies.md`
Contributing guide	Environment, style, testing, documentation expectations	`docs/contributing.md`

Contributing & Support

Read docs/contributing.md before opening a pull request; it outlines environment setup, code style, testing, and documentation requirements.
File issues on GitHub with reproducible examples, presets used, optional dependencies, and OS details when reporting pipeline differences.
Commercial or large-scale users should wrap ReviewChain with automated smoke tests to detect upstream dictionary or regex changes early.

Leksara is licensed under the terms specified in LICENSE.

For more details, you can also visit our official documentation website.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github/workflows		.github/workflows
docs		docs
leksara		leksara
notebooks		notebooks
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LEKSARA

Table of Contents

Why Leksara

Quickstart

How Leksara Fits Together

Documentation Map

Contributing & Support

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LEKSARA

Table of Contents

Why Leksara

Quickstart

How Leksara Fits Together

Documentation Map

Contributing & Support

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages