Code Review Observations & Questions from a Developer Exploring the Project

## Overview

I spent time reading through the core source — `prompt_compressor.py` (~2,455 lines), `utils.py`, and the examples — as someone integrating LLMLingua into a RAG pipeline. Overall this is impressively well-documented research code. Here are observations, questions, and a few concrete suggestions.

---

## Strengths Worth Calling Out

- **Docstrings are thorough.** The parameter-level documentation in `compress_prompt` and `structured_compress_prompt` is genuinely useful — the math definition of compression rate and the caveats about tokenizer mismatch are the kind of context users actually need.
- **The `__call__` delegation to `compress_prompt` is a nice UX touch** — lets users treat the compressor as a callable without remembering method names.
- **`rank_method` extensibility** is well-designed; adding a new ranking backend (BM25, Cohere, sentence-BERT, etc.) is a clean one-function addition inside `get_rank_results`.

---

## Questions & Observations

### 1. Mutable default arguments in function signatures

Several public methods use mutable defaults:

```python
def compress_prompt(self, ..., force_tokens: List[str] = [], ...):
def compress_prompt_llmlingua2(self, ..., force_context_ids: List[int] = [], force_tokens: List[str] = [], ...):
```

Python shares the same list object across all calls when the default is a mutable literal. This is a classic footgun — if any call path appends to `force_tokens`, subsequent calls with the default will see the accumulated state. The standard fix is `force_tokens: List[str] = None` with an early `if force_tokens is None: force_tokens = []` guard. Given the library's public API surface, this seems worth hardening.

### 2. `trust_remote_code=True` on by default

In `load_model`:

```python
trust_remote_code = model_config.get("trust_remote_code", True)
```

The default is `True`, meaning any model loaded from HuggingFace Hub will execute arbitrary remote code without the user opting in. For a library with a broad user base (LangChain, LlamaIndex integrations), this seems like a surprising default. Most downstream frameworks default this to `False` and require explicit opt-in. Would a change to default `False` break any of the recommended models, or is it only needed for specific quantized variants?

### 3. `TokenClfDataset` hardcodes model name string matching

```python
if "bert-base-multilingual-cased" in model_name:
    ...
elif "xlm-roberta-large" in model_name:
    ...
else:
    raise NotImplementedError()
```

The same string-matching pattern appears in `is_begin_of_new_word` and `get_pure_token` in `utils.py`. This means any new model (even a fine-tune like `my-org/xlm-roberta-large-finetuned`) may silently route to the wrong branch or raise. A model family enum or a config-driven approach would make adding new model families less error-prone and avoid the current reliance on substring matching.

### 4. `saving` is hardcoded to GPT-4 pricing

```python
saving = (origin_tokens - compressed_tokens) * 0.06 / 1000
```

The `$0.06/1K tokens` figure (GPT-4 input pricing circa 2023) is baked into the return value of both `compress_prompt` and `compress_prompt_llmlingua2`. GPT-4o pricing is now ~10x cheaper, and users routing to other models will see a misleading savings estimate. Options: accept an optional `cost_per_token` parameter, document the assumption clearly in the return value docstring, or drop this field in favor of letting callers compute savings from the raw token counts.

### 5. `process_sequence_data` has a typo — dict values use `[` brackets

```python
elif value_type == "dict" or value_type == "dictionary":
    return (
        "<llmlingua, compress=False>"
        + f'"{k}": {process_sequence_data(rate, "[", "]", v, is_dict=True)}'
    )
```

A JSON dict should be wrapped in `{`/`}`, not `[`/`]`. This may be intentional for the internal markup format, but if so it would benefit from a comment explaining why, since it looks like a bug to a reader unfamiliar with the internal representation.

### 6. `get_rank_results` nests all ranking functions as inner closures

The twelve `get_distance_*` functions are all defined as closures inside `get_rank_results` and re-created on every call. This isn't a correctness issue, but it makes the function ~200 lines long and prevents unit-testing individual ranking methods in isolation. Extracting them as private methods or a strategy registry would improve testability and readability.

---

## Minor Suggestions

- The `context_budget` parameter accepts a string like `"+100"` but this is not validated — a malformed string would produce a silent arithmetic error downstream. A brief validation or an explicit `int | str` union type with parsing docs would help.
- `seed_everything(42)` is called unconditionally in `init_llmlingua2`, which globally sets `torch`, `numpy`, and `random` seeds. This is fine for reproducibility in research but is a side-effect that library users may not expect. Worth noting in the docstring.

---

## Summary

The library is well-written research code with production integration already in LangChain and LlamaIndex. The main areas I'd prioritize for a stability pass are the mutable default argument issue (correctness risk), the `trust_remote_code` default (security), and the hardcoded pricing constant (staleness). The string-matching model dispatch is a longer-term maintainability concern.

Thanks for the work on this — the LongLLMLingua context-reordering approach in particular is elegant.

— A developer exploring LLMLingua for RAG cost optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Code Review Observations & Questions from a Developer Exploring the Project #249

Overview

Strengths Worth Calling Out

Questions & Observations

1. Mutable default arguments in function signatures

2. `trust_remote_code=True` on by default

3. `TokenClfDataset` hardcodes model name string matching

4. `saving` is hardcoded to GPT-4 pricing

5. `process_sequence_data` has a typo — dict values use `[` brackets

6. `get_rank_results` nests all ranking functions as inner closures

Minor Suggestions

Summary

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Code Review Observations & Questions from a Developer Exploring the Project #249

Description

Overview

Strengths Worth Calling Out

Questions & Observations

1. Mutable default arguments in function signatures

2. trust_remote_code=True on by default

3. TokenClfDataset hardcodes model name string matching

4. saving is hardcoded to GPT-4 pricing

5. process_sequence_data has a typo — dict values use [ brackets

6. get_rank_results nests all ranking functions as inner closures

Minor Suggestions

Summary

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. `trust_remote_code=True` on by default

3. `TokenClfDataset` hardcodes model name string matching

4. `saving` is hardcoded to GPT-4 pricing

5. `process_sequence_data` has a typo — dict values use `[` brackets

6. `get_rank_results` nests all ranking functions as inner closures