Skip to content

Code Review Observations & Questions from a Developer Exploring the Project #249

Description

@horsleyb

Overview

I spent time reading through the core source — prompt_compressor.py (~2,455 lines), utils.py, and the examples — as someone integrating LLMLingua into a RAG pipeline. Overall this is impressively well-documented research code. Here are observations, questions, and a few concrete suggestions.


Strengths Worth Calling Out

  • Docstrings are thorough. The parameter-level documentation in compress_prompt and structured_compress_prompt is genuinely useful — the math definition of compression rate and the caveats about tokenizer mismatch are the kind of context users actually need.
  • The __call__ delegation to compress_prompt is a nice UX touch — lets users treat the compressor as a callable without remembering method names.
  • rank_method extensibility is well-designed; adding a new ranking backend (BM25, Cohere, sentence-BERT, etc.) is a clean one-function addition inside get_rank_results.

Questions & Observations

1. Mutable default arguments in function signatures

Several public methods use mutable defaults:

def compress_prompt(self, ..., force_tokens: List[str] = [], ...):
def compress_prompt_llmlingua2(self, ..., force_context_ids: List[int] = [], force_tokens: List[str] = [], ...):

Python shares the same list object across all calls when the default is a mutable literal. This is a classic footgun — if any call path appends to force_tokens, subsequent calls with the default will see the accumulated state. The standard fix is force_tokens: List[str] = None with an early if force_tokens is None: force_tokens = [] guard. Given the library's public API surface, this seems worth hardening.

2. trust_remote_code=True on by default

In load_model:

trust_remote_code = model_config.get("trust_remote_code", True)

The default is True, meaning any model loaded from HuggingFace Hub will execute arbitrary remote code without the user opting in. For a library with a broad user base (LangChain, LlamaIndex integrations), this seems like a surprising default. Most downstream frameworks default this to False and require explicit opt-in. Would a change to default False break any of the recommended models, or is it only needed for specific quantized variants?

3. TokenClfDataset hardcodes model name string matching

if "bert-base-multilingual-cased" in model_name:
    ...
elif "xlm-roberta-large" in model_name:
    ...
else:
    raise NotImplementedError()

The same string-matching pattern appears in is_begin_of_new_word and get_pure_token in utils.py. This means any new model (even a fine-tune like my-org/xlm-roberta-large-finetuned) may silently route to the wrong branch or raise. A model family enum or a config-driven approach would make adding new model families less error-prone and avoid the current reliance on substring matching.

4. saving is hardcoded to GPT-4 pricing

saving = (origin_tokens - compressed_tokens) * 0.06 / 1000

The $0.06/1K tokens figure (GPT-4 input pricing circa 2023) is baked into the return value of both compress_prompt and compress_prompt_llmlingua2. GPT-4o pricing is now ~10x cheaper, and users routing to other models will see a misleading savings estimate. Options: accept an optional cost_per_token parameter, document the assumption clearly in the return value docstring, or drop this field in favor of letting callers compute savings from the raw token counts.

5. process_sequence_data has a typo — dict values use [ brackets

elif value_type == "dict" or value_type == "dictionary":
    return (
        "<llmlingua, compress=False>"
        + f'"{k}": {process_sequence_data(rate, "[", "]", v, is_dict=True)}'
    )

A JSON dict should be wrapped in {/}, not [/]. This may be intentional for the internal markup format, but if so it would benefit from a comment explaining why, since it looks like a bug to a reader unfamiliar with the internal representation.

6. get_rank_results nests all ranking functions as inner closures

The twelve get_distance_* functions are all defined as closures inside get_rank_results and re-created on every call. This isn't a correctness issue, but it makes the function ~200 lines long and prevents unit-testing individual ranking methods in isolation. Extracting them as private methods or a strategy registry would improve testability and readability.


Minor Suggestions

  • The context_budget parameter accepts a string like "+100" but this is not validated — a malformed string would produce a silent arithmetic error downstream. A brief validation or an explicit int | str union type with parsing docs would help.
  • seed_everything(42) is called unconditionally in init_llmlingua2, which globally sets torch, numpy, and random seeds. This is fine for reproducibility in research but is a side-effect that library users may not expect. Worth noting in the docstring.

Summary

The library is well-written research code with production integration already in LangChain and LlamaIndex. The main areas I'd prioritize for a stability pass are the mutable default argument issue (correctness risk), the trust_remote_code default (security), and the hardcoded pricing constant (staleness). The string-matching model dispatch is a longer-term maintainability concern.

Thanks for the work on this — the LongLLMLingua context-reordering approach in particular is elegant.

— A developer exploring LLMLingua for RAG cost optimization

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions