Skip to content

Query rewriting phase 1#75

Open
kelockhart wants to merge 4 commits into
adsabs:masterfrom
kelockhart:retreat2603
Open

Query rewriting phase 1#75
kelockhart wants to merge 4 commits into
adsabs:masterfrom
kelockhart:retreat2603

Conversation

@kelockhart
Copy link
Copy Markdown
Member

@kelockhart kelockhart commented Mar 4, 2026

Implemented a new citation-style query rewrite strategy for unfielded q searches to improve relevance for ADS/SciX input patterns.

What changed

  • Added a dedicated query rewriter:
    solr/query_rewrite.py
  • Integrated rewrite into search request preprocessing:
    solr/views.py
  • Added a feature flag (enabled by default):
    config.py
    SOLR_SERVICE_ENABLE_CITATION_STYLE_REWRITE = True
  • Added unit and integration tests:
    solr/tests/unittests/test_query_rewrite.py
    solr/tests/unittests/test_solr.py

Supported rewrite patterns

  • Lastname YYYY and Lastname YYYYa/b
    (first_author:"lastname" OR author:"lastname") year:YYYY
  • Lastname1 & Lastname2 YYYY and suffix year variants
    first_author:"lastname1" author:"lastname2" year:YYYY
  • Lastname1 Lastname2 & Lastname3 YYYY (comma optional before second name)
    ((first_author:"lastname1" author:("lastname2" "lastname3")) OR (first_author:"lastname1 lastname2" author:"lastname3")) year:YYYY
    (supports both three-author interpretation and double-barreled first-author interpretation)
  • Lastname et al YYYY (including et al.)
    first_author:"lastname" author_count:[2 TO 10000] year:YYYY
  • Lastname1 Lastname2 et al YYYY
    ((first_author:"lastname1 lastname2" author_count:[2 TO 10000]) OR (first_author:"lastname1" author:"lastname2" author_count:[3 TO 10000])) year:YYYY
  • Name1 Name2 YYYY (for both Firstname Lastname and Lastname Lastname)
    (author:"name1 name2" OR author("name1" "name2")) year:YYYY

Additional behavior details

  • Trailing year suffix letters are normalized (2000a -> year:2000).
  • Query normalization handles extra whitespace and Unicode dash variants.
  • Name parsing is Unicode-aware (not ASCII-only), so non-ASCII names are matched and rewritten correctly (for example: Nuñoz, García).
  • Name token regex supports internal apostrophes and hyphens (for example: O'Neil, Blanco-Cuaresma).
  • Rewriting is conservative: explicit fielded/advanced Solr syntax is not rewritten.
  • Works with both string and list-style q values in request preprocessing.
  • Feature can be disabled via config flag.

Validation

Full unit test suite run:

  • pytest -q solr/tests/unittests
  • Result: 30 passed, 0 failed.

@kelockhart
Copy link
Copy Markdown
Member Author

Added a reference-string resolution branch to the search preprocessing flow, adapted from the adssh PR approach, while preserving existing unfielded citation rewriting.

What was added

  • Bibliographic reference detection heuristic
    • New helper in solr/query_rewrite.py: is_likely_bibliographic_reference(...)
    • Detects likely full reference strings (for example journal/volume/page/year style), and excludes already-fielded queries.
    • Keeps short inline citation queries (Kurtz 2000, etc.) on the existing citation-rewrite path.
  • Reference resolution branch in request preprocessing
    • In solr/views.py preprocess_request, for q on search requests:
      • Try existing citation-style rewrite first.
      • If no citation rewrite and query looks like full reference, call reference resolver endpoint.
      • If resolver returns exactly one high-confidence match, rewrite query to bibcode:<resolved_bibcode>.
      • If resolution fails/low confidence/no unique match, fall back to original query behavior.
  • Resolver configuration
    • Added to config.py:
      • SOLR_SERVICE_ENABLE_REFERENCE_RESOLUTION = True
      • REFERENCE_RESOLVER_ENDPOINT = API_URL + '/reference/text'
      • REFERENCE_RESOLVER_MIN_SCORE = 0.8
      • REFERENCE_RESOLVER_TIMEOUT = 5
  • Auth/header forwarding for resolver calls
    • Resolver requests forward Authorization and X-Forwarded-Authorization when present.

Tests added

  • solr/tests/unittests/test_query_rewrite.py
    • Positive and negative tests for is_likely_bibliographic_reference.
  • solr/tests/unittests/test_solr.py
    • Resolver success case rewrites to bibcode:<...>.
    • Resolver no-match case preserves original query.

Validation

Ran full unittest suite:

  • pytest -q solr/tests/unittests
  • 35 passed, 0 failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant