Skip to content

Account for scrapy-poet Zyte API parameters when fingerprinting#281

Draft
AdrianAtZyte wants to merge 10 commits into
mainfrom
anchor-fingerprinting
Draft

Account for scrapy-poet Zyte API parameters when fingerprinting#281
AdrianAtZyte wants to merge 10 commits into
mainfrom
anchor-fingerprinting

Conversation

@AdrianAtZyte
Copy link
Copy Markdown
Contributor

Before this change, one of the 2 URLs in the example below would be skipped:

from scrapy import Request, Spider
from scrapy_poet import DummyResponse
from zyte_common_items import JobPostingNavigation

class TestSpider(Spider):
    name = "test"
    custom_settings = {
        "ZYTE_API_LOG_REQUESTS": True,
    }

    def start_requests(self):
        for slug in ("foo", "bar"):
            yield Request(
                f"https://toscrape.com/#/{slug}",
                meta={
                    "zyte_api_provider": {
                        "jobPostingNavigationOptions": {"extractFrom": "browserHtml"},
                    },
                },
            )

    async def parse(self, response: DummyResponse, item: JobPostingNavigation):
        yield item

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 96.87500% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.40%. Comparing base (a449fde) to head (c55d263).

Files with missing lines Patch % Lines
scrapy_zyte_api/providers.py 94.18% 2 Missing and 3 partials ⚠️
scrapy_zyte_api/_provider_fingerprint_cache.py 96.72% 1 Missing and 1 partial ⚠️
scrapy_zyte_api/_request_fingerprinter.py 99.08% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #281      +/-   ##
==========================================
+ Coverage   97.33%   97.40%   +0.07%     
==========================================
  Files          15       16       +1     
  Lines        2027     2200     +173     
  Branches      370      391      +21     
==========================================
+ Hits         1973     2143     +170     
- Misses         26       27       +1     
- Partials       28       30       +2     
Files with missing lines Coverage Δ
scrapy_zyte_api/_request_fingerprinter.py 98.89% <99.08%> (-0.02%) ⬇️
scrapy_zyte_api/_provider_fingerprint_cache.py 96.72% <96.72%> (ø)
scrapy_zyte_api/providers.py 94.44% <94.18%> (-0.38%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@AdrianAtZyte AdrianAtZyte marked this pull request as ready for review April 23, 2026 12:25
@AdrianAtZyte AdrianAtZyte marked this pull request as draft April 23, 2026 12:26
@AdrianAtZyte AdrianAtZyte marked this pull request as ready for review April 27, 2026 15:10
@AdrianAtZyte AdrianAtZyte requested a review from wRAR April 27, 2026 15:11
@wRAR wRAR requested a review from Copilot April 28, 2026 11:26
fingerprint += serialized_page_params
return hashlib.sha1(fingerprint, usedforsecurity=False).digest()

def _get_provider_request_fingerprint(self, request: Request) -> bytes | None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only question is whether this makes costly calculations for requests passed to it that are already made for them somewhere else

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates request fingerprinting to incorporate scrapy-poet/Zyte API provider parameters so that requests to the same URL with different provider options no longer share the same fingerprint (and therefore won’t be incorrectly deduplicated/skipped).

Changes:

  • Extend ScrapyZyteAPIRequestFingerprinter to compute a provider-aware fingerprint (via scrapy-poet’s dependency plan) and combine it with the “regular” Zyte API fingerprint when appropriate.
  • Refactor Zyte API provider meta construction into a reusable helper (_build_zyte_api_provider_meta) and add coverage around AnyResponse behavior when HttpResponse is already available.
  • Add/adjust tests around the new fingerprint combination logic and provider validation behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scrapy_zyte_api/_request_fingerprinter.py Adds provider-aware fingerprinting and combines provider + regular fingerprints, including special handling for provider-only requests.
scrapy_zyte_api/providers.py Introduces _build_zyte_api_provider_meta helper and reuses it inside ZyteApiProvider.__call__.
tests/test_request_fingerprinter.py Adds unit tests for provider/regular fingerprint combination and provider-only behavior.
tests/test_providers.py Imports and tests _build_zyte_api_provider_meta, plus adds a test for unannotated Actions dependencies.
tests/test_sessions_enabled.py Sets RETRY_TIMES = 0 for test stability/reproducibility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_providers.py
Comment thread scrapy_zyte_api/_request_fingerprinter.py Outdated
@AdrianAtZyte AdrianAtZyte marked this pull request as draft April 29, 2026 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants