Skip to content

Commit 8035e27

Browse files
committed
Merge remote-tracking branch 'origin/main' into close-inactive-contexts
2 parents 5acff74 + 72060d2 commit 8035e27

File tree

15 files changed

+298
-110
lines changed

15 files changed

+298
-110
lines changed

.bumpversion.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 0.0.26
2+
current_version = 0.0.28
33
commit = True
44
tag = True
55

README.md

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,8 @@ Type `dict`, default `{}`
128128

129129
A dictionary with options to be passed as keyword arguments when launching the
130130
Browser. See the docs for
131-
[`BrowserType.launch`](https://playwright.dev/python/docs/api/class-browsertype#browser-type-launch).
131+
[`BrowserType.launch`](https://playwright.dev/python/docs/api/class-browsertype#browser-type-launch)
132+
for a list of supported keyword arguments.
132133

133134
```python
134135
PLAYWRIGHT_LAUNCH_OPTIONS = {
@@ -524,7 +525,7 @@ class AwesomeSpiderWithPage(scrapy.Spider):
524525

525526
Multiple [browser contexts](https://playwright.dev/python/docs/browser-contexts)
526527
to be launched at startup can be defined via the
527-
[`PLAYWRIGHT_CONTEXTS`](#PLAYWRIGHT_CONTEXTS) setting.
528+
[`PLAYWRIGHT_CONTEXTS`](#playwright_contexts) setting.
528529

529530
### Choosing a specific context for a request
530531

@@ -548,7 +549,12 @@ context can also be customized on startup via the `PLAYWRIGHT_CONTEXTS` setting.
548549
Pass a value for the `user_data_dir` keyword argument to launch a context as
549550
persistent. See also [`BrowserType.launch_persistent_context`](https://playwright.dev/python/docs/api/class-browsertype#browser-type-launch-persistent-context).
550551

551-
### Creating a context during a crawl
552+
Note that persistent contexts are launched independently from the main browser
553+
instance, hence keyword arguments passed in the
554+
[`PLAYWRIGHT_LAUNCH_OPTIONS`](#playwright_launch_options)
555+
setting do not apply.
556+
557+
### Creating contexts while crawling
552558

553559
If the context specified in the `playwright_context` meta key does not exist, it will be created.
554560
You can specify keyword arguments to be passed to
@@ -583,7 +589,7 @@ Specifying a non-negative integer value for the
583589
[`PLAYWRIGHT_CLOSE_CONTEXT_INTERVAL`](#playwright_close_context_interval)
584590
setting enables closing browser contexts which have no active pages.
585591

586-
### Closing a context during a crawl
592+
### Closing contexts while crawling
587593

588594
After [receiving the Page object in your callback](#receiving-page-objects-in-callbacks),
589595
you can access a context though the corresponding [`Page.context`](https://playwright.dev/python/docs/api/class-page#page-context)
@@ -605,21 +611,28 @@ def parse(self, response):
605611
async def parse_in_new_context(self, response):
606612
page = response.meta["playwright_page"]
607613
title = await page.title()
614+
await page.close()
608615
await page.context.close()
609616
return {"title": title}
610617

611618
async def close_context_on_error(self, failure):
612619
page = failure.request.meta["playwright_page"]
620+
await page.close()
613621
await page.context.close()
614622
```
615623

624+
### Avoid race conditions & memory leaks when closing contexts
625+
Make sure to close the page before closing the context. See
626+
[this comment](https://github.com/scrapy-plugins/scrapy-playwright/issues/191#issuecomment-1548097114)
627+
in [#191](https://github.com/scrapy-plugins/scrapy-playwright/issues/191)
628+
for more information.
629+
616630
### Maximum concurrent context count
617631

618632
Specify a value for the `PLAYWRIGHT_MAX_CONTEXTS` setting to limit the amount
619633
of concurent contexts. Use with caution: it's possible to block the whole crawl
620-
if contexts are not closed after they are no longer used (refer to the above
621-
section to dinamically close contexts, see also the section about
622-
[automatically closing inactive contexts](#automatically-closing-inactive-contexts)).
634+
if contexts are not closed after they are no longer used (refer to
635+
[this section](#closing-contexts-while-crawling) to dinamically close contexts).
623636
Make sure to define an errback to still close contexts even if there are errors.
624637

625638

@@ -636,7 +649,7 @@ class ProxySpider(Spider):
636649
custom_settings = {
637650
"PLAYWRIGHT_LAUNCH_OPTIONS": {
638651
"proxy": {
639-
"server": "http://myproxy.com:3128"
652+
"server": "http://myproxy.com:3128",
640653
"username": "user",
641654
"password": "pass",
642655
},
@@ -671,7 +684,7 @@ PLAYWRIGHT_CONTEXTS = {
671684
}
672685
```
673686

674-
Or passing a `proxy` key when [creating a context during a crawl](#creating-a-context-during-a-crawl).
687+
Or passing a `proxy` key when [creating contexts while crawling](#creating-contexts-while-crawling).
675688

676689
See also:
677690
* [`zyte-smartproxy-playwright`](https://github.com/zytedata/zyte-smartproxy-playwright):

docs/changelog.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,18 @@
11
# scrapy-playwright changelog
22

3+
4+
### [v0.0.28](https://github.com/scrapy-plugins/scrapy-playwright/releases/tag/v0.0.28) (2023-08-05)
5+
6+
* Retry page.content if necessary (#218)
7+
8+
9+
### [v0.0.27](https://github.com/scrapy-plugins/scrapy-playwright/releases/tag/v0.0.27) (2023-07-24)
10+
11+
* Override method only for navigation requests (#177)
12+
* Pass spider argument to _create_browser_context (#212)
13+
* await AsyncPlaywright.stop on close (#214)
14+
15+
316
### [v0.0.26](https://github.com/scrapy-plugins/scrapy-playwright/releases/tag/v0.0.26) (2023-02-01)
417

518
* Fix logging (pass extra args instead of updating log record factory)

examples/contexts.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ async def parse(self, response):
9999
page = response.meta["playwright_page"]
100100
context_name = response.meta["playwright_context"]
101101
storage_state = await page.context.storage_state()
102+
await page.close()
102103
await page.context.close()
103104
return {
104105
"url": response.url,

scrapy_playwright/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.0.26"
1+
__version__ = "0.0.28"

scrapy_playwright/_utils.py

Lines changed: 43 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,17 @@
1+
import logging
12
from typing import Awaitable, Iterator, Optional, Tuple
23

4+
from playwright.async_api import Error, Page
5+
from scrapy import Spider
36
from scrapy.http.headers import Headers
47
from scrapy.settings import Settings
58
from scrapy.utils.python import to_unicode
69
from w3lib.encoding import html_body_declared_encoding, http_content_type_encoding
710

811

12+
logger = logging.getLogger("scrapy-playwright")
13+
14+
915
async def _maybe_await(obj):
1016
if isinstance(obj, Awaitable):
1117
return await obj
@@ -30,9 +36,9 @@ def _encode_body(headers: Headers, text: str) -> Tuple[bytes, str]:
3036
return text.encode("utf-8"), "utf-8" # fallback
3137

3238

33-
def _is_safe_close_error(error: Exception) -> bool:
39+
def _is_safe_close_error(error: Error) -> bool:
3440
"""
35-
Taken verbatim from
41+
Taken almost verbatim from
3642
https://github.com/microsoft/playwright-python/blob/v1.20.0/playwright/_impl/_helper.py#L234-L238
3743
"""
3844
message = str(error)
@@ -41,6 +47,41 @@ def _is_safe_close_error(error: Exception) -> bool:
4147
)
4248

4349

50+
_NAVIGATION_ERROR_MSG = (
51+
"Unable to retrieve content because the page is navigating and changing the content."
52+
)
53+
54+
55+
async def _get_page_content(
56+
page: Page,
57+
spider: Spider,
58+
context_name: str,
59+
scrapy_request_url: str,
60+
scrapy_request_method: str,
61+
) -> str:
62+
"""Wrapper around Page.content to retry if necessary.
63+
Arguments other than the page are only for logging.
64+
"""
65+
try:
66+
return await page.content()
67+
except Error as err:
68+
if err.message == _NAVIGATION_ERROR_MSG:
69+
logger.debug(
70+
"Retrying to get content from page '%s', error: '%s'",
71+
page.url,
72+
_NAVIGATION_ERROR_MSG,
73+
extra={
74+
"spider": spider,
75+
"context_name": context_name,
76+
"scrapy_request_url": scrapy_request_url,
77+
"scrapy_request_method": scrapy_request_method,
78+
"playwright_page_url": page.url,
79+
},
80+
)
81+
return await page.content()
82+
raise
83+
84+
4485
def _read_float_setting(settings: Settings, key: str) -> Optional[float]:
4586
try:
4687
return float(settings[key])

scrapy_playwright/handler.py

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
Browser,
1111
BrowserContext,
1212
BrowserType,
13+
Error as PlaywrightError,
1314
Page,
1415
PlaywrightContextManager,
1516
Request as PlaywrightRequest,
@@ -31,6 +32,7 @@
3132
from scrapy_playwright.page import PageMethod
3233
from scrapy_playwright._utils import (
3334
_encode_body,
35+
_get_page_content,
3436
_is_safe_close_error,
3537
_maybe_await,
3638
_read_float_setting,
@@ -119,8 +121,8 @@ async def _launch(self) -> None:
119121
"""Launch Playwright manager and configured startup context(s)."""
120122
logger.info("Starting download handler")
121123
self.playwright_context_manager = PlaywrightContextManager()
122-
playwright_instance = await self.playwright_context_manager.start()
123-
self.browser_type: BrowserType = getattr(playwright_instance, self.browser_type_name)
124+
self.playwright = await self.playwright_context_manager.start()
125+
self.browser_type: BrowserType = getattr(self.playwright, self.browser_type_name)
124126
if self.startup_context_kwargs:
125127
logger.info("Launching %i startup context(s)", len(self.startup_context_kwargs))
126128
await asyncio.gather(
@@ -209,7 +211,9 @@ async def _create_page(self, request: Request, spider: Spider) -> Page:
209211
ctx_wrapper = self.context_wrappers.get(context_name)
210212
if ctx_wrapper is None:
211213
ctx_wrapper = await self._create_browser_context(
212-
name=context_name, context_kwargs=request.meta.get("playwright_context_kwargs")
214+
name=context_name,
215+
context_kwargs=request.meta.get("playwright_context_kwargs"),
216+
spider=spider,
213217
)
214218

215219
await ctx_wrapper.semaphore.acquire()
@@ -295,6 +299,7 @@ async def _close(self) -> None:
295299
logger.info("Closing browser")
296300
await self.browser.close()
297301
await self.playwright_context_manager.__aexit__()
302+
await self.playwright.stop()
298303

299304
def download_request(self, request: Request, spider: Spider) -> Deferred:
300305
if request.meta.get("playwright"):
@@ -379,7 +384,13 @@ async def _download_request_with_page(
379384
headers = Headers(await response.all_headers())
380385
headers.pop("Content-Encoding", None)
381386
await self._apply_page_methods(page, request, spider)
382-
body_str = await page.content()
387+
body_str = await _get_page_content(
388+
page=page,
389+
spider=spider,
390+
context_name=context_name,
391+
scrapy_request_url=request.url,
392+
scrapy_request_method=request.method,
393+
)
383394
request.meta["download_latency"] = time() - start_time
384395

385396
if not request.meta.get("playwright_include_page"):
@@ -533,7 +544,10 @@ async def _request_handler(route: Route, playwright_request: PlaywrightRequest)
533544

534545
# if the request is triggered by scrapy, not playwright
535546
original_playwright_method: str = playwright_request.method
536-
if playwright_request.url == url:
547+
if (
548+
playwright_request.url.rstrip("/") == url.rstrip("/")
549+
and playwright_request.is_navigation_request()
550+
):
537551
if method.upper() != playwright_request.method.upper():
538552
overrides["method"] = method
539553
if body:
@@ -559,7 +573,7 @@ async def _request_handler(route: Route, playwright_request: PlaywrightRequest)
559573
"playwright_request_method_new": overrides["method"],
560574
},
561575
)
562-
except Exception as ex:
576+
except PlaywrightError as ex:
563577
if _is_safe_close_error(ex):
564578
logger.warning(
565579
"Failed processing Playwright request: <%s %s> exc_type=%s exc_msg=%s",

tests/__init__.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
from contextlib import asynccontextmanager
22

3+
from scrapy import Request
4+
from scrapy.http.response.html import HtmlResponse
35
from scrapy.utils.test import get_crawler
46

57

@@ -8,6 +10,7 @@ async def make_handler(settings_dict: dict):
810
"""Convenience function to obtain an initialized handler and close it gracefully"""
911
from scrapy_playwright.handler import ScrapyPlaywrightDownloadHandler
1012

13+
settings_dict.setdefault("TELNETCONSOLE_ENABLED", False)
1114
crawler = get_crawler(settings_dict=settings_dict)
1215
handler = ScrapyPlaywrightDownloadHandler(crawler=crawler)
1316
try:
@@ -18,3 +21,11 @@ async def make_handler(settings_dict: dict):
1821
yield handler
1922
finally:
2023
await handler._close()
24+
25+
26+
def assert_correct_response(response: HtmlResponse, request: Request) -> None:
27+
assert isinstance(response, HtmlResponse)
28+
assert response.request is request
29+
assert response.url == request.url
30+
assert response.status == 200
31+
assert "playwright" in response.flags

tests/conftest.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ def _is_coroutine(obj):
2121
return asyncio.iscoroutinefunction(obj) or inspect.isgeneratorfunction(obj)
2222

2323

24-
@pytest.mark.tryfirst
24+
@pytest.hookimpl(tryfirst=True)
2525
def pytest_pycollect_makeitem(collector, name, obj):
2626
"""A pytest hook to collect asyncio coroutines."""
2727
if collector.funcnamefilter(name) and _is_coroutine(obj):

tests/site/redirect.html

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
<!DOCTYPE html>
2+
<html lang="en">
3+
<head>
4+
<title>Page should redirect</title>
5+
<link rel="canonical" href="index.html">
6+
<meta name="robots" content="noindex">
7+
<meta charset="utf-8">
8+
<meta http-equiv="refresh" content="0; url=index.html">
9+
</head>
10+
<body>
11+
<p>You should not see this because you are immediately redirected.</p>
12+
</body>
13+
</html>

0 commit comments

Comments
 (0)