[Feature Request]: Only extract visible content #1865
SimonMayerhofer
started this conversation in
Feature requests
Replies: 1 comment
-
|
@SimonMayerhofer from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
remove_hidden_js = """
(() => {
const allElements = document.querySelectorAll('body *');
for (const el of allElements) {
const style = window.getComputedStyle(el);
if (style.display === 'none' || style.visibility === 'hidden') {
el.remove();
}
}
})();
"""
config = CrawlerRunConfig(
js_code=remove_hidden_js,
wait_for="js:() => true", # ensure JS runs first
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown) # only visible contentThis runs in the browser (Playwright) so getComputedStyle resolves inherited display:none from parent elements — exactly what you need. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
What needs to be done?
I'd like to have an option that the final markdown only has text which is visible on the page. This means any content which is in an element where any parent element has e.g.
display: none;orvisibility: hidden;set is not extracted.What problem does this solve?
We contact companies and reference content from their website. Hidden content might be outdated, so they sometimes wonder where we found that info.
Target users/beneficiaries
Marketing/Sales teams
Current alternatives/workarounds
None that I'm aware of. Maybe remove elements with JS after page load which are not visible.
Proposed approach
No response
Beta Was this translation helpful? Give feedback.
All reactions