Loading

scraping

Webscraping with Python and Playwright

A hands-on guide to set up python and playwright to start webscraping

Harun Sevinc on 23 December 2024

Web scraping has become an essential tool for extracting data from websites, whether for personal projects or business applications. While there are several popular web scraping libraries in Python, Playwright stands out due to its speed, support for multiple browser contexts, and automation capabilities. In this guide, we will walk you through how to use Playwright for web scraping with Python.

What is Playwright?

Playwright is a powerful automation framework that supports multiple browsers, including Chrome, Firefox, and WebKit (Safari). Unlike traditional scraping libraries like BeautifulSoup or requests, Playwright controls browsers to simulate real user interactions, which can help bypass JavaScript-heavy websites or detection mechanisms used by websites to block scrapers.

Playwright allows scraping dynamic content, making it an excellent choice for scraping modern web applications built with frameworks like React, Vue, or Angular.

Why Use Playwright for Web Scraping?

  1. Handles JavaScript Rendering: Many websites rely on JavaScript to render their content. Playwright can interact with a page just like a real user, ensuring you can scrape dynamic content.
  2. Multiple Browser Support: Playwright supports Chrome, Firefox, and WebKit, giving you flexibility and better browser emulation.
  3. Headless and Full-Browser Modes: It allows running in both headless (without a graphical browser) and full-browser modes for testing or scraping visibility.
  4. Advanced Automation: You can automate complex workflows, take screenshots, emulate devices, and much more.

Setting Up Playwright in Python

Let’s start by setting up Playwright and building a simple scraper.

Step 1: Install Playwright

You can install Playwright for Python using pip:

pip install playwright

After installation, you’ll need to install the necessary browser binaries:

 python -m playwright install

This command will install Chromium, Firefox, and WebKit.

Step 2: Writing a Simple Web Scraper

Once Playwright is installed, you can write your first Python script. Let’s scrape a website and extract data:

from playwright.sync_api import sync_playwright

def scrape_with_playwright():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)  # Run headless
        page = browser.new_page()
        page.goto('https://example.com')

        # Extract content using Playwright’s selectors
        content = page.text_content('h1')  # Extract text from the first H1 tag
        print("Scraped content:", content)

        browser.close()

scrape_with_playwright()

In this script:

  • We launch a Chromium browser in headless mode.
  • Navigate to the website https://example.com.
  • Extract the text content from the first tag.
  • Close the browser after scraping.

Step 3: Handling JavaScript-Rendered Pages

Unlike traditional libraries that may fail to scrape JavaScript-heavy websites, Playwright can wait for content to load and interact with elements.

Here’s an example of scraping a JavaScript-heavy site:

from playwright.sync_api import sync_playwright

def scrape_dynamic_page():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto('https://example.com', wait_until='networkidle')  # Wait for the page to fully load

        # Interact with dynamic elements
        page.click('button#load-more')  # Click a button to load more content

        # Wait for additional content to load
        page.wait_for_selector('div.loaded-content')

        # Extract the new content
        content = page.text_content('div.loaded-content')
        print("New content:", content)

        browser.close()

scrape_dynamic_page()

In this example:

  • We wait for the page to fully load using wait_until=‘networkidle’.
  • Simulate clicking a button that loads more content dynamically.
  • Wait for the new content to be loaded and then scrape it.

Advanced Features in Playwright for Web Scraping

1. Handling Pagination

If you need to scrape data across multiple pages, you can automate pagination with Playwright:

def scrape_multiple_pages():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        for page_num in range(1, 6):  # Scrape first 5 pages
            url = f'https://example.com/page/{page_num}'
            page.goto(url)
            content = page.text_content('div.article-content')
            print(f"Page {page_num} content:", content)

        browser.close()

scrape_multiple_pages()

2. Taking Screenshots

Taking screenshots during scraping can be useful for debugging or monitoring. Playwright makes this easy:

def take_screenshot():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto('https://example.com')

        # Take a screenshot of the entire page
        page.screenshot(path='screenshot.png', full_page=True)

        browser.close()

take_screenshot()

3. Emulating Devices and Geolocation

You can scrape websites as if you’re browsing from a mobile device or a specific location:

def emulate_mobile():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        iphone_11 = p.devices['iPhone 11']
        page = browser.new_page(**iphone_11)

        page.goto('https://example.com')
        page.screenshot(path='mobile_screenshot.png')

        browser.close()

emulate_mobile()

4. Bypassing CAPTCHA and Detection

Playwright offers tools like stealth plugins and proxy integration to help bypass CAPTCHA and anti-scraping measures. While this requires more advanced setups, Playwright’s browser emulation makes it harder for websites to detect your scraping activities.

Best Practices for Web Scraping with Playwright

  1. Respect Robots.txt: Always check the website’s robots.txt file and respect their scraping policies.
  2. Throttle Requests: Avoid sending too many requests too quickly. Use page.wait_for_timeout() to introduce delays between requests.
  3. Use Proxies: For large-scale scraping, use rotating proxies to avoid IP bans.
  4. Headless Browsing: Use headless mode to save resources, but for debugging purposes, full-browser mode can be useful

Conclusion

Playwright is a robust tool for web scraping with Python, especially for modern websites with dynamic content. It combines ease of use with advanced capabilities like device emulation, automation, and JavaScript rendering, making it a perfect choice for complex scraping projects.

Whether you’re scraping data for analysis, building a web scraper for business, or automating workflows, Playwright provides a powerful framework to achieve your goals efficiently.