Rayrun
← Back to Discord Forum

Scraping Data using Playwright with Apache Airflow and Docker

Hi, everyone. I managed to build an application that uses Airflow, Docker and Playwright.

It runs just fine and there are some scrapers actively running without issues. The problem is one of them, the code is as follows:

def run_scraper(url, nome_arquivo):
    with p.sync_playwright() as pw:
        browser = pw.chromium.launch()
        context = browser.new_context()
        page = context.new_page()
        print(url)
        page.on("download", lambda download: download.save_as(os.path.join("downloads", nome_arquivo)))

        page.goto(url)
        page.wait_for_load_state()

        page.click("button:text('Buscar')")
        page.wait_for_load_state()

        with page.expect_download():
            page.click("a span:text('DOWNLOAD')")
        browser.close()

This scraper runs like it should when I run on my computer like a script. It clicks on "Buscar" and then downloads a file. Afterwards, that file is downloaded in a folder near the scraper.

The problem is when I run it via Airflow, because it prompts me this error:

=========================== logs =========================== waiting for locator("button:text('Buscar')")

It seems to still be looking for the button 'Buscar', but when I run it on my machine it finds it really quickly. Does anyone know what is the issue here?

This thread is trying to answer question "What is the issue with the scraper that uses Playwright, Apache Airflow, and Docker, and how is it resolved?"

1 reply

Edit: For some reason it only works when pw.chromium.launch is set with headless=False

TwitterGitHubLinkedIn
AboutQuestionsDiscord ForumBrowser ExtensionTagsQA Jobs

Rayrun is a community for QA engineers. I am constantly looking for new ways to add value to people learning Playwright and other browser automation frameworks. If you have feedback, email [email protected].