Elevate Your Scraping Project With Puppeteer Extra

Elevate Your Scraping Project With Puppeteer Extra
2024-9-5 07:58:15 Author: hackernoon.com(查看原文) 阅读量:5 收藏

As highlighted in our guide to web scraping with Puppeteer, this browser automation library is a fantastic ally for extracting data from dynamic content sites. Still, like any other tool, it has its shortcomings. That's where Puppeteer Extra steps in!

In this guide, we’ll introduce you to puppeteer-extra—a library that wraps puppeteer to extend it with plugin support. Get ready to take your Puppeteer scraping project to the next level! 🚀

Puppeteer Extra is a lightweight wrapper around puppeteer that enables plugin integration through a clean interface. Although it's not developed by the team behind Puppeteer, this community-driven project has hundreds of thousands of weekly downloads and over 6k stars on GitHub 📈.

Check out the GitHub stars chart below—it’s clear that the puppeteer-extra repo has been on a steady rise in popularity over the years: The rise of Puppeteer Extra on GitHub

The plugins officially supported by Puppeteer Extra are:

puppeteer-extra-plugin-stealth: To make it harder to detect Puppeteer as a bot.
puppeteer-extra-plugin-recaptcha: To solve reCAPTCHAs and hCaptchas automatically.
puppeteer-extra-plugin-adblocker: To reduce bandwidth and load times by applying a fast and efficient blocker for ads and trackers.
puppeteer-extra-plugin-devtools: To make debugging the Puppeteer browser possible from anywhere.
puppeteer-extra-plugin-repl: To make Puppeteer debugging and exploration easier and more enjoyable with an interactive REPL.
puppeteer-extra-plugin-block-resources: To programmatically block resources like images, media files, CSS stylesheets, and more while loading pages.
puppeteer-extra-plugin-flash: To allow Adobe Flash content to run on all sites without user interaction.
puppeteer-extra-plugin-anonymize-ua: To anonymize the User-Agent header on all pages, with support for dynamic replacing.
puppeteer-extra-plugin-user-preferences: To set custom Chrome/Chromium user preferences.

On top of those, it integrates with the following community plugins:

puppeteer-extra-plugin-minmax: To minimize and maximize the Puppeteer browser window in real time.
puppeteer-extra-plugin-portal: To remotely view and interact with puppeteer sessions via the Chromium screencast API.

No doubt, Puppeteer is one of the top headless browser libraries for scraping and testing. But let’s be honest—it has its limits, especially when facing anti-bot tech like browser fingerprinting and CAPTCHAs. Read our guide to learn how to deal with reCAPTCHA automation.

Websites armed with anti-bot defenses can easily detect and block Puppeteer scripts. If only there was a way to extend and customize Puppeteer’s default behavior...

…well, that’s exactly what Puppeteer Extra is all about!

Puppeteer Extra vs Puppeteer

Puppeteer Extra is like a power-up for Puppeteer, adding plugin support to tackle those major drawbacks. Instead of overriding or extending everything for you, it wraps Puppeteer and lets you register only the plugins you need. 🦸

`puppeteer-extra`: Setup and Plugins for Web Scraping

You can add Puppeteer Extra to your project's npm dependencies with:

npm install puppeteer-extra

⚠️ Note: puppeteer-extra requires puppeteer to work, so make sure both packages are installed in your project.

Then, you have to import the puppeteer object from puppeteer-extra instead of the puppeteer library:

const puppeteer = require("puppeteer-extra")
// for ESM users:
// const { puppeteer } from "puppeteer-extra"

Everything in the Puppeteer API stays the same, but you get a little extra magic ✨. The puppeteer object now exposes a use() method to plug in Puppeteer Extra plugins.

Time to dive into what these plugins can do, and see how they’ll level up your web scraping game!

Puppeteer Extra Plugin Stealth, also known simply as Puppeteer Stealth, includes a set of configurations designed to reduce bot detection. It overrides Puppeteer’s detectable properties and settings that might expose it as a bot.

For more details, check out our guide on how to avoid getting blocked with Puppeteer Stealth.

⚙️ Installation:

npm install puppeteer-extra-plugin-stealth

💡 Usage:

const StealthPlugin = require("puppeteer-extra-plugin-stealth")
// for ESM users:
// import StealthPlugin from "puppeteer-extra-plugin-stealth"

puppeteer.use(StealthPlugin())

A plugin to prevent the Puppeteer browser from loading specific resources. The supported resource types include document, stylesheet, image, media, font, script, texttrack, xhr, fetch, eventsource, websocket, manifest, other.

Resource blocking can be configured both globally and locally.

⚙️ Installation:

npm install puppeteer-extra-plugin-block-resources

💡 Usage:

const BlockResourcesPlugin = require("puppeteer-extra-plugin-block-resources")
// for ESM users:
// import BlockResourcesPlugin from "puppeteer-extra-plugin-block-resources"

You can then configure the resources to block globally on all pages:

puppeteer.use(BlockResourcesPlugin({
  blockedTypes: new Set(["image", "stylesheet"]),
}))

Similarly, you can locally select the resources to be blocked:

puppeteer.use(BlockResourcesPlugin()

const browser = await puppeteer.launch()
const page = await browser.newPage()

blockResourcesPlugin.blockedTypes.add("stylesheet")
await page.goto("https://www.example.com/", { waitUntil: "domcontentloaded" })

A plugin to anonymize the User-Agent set by the browser controlled by Puppeteer. 🎭

It gives you the ability to strip the 'Headless' string from the Chrome user agent in headless mode and supports dynamic replacement of the user agent through a custom function. See it in action in our Puppeteer user agent guide.

Discover what’s the best user agent for web scraping!

⚙️ Installation:

npm install puppeteer-extra-plugin-anonymize-ua

💡 Usage:

const AnonymizeUAPlugin = require("puppeteer-extra-plugin-anonymize-ua")
// for ESM users:
// import AnonymizeUAPlugin from "puppeteer-extra-plugin-anonymize-ua"

Next, you can configure the anonymous user agent:

puppeteer.use(AnonymizeUAPlugin({
  stripHeadless: true,
}))

Also, you can set a dynamic user agent via a custom function:

puppeteer.use(AnonymizeUAPlugin({
  customFn: (ua) => ua.replace("Chrome", "Chromium")})
}))

Just like with Playwright, no matter how slick and customized your Puppeteer script is, advanced anti-bot systems can still sniff you out and shut you down. But how is that even possible? 🤔

The puppeteer-extra-stealth-plugin documentation breaks it down for you:

Please note: I consider this a friendly competition in a rather interesting cat and mouse game. If the other team (👋) wants to detect headless chromium there are still ways to do that (at least I noticed a few, which I'll tackle in future updates).

It's probably impossible to prevent all ways to detect headless chromium, but it should be possible to make it so difficult that it becomes cost-prohibitive or triggers too many false-positives to be feasible.

So, while Puppeteer Extra can dodge most basic bot detection like Neo in Matrix, it can’t surely bypass Cloudflare. Sure, you could integrate a proxy into Puppeteer, but even that might not be enough.

The problem isn't Puppeteer itself (because let's be real, Puppeteer rocks! 🤘), but the browser it's controlling. The real solution? A powerful browser that:

Operates in headed mode like a regular browser to reduce bot detection.
Scales in the cloud for you, saving you time and costs in infrastructure management.
Offers rotating IPs powered by one of the largest and most reliable proxy networks on the market.
Automatically handles CAPTCHA solving, browser fingerprinting, cookie and header customization, and retries for optimal efficiency.
Seamlessly integrates with leading browser automation libraries like Playwright, Selenium, and Puppeteer.

Believe it or not, this isn't some distant dream. It's real, and it's exactly what Bright Data’s Scraping Browser has to offer!

Final Thoughts

Puppeteer is one of the most widely used browser automation tools in the tech world, but even superheroes have their limits. The community stepped in with puppeteer-extra, a package that gives Puppeteer some seriously cool new abilities through custom plugins.

But here's the thing: while these plugins can make your scraping operation way stronger, they won’t magically turn you into a ghost 👻. Sites with advanced bot detection might still be able to block you!

Bypass all anti-bots with Bright Data’s Scraping Browser—a non-detectable cloud browser that integrates seamlessly with Puppeteer. Join our mission to make the Web a public space for everyone, everywhere, even through automated scripts.

Until next time, keep exploring the Internet with freedom! 🌐

文章来源: https://hackernoon.com/elevate-your-scraping-project-with-puppeteer-extra?source=rss
如有侵权请联系:admin#unsafe.sh

puppeteer-extra: Setup and Plugins for Web Scraping

Final Thoughts

`puppeteer-extra`: Setup and Plugins for Web Scraping