Web Scraping with Scrapy: Complete Guide

TL;DR

Build a Scrapy spider (Lonely Planet demo) that crawls “experiences” pages and yields structured items (name, type, price, duration, description, link).
Run and export results directly from the CLI using -o to CSV / JSON / XML, and pass dynamic targets via -a arguments (like continent).
Plain Scrapy can get blocked fast (CAPTCHAs, IP bans, rate limits), especially on real-world sites.
Fix that by routing requests through Scrapingdog (wrap target URLs with the Scrapingdog endpoint + api_key, use dynamic=false/dynamic=true).
Result: you keep Scrapy’s speed + structure, while Scrapingdog handles rotating proxies, CAPTCHA bypass, and optional JS rendering with minimal code changes.

Scrapy is one of the most powerful web scraping frameworks in Python, but the moment you point it at a real-world website, you’ll hit a wall. CAPTCHAs, IP bans, and rate limiting can stop your spider dead in its tracks within the first few requests.

In this tutorial, you’ll learn how to build a Scrapy spider from scratch, integrate web scraping APIs to bypass CAPTCHAs, and export your scraped data to CSV, JSON, or XML, that too all in under 50 lines of Python code.

For this walkthrough, we’ll scrape data from the lonelyplanet, which is a travel guide website. Specifically, the experiences section. We’ll extract this data and store it in various formats such as JSON, CSV, and XML. The data can then be analyzed and used to plan our next trip!

Here’s what we’ll cover:

What Scrapy is and why it’s the right tool for large-scale scraping
How to set up a Scrapy project and write your first spider
How to configure a scraping API to beat CAPTCHAs
How to export scraped data to CSV, JSON, and XML

What’s Scrapy and Why Should I Use It

Scrapy is a fast, high-level web crawling and scraping framework built for extracting structured data from websites at scale. It’s widely used across a range of applications, from data mining and price monitoring to automated testing and research.

What makes Scrapy stand out is its asynchronous architecture. It handles all your requests concurrently without requiring you to manage threads, processes, or synchronization manually. This makes it significantly faster than most custom-built crawlers. If you were building something similar from scratch, you’d either have to implement async logic yourself or rely on a third-party library. Scrapy gives you all of that out of the box, and it’s completely open-source.

Now that you have a solid understanding of Scrapy, let’s walk through how to build a crawler using Scrapy.

Setup

To get started, we’ll need to install the Scrapy library. Remember to separate your Python dependencies by using virtual environments. Once you’ve set up a virtual environment and activated it, run:

1pip install scrapy

Next, head over to the ScrapingDog website and grab your API key. You’ll need this to access their CAPTCHA-solving service, which we’ll integrate into our spider later in this walkthrough.

Initializing the project

With the installation steps complete, you’re ready to set up the web crawler. Run the following command to initialize a new Scrapy project:

1scrapy startproject projectName
2```
3 
4This generates a Scrapy project with the following structure:
5```
6├── scrapy.cfg
7└── trips
8    ├── __init__.py
9    ├── items.py
10    ├── middlewares.py
11    ├── pipelines.py
12    ├── settings.py
13    └── spiders
14        └── __init__.py

Next, create a new file inside the spiders folder and name it destinations.py. This file will contain the core logic for your web scraper.

Here’s what the initial source code in destinations.py looks like:

1from scrapy import Request, Spider
2 
3from ..items import TripsItem
4 
5class DestinationsCrawl(Spider):
6    name = 'destinations'
7    items = TripsItem()
8 
9    allowed_domains = ['lonelyplanet.com']
10    url_link = 'https://www.lonelyplanet.com/europe/activities'
11    start_urls = [url_link]
12 
13    def __init__(self, name, continent, **kwargs):
14        self.continent = continent
15        super().__init__(name=name, **kwargs)
16 
17    def start_requests(self):
18        if self.continent:  # taking input from command line parameters
19            url = f'https://www.lonelyplanet.com/{self.continent}/activities'
20            yield Request(url, self.parse)
21        else:
22            for url in self.start_urls:
23                yield Request(url, dont_filter=True)
24 
25    def parse(self, response):
26        experiences = response.css("article.rounded.shadow-md")
27        items = TripsItem()
28        for experience in experiences:
29            items["name"] = experience.css(
30                'h2.text-xl.leading-tight::text').extract()
31            items["experience_type"] = experience.css(
32                'span.mr-4::text').extract()
33            items["price"] = experience.css("span.text-green::text").extract()
34            items["duration"] = experience.css(
35                "p.text-secondary.text-xs::text").extract()
36            items["description"] = experience.css(
37                "p.text-sm.leading-relaxed::text").extract()
38            items["link"] = (
39                f'https://{self.allowed_domains[0]}'
40                f'{experience.css("a::attr(href)").extract()[0]}'
41            )
42            yield items

The code might look like a lot at first glance, but don’t worry, we’ll walk through it step by step.

DestinationsCrawl(Spider) — Defines the spider class, inheriting Scrapy's built-in Spider functionality.
name = 'destinations' — Unique identifier used to run the spider via scrapy crawl destinations.
allowed_domains — Restricts crawling to lonelyplanet.com only, blocking any external links.
start_urls — The default URL the spider visits if no custom input is provided.
__init__ — Accepts a continent argument from the command line and stores it for later use.
start_requests — Dynamically builds the target URL based on the continent argument, or falls back to start_urls if none is provided.
parse — Core scraping method that extracts name, experience type, price, duration, description, and link from each listing on the page.
TripsItem — A structured data container defined in items.py that holds all the scraped fields for each record.
yield items — Passes each scraped record to Scrapy's pipeline for storage or further processing.

Web Scraping with Scrapy

Let’s walk through the full logic of the destinations.py file to understand how the spider works.

1class DestinationsCrawl(Spider):
2    name = 'destinations'
3    items = TripsItem()
4 
5    allowed_domains = ['lonelyplanet.com']
6    url_link = 'https://www.lonelyplanet.com/europe/activities'
7    start_urls = [url_link]
8 
9    def __init__(self, name, continent, **kwargs):
10        self.continent = continent
11        super().__init__(name=name, **kwargs)
12 
13    def start_requests(self):
14        if self.continent:  # taking input from command line parameters
15            url = f'https://www.lonelyplanet.com/{self.continent}/activities'
16            yield Request(url, self.parse)
17        else:
18            for url in self.start_urls:
19                yield Request(url, dont_filter=True)

The DestinationsCrawl class inherits from Scrapy's Spider class and serves as the blueprint for our web scraper.

The name variable identifies the spider and is used when executing it from the command line. The url_link variable points to the default URL, while start_urls holds the list of URLs Scrapy uses to generate the initial requests.

We override the start_requests() method to accept a continent argument from the command line, making the spider dynamic. Instead of writing a separate scraper for every continent, you can simply pass the continent name as a parameter and the spider builds the target URL automatically.

Now let’s look at the core of the spider, the parse() method.

1def parse(self, response):
2    experiences = response.css("article.rounded.shadow-md")
3    items = TripsItem()
4    for experience in experiences:
5        items["name"] = experience.css(
6            'h2.text-xl.leading-tight::text').extract()
7        items["experience_type"] = experience.css(
8            'span.mr-4::text').extract()
9        items["price"] = experience.css("span.text-green::text").extract()
10        items["duration"] = experience.css(
11            "p.text-secondary.text-xs::text").extract()
12        items["description"] = experience.css(
13            "p.text-sm.leading-relaxed::text").extract()
14        items["link"] = (
15            f'https://{self.allowed_domains[0]}'
16            f'{experience.css("a::attr(href)").extract()[0]}'
17        )
18        yield items

As Scrapy’s documentation describes, the parse() method is responsible for processing the response and returning scraped data and/or additional URLs to follow.

By identifying patterns in the page’s underlying HTML, we can automate data extraction. We use a browser extension like SelectorGadget or the browser’s built-in developer tools to inspect the page and identify the right elements. In this case, each destination is wrapped in an article element with the classes rounded and shadow-md, so response.css("article.rounded.shadow-md") retrieves all matching elements.

We then loop through each result, extracting the trip name, experience type, price, duration, description, and a full link to the listing page on Lonely Planet.

Before moving on, let’s take a quick look at the TripsItem class imported at the top of the script:

1import scrapy
2 
3class TripsItem(scrapy.Item):
4    name = scrapy.Field()
5    experience_type = scrapy.Field()
6    description = scrapy.Field()
7    price = scrapy.Field()
8    duration = scrapy.Field()
9    link = scrapy.Field()

TripsItem is a structured data container that holds all the scraped fields for each record. By mapping extracted values to these fields, Scrapy can automatically export the data in CSV, JSON, or XML format.

Running the Spider

To run the spider and export data in CSV format:

1scrapy crawl destinations -a continent=asia -a name=asia -o asia.csv

The -a flag passes arguments to the spider's __init__ method. These arguments are optional; you can also run the spider without them, and it will fall back to the default Europe URL:

1scrapy crawl destinations -o europe.csv

For other output formats:

1scrapy crawl destinations -a continent=africa -a name=africa -o africa.json
2 
3scrapy crawl destinations -a continent=pacific -a name=pacific -o pacific.xml

Limitations of This Approach

While this setup works well for straightforward scraping tasks, it does have its limitations. Many websites employ aggressive bot-detection mechanisms, including IP-based rate limiting, browser fingerprinting, and CAPTCHA challenges, that can block your spider after just a few requests. Managing rotating proxies, handling CAPTCHAs, and mimicking real browser behavior adds significant complexity to your scraper. This is where a dedicated scraping API like ScrapingDog comes in. ScrapingDog handles all of this for you under the hood, rotating proxies, CAPTCHA solving, and JavaScript rendering, so you can focus on extracting data rather than fighting bot detection.

Integrating ScrapingDog with Scrapy

To integrate ScrapingDog with your Scrapy spider, you’ll need your ScrapingDog API key. If you haven’t grabbed one yet, head over to the ScrapingDog dashboard and sign up for a free account.

The integration is simple, instead of sending requests directly to your target URL, you route them through ScrapingDog’s API endpoint, which handles all the heavy lifting for you.

Start by creating a config.py file in the same directory as destinations.py to store your API key:

1# Don't commit this to version control.
2# Use a tool like python-decouple and store sensitive data in a .env file.
3 
4API_KEY = 'your_scrapingdog_api_key'

Next, update your destinations.py to route requests through ScrapingDog's API:

1from scrapy import Request, Spider
2from ..items import TripsItem
3from .config import API_KEY
4 
5class DestinationsCrawl(Spider):
6    name = 'destinations'
7 
8    allowed_domains = ['lonelyplanet.com']
9    url_link = 'https://www.lonelyplanet.com/europe/activities'
10    start_urls = [url_link]
11 
12    def __init__(self, name, continent, **kwargs):
13        self.continent = continent
14        super().__init__(name=name, **kwargs)
15 
16    def _build_url(self, target_url):
17        return f'https://api.scrapingdog.com/scrape?api_key={API_KEY}&url={target_url}&dynamic=false'
18 
19    def start_requests(self):
20        if self.continent:
21            target_url = f'https://www.lonelyplanet.com/{self.continent}/activities'
22            yield Request(self._build_url(target_url), self.parse)
23        else:
24            for url in self.start_urls:
25                yield Request(self._build_url(url), dont_filter=True)
26 
27    def parse(self, response):
28        experiences = response.css("article.rounded.shadow-md")
29        items = TripsItem()
30        for experience in experiences:
31            items["name"] = experience.css(
32                'h2.text-xl.leading-tight::text').extract()
33            items["experience_type"] = experience.css(
34                'span.mr-4::text').extract()
35            items["price"] = experience.css("span.text-green::text").extract()
36            items["duration"] = experience.css(
37                "p.text-secondary.text-xs::text").extract()
38            items["description"] = experience.css(
39                "p.text-sm.leading-relaxed::text").extract()
40            items["link"] = (
41                f'https://{self.allowed_domains[0]}'
42                f'{experience.css("a::attr(href)").extract()[0]}'
43            )
44            yield items

The key change here is the _build_url() helper method, which wraps any target URL into a ScrapingDog API request. Setting dynamic=false is sufficient for pages that don't require JavaScript rendering. If your target site is JavaScript-heavy, switch this to dynamic=true and ScrapingDog will handle the rendering for you.

With this integration in place, every request your spider makes is automatically routed through ScrapingDog’s infrastructure, giving you rotating proxies, CAPTCHA bypass, and bot detection evasion without any additional configuration on your end. You can now run the spider exactly as before:

1scrapy crawl destinations -a continent=asia -a name=asia -o asia.csv

Key Takeaways

Scrapy’s asynchronous architecture makes it faster and more efficient than custom-built crawlers.
A well-structured Scrapy project separates logic across spiders, middlewares, pipelines, and settings for easier maintenance.
Command-line arguments make your spider dynamic and reusable across multiple targets.
CSS selectors combined with tools like SelectorGadget make data extraction straightforward.
Scrapy’s Item containers ensure scraped data is consistently structured and export-ready.
Basic scraping without evasion leaves your spider exposed to IP bans, rate limiting, and CAPTCHAs.
ScrapingDog handles rotating proxies, CAPTCHA bypass, and JavaScript rendering with minimal code changes.

Conclusion

Web scraping can be a great way to automate tasks or gather data for analysis. Scrapy and Beat Captcha can make this process easier and more efficient. With a little practice, anyone can learn to use these tools to their advantage.

In this article, we understood how we can scrape data using Python’s scrapy and the web scraping service.

Feel free to comment and ask our team anything. Our Twitter DM is open for inquiries and general questions.

Thank you for your time.

Additional Resources

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey:

Web Scraping with Scrapingdog

Scrape the web without the hassle of getting blocked Try for Free Contact sales