Web Scraping

Web Scraping

Last Updated on 25 November 2025

Web scraping is the practice of writing a script that automatically collects data from websites. A script can move through the web far faster than any human, reaching pages, pulling their contents, and organizing the results into clean, structured data. This saves time, reduces errors, and makes it possible to work with information on a scale that would be impossible by hand.

With scraping, you can gather prices across e-commerce sites, collect articles for research, build datasets for training machine learning models or monitor changes in financial markets in real time. In short, it transforms the open web into a usable, searchable database that fuels analysis, insight, and innovation.

Foreword

Web scraping is the action of methodically drawing data from the internet using a script made with a programming language. It is the actual extracted data that can later be used in data mining.

Generally, everything that is available to a web browser can be reached directly using a programming language, its contents can then be copied and then stored for easy access and analysis.

We can scrape various kinds of data for various purposes, such as the text in articles or numerical data from tables for analysis or to train a machine learning model. We should be able to scrape everything our browser can reach, be it visibly shown or not.

In this text I will cover web scraping using Python. Since web scraping and dealing with internet communication is a very technical field, Python is a good choice for this task, since it allows us to perform complicated functions easily without the need to get into its fine details. Python grants many people who do not come from a computer science background the ability to get more out of their PC. Sort of like driving a car without the need to exactly understand how the engine generates power.

Introduction

Web scraping involves the creation of a script that communicates with a website address that stores the data we are interested in. It needs to identify the data and bring it back to us for processing.

In Python, this action can be done using the following packages (also called “libraries” and “modules”):

  • Requests a high-level HTTP client interface for contacting a web address and drawing information from static pages. It is an easy way to do so, based on the urllib3 package. See documentation.
  • BeautifulSoup this library is used for pulling data out of HTML and XML files. It basically takes the data that Requests gets and identifies the HTML or XML tags in it, so we can easily access the contents of each tag. Data is stored in web pages within tags: in HTML it can be wrapped by <table>, <tr>, <td>, <div>, <p> etc. See documentation.
  • Selenium an open-source framework introduced in 2004 at ThoughtWorks, designed for automating browsers by simulating real user actions. Unlike earlier tools such as Requests, which could only parse static HTML, Selenium can interact with dynamic websites, reaching content generated through clicks, form submissions, or JavaScript execution. It supports multiple browsers in both headless and headed modes, and remains one of the most established tools for testing and scraping tasks that require authentic user-like behavior. See documentation.
  • Playwright a newer package developed and maintained by Microsoft, designed for automating browsers with greater speed and reliability. Similar in purpose to Selenium, it comes with built-in drivers, automatic waits, and powerful support for handling multiple sessions and dynamic, modern websites. Playwright can run Chromium, Firefox, and WebKit browsers in both headless and headed modes, and is especially useful for large-scale scraping tasks where performance and concurrency are important. See documentation.

Definitions

API (Application Programming Interface)- many sites provide official APIs that let us access data in a structured way without scraping. Using an API is usually more reliable and efficient than parsing HTML, but often requires authentication or has usage limits.

API Endpoint– this is a specific URL that provides access to structured data from a web service. Instead of scraping HTML, we can often request data directly from these endpoints, usually in JSON or XML format. Using endpoints is faster and more reliable, though sometimes access requires authentication or keys.

Backpressure- this is a control mechanism that prevents producers from overwhelming consumers. If producers (e.g. fetchers) generate work faster than consumers (e.g. parsers) can handle, backpressure slows the producers. This can happen automatically or by explicit limits are set, such as rate limiting, semaphores and sleeps. In scraping, backpressure keeps request volume and parsing speed balanced to avoid overloads.

Bytecode- this is an intermediate, low-level set of instructions that Python generates after compiling our source code (.py) but before executing it. When we run a Python program, the interpreter first translates our human-readable code into bytecode (.which is stored in the pycache folder as .pyc files). This bytecode is then executed by the Python Virtual Machine (PVM), step by step. Bytecode is a platform-independent representation that makes Python portable, and allows the interpreter to run code consistently across different operating systems and hardware.

Class (Object)- in programming, a class is a self-contained unit that combines data (attributes) with behavior (methods). It is used to model real-world entities or systems in a structured way, so instead of just writing functions that act on loose data, we can bundle the data with the operations that belong to it. This makes code easier to organize, reuse, and extend, especially in large projects. Compared to just using functions, classes are preferred when you need to manage complex relationships, maintain state across operations, or create many objects that share the same structure but hold different values. A class is a core building block of object-oriented programming (OOP)

CAPTCHA- short for “Completely Automated Public Turing test to tell Computers and Humans Apart.” These are challenges websites use to distinguish between humans and bots (like selecting images or typing distorted text). Scrapers should avoid triggering CAPTCHAs by reducing bot-like behavior.

Concurrency- structuring a program as multiple tasks that overlap in time (may be interleaved on one core). Tools: threads, async/await, coroutines. Goal: responsiveness/throughput for I/O-bound work. Discussed in Scraping Faster hereunder.

CPython- this is the standard, base implementation of Python, written in the C programming language. It compiles our .py files to bytecode and runs them in the CPython virtual machine. It’s the one we download from python.org and what most packages target.

DOM (Document Object Model)- this is a structured, tree-like representation of an HTML or XML document. When a browser loads a web page, it reads the HTML code and builds the DOM, turning every tag (<html>, <head>, <body>, <div>, <p> etc.) into a node in this tree. Each node can then be accessed, modified, or removed by scripts (like JavaScript) or tools we use in scraping.

The DOM is not always identical to the raw HTML source code. This is because modern websites often use JavaScript to dynamically load or change content after the initial HTML is delivered. In those cases, if you only fetch the source code with Requests, you might not see all the data. The browser, however, will build the DOM after executing the scripts, and that’s what Selenium or Playwright can access.

In scraping, the DOM is our “map” of the page. With it, we can locate elements by their position in the tree, their attributes, or their text.

Headless Browser- this is a web browser that runs without a graphical user interface (GUI). It behaves like a normal browser (executing JavaScript, rendering the DOM, loading dynamic content), but everything happens in the background. Headless browsers such as Chrome Headless or Firefox Headless are essential in scraping modern, interactive websites.

HTML (Hypertext Markup Language)- this is the standard markup language used to structure and display content on the web. Web pages are written in HTML, with data stored inside tags such as
<div>, <p>, <table> and <a>. In scraping, we parse these tags to locate and extract the information we need.

HTTP (Hypertext Transfer Protocol)- this is the standard protocol used for communication on the web. It defines how a client (such as a browser or a Python script) requests resources from a server, and how the server responds with content and status codes. Each interaction consists of a request (containing the method, URL, headers, and sometimes data) and a response (containing the status code, headers, and the requested content such as HTML, JSON, or images). In web scraping, every time we fetch data from a website, we are performing an HTTP request and receiving an HTTP response.

HTTP Headers- these are pieces of metadata included in every HTTP request and response, carrying information such as the client’s identity, accepted formats, authentication tokens, caching instructions, and more. They follow a simple key–value format (for example, User-Agent: Mozilla/5.0). According to the HTTP specification, header names are case-insensitive, so user-agent, USER-AGENT, and User-Agent are all valid, but the convention is to use the canonical form (capitalizing each word, like User-Agent, Content-Type, Accept-Language). Sticking to canonical names makes code clearer, matches browser/network logs, and avoids confusion even though the protocol itself doesn’t enforce casing.

Interpreter- an interpreter is a program that executes code within a runtime environment, rather than producing a standalone native executable (e.g., .exe). In Python’s reference implementation CPython, a .py file is first compiled to bytecode and optionally cached as a .pyc under pycache, then executed by the Python virtual machine (the bytecode evaluation loop). In short: the interpreter reads source -> compiles to bytecode -> executes.

IP (Internet Protocol) Address- this is a unique identifier assigned to a device connected to the internet. Servers use IP addresses to track where requests are coming from. In scraping, rotating IP addresses with proxies helps prevent detection and blocking.

JSON (JavaScript Object Notation)- JSON is a lightweight, text-based data format used to store and exchange information in key-value pairs. It is widely used in modern web APIs because it is easy for both humans and machines to read. Many web scraping tasks involve retrieving JSON directly instead of parsing HTML.

Latency- this is the time delay between the initiation of an action and the observable result. In computing, it usually refers to how long it takes for a request to travel through a system and return a response. For example, the time between sending a network packet and receiving a reply, or the gap between a user clicking and the system reacting. Lower latency means faster responsiveness, while high latency indicates noticeable lag.

A Markup Language- this is a way of structuring text by surrounding it with tags that describe its meaning or organization. Unlike programming languages, it doesn’t execute logic but instead marks content for interpretation, such as
for a paragraph in HTML. In web scraping, markup languages like HTML and XML are the foundation of the data we extract.

Object-oriented programming (OOP)- this is a programming paradigm that organizes code around objects. Instead of writing separate functions and passing data between them, OOP models real-world entities or abstract concepts as objects, making programs more modular, reusable, and easier to maintain. Its key principles are encapsulation (hiding implementation details inside objects), inheritance (creating new classes from existing ones to reuse and extend code), and polymorphism (allowing different objects to respond differently to the same method call). OOP is widely used because it helps manage complexity, especially in large or evolving systems.

Parallelism- executing multiple tasks simultaneously on multiple cores/processors. Tools: multiprocessing, ProcessPoolExecutor, vectorized/native code. Goal: speed for CPU-bound work. Defined in Scraping Faster hereunder.

To Parse– this means to take unstructured or semi-structured text (like HTML source code) and break it down into a structured format that a program can understand and work with. In web scraping, parsing usually refers to analyzing the HTML of a web page, identifying its elements (such as tags, attributes, and text), and then extracting the specific pieces of data we are interested in. Tools like BeautifulSoup or lxml perform this parsing step, turning raw strings of code into objects we can navigate and query easily.

Proxy Server- a proxy acts as an intermediary between your scraper and the target server, masking the user’s real IP address. By rotating proxies, scrapers can avoid detection and distribute requests across multiple identities. Proxies are essential for large-scale scraping projects.

Scraper- in web scraping, this is a program or script that automatically visits websites, retrieves their content, and extracts specific information from it. It mimics what a human would do- looking at a page and picking out relevant data- but does so at scale and with speed, often converting unstructured web content like HTML into structured formats such as CSV or JSON for further use.

Selector- a selector is a piece of code written by a web developer in the HTML or CSS of a website that specifies how to target certain elements. It is not the tag itself, but an instruction that points to tags based on their name (p), class (.price) or ID (#main-title). Developers define these selectors so the browser knows how to style or manipulate elements. Web scrapers then reuse those same selectors in tools like BeautifulSoup, Selenium, or Playwright to precisely locate and extract data from the DOM.

Status Codes- these are numeric codes returned in HTTP responses that indicate the outcome of a request. For example, 200 means success, 301 or 302 means a redirect, 403 Forbidden means access denied, and 404 Not Found means the resource doesn’t exist. In scraping, checking status codes helps us understand if our request succeeded or if we are being blocked.

Synchronization- this is achieved using low-level primitives such as locks, semaphores, events, queues and barriers. Synchronisation allows the parallel work of threads, processes, and coroutines and enforces ordering, mutual exclusion, signaling, and backpressure.

Throttling- this is the practice of deliberately slowing down scraping by adding delays between requests. Throttling reduces the load on the target server, makes the scraping behavior appear more “human,” and helps avoid IP bans or CAPTCHAs. It is part of respectful and sustainable scraping.

User-Agent- this is a string sent in the HTTP request headers that identifies the client (browser or scraper) making the request. Servers use the User-Agent to understand who is connecting and may serve different content depending on it. In scraping, rotating, or shifting between a group of User-Agents, is a common technique to avoid detection and blocking.

XML (Extensible Markup Language)- this is a markup language used to store and transport data in a structured format that is both human- and machine-readable. Like HTML, it uses tags enclosed in angle brackets, but unlike HTML, the tags are user-defined and describe the meaning of the data rather than how it should be displayed. In web scraping, XML is commonly encountered in sitemaps, RSS or Atom feeds, and some APIs that deliver data in XML format instead of JSON.

XPath- a query language used to navigate and locate elements within XML or HTML documents. XPath allows precise selection of elements by their hierarchy, attributes, or text content. Tools like Selenium and Playwright often use XPath to find elements that cannot be easily reached with CSS selectors.

Web Scraping Ethics

Before we continue, we should discuss ethics. Is web scraping ethical? When we scrape a web page, we use some resources of the hosting server, which needs to facilitate our connection and data transfer. The website owner is happy to share its information with human visitors that click on ads, but may be unwilling to share it with bots, which ignore ads and can use a lot of resources since they do their work automatically and potentially in large scale.

Since a server’s resources are precious on one hand and since any information that is openly available on the internet should be free to access on the other, a system of voluntary compliance was established to try and balance the two principles.  

This is called the “Robots exclusion standard”, which is voluntary. The system is based on a file that informs crawlers and bots what kind of bot activity the website owner is willing to allow. This information appears in the “robots.txt” file, which should be stored in the website’s root directory on the server. “robots.txt” contains instructions on what pages crawler should and shouldn’t access, instructions on rate of requests, visit times and more.

Since it is a voluntary system, it shouldn’t be treated as having obligatory rules, but more as a guide for how to fetch data without harming the website’s owner. It is a guide for respectful scraping, such not to overwhelm the server with requests while taking data. Since this is voluntary, the website owner takes in unto himself to protect his data from scraping, often employing methods to detect and block bots such as blocking too many requests from a single user.

At this point it is important to mention that while this standard is voluntary, causing any harm to a website owner’s business in one way or the other should be liable for legal action. The same goes for the commercial use of copyrighted or proprietary materials. So, we use this standard as a guide and perform our scraping in an efficient and respectful way, trying all that we can not to harm the business and cause any extra costs.

Other considerations to put in mind are privacy concerns for scraping and holding data that includes ways to identify people, which is a serious matter. It’s best practice to follow the European directive- see more here and here.

Since the internet was built to be free and should remain so for the benefit of humanity, respectful scraping and responsible handling of data should remain a part of it, that is if we are sophisticated enough to bypass the obstacles put by the website’s owner.

We will now explore the main components of web scraping and the Python libraries that make it possible. Starting with inspecting web pages using a browser and facilitating Requests, we will learn how to connect to a website and fetch its content. Then we will see how tools like BeautifulSoup can parse and organize the data, and how Selenium or Playwright can handle more complex, dynamic pages that require interaction. Along the way, we will also look at methods to make scraping more efficient and respectful, such as using headers, proxies, and concurrency.

Inspecting Web Pages

Before writing code, we first need to see how a web page actually delivers its data, and identify the data we want to get. Every scraping project starts with opening the browser’s developer tools. This lets us understand what is going on behind the scenes- whether the data is already in the HTML, hidden in the Document Object Model (DOM), or coming from a separate request in the background.

We open the inspector by right-clicking anywhere on a page and choosing Inspect in the dropdown menu. The inspector has several tabs, and the most useful for scraping are Elements and Network.

The Elements tab- this shows us the page’s structure. Every piece of content is wrapped inside HTML tags like <div>, <p>, <table>. When we highlight parts of the page in this tab, the corresponding tag is shown. For example, the price on an e-commerce site might be inside <span class=”price”>. If we see the data directly in the HTML, it means we can fetch it with Requests and parse it with BeautifulSoup.

The Network tab- this is where we see how the page loads. We reload the page with the Network tab open and watch the requests scroll by. Many modern sites don’t serve data directly in the HTML, but through background requests. We may see JSON or XML files being fetched, or even an API endpoint like /api/products?page=2. In the case of API, our communication with the website basically sums to buying data. It’s much more straightforward than scraping.

Clicking on any request in the Network tab shows its Headers and Cookies. These are the details the browser sends with the request: things like User-Agent, accepted language, or a login cookie. When we copy these values into our script, our scraper looks more like a normal browser, which is our goal. This is also where we learn what information the server requires before sending data back.

Finally, under the same request, we can check the Response or Preview tabs. Here the server’s answer is shown exactly as the script will receive it. If it is HTML, we know to parse it. If it is JSON, we can load it directly into a Python dictionary. For example, a news site might return a JSON file with article titles and links, which is a perfect example for scraping without touching HTML at all.

Inspecting a web page before scraping saves a lot of time and effort. We don’t want to build a Selenium script to click buttons if the clean data already appears in a JSON response. We also don’t want to scrape HTML only to realize the site blocks us without cookies. Inspect is the diagnostic step- it gives us the roadmap from website to data.

Understanding Web Pages

When we inspect a page, we are looking at HTML and CSS. Knowing the basics of these languages is important. HTML tells us the structure of the page as its tags wrap the data we want. CSS defines how these tags are styled and, more importantly for us, how they are identified. Scraping depends on targeting the right selector, so a basic grasp of HTML and CSS makes the process much easier.

We also need to recognize the common data formats that websites use. HTML itself is markup text. JSON and XML are structured data formats, often returned by APIs. Cookies are small pieces of text that servers give to browsers and expect to get back. Headers are key–value metadata that shape the request. Understanding what each of these is, and how they look, lets us decide quickly what kind of response we’re dealing with and how to process it.

Other technical basics help too. For example, knowing what an HTTP request is, how status codes work (200 means success, 403 means blocked), or how character encoding can affect text. These are not difficult concepts, but they come up in almost every scraping project. The more we know about them, the smoother our work will be.

Requests

The Requests library is the starting point for most scraping projects in Python. It makes it simple for us to talk to a web server: we send a request and get a response. With just a few lines of code, we can fetch a page’s contents and start working with the data. Requests is a lightweight, reliable and easy to handle, which makes it the natural first tool for scraping static websites.

Basic Usage

A request is simply an attempt to connect to a web address and draw content. When a connection is made, the package attempts to pull data using one of several methods. The target server can then approve or deny access and provide us with some kind of response, which we store in a variable to process. Generally, it looks like this:

import requests

url = "https://example.com"
response = requests.get(url)
print(response.text)
Python

Block 1: requests.get().

Breaking it down:

  • response- a variable holding the server’s reply, stored as a requests.Response object. It includes attributes such as:
    • .status_code → the server’s reply code (200 means success).
    • .text → the raw HTML returned by the server. This is what we parse using the BeautifulSoup module.
    • .json() → structured data, if the server responds with JSON.
  • get()- the method we call to fetch data from a URL (Uniform Resource Locator).
  • url- the web address we want to request.

This is the foundation. One request, one response. Everything else we’ll do- adding headers, rotating proxies, using sessions- builds on top of this simple pattern.

Scraping with Requests

To actually scrape a page, we first need to inspect the website and review its source code, to see how the information we’re after is stored and organized. This is done by right-clicking anywhere on the page and selecting “View Page Source” (or the equivalent option in your browser).

Notice that some of the information may be divided into multiple pages- in that case we iterate over them with a loop. Some content is available only after logging in with a username and password, and some appears only after a user action (like clicking a button).

We scrape a web address by using .get() to contact its server. If the response has a status code in the 200 family, our request succeeded and the contents are now inside the requests.Response object.

If we repeat this many times, the server may detect unusual activity. It can block our IP and start returning status codes in the 400 family (401 Unauthorized, 403 Forbidden, 404 Not Found etc.).

When this happens, we need to take it up a notch. We enhance our requests by adding headers and proxies to disguise our identity. With these, the server sees us as different users, making it much harder to connect all our requests to one source.

Headers

Headers allow us to send extra information with our request to the server. The server may require this information before approving a connection, so we should first understand what kinds of headers it expects.

Headers can include the browser and operating system we’re pretending to use, the languages we accept, the encoding we support and other details about our connection.

When scraping, we often pretend to be a real browser by sending these headers, so we won’t get blocked right away. Therefore, when scraping we’ll focus on two main aspects:

  1. Finding what information the server expects to see.
  2. Update the User-Agent, which identifies the browser and OS for the server.

How do we find what headers the server expects to get? Sometimes the server wants to see specific details before it allows a request. To discover them, we can inspect what our browser sends during a normal visit to the website:

  1. Open the website in your browser.
  2. Right-click anywhere and choose Inspect.
  3. Go to the Network tab.
  4. Reload the page. The browser will log the new request.
  5. Find the main request (usually at the top of the list) and click it.
  6. Open the Headers tab.

There we see three sections: General Information, Response Headers and Request Headers. The Request Headers show exactly what our browser sent to the server- the data that got the request approved. Not all of it is strictly required, but it’s a good reference for what the server expects.

We can take the relevant {key: value} pairs from the Request Headers and copy them into our headers variable to send along with .get(). This dictionary structure is the same one used in JSON, the most common format for data exchange on the web.

User-Agent

At the bottom of the Request Headers we usually find the User-Agent string. It identifies the operating system and browser, and looks something like:

Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko
Python

Block 1: an example User-Agent string.

This is one of the most important headers for scraping. Together with proxies, it helps disguise our true identity.

Sending Headers in Python

We send headers in Python as a dictionary:

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko", 
           "Content-Type": "...", 
           "Accept-Language": "...",
           "Accept-Encoding": "...",
           "Referer": "..."}
Python

Block 2: the Headers dictionary.

We pass this dictionary to requests.get() like so:

response = requests.get(url, headers=headers)
Python

Block 3: a response.

Header keys are case-insensitive, but it’s best to follow the standard capitalization for clarity.

We can rotate the User-Agent string by keeping a list and selecting one at random for each request. This makes our scraper look like many different users instead of one.

Rotating Headers

To avoid detection, we can rotate the User-Agent value. The simplest way is to keep a list of User-Agents, randomly select one, and include it in each request:

import random

user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko)",
    # more user agents...
]

headers = {"User-Agent": random.choice(user_agents)}
response = requests.get(url, headers=headers)
Python

Block 4: rotating headers.

This way our scraper looks less like a single bot and more like different real users.

Proxy Servers

A proxy server is simply another device on the internet that connects on our behalf. It has its own IP address, and when we use it, websites see the proxy’s IP instead of ours. This is what VPN services do.

For scraping, proxies are used mainly to hide identity and spread requests across multiple IP addresses. Some proxies are free, but usually they’re unreliable. Paid services provide more stable options. Once we have a list of working proxies, we rotate them- using a different one for each request or switching when one fails.

A proxy is passed into requests.get() as a dictionary:

proxy = {
    "http": "http://123.45.67.89:8080",
    "https": "http://123.45.67.89:8080"
}
response = requests.get(url, proxies=proxy)
Python

Block 5: proxy servers.

The IP above is an example: each group of numbers can be 1–3 digits, and sometimes we see “:port” added at the end. Ports identify applications. We usually prepare a list of proxies, pick one randomly, and replace the IP in the dict above.

We test proxies with a HEAD request (lighter than GET since it only asks for a response, not content):

response = requests.head("https://www.google.com", proxies=proxy, timeout=5)
Python

Block 6: a HEAD request using a proxy.

What we do in practice is we prepare a list of proxies and then test them by sending a “head()” request to a random website, say https://www.google.com using each proxy in its correct format (see block 3 above). We then tag the response we get and if it is one of the OK codes (2xx, 3xx) we keep the proxy for use in our scraping task. A “head()” request is similar to the “get()” request but is used only to get a server’s response and not for receiving a page’s full contents. This way we achieve the testing task in a more efficient way. It looks like this:

import requests

response = requests.head("https://www.google.com", proxies = proxy, timeout = value)
Python

Block 7: requests.head().

Every once in a while our proxy will stop working, either because it got blocked, disconnected or suffered some error. In this case we should stop using it but still keep it somewhere and try to use it again in the future. The reason is that proxies sometimes get back online and work.

We should therefore store our proxies for the long term after use so we can try and use them again after some time passes. A good way to do it is to keep a file (such as a .csv) that stores all the proxies that we used in the past and stopped working. Each time we want to scrape we can get a new list of proxies and add to them some proxies from our existing repository. We can then test them together and use the working ones for scraping and then repeating the process.

There is a lot more to know about proxy servers. There are severals types of protocols and several levels of anonymity. For more information see here and here.

In our request, we prefer to use a combination of random User-Agent and an elite (level 1 anonymity) proxy. This way, our target will not know it is being contacted by the same user over and over again.

After making a successful request and receiving a valid response, we can add a time.sleep(x) command in order to be courteous and not overload the server. X is the number of seconds to wait before the next scraping attempt. Remember to add an “import time” line at the head of the script to load the “time” module in advance.

Rotating Proxies

Using one proxy again and again is only a temporary solution. If the server sees too many requests coming from the same IP, it can still block us. To prevent this, we rotate proxies, which means that we switch to a different IP for each request, or every few requests.

We usually pair this with Headers rotation, so our identity keeps changing, making the server think it is being contacted by totally different devices from different locations.

In practice, we keep a list of working proxies and pick one at random each time:

import random, requests

proxies_list = [
    {"http": "http://111.22.33.44:8080", "https": "http://111.22.33.44:8080"},
    {"http": "http://222.33.44.55:8080", "https": "http://222.33.44.55:8080"},
    # more proxies...
]

proxy = random.choice(proxies_list)
response = requests.get("https://example.com", proxies=proxy)
Python

Block 8: rotating proxies.

Rotating proxies can be done in several ways:

  • Random selection → pick a proxy from the list each time.
  • Round-robin → cycle through the list in order.
  • Adaptive rotation → drop failing proxies, retry them later, and keep only the reliable ones active.

Just like with headers, rotation reduces the chance of detection. Many serious scraping setups combine rotating User-Agent + rotating proxies to survive longer without being blocked.

Sessions

So far, we’ve learned how to disguise ourselves using headers and proxies. But sometimes the server expects continuity- the idea that the same visitor stays logged in, keeps cookies, and browses multiple pages. That’s where sessions come in.

In these cases, we will need to carry certain parameters across making several requests in order to achieve our scraping task, such as when we want to go over several pages in the same website or when it uses cookies.

Sessions make scraping look less like a series of random bots and more like a single real user clicking around.

The benefits of sessions:

  • Performance → reuses the same TCP connection instead of opening a new one for every request.
  • Persistence → headers, proxies, and cookies carry across multiple requests automatically.
  • Realism → makes our scraper behave more like a single human visitor, reducing suspicion.

With headers and proxies, we change who we look like and where we seem to be. With sessions, we also add memory- the ability to stay logged in, carry cookies, and browse several pages in sequence. This combination makes our scraper much closer to real user behavior. Cookies are especially important for Sessions.

Cookies

Cookies are small pieces of data that the server sends to the user’s browser for later identification. The user’s browser keeps those pieces of data and sends them back to the server in a later connection, allowing the server to recognize the user. Websites usually use cookies for session management (keeping the user logged in), personalization, tracking and website security. Cookies are a part of the headers that we feed into our request, and are managed automatically by the Requests package. Cookies are different from the rest of the headers since in their case, the website is the one determining the value, not us. The website then expects to see this data in each of our requests. This is handled automatically within the session.

Servers can use cookies as part of their defense mechanism against bots, expecting to receive a certain cookie in a certain sequence of requests. If we want to keep our bot running and avoid our ip getting blocked, we need to provide the server with the data it expects to receive using Headers, just like we did when we created our “requests.Response” object above. More on cookies here.

Creating a Session

According to the docs, if we make several requests to the same host using a session, the underlying TCP connection will be reused, resulting in significant performance increase.

A session is a way to carry such headers over many requests, mimicking the behavior of a human user and reducing the chances our ip will get blocked.

For efficiency, it’s best to use a session as part of a context manager (Python context managers allow us to allocate and release resources precisely when we want to):

# Create the session object
with requests.session() as s:

  # Update headers
  s.headers.update({"User-Agent": user_agent})
  
  # Get a response
  response = s.get(url, proxies = proxy)
  
  # Process the response with another function
  process_response(response)
Python

Block 8: requests.session().

We create a session object, call it “s” and make a requests.Session.get() request to the website while also sending the headers and a proxy. We then keep the requests.Response object in a variable, and can then go ahead and process this data before making a request again (i.e. by including a loop within the context manager with changing parameters, such as the page to scrape). We need to do it under the “with” block, since the context manager will make sure the session is closed as soon as the block is exited. Notice that under a Session object, we update its attributes with the data we want it to persist. This is unlike requests.get() method, where we had to pass these attributes inside the get() method, as these do not persist.

Notice that in Block 5 we pass a proxy directly to the get() method, rather than loading it through the session proxies attribute. We can either set default proxies on a requests.Session via s.proxies or pass “proxies=” per request. When both are present, the per-request proxies override the session defaults for that call. This lets you keep a baseline proxy and swap it dynamically when needed.

The data stored in a server can take several shapes. In both the Session and Response objects, the data we pull from the website can either be stored as a string under the attribute “text”, i.e. reachable through “response.text”, or as a JSON (basically a series of dicts) through “response.json”. So we need to understand how the data is stored in the server to know how to get it.

Retry

We can add a “retry” functionality to our get() method using a Transport Adapter, by including the following code into our script:

from requests.adapters import HTTPAdapter
from urllib3.util import Retry

# Add a "retry" functionality
retries = Retry(total = 5, backoff_factor = 0.2, status_forcelist = [429, 500, 502, 503, 504])
s.mount('https://', HTTPAdapter(max_retries = retries))
Python

Block 9: retry.

With the import part at the head of the script and the rest of the code inside the session context manager, above the get() method. This will make sure that given the status codes listed in the status_forcelist parameter, the method will retry connection up to 5 failures, before moving on unsuccessfully. Notice that the Retry capability is mounted onto the Session object. It should be placed before the get() method. Retry is an important feature to implement when using proxies.

Debugging

Sometimes our connection attempts fail and we would like to get more information on the reason, so we can understand what to fix. We could see the request and response log by printing the http headers, which will allow us to see the details of our communication with the server. This is especially useful when interacting with an API to get its detailed response. The code looks like this:

import http

# Set debug level to any number other than 0 to get the http headers printed
http.client.HTTPConnection.debuglevel = 1
Python

Block 10: debugging.

Setting the logging debug level to any number other than 0 will print the http headers. As always, we should insert the import part at the top of our script and insert the http.client line before initializing our session object.

BeautifulSoup

Most of the times, scraping website data returns a long string (text) that is basically the page’s source code. Our next step is to parse this string into useful data and only extract what we need. We do this using BeautifulSoup, which is able to spot HTML or XML tags within a string making them easily accessible to us.

Using BeautifulSoup by passing a string of source code into it under the “markup” attribute creates a BeautifulSoup object, which represents the source code as a nested data structure, i.e. textual data wrapped with HTML or XML tags. The tags are identified and used as an address for reaching the data they wrap.

Basic Usage

We apply this object’s methods on our response object’s text() method, which stores our target url’s textual data, and get the ability to reach the data by its tags.

It looks like this:

from bs4 import BeautifulSoup

# Create a beautifulsoup object
soup = BeautifulSoup(markup = response.text, features = parser)
Python

Block 11: BeautifulSoup.

We provide this object with our url response text (the “markup” attribute) and can specify a parser for extracting the appropriate tags, depending on the structure of the data in our target url. A parser takes the textual data, identifies tags and returns a “parsed” object, which allows us to point to data by addressing specific tags. If we don’t provide a parser, the BeautifulSoup object will use python’s html parser by default. Each parser comes with its advantages and disadvantages and should be matched to the way our data is stored in the server.

Finding Elements

The resulting BeautifulSoup object now allows us to easily reach specific data points within the target url response text by calling them directly. Here are a few examples, accessing various data stored in a BeautifulSoup object called “soup”, just like the one we created in Block 8:

# Get the page's title block (i.e. get the opening and closing tags as well)
soup.title

# Get the title's text as a string
soup.title.string

# Get a list containing all the page's "<a>" tag, including their opening and closing tags
soup.find_all(name = "a")

# Get a specific tag by using its id, if such attribute exists in the page
soup.find(id = "link3")

# Get a specific "<span>" tag with a "class" attribute, if such attribute exists in the page
soup.find(name = "span", attrs = {"class": ...})
Python

Block 12: BeautifulSoup object use.

As we’ve done before, we can store each method’s result in a variable and further use it in any way we wish. The bottom line is that BeautifulSoup provides us access to specific parts in the page “soup”, focusing only on the data we seek.

Integration with Requests

Most often, BeautifulSoup is used together with the requests package. Requests handles the connection to the website and retrieves the raw HTML, while BeautifulSoup parses it into a form that is easy to analyze.

url = "https://example.com"
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

# Extract all links
links = [a["href"] for a in soup.find_all("a", href=True)]
print(links)
Python

Block 13: BeautifulSoup with Requests.

This combination is one of the most common scraping patterns for static websites.

Practical Use Cases

BeautifulSoup is especially useful for:

  • Extracting text from paragraphs or headers.
  • Pulling tables or lists from HTML.
  • Navigating nested structures where data is wrapped in multiple tags.
  • Cleaning up messy or inconsistent HTML before processing.

For a deeper guide to BeaufiulSoup, see the docs.

Selenium

The Selenium package was created as a tool for automatic web browser actions. We can use it to best mimic the behavior of a real user and get data that is harder to reach, such as when it is generated when a button is pressed. Selenium uses a “headless browser”, which is a web browser without a graphical user interface (GUI), to get a page’s contents.

While a headless browser can be used for automating layout testing and performance testing, we will use it for data extraction.

According to the docs:

“Selenium supports the automation of all the major browsers in the market through the use of WebDriver. WebDriver is an API and protocol that defines a language-neutral interface for controlling the behaviour of web browsers. Each browser is backed by a specific WebDriver implementation, called a driver. The driver is the component responsible for delegating down to the browser, and handles communication to and from Selenium and the browser.”

Basic Usage

Selenium requires a browser driver to interface with the chosen browser, with each browser using a different driver. This driver needs to be installed before we use Selenium and its location should be included in our system’s PATH (a special environment variable that provides guidance to the Python interpreter about where to find various libraries and applications).

Implementing a simple get() method, to get a page’s contents with Selenium looks like this:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service

options = webdriver.ChromeOptions()
options.add_argument("--headless=new")  # preferred over options.headless = True

service = Service(path)  # path = path to your chromedriver

url = "https://www.google.com"
with webdriver.Chrome(service=service, options=options) as driver:
    driver.get(url)
    print(driver.title)
Python

Block 14: Selenium WebDriver.

Here we import the package, configure options to run Chrome in headless mode, and load a URL. The page’s contents are stored directly in the driver instance, so we can access them immediately.

Just like with Session(), we use a context manager to frame our use of WebDriver object. In this example we are using the Google Chrome WebDriver. We start by importing our package, creating an “options” class instance for managing the WebDriver class instance’s special options. Then, we instruct our WebDriver to run in “headless browser” mode by passing “True” to our “options” instance “headless” attribute. Then, we provide the driver file’s path (this is not needed if it is included our system’s PATH), and finally we can then go ahead and perform the get() method. Under Selenium, we don’t need to store the response in a variable, as it is stored within the WebDriver object.

Finding Elements

This takes care of getting raw data from a web page. Now we can access this information and draw the data we need. We can do that with the driver instance’s “find_element” method. Instance is a specific Python object that belongs to a class (like how a certain apple belongs to the “fruit” class). This method finds the first appearance of an element in the web page based on provided locator values (these are various properties that identify HTML tags) and returns a reference to it. This value can be stored in a variable and used for future element actions. To find multiple elements, we can use find_elements, which will return a list of multiple matching web elements.

Some examples:

# Before looking for elements in a web page, import the By object for this task
from selenium.webdriver.common.by import By

# Get all the <p> elements in a web page
elements = driver.find_elements(by = By.TAG_NAME, value = "p")

# Print the string that is found between each <p> and </p> tags
for e in elements:
  print(e.text)
    
# Find a table with the id "movies" and get its rows' contents
table = driver.find_elements(by = By.XPATH, value = '//table[@id = "movies"]/tbody/tr')

# Find a button with the label "Fruits"
fruit_button = driver.find_element(by = By.XPATH, value = '//*[text() = "Fruits"]')
Python

Block 15: finding web page elements.

Notice the use of the XPATH attribute on the By object, more on that here. We should choose the “by” attribute that best serves our needs, after examining the web page’s source and understanding its structure.

Also notice the single apostrophes in lines 12 and 15 above. If we use double apostrophes in a line of code and need to wrap it with more apostrophes, they must be single apostrophes, and vice versa.

Interacting with Elements

After finding an element, we can interact with it. Say that after pressing our “Fruits” button, a list of fruits appears somewhere on the web page and we are interested in the data it holds. So we instruct the headless browser to click the button and then expect its data to show:

# Click the "Fruits" button we found. We previously stored the reference to it as fruit_button
fruit_button.click()

# If it takes a few seconds for the data to show after clicking, instruct the headless browser to wait a bit before getting the newly appeared list. First, import this property
from selenium.webdriver.support.ui import WebDriverWait

# Also import the Expected Condition package and give it the alias "EC"
from selenium.webdriver.support import expected_conditions as EC

# Instruct the browser to wait up to 20 seconds or until the text we are interested in appears inside a newly created <div> tag
fruit_list_element = WebDriverWait(driver, timeout = 20).until(EC.visibility_of_element_located((By.XPATH, '//div[@id = "fruit_data"]')))

# Get the string instide the element we found
fruit_list = fruit_list_element.text
Python

Block 16: interacting with elements.

Line 11 uses expected_conditions (through its alias “EC”) to wait for an element to become visible on the site and return the value “True”. Our driver instance will then store the element’s data inside a variable. If nothing happens in 20 seconds, it will throw a TimeoutException, which is an error notice that stops our code from executing. We should therefore handle it using “try” and “except” blocks to keep our code running.

Headers and Proxies

When scraping web pages using Selenium, we should use artificial headers and proxies, just as we used in Requests and Session. We do that by adding arguments into our “options” object instance and then perform our get() method. See the following example:

# Get a user agent, usually chosen randomly from a repository we prepared in advance
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:48.0) Gecko/20100101 Firefox/48.0"

# Get a proxy, same as the user agent. Notice it is given as a simple string and not as a dict like in Requests
proxy = "http://18.159.181.93:8088"

# Use Google Chrome as our browser
options = WebDriver.ChromeOptions()

# Add the user agent to our driver's options
options.add_argument(f"User-Agent={user_agent}")

# Add the proxy server to our driver's options
options.add_argument(f"--proxy-server={proxy}")

# Then we create a driver instance and execute a get() method
# See Block 10 above
Python

Block 17: adding headers to our driver instance.

By using the add_argument attribute on our “options” class instance, we load headers onto our driver instance, making them available to the get() method.

Notice the “f” before some of our attribute strings. This is a string formatting syntax that instructs Python to treat the value inside the “{}” as a variable and not as a string, allowing us to provide a different variable values within strings.

For more information on how to configure a ChromeDriver for use in Selenium, see here.

Playwright

Playwright is a modern tool created and maintained by Microsoft for automating browsers. Like Selenium, it allows us to programmatically control a browser and interact with websites, but it is designed to be faster, more reliable, and more suited for complex modern web applications. Playwright supports all major browsers (Chromium, Firefox, and WebKit) and can run them in both headed and headless modes. It also has official bindings for Python, Node.js, C#, and Java, making it highly flexible for different development environments.

Playwright is capable of handling dynamic websites, multiple browser contexts, and even multiple user sessions in parallel, all while being more lightweight than Selenium. Unlike Selenium, Playwright ships with its own drivers, so there is no need for separate browser driver installations.

To begin using Playwright in Python, we need to install it and initialize the browser:

from playwright.sync_api import sync_playwright

# Create a context manager for Playwright
with sync_playwright() as p:
    # Launch a Chromium browser in headless mode
    browser = p.chromium.launch(headless=True)
    
    # Create a new browser page (tab)
    page = browser.new_page()
    
    # Visit a URL
    url = "https://example.com"
    page.goto(url)
    
    # Extract the page's title
    print(page.title())
    
    # Close the browser
    browser.close()
Python

Block 18: Playwright basic usage.

This short example shows the typical Playwright workflow: launching a browser, creating a page, visiting a URL, and interacting with it. Playwright exposes many useful methods on the page object, which represents a browser tab. We can navigate, click elements, fill in forms, and extract data, all within the same session.

Finding Elements

Similar to Selenium, Playwright allows us to locate elements using selectors. It supports CSS selectors, XPath, and text-based matching. For example:

# Find all links on the page
links = page.query_selector_all("a")

# Print their href values
for link in links:
    print(link.get_attribute("href"))

# Find an element by text
button = page.get_by_text("Sign In")
button.click()
Python

Block 19: Playwright element selection.

Selectors in Playwright are powerful. We can chain them, scope them, and wait for elements to appear before interacting with them. This helps overcome one of the biggest challenges in scraping modern websites- ensuring the element is actually loaded and visible before we try to use it.

Waiting and Timing

Modern websites often load content asynchronously. To handle this, Playwright comes with automatic waiting. Most actions, like page.click() or page.fill(), will automatically wait until the element is ready. But we can also explicitly wait for elements:

# Wait for an element to appear by CSS selector
page.wait_for_selector("#content")

# Wait for a network response before continuing
page.wait_for_response("**/api/data")
Python

Block 20: Playwright waiting.

This functionality greatly reduces the amount of manual time.sleep() code, making scraping scripts faster and more reliable.

Handling Multiple Pages and Contexts

Playwright allows us to run multiple isolated sessions in a single browser instance. This is useful when scraping websites that require logins or different cookies. We can create multiple contexts, each with its own set of cookies and storage:

# Create a new context
context = browser.new_context()

# Each context can create independent pages
page1 = context.new_page()
page2 = context.new_page()

page1.goto("https://example.com/user1")
page2.goto("https://example.com/user2")
Python

Block 21: Playwright multiple contexts.

This way, we can simulate different users without interference, something much harder to achieve with Selenium.

Speed and Concurrency

Another major benefit of Playwright is built-in support for parallel scraping. We can launch multiple browser contexts or pages and scrape them concurrently, without needing external libraries for concurrency. This makes Playwright especially useful for large-scale data collection projects.

urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context()

    # Create one page per URL
    pages = [context.new_page() for _ in urls]

    # Load all URLs in parallel
    for page, url in zip(pages, urls):
        page.goto(url)

    # Collect titles
    titles = [page.title() for page in pages]
    print(titles)

    browser.close()
Python

Block 22: Playwright concurrency.

Here, three pages load in parallel within the same browser. This approach can easily be scaled up to dozens of tabs, giving us significant speed improvements for scraping many URLs.

Headers and Proxies

Just like with Requests and Selenium, we may need to disguise our scraper using headers and proxies. Playwright makes this straightforward:

# Launch a browser with a proxy
browser = p.chromium.launch(proxy={"server": "http://18.159.181.93:8088"})

# Create a new context with custom headers
context = browser.new_context(extra_http_headers={
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Firefox/91.0"
})

page = context.new_page()
page.goto("https://example.com")
Python

Block 23: Playwright headers and proxy.

This gives us the same disguise capabilities we had in Requests and Selenium, while maintaining Playwright’s robust browser automation.

So basically, Playwright is becoming one of the most important tools for web scraping. It is considered to be faster and more reliable than Selenium, and easier to use for complex sites. Its ability to handle multiple contexts, automatic waits, and built-in concurrency makes it a strong choice for modern scraping tasks.

When building a scraper today, we can think of Playwright as an evolution beyond Selenium- offering more power, more flexibility, and fewer headaches.

Scraping Faster

Key Concepts

Web scraping is mostly I/O-bound: our code spends far more time waiting on the network than using CPU. To go faster, we structure the program so work overlaps (concurrency) and, when needed, runs simultaneously on multiple cores (parallelism).

Asynchronous execution- this is a way to run tasks without them blocking each other. Instead of waiting for one task (like an HTTP request) to finish, the program yields control so another can run. In Python, this is powered by asyncio, a single-threaded event loop that schedules tasks, and coroutines, which are functions defined with async def. Coroutines pause at await, letting the event loop run other tasks while they wait (for example, on a network response). This model is ideal for high-throughput scraping: thousands of requests can be active in one thread with very little overhead. Heavy CPU work, however, must still be offloaded to processes or native code.

Concurrency- this code execution model allows tasks to overlap in time, so one task waiting (e.g. an HTTP request) doesn’t block others. In Python we get concurrency with threads or asyncio. It doesn’t require multiple cores, as tasks can interleave on a single core.

Executors- these are high-level tools that manage pools (groups) of threads or processes. We use ThreadPoolExecutor for I/O-bound scraping (lots of HTTP requests) and ProcessPoolExecutor for CPU-heavy work (parsing, data analysis). The “max_workers” setting controls how many run at once. Python defaults to min(32, os.cpu_count() + 4) for threads. We tune pool size for our workload and respect site rate limits.

Future- this is an object that represents a result which will be available later. It’s returned when we submit work to a thread or process pool (concurrent.futures). We can check its state, wait for it, or call .result() to get the value once the work has finished. A future is a “delayed result object” used in threads/processes.

GIL (Global Interpreter Lock)- Python has a global lock that lets only one thread execute Python code at a time. This simplifies memory management but blocks true CPU parallelism with threads. Threads are still valuable for I/O-bound scraping, since one can run while another waits. For CPU-bound work, we use processes or vectorized libraries (like NumPy) that bypass the GIL.

I/O (Input/Output)- this term refers to the transfer of data between a program and the outside world (files, network requests and databases). In scraping, “I/O-bound” means most time is spent waiting for server responses, not using CPU. In such cases, concurrency (threads or async) boosts speed by letting many requests progress while others wait.

Lock- a lock is a tool that ensures only one thread or process touches shared data at a time. Other threads wait until the lock is released, preventing race conditions. In scraping, locks help when many workers write to the same file or data structure.

Parallelism- this code execution model runs tasks at the same time on multiple CPU cores. In Python this usually means using multiple processes or vectorized libraries that bypass the GIL. Useful for CPU-heavy scraping steps, like parsing large pages or processing data. Parallelism helps only when the bottleneck is CPU, not network.

Processes- these are independent Python interpreters with their own memory that are spawned and work at the same time. They bypass the GIL and give true parallelism on multiple cores. Best for CPU-heavy scraping steps like parsing large pages, data transformations, or ML inference. The downside is higher startup cost and slower data exchange between processes.

Synchronization- this is a set of tools that coordinate tasks in concurrent or parallel programs. They prevent conflicts, enforce order, and control flow. Examples include locks, semaphores, events, barriers, and queues. In scraping, synchronization ensures workers don’t overwrite each other’s results, and can also enforce limits like “no more than 10 requests at once.”

Synchronous execution- this code execution model runs one task at a time: we call an operation, the caller blocks until it completes, then we move on. It’s simple to reason about and debug, but wastes time when most of the work is just waiting on HTTP responses.

Task- this is an object that represents a coroutine scheduled to run on the asyncio event loop. Created with asyncio.create_task, it runs in the background until complete. We can await it for the result, cancel it, or group multiple tasks together. A task is a coroutine wrapper scheduled by the event loop, asyncio.

Threads- these are lightweight execution units within one process that share memory. They’re great for I/O-bound scraping (many HTTP requests, DB or disk waits) because while one thread waits, another can run. But due to the GIL, only one thread executes Python bytecode at a time, so threads don’t speed up CPU-heavy code.

Applied Faster Scraping

For scraping pipelines, we fetch data concurrently (threads or async) and process it in parallel (processes or native/vectorized code). If the task must run in strict order, we keep it synchronous. If tasks can overlap safely, we run them concurrently. If they’re CPU-heavy and independent, we run them in parallel.

If we want to keep things simple, remain unblocked and be allowed to scrape freely, we need to scrape politely. This means we adopt several scraping conventions to make our effort efficient and easy on the server: we reuse a single HTTP client/session with connection pooling, set timeouts on every call, retry idempotent requests with exponential backoff + jitter, cap per-host concurrency (semaphore), prefer caching (ETag/If-Modified-Since), and always follow robots/ToS and data-protection rules.

Concurrency/parallelism is our device’s ability to perform many tasks in parallel (and not starting a process only after its predecessor is complete). It can be done either using “multithreading” or “multiprocessing”, with each fitting best to a different task:

  • that enables it to run several processes at the same time, increasing performance.
  • Multiprocessing- this process is called “parallelism” and it generally refers to utilizing a device’s having multiple CPU cores, each having its own threads, allowing it to greatly increase performance. We can split our task between cores and not just threads on a single core.

How to know which concurrent computing method to use? Basically:

“Multithreading is useful for IO-bound processes, such as reading files from a network or database since each thread can run the IO-bound process concurrently. Multiprocessing is useful for CPU-bound processes, such as computationally heavy tasks since it will benefit from having multiple processors; similar to how multicore computers work faster than computers with a single core.”

Using a concurrency method for a task that it is not optimal for may result in performance costs.

More on the difference between multithreading and multiprocessing can be found here.

Web scraping is an IO-bound task, since a long time is spent waiting for the network and responses from servers. Therefore, we can greatly enhance our web scraper’s efficiency by utilizing multithreading.

We implement concurrency in out program by importing concurrent.futures and using ThreadPoolExecutor class:

# Import the concurrency package
import concurrent.futures

# Setup a context manager to ensure threads are cleaned up promptly by calling the executor.shutdown() method implicitly when we're done.
with concurrent.futures.ThreadPoolExecutor(max_workers = 5) as executor:

  # Option 1- when we want to use a "for" loop to iterate over an iterable (i.e. list), perform tasks without receiving data back
  executor.map(scrape_function, list_of_urls)
  
  # Option 2- when we want to use a "for" loop to iterate over an iterable (i.e. list), perform tasks, get data back in the form of a list and show progress using the tqdm module
  from tqdm import tqdm
  feedback_lst = list(tqdm(executor.map(scrape_function, list_of_urls)))
  
  # Option 3- when we want more control over our tasks. Submit tasks, receive Futures objects and then process the results as they become available
  futures = [executor.submit(task, i) for i in range(10)]
  
  for future in concurrent.futures.as_completed(futures):
        
        # Process the ansynchronous task results: either keep them in a variable for further processing or print them
        print(future.result())
        
  """ Generally, we use the concurrent.futures model to perform a group of tasks asynchronously (i.e. each thread works without waiting for the previous task to finish). """
Python

Block 24: concurrency.

Executor is an abstract base for running calls asynchronously. We do not instantiate Executor directly, but use concrete subclasses that implement all required methods and can be instantiated. ThreadPoolExecutor creates a pool of reusable threads for I/O-bound or latency-heavy tasks. Its max_workers sets the thread count. In modern Python the default is min(32, os.cpu_count() + 4). Tune max_workers empirically for your workload. For CPU-bound work prefer ProcessPoolExecutor.

When designing concurrency, decide whether tasks must run in order or can overlap. Use synchronous execution when tasks must be sequential. Use concurrent execution when tasks can overlap. With concurrent.futures, use executor.map for ordered results, or executor.submit with as_completed to consume results as they finish.

See more on the difference between executor.map() and executor.submit() here.

Conclusion

Web scraping has become an indispensable tool for extracting data from the vast resources available online. As we have explored in this document, Python offers several powerful libraries, such as Requests, BeautifulSoup, and Selenium, to help automate the process of retrieving and parsing data from websites. Whether you’re gathering data for market analysis, machine learning, or business intelligence, web scraping opens up a world of possibilities by allowing you to access information that would otherwise be difficult to collect manually.

While the technical details of scraping are important, it is equally crucial to be aware of the ethical and legal considerations surrounding the practice. Ensuring that you comply with a website’s terms of service, respecting robots.txt files, and understanding when it is appropriate to scrape are all key factors in responsible web scraping.

With this knowledge, you are now equipped to start developing your own web scraping projects. From handling simple static pages to more complex dynamic content with Selenium, you can leverage these skills to automate data collection tasks at scale. As you continue to explore this field, experiment with proxies, User-Agent rotation, and multithreading to improve your scraping efficiency and avoid common obstacles.

I encourage you to apply the techniques and best practices shared here to your own projects, and remember that there is always room to enhance and expand your capabilities in this fast-evolving field. The more you experiment, the more proficient you will become in unlocking valuable insights from the web’s vast data.

* * *

Join my mailing list and stay updated.

I will notify you about material additions to the blog, along with other interesting news.

← Back

Thank you for your response. ✨

Leave a Reply

Your email address will not be published. Required fields are marked *