Browse Courses

Web Scraping

This document explains web scraping using Python, covering HTTP requests, HTML parsing, data extraction, and best practices with requests, BeautifulSoup, and pandas.

This document provides a practical guide to web scraping with Python. It covers sending HTTP requests, parsing HTML, extracting structured data, and using libraries like requests, BeautifulSoup, and pandas. Readers will learn how to automate data collection and follow ethical scraping practices.


Introduction

Web scraping is the process of programmatically extracting information from websites. It is widely used for data collection, price monitoring, content aggregation, and research. Python offers powerful libraries for web scraping, enabling efficient and automated data extraction from web pages.


How Web Scraping Works

StepDescription
HTTP RequestSend a GET request to the target URL using requests.
Retrieve HTMLReceive the HTML content of the web page from the server.
Parse HTMLUse BeautifulSoup to parse and navigate the HTML structure.
Extract DataLocate and extract required data using tags, attributes, or CSS selectors.
Transform & StoreClean, format, and save the data for further analysis or use.

HTML Structure Overview

Understanding HTML is essential for web scraping. Key elements include:

  • <html>: Root element of the page
  • <head>: Metadata and resources
  • <body>: Main content
  • <p>: Paragraphs of text
  • <a>: Hyperlinks
  • <img>: Images
  • <div>: Generic container for grouping elements
  • <span>: Inline container for styling
  • <table>, <tr>, <th>, <td>: Used for tabular data
  • <ul>, <ol>, <li>: Lists for structured content
  • <form>, <input>: Used for user input and forms

Tags Composition

  • Tags defines the structure of the HTML document.
  • Tags can have attributes that provide additional information.
  • Attributes are key-value pairs within tags, such as class, id, and href.
  • Example of a simple HTML structure:
 1<!DOCTYPE html>
 2<html>
 3  <head>
 4    <title>Web Scraping Example</title>
 5  </head>
 6  <body>
 7    <h1>Web Scraping with Python</h1>
 8    <p>This is an example of a simple HTML document.</p>
 9  </body>
10</html>

Document Tree

  • A tree structure can be seen to visualize the hierarchy of HTML elements.
	graph TD
	subgraph HTML_Document["HTML Document"]
	HTML["html"]:::root --> HEAD["head"]:::head
	HTML --> BODY["body"]:::body
	
	HEAD --> TITLE["title"]:::head
	HEAD --> META["meta"]:::head
	
	BODY --> HEADER["header"]:::section
	BODY --> MAIN["main"]:::section
	BODY --> FOOTER["footer"]:::section
	
	HEADER --> H1["h1"]:::text
	MAIN --> SECTION["section"]:::section
	SECTION --> ARTICLE["article"]:::section
	ARTICLE --> H2["h2"]:::text
	    ARTICLE --> P["p"]:::text
	    FOOTER --> P2["p"]:::text
	
	classDef root fill:#b58900,stroke:#586e75,color:#fff;
	classDef head fill:#268bd2,stroke:#586e75,color:#fff;
	classDef body fill:#2aa198,stroke:#586e75,color:#fff;
	classDef section fill:#6c71c4,stroke:#586e75,color:#fff;
	classDef text fill:#dc322f,stroke:#586e75,color:#fff;
	end
	

Tools Required

  • requests: For sending HTTP requests
  • BeautifulSoup: For parsing and navigating HTML, allowing for easy navigation.
  • Scrapy (optional): For more advanced scraping tasks, including handling complex sites and data pipelines. Scrapy is an open-source and collaborative web crawling framework for Python
  • lxml (optional): For faster parsing of HTML and XML documents.
  • html5lib (optional): For parsing HTML5 documents.
  • Selenium (optional): For scraping dynamic content rendered by JavaScript.
  • PyQuery (optional): For jQuery-like syntax to navigate and manipulate HTML documents.
  • regex (optional): For advanced text processing and pattern matching.
  • pandas (optional): For extracting tables directly into DataFrames

Install the required libraries:

1pip install requests beautifulsoup4 pandas

Fetching and Parsing HTML

The following example demonstrates how to fetch a web page and parse its content:

 1import requests
 2from bs4 import BeautifulSoup
 3# Specify the URL of the webpage you want to scrape
 4url = 'https://en.wikipedia.org/wiki/IBM'
 5# Send an HTTP GET request to the webpage
 6response = requests.get(url)
 7# Store the HTML content in a variable
 8html_content = response.text
 9# Create a BeautifulSoup object to parse the HTML
10soup = BeautifulSoup(html_content, 'html.parser')
11# Display a snippet of the HTML content
12print(html_content[:500])

1# Find all anchor tags and print their href attributes
2for link in soup.find_all('a'):
3    print(link.get('href'))

Extracting Tables with pandas

If the page contains HTML tables, you can extract them directly into pandas DataFrames:

1import pandas as pd
2url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
3tables = pd.read_html(url)
4print(tables[0].head())  # Print the first table

Best Practices and Ethics

  • Always check the website’s robots.txt and terms of service before scraping.
  • Avoid making too many requests in a short time (use delays if needed).
  • Do not scrape personal or sensitive data.
  • Respect copyright and data usage policies.

Applications of Web Scraping

Web scraping is used in various fields and has many applications:

  1. Price Comparison: Services such as ParseHub use web scraping to collect data from online shopping websites and use it to compare the prices of products.

  2. Email Address Gathering: Many companies that use email as a medium for marketing, use web scraping to collect email IDs and then send bulk emails.

  3. Social Media Scraping: Web scraping is used to collect data from Social Media websites such as Twitter to find out what’s trending.


Web Scraping Example with BeautifulSoup

Web scraping automates the extraction of information from websites, making it possible to collect large amounts of data efficiently. Python, with the Requests and BeautifulSoup libraries, provides a powerful toolkit for this task.


What is Web Scraping

Web scraping is the process of programmatically extracting data from web pages. Instead of manually copying and pasting information, scripts can retrieve and parse HTML content, saving time and reducing errors.

Importance of Web Scraping in Data Science

In the field of data science, web scraping plays an integral role. It is used for various purposes such as:

Data Collection: Web scraping is a primary method of collecting data from the internet. This data can be used for analysis, research, etc. Real-time Application: Web scraping is used for real-time applications like weather updates, price comparison, etc. Machine Learning: Web scraping provides the data needed to train machine learning models.


The Role of BeautifulSoup Objects

BeautifulSoup is a Python library that parses HTML and XML documents, creating a tree-like structure of objects. Each tag in the HTML becomes a BeautifulSoup object, allowing for easy navigation and data extraction.


  • Each tag object corresponds to an HTML tag in the document.
  • The first occurrence of a tag can be accessed directly; for example, the first <h3> tag.
  • Tags can have children (nested tags) and siblings (tags at the same level).
  • The parent attribute allows navigation up the tree, while next_sibling and previous_sibling allow movement between siblings.
  • Tag attributes can be accessed as key-value pairs, and the content can be retrieved as a navigable string.

Using the find_all Method

The find_all() method retrieves all descendants of a tag that match specified filters, such as tag name, attributes, or text content. This is useful for extracting lists or tables from web pages.

Example:

 1from bs4 import BeautifulSoup
 2html = """
 3<table>
 4  <tr><th>Name</th><th>Salary</th></tr>
 5  <tr><td>Player 1</td><td>$1,000,000</td></tr>
 6  <tr><td>Player 2</td><td>$2,000,000</td></tr>
 7</table>
 8"""
 9soup = BeautifulSoup(html, 'html.parser')
10rows = soup.find_all('tr')
11for row in rows:
12    cells = row.find_all(['th', 'td'])
13    print([cell.get_text(strip=True) for cell in cells])

Scraping a Web Page with Requests and BeautifulSoup

To scrape a web page:

  1. Import the required modules.
  2. Use requests.get() to download the web page.
  3. Access the HTML content with the .text attribute.
  4. Create a BeautifulSoup object to parse the HTML.
  5. Use BeautifulSoup methods to extract the desired data.

Example:

 1import requests
 2from bs4 import BeautifulSoup
 3url = 'https://example.com/data'
 4response = requests.get(url)
 5soup = BeautifulSoup(response.text, 'html.parser')
 6# Extract all table rows
 7rows = soup.find_all('tr')
 8for row in rows:
 9    cells = row.find_all('td')
10    print([cell.text for cell in cells])

Conclusion

Web scraping with Python enables efficient and automated data extraction from web pages. By combining requests, BeautifulSoup, and pandas, it is possible to collect and analyze online data for a variety of applications. Responsible scraping ensures compliance with website policies and ethical standards. The method find_all() is particularly useful for extracting multiple elements, such as links or table rows, from a web page. By understanding the HTML structure and using BeautifulSoup’s powerful parsing capabilities, you can effectively gather and manipulate data from the web. To learn more about web scraping, you can take this coursera course on Web Scraping with Python.


FAQ