This document explains web scraping using Python, covering HTTP requests, HTML parsing, data extraction, and best practices with requests, BeautifulSoup, and pandas.
This document provides a practical guide to web scraping with Python. It covers sending HTTP requests, parsing HTML, extracting structured data, and using libraries like requests, BeautifulSoup, and pandas. Readers will learn how to automate data collection and follow ethical scraping practices.
Web scraping is the process of programmatically extracting information from websites. It is widely used for data collection, price monitoring, content aggregation, and research. Python offers powerful libraries for web scraping, enabling efficient and automated data extraction from web pages.
| Step | Description |
|---|---|
| HTTP Request | Send a GET request to the target URL using requests. |
| Retrieve HTML | Receive the HTML content of the web page from the server. |
| Parse HTML | Use BeautifulSoup to parse and navigate the HTML structure. |
| Extract Data | Locate and extract required data using tags, attributes, or CSS selectors. |
| Transform & Store | Clean, format, and save the data for further analysis or use. |
Understanding HTML is essential for web scraping. Key elements include:
<html>: Root element of the page<head>: Metadata and resources<body>: Main content<p>: Paragraphs of text<a>: Hyperlinks<img>: Images<div>: Generic container for grouping elements<span>: Inline container for styling<table>, <tr>, <th>, <td>: Used for tabular data<ul>, <ol>, <li>: Lists for structured content<form>, <input>: Used for user input and formsclass, id, and href. 1<!DOCTYPE html>
2<html>
3 <head>
4 <title>Web Scraping Example</title>
5 </head>
6 <body>
7 <h1>Web Scraping with Python</h1>
8 <p>This is an example of a simple HTML document.</p>
9 </body>
10</html>
graph TD subgraph HTML_Document["HTML Document"] HTML["html"]:::root --> HEAD["head"]:::head HTML --> BODY["body"]:::body HEAD --> TITLE["title"]:::head HEAD --> META["meta"]:::head BODY --> HEADER["header"]:::section BODY --> MAIN["main"]:::section BODY --> FOOTER["footer"]:::section HEADER --> H1["h1"]:::text MAIN --> SECTION["section"]:::section SECTION --> ARTICLE["article"]:::section ARTICLE --> H2["h2"]:::text ARTICLE --> P["p"]:::text FOOTER --> P2["p"]:::text classDef root fill:#b58900,stroke:#586e75,color:#fff; classDef head fill:#268bd2,stroke:#586e75,color:#fff; classDef body fill:#2aa198,stroke:#586e75,color:#fff; classDef section fill:#6c71c4,stroke:#586e75,color:#fff; classDef text fill:#dc322f,stroke:#586e75,color:#fff; end
Install the required libraries:
1pip install requests beautifulsoup4 pandas
The following example demonstrates how to fetch a web page and parse its content:
1import requests
2from bs4 import BeautifulSoup
3# Specify the URL of the webpage you want to scrape
4url = 'https://en.wikipedia.org/wiki/IBM'
5# Send an HTTP GET request to the webpage
6response = requests.get(url)
7# Store the HTML content in a variable
8html_content = response.text
9# Create a BeautifulSoup object to parse the HTML
10soup = BeautifulSoup(html_content, 'html.parser')
11# Display a snippet of the HTML content
12print(html_content[:500])
1# Find all anchor tags and print their href attributes
2for link in soup.find_all('a'):
3 print(link.get('href'))
If the page contains HTML tables, you can extract them directly into pandas DataFrames:
1import pandas as pd
2url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
3tables = pd.read_html(url)
4print(tables[0].head()) # Print the first table
robots.txt and terms of service before scraping.Web scraping is used in various fields and has many applications:
Price Comparison: Services such as ParseHub use web scraping to collect data from online shopping websites and use it to compare the prices of products.
Email Address Gathering: Many companies that use email as a medium for marketing, use web scraping to collect email IDs and then send bulk emails.
Social Media Scraping: Web scraping is used to collect data from Social Media websites such as Twitter to find out what’s trending.
Web scraping automates the extraction of information from websites, making it possible to collect large amounts of data efficiently. Python, with the Requests and BeautifulSoup libraries, provides a powerful toolkit for this task.
Web scraping is the process of programmatically extracting data from web pages. Instead of manually copying and pasting information, scripts can retrieve and parse HTML content, saving time and reducing errors.
In the field of data science, web scraping plays an integral role. It is used for various purposes such as:
Data Collection: Web scraping is a primary method of collecting data from the internet. This data can be used for analysis, research, etc. Real-time Application: Web scraping is used for real-time applications like weather updates, price comparison, etc. Machine Learning: Web scraping provides the data needed to train machine learning models.
BeautifulSoup is a Python library that parses HTML and XML documents, creating a tree-like structure of objects. Each tag in the HTML becomes a BeautifulSoup object, allowing for easy navigation and data extraction.
<h3> tag.parent attribute allows navigation up the tree, while next_sibling and previous_sibling allow movement between siblings.The find_all() method retrieves all descendants of a tag that match specified filters, such as tag name, attributes, or text content. This is useful for extracting lists or tables from web pages.
Example:
1from bs4 import BeautifulSoup
2html = """
3<table>
4 <tr><th>Name</th><th>Salary</th></tr>
5 <tr><td>Player 1</td><td>$1,000,000</td></tr>
6 <tr><td>Player 2</td><td>$2,000,000</td></tr>
7</table>
8"""
9soup = BeautifulSoup(html, 'html.parser')
10rows = soup.find_all('tr')
11for row in rows:
12 cells = row.find_all(['th', 'td'])
13 print([cell.get_text(strip=True) for cell in cells])
To scrape a web page:
requests.get() to download the web page..text attribute.Example:
1import requests
2from bs4 import BeautifulSoup
3url = 'https://example.com/data'
4response = requests.get(url)
5soup = BeautifulSoup(response.text, 'html.parser')
6# Extract all table rows
7rows = soup.find_all('tr')
8for row in rows:
9 cells = row.find_all('td')
10 print([cell.text for cell in cells])
Web scraping with Python enables efficient and automated data extraction from web pages. By combining requests, BeautifulSoup, and pandas, it is possible to collect and analyze online data for a variety of applications. Responsible scraping ensures compliance with website policies and ethical standards. The method find_all() is particularly useful for extracting multiple elements, such as links or table rows, from a web page. By understanding the HTML structure and using BeautifulSoup’s powerful parsing capabilities, you can effectively gather and manipulate data from the web. To learn more about web scraping, you can take this coursera course on Web Scraping with Python.