Python - Combined Example: Using Both Requests and Beautiful Soup

Combined Example: Using Both Requests and Beautiful Soup

Web scraping is the process of extracting data from websites. In Python, the two most commonly used libraries for this purpose are requests and BeautifulSoup. The requests library allows you to send HTTP/1.1 requests, and BeautifulSoup allows you to parse HTML or XML documents efficiently. This document provides a detailed explanation of how to use these libraries together for effective web scraping, covering everything from installation to advanced scraping examples.

Introduction to Requests and BeautifulSoup

Why Combine Requests and BeautifulSoup?

While requests is used for fetching the HTML content of a webpage, BeautifulSoup is used for parsing and navigating that HTML content. Together, they form a powerful toolkit for scraping static web pages.

Basic Workflow of Web Scraping

  1. Send an HTTP request to the target website using requests. 
  2. Receive the HTML content from the response.
  3. Parse the HTML using BeautifulSoup.
  4. Extract the required data by navigating the parsed HTML tree.

Installing the Required Libraries

Using pip

pip install requests
pip install beautifulsoup4

Optional Parser Installation

pip install lxml
pip install html5lib

These optional parsers may improve speed and accuracy of HTML parsing.

Step-by-Step Example: Scraping Quotes from a Website

Website Used: quotes.toscrape.com

This website is specifically built for practicing web scraping. It displays quotes from famous people, along with authors and tags.

Step 1: Import Libraries


import requests
from bs4 import BeautifulSoup

Step 2: Make an HTTP Request


url = "http://quotes.toscrape.com"
response = requests.get(url)
print(response.status_code)  # Should return 200 if successful

Step 3: Parse HTML with BeautifulSoup


soup = BeautifulSoup(response.text, 'html.parser')

Step 4: Extract Quotes


quotes = soup.find_all('div', class_='quote')

for quote in quotes:
    text = quote.find('span', class_='text').text
    author = quote.find('small', class_='author').text
    tags = [tag.text for tag in quote.find_all('a', class_='tag')]
    
    print(f"Quote: {text}")
    print(f"Author: {author}")
    print(f"Tags: {', '.join(tags)}\n")

Analyzing the HTML Structure

Inspecting the HTML

Each quote is wrapped in a <div class="quote">  element. Inside this div:

  • The quote text is inside <span class="text">
  • The author is inside <small class="author">
  • Tags are nested within <div class="tags"> under <a class="tag">

Handling Multiple Pages

Paginated Content

The site includes pagination using a "Next" button. We can loop through all pages until there is no "next" button.

Code Example for Pagination


url = "http://quotes.toscrape.com"
while url:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    quotes = soup.find_all('div', class_='quote')
    for quote in quotes:
        text = quote.find('span', class_='text').text
        author = quote.find('small', class_='author').text
        print(f"{text} - {author}")
    
    next_button = soup.find('li', class_='next')
    if next_button:
        next_url = next_button.find('a')['href']
        url = f"http://quotes.toscrape.com{next_url}"
    else:
        break

Handling Errors Gracefully

HTTP Error Codes


response = requests.get(url)
if response.status_code != 200:
    print(f"Failed to retrieve the page: {response.status_code}")

Using Try-Except for Robustness


try:
    response = requests.get(url)
    response.raise_for_status()
except requests.exceptions.RequestException as e:
    print("Error fetching data:", e)

Working with Headers and User-Agent

Why Set Headers?

Some websites block requests without proper headers. Setting a fake User-Agent can help mimic a real browser.


headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}
response = requests.get(url, headers=headers)

Extracting and Saving Data

Storing in a CSV File


import csv

with open('quotes.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Text', 'Author', 'Tags'])
    
    for quote in quotes:
        text = quote.find('span', class_='text').text
        author = quote.find('small', class_='author').text
        tags = [tag.text for tag in quote.find_all('a', class_='tag')]
        writer.writerow([text, author, ', '.join(tags)])

Using pandas for Data Analysis


import pandas as pd

data = []

for quote in quotes:
    text = quote.find('span', class_='text').text
    author = quote.find('small', class_='author').text
    tags = [tag.text for tag in quote.find_all('a', class_='tag')]
    data.append({'Text': text, 'Author': author, 'Tags': ', '.join(tags)})

df = pd.DataFrame(data)
df.to_csv('quotes_pandas.csv', index=False)

Advanced Techniques

Using CSS Selectors with select()


quotes = soup.select('div.quote')
for q in quotes:
    print(q.select_one('span.text').text)

Scraping with Authentication

Some websites require login credentials. Use requests.Session() to manage login sessions.


session = requests.Session()
login_data = {'username': 'your_username', 'password': 'your_password'}
session.post('http://example.com/login', data=login_data)
response = session.get('http://example.com/protected-page')

Limitations of BeautifulSoup and Requests

  • Cannot scrape JavaScript-rendered content (use Selenium for this)
  • Some sites detect and block bots using rate limiting or captchas
  • Not suitable for heavy-duty parallel scraping (use Scrapy for large-scale tasks)

Ethical and Legal Considerations

Respect robots.txt

Always check if the site allows scraping:


http://example.com/robots.txt

Be Polite: Add Delays


import time

time.sleep(2)  # Sleep for 2 seconds between requests

Avoid Personal Data Collection

Do not collect sensitive data or violate privacy policies. Ensure your scraping complies with local data protection laws like GDPR or CCPA.

Combining Python's requests and BeautifulSoup libraries provides a powerful, flexible, and accessible way to extract data from static websites. This approach is ideal for beginners and intermediate users who want to automate data collection and analysis tasks. Whether you're scraping quotes for inspiration, job listings for research, or tables for analysis, understanding how to use these tools together is a foundational skill in modern data science and automation workflows.

Always remember to respect the legal and ethical boundaries of web scraping. Use public APIs when available, read website terms of use, and never overload a site’s server with unnecessary requests. With responsible and efficient use, web scraping can unlock tremendous potential for data extraction and insight generation.

logo

Python

Beginner 5 Hours

Combined Example: Using Both Requests and Beautiful Soup

Web scraping is the process of extracting data from websites. In Python, the two most commonly used libraries for this purpose are requests and BeautifulSoup. The requests library allows you to send HTTP/1.1 requests, and BeautifulSoup allows you to parse HTML or XML documents efficiently. This document provides a detailed explanation of how to use these libraries together for effective web scraping, covering everything from installation to advanced scraping examples.

Introduction to Requests and BeautifulSoup

Why Combine Requests and BeautifulSoup?

While requests is used for fetching the HTML content of a webpage, BeautifulSoup is used for parsing and navigating that HTML content. Together, they form a powerful toolkit for scraping static web pages.

Basic Workflow of Web Scraping

  1. Send an HTTP request to the target website using requests. 
  2. Receive the HTML content from the response.
  3. Parse the HTML using BeautifulSoup.
  4. Extract the required data by navigating the parsed HTML tree.

Installing the Required Libraries

Using pip

pip install requests pip install beautifulsoup4

Optional Parser Installation

pip install lxml pip install html5lib

These optional parsers may improve speed and accuracy of HTML parsing.

Step-by-Step Example: Scraping Quotes from a Website

Website Used: quotes.toscrape.com

This website is specifically built for practicing web scraping. It displays quotes from famous people, along with authors and tags.

Step 1: Import Libraries

import requests from bs4 import BeautifulSoup

Step 2: Make an HTTP Request

url = "http://quotes.toscrape.com" response = requests.get(url) print(response.status_code) # Should return 200 if successful

Step 3: Parse HTML with BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

Step 4: Extract Quotes

quotes = soup.find_all('div', class_='quote') for quote in quotes: text = quote.find('span', class_='text').text author = quote.find('small', class_='author').text tags = [tag.text for tag in quote.find_all('a', class_='tag')] print(f"Quote: {text}") print(f"Author: {author}") print(f"Tags: {', '.join(tags)}\n")

Analyzing the HTML Structure

Inspecting the HTML

Each quote is wrapped in a <div class="quote">  element. Inside this div:

  • The quote text is inside <span class="text">
  • The author is inside <small class="author">
  • Tags are nested within <div class="tags"> under <a class="tag">

Handling Multiple Pages

Paginated Content

The site includes pagination using a "Next" button. We can loop through all pages until there is no "next" button.

Code Example for Pagination

url = "http://quotes.toscrape.com" while url: response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') quotes = soup.find_all('div', class_='quote') for quote in quotes: text = quote.find('span', class_='text').text author = quote.find('small', class_='author').text print(f"{text} - {author}") next_button = soup.find('li', class_='next') if next_button: next_url = next_button.find('a')['href'] url = f"http://quotes.toscrape.com{next_url}" else: break

Handling Errors Gracefully

HTTP Error Codes

response = requests.get(url) if response.status_code != 200: print(f"Failed to retrieve the page: {response.status_code}")

Using Try-Except for Robustness

try: response = requests.get(url) response.raise_for_status() except requests.exceptions.RequestException as e: print("Error fetching data:", e)

Working with Headers and User-Agent

Why Set Headers?

Some websites block requests without proper headers. Setting a fake User-Agent can help mimic a real browser.

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' } response = requests.get(url, headers=headers)

Extracting and Saving Data

Storing in a CSV File

import csv with open('quotes.csv', 'w', newline='', encoding='utf-8') as file: writer = csv.writer(file) writer.writerow(['Text', 'Author', 'Tags']) for quote in quotes: text = quote.find('span', class_='text').text author = quote.find('small', class_='author').text tags = [tag.text for tag in quote.find_all('a', class_='tag')] writer.writerow([text, author, ', '.join(tags)])

Using pandas for Data Analysis

import pandas as pd data = [] for quote in quotes: text = quote.find('span', class_='text').text author = quote.find('small', class_='author').text tags = [tag.text for tag in quote.find_all('a', class_='tag')] data.append({'Text': text, 'Author': author, 'Tags': ', '.join(tags)}) df = pd.DataFrame(data) df.to_csv('quotes_pandas.csv', index=False)

Advanced Techniques

Using CSS Selectors with select()

quotes = soup.select('div.quote') for q in quotes: print(q.select_one('span.text').text)

Scraping with Authentication

Some websites require login credentials. Use requests.Session() to manage login sessions.

session = requests.Session() login_data = {'username': 'your_username', 'password': 'your_password'} session.post('http://example.com/login', data=login_data) response = session.get('http://example.com/protected-page')

Limitations of BeautifulSoup and Requests

  • Cannot scrape JavaScript-rendered content (use Selenium for this)
  • Some sites detect and block bots using rate limiting or captchas
  • Not suitable for heavy-duty parallel scraping (use Scrapy for large-scale tasks)

Ethical and Legal Considerations

Respect robots.txt

Always check if the site allows scraping:

http://example.com/robots.txt

Be Polite: Add Delays

import time time.sleep(2) # Sleep for 2 seconds between requests

Avoid Personal Data Collection

Do not collect sensitive data or violate privacy policies. Ensure your scraping complies with local data protection laws like GDPR or CCPA.

Combining Python's requests and BeautifulSoup libraries provides a powerful, flexible, and accessible way to extract data from static websites. This approach is ideal for beginners and intermediate users who want to automate data collection and analysis tasks. Whether you're scraping quotes for inspiration, job listings for research, or tables for analysis, understanding how to use these tools together is a foundational skill in modern data science and automation workflows.

Always remember to respect the legal and ethical boundaries of web scraping. Use public APIs when available, read website terms of use, and never overload a site’s server with unnecessary requests. With responsible and efficient use, web scraping can unlock tremendous potential for data extraction and insight generation.

Frequently Asked Questions for Python

Python is commonly used for developing websites and software, task automation, data analysis, and data visualisation. Since it's relatively easy to learn, Python has been adopted by many non-programmers, such as accountants and scientists, for a variety of everyday tasks, like organising finances.


Python's syntax is a lot closer to English and so it is easier to read and write, making it the simplest type of code to learn how to write and develop with. The readability of C++ code is weak in comparison and it is known as being a language that is a lot harder to get to grips with.

Learning Curve: Python is generally considered easier to learn for beginners due to its simplicity, while Java is more complex but provides a deeper understanding of how programming works. Performance: Java has a higher performance than Python due to its static typing and optimization by the Java Virtual Machine (JVM).

Python can be considered beginner-friendly, as it is a programming language that prioritizes readability, making it easier to understand and use. Its syntax has similarities with the English language, making it easy for novice programmers to leap into the world of development.

To start coding in Python, you need to install Python and set up your development environment. You can download Python from the official website, use Anaconda Python, or start with DataLab to get started with Python in your browser.

Learning Curve: Python is generally considered easier to learn for beginners due to its simplicity, while Java is more complex but provides a deeper understanding of how programming works.

Python alone isn't going to get you a job unless you are extremely good at it. Not that you shouldn't learn it: it's a great skill to have since python can pretty much do anything and coding it is fast and easy. It's also a great first programming language according to lots of programmers.

The point is that Java is more complicated to learn than Python. It doesn't matter the order. You will have to do some things in Java that you don't in Python. The general programming skills you learn from using either language will transfer to another.


Read on for tips on how to maximize your learning. In general, it takes around two to six months to learn the fundamentals of Python. But you can learn enough to write your first short program in a matter of minutes. Developing mastery of Python's vast array of libraries can take months or years.


6 Top Tips for Learning Python

  • Choose Your Focus. Python is a versatile language with a wide range of applications, from web development and data analysis to machine learning and artificial intelligence.
  • Practice regularly.
  • Work on real projects.
  • Join a community.
  • Don't rush.
  • Keep iterating.

The following is a step-by-step guide for beginners interested in learning Python using Windows.

  • Set up your development environment.
  • Install Python.
  • Install Visual Studio Code.
  • Install Git (optional)
  • Hello World tutorial for some Python basics.
  • Hello World tutorial for using Python with VS Code.

Best YouTube Channels to Learn Python

  • Corey Schafer.
  • sentdex.
  • Real Python.
  • Clever Programmer.
  • CS Dojo (YK)
  • Programming with Mosh.
  • Tech With Tim.
  • Traversy Media.

Python can be written on any computer or device that has a Python interpreter installed, including desktop computers, servers, tablets, and even smartphones. However, a laptop or desktop computer is often the most convenient and efficient option for coding due to its larger screen, keyboard, and mouse.

Write your first Python programStart by writing a simple Python program, such as a classic "Hello, World!" script. This process will help you understand the syntax and structure of Python code.

  • Google's Python Class.
  • Microsoft's Introduction to Python Course.
  • Introduction to Python Programming by Udemy.
  • Learn Python - Full Course for Beginners by freeCodeCamp.
  • Learn Python 3 From Scratch by Educative.
  • Python for Everybody by Coursera.
  • Learn Python 2 by Codecademy.

  • Understand why you're learning Python. Firstly, it's important to figure out your motivations for wanting to learn Python.
  • Get started with the Python basics.
  • Master intermediate Python concepts.
  • Learn by doing.
  • Build a portfolio of projects.
  • Keep challenging yourself.

Top 5 Python Certifications - Best of 2024
  • PCEP (Certified Entry-level Python Programmer)
  • PCAP (Certified Associate in Python Programmer)
  • PCPP1 & PCPP2 (Certified Professional in Python Programming 1 & 2)
  • Certified Expert in Python Programming (CEPP)
  • Introduction to Programming Using Python by Microsoft.

The average salary for Python Developer is β‚Ή5,55,000 per year in the India. The average additional cash compensation for a Python Developer is within a range from β‚Ή3,000 - β‚Ή1,20,000.

The Python interpreter and the extensive standard library are freely available in source or binary form for all major platforms from the Python website, https://www.python.org/, and may be freely distributed.

If you're looking for a lucrative and in-demand career path, you can't go wrong with Python. As one of the fastest-growing programming languages in the world, Python is an essential tool for businesses of all sizes and industries. Python is one of the most popular programming languages in the world today.

line

Copyrights © 2024 letsupdateskills All rights reserved