Web scraping is the process of extracting data from websites. In Python, the two most commonly used libraries for this purpose are requests and BeautifulSoup. The requests library allows you to send HTTP/1.1 requests, and BeautifulSoup allows you to parse HTML or XML documents efficiently. This document provides a detailed explanation of how to use these libraries together for effective web scraping, covering everything from installation to advanced scraping examples.
While requests is used for fetching the HTML content of a webpage, BeautifulSoup is used for parsing and navigating that HTML content. Together, they form a powerful toolkit for scraping static web pages.
pip install requests
pip install beautifulsoup4
pip install lxml
pip install html5lib
These optional parsers may improve speed and accuracy of HTML parsing.
This website is specifically built for practicing web scraping. It displays quotes from famous people, along with authors and tags.
import requests
from bs4 import BeautifulSoup
url = "http://quotes.toscrape.com"
response = requests.get(url)
print(response.status_code) # Should return 200 if successful
soup = BeautifulSoup(response.text, 'html.parser')
quotes = soup.find_all('div', class_='quote')
for quote in quotes:
text = quote.find('span', class_='text').text
author = quote.find('small', class_='author').text
tags = [tag.text for tag in quote.find_all('a', class_='tag')]
print(f"Quote: {text}")
print(f"Author: {author}")
print(f"Tags: {', '.join(tags)}\n")
Each quote is wrapped in a <div class="quote"> element. Inside this div:
The site includes pagination using a "Next" button. We can loop through all pages until there is no "next" button.
url = "http://quotes.toscrape.com"
while url:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
quotes = soup.find_all('div', class_='quote')
for quote in quotes:
text = quote.find('span', class_='text').text
author = quote.find('small', class_='author').text
print(f"{text} - {author}")
next_button = soup.find('li', class_='next')
if next_button:
next_url = next_button.find('a')['href']
url = f"http://quotes.toscrape.com{next_url}"
else:
break
response = requests.get(url)
if response.status_code != 200:
print(f"Failed to retrieve the page: {response.status_code}")
try:
response = requests.get(url)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print("Error fetching data:", e)
Some websites block requests without proper headers. Setting a fake User-Agent can help mimic a real browser.
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}
response = requests.get(url, headers=headers)
import csv
with open('quotes.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['Text', 'Author', 'Tags'])
for quote in quotes:
text = quote.find('span', class_='text').text
author = quote.find('small', class_='author').text
tags = [tag.text for tag in quote.find_all('a', class_='tag')]
writer.writerow([text, author, ', '.join(tags)])
import pandas as pd
data = []
for quote in quotes:
text = quote.find('span', class_='text').text
author = quote.find('small', class_='author').text
tags = [tag.text for tag in quote.find_all('a', class_='tag')]
data.append({'Text': text, 'Author': author, 'Tags': ', '.join(tags)})
df = pd.DataFrame(data)
df.to_csv('quotes_pandas.csv', index=False)
quotes = soup.select('div.quote')
for q in quotes:
print(q.select_one('span.text').text)
Some websites require login credentials. Use requests.Session() to manage login sessions.
session = requests.Session()
login_data = {'username': 'your_username', 'password': 'your_password'}
session.post('http://example.com/login', data=login_data)
response = session.get('http://example.com/protected-page')
Always check if the site allows scraping:
http://example.com/robots.txt
import time
time.sleep(2) # Sleep for 2 seconds between requests
Do not collect sensitive data or violate privacy policies. Ensure your scraping complies with local data protection laws like GDPR or CCPA.
Combining Python's requests and BeautifulSoup libraries provides a powerful, flexible, and accessible way to extract data from static websites. This approach is ideal for beginners and intermediate users who want to automate data collection and analysis tasks. Whether you're scraping quotes for inspiration, job listings for research, or tables for analysis, understanding how to use these tools together is a foundational skill in modern data science and automation workflows.
Always remember to respect the legal and ethical boundaries of web scraping. Use public APIs when available, read website terms of use, and never overload a siteβs server with unnecessary requests. With responsible and efficient use, web scraping can unlock tremendous potential for data extraction and insight generation.
Python is commonly used for developing websites and software, task automation, data analysis, and data visualisation. Since it's relatively easy to learn, Python has been adopted by many non-programmers, such as accountants and scientists, for a variety of everyday tasks, like organising finances.
Learning Curve: Python is generally considered easier to learn for beginners due to its simplicity, while Java is more complex but provides a deeper understanding of how programming works.
The point is that Java is more complicated to learn than Python. It doesn't matter the order. You will have to do some things in Java that you don't in Python. The general programming skills you learn from using either language will transfer to another.
Read on for tips on how to maximize your learning. In general, it takes around two to six months to learn the fundamentals of Python. But you can learn enough to write your first short program in a matter of minutes. Developing mastery of Python's vast array of libraries can take months or years.
6 Top Tips for Learning Python
The following is a step-by-step guide for beginners interested in learning Python using Windows.
Best YouTube Channels to Learn Python
Write your first Python programStart by writing a simple Python program, such as a classic "Hello, World!" script. This process will help you understand the syntax and structure of Python code.
The average salary for Python Developer is βΉ5,55,000 per year in the India. The average additional cash compensation for a Python Developer is within a range from βΉ3,000 - βΉ1,20,000.
Copyrights © 2024 letsupdateskills All rights reserved