🧠 Introduction
Web scraping starts with the most essential skill:
extracting data from static HTML pages. These are pages whose content
doesn't change dynamically via JavaScript. If you can view the full data using
the browser’s “View Page Source” feature, it's a static page — and it can
usually be scraped with simple tools like requests and BeautifulSoup.
In this chapter, you’ll:
Let’s get started.
🛠️ Tools You’ll Need
pip
install requests beautifulsoup4
📄 Sample Static Web Page
Let’s imagine we want to scrape a blog listing page. Here’s
a simplified HTML structure of such a page:
<html>
<head><title>Tech
Blog</title></head>
<body>
<h1>Latest Posts</h1>
<div class="post">
<h2>How to Learn Python</h2>
<p>Published on:
2024-01-01</p>
<a
href="/posts/python">Read More</a>
</div>
<div class="post">
<h2>Web Scraping Basics</h2>
<p>Published on:
2024-01-05</p>
<a
href="/posts/web-scraping">Read More</a>
</div>
</body>
</html>
📥 Step 1: Fetch the Page
using Requests
import
requests
url
= "https://example.com/blog"
headers
= {
'User-Agent': 'Mozilla/5.0'
}
response
= requests.get(url, headers=headers)
print(response.status_code) # 200 means OK
print(response.text) # Raw HTML content
🧾 Step 2: Parse HTML with
BeautifulSoup
from
bs4 import BeautifulSoup
soup
= BeautifulSoup(response.text, 'html.parser')
#
Title of the page
print(soup.title.text)
# Output: Tech Blog
📚 Step 3: Extracting
Specific Tags and Content
➤ Extracting Post Titles:
titles
= soup.find_all('h2')
for
title in titles:
print(title.text)
Output:
How to Learn Python
Web Scraping Basics
➤ Extracting Published Dates:
dates
= soup.find_all('p')
for
date in dates:
print(date.text)
➤ Extracting Post Links:
links
= soup.find_all('a')
for
link in links:
print(link['href'])
🧠 Understanding HTML
Parsing with BeautifulSoup
|
Method |
Description |
|
find() |
Finds the first matching tag |
|
find_all() |
Finds all matching tags |
|
select() |
Uses CSS selectors to find elements |
|
get_text() |
Extracts just the text content of a tag |
|
attrs or [] |
Extracts specific tag attributes (like href) |
📦 Combine Everything into
a Clean Web Scraper
import
requests
from
bs4 import BeautifulSoup
def
scrape_blog_posts(url):
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url,
headers=headers)
soup = BeautifulSoup(response.text,
'html.parser')
posts = soup.find_all('div', class_='post')
for post in posts:
title = post.find('h2').text
date = post.find('p').text
link = post.find('a')['href']
print(f"Title: {title}")
print(f"Date: {date}")
print(f"Link: {link}\n")
scrape_blog_posts("https://example.com/blog")
📝 Output:
Title: How to Learn Python
Date: Published on: 2024-01-01
Link: /posts/python
Title: Web Scraping Basics
Date: Published on: 2024-01-05
Link: /posts/web-scraping
📊 Table: Comparison of
Parsing Techniques
|
Technique |
Syntax Example |
Use Case |
|
.find() |
soup.find('h2') |
Get the first match |
|
.find_all() |
soup.find_all('a') |
Get all <a> tags |
|
.select() |
soup.select('.post h2') |
CSS-style querying |
|
.get_text() |
tag.get_text() |
Extract plain text |
|
tag['href'] |
link['href'] |
Access attribute (e.g., URL) |
🚧 Error Handling and Best
Practices
#
Check for valid response
if
response.status_code == 200:
soup = BeautifulSoup(response.text,
'html.parser')
else:
print("Failed to retrieve page")
#
Always include headers
headers
= {
'User-Agent': 'Mozilla/5.0'
}
🧑💻
Bonus Challenge
Scrape all product titles and prices from this HTML:
<div
class="product">
<h3
class="title">Laptop</h3>
<span
class="price">$1200</span>
</div>
<div
class="product">
<h3
class="title">Mouse</h3>
<span
class="price">$25</span>
</div>
Tip: Use .find_all('div', class_='product') and then extract
h3 and span.
🧠 Summary
By the end of this chapter, you should be able to:
The most popular ones are requests, BeautifulSoup, lxml, Selenium, and recently Playwright for dynamic websites.
BeautifulSoup is used for parsing static HTML content, while Selenium is used for scraping JavaScript-heavy websites by simulating a browser.
Use looped requests where you modify URL parameters (e.g., ?page=2) or parse "next" links from HTML dynamically.
Not always. You should always check the website's robots.txt file and Terms of Service. Many sites restrict scraping or require permission.
Some typical errors include 403 Forbidden, 404 Not Found, Captchas, and broken selectors due to dynamic content.
Yes. Websites may detect bots through headers, request frequency, or missing JavaScript execution. Using User-Agent headers and delays helps.
These require using Selenium or Playwright to simulate scroll events, wait for content to load, and then extract the data.
Tutorials are for educational purposes only, with no guarantees of comprehensiveness or error-free content; TuteeHUB disclaims liability for outcomes from reliance on the materials, recommending verification with official sources for critical applications.
Kindly log in to use this feature. We’ll take you to the login page automatically.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Your experience on this site will be improved by allowing cookies. Read Cookie Policy
Your experience on this site will be improved by allowing cookies. Read Cookie Policy
Comments(0)