Most useful datasets aren’t found on a single page. Think
product listings on Amazon, articles on a blog, or job postings on LinkedIn.
These use pagination to divide results across multiple pages.
In this chapter, you’ll learn:
📄 What is Pagination?
Pagination splits content into parts, usually with:
🔍 Example 1: Pagination
via URL Parameter
Let’s say we’re scraping blog articles from a site like:
https://example.com/blog?page=1
https://example.com/blog?page=2
...
Each page shows a list of posts like:
<div
class="post">
<h2>Post Title</h2>
<p>Published: 2024-01-01</p>
</div>
✅ Step-by-Step Scraper for
Multiple Pages
import
requests
from
bs4 import BeautifulSoup
import
time
base_url
= "https://example.com/blog?page="
for
page in range(1, 6): # scrape pages 1 to
5
print(f"Scraping page {page}...")
url = base_url + str(page)
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url,
headers=headers)
soup = BeautifulSoup(response.text,
'html.parser')
posts = soup.find_all('div', class_='post')
for post in posts:
title = post.find('h2').text
date = post.find('p').text
print(f"{title} | {date}")
time.sleep(1) # be polite!
📊 Output (Sample):
Scraping
page 1...
Learn
Python Fast | Published: 2024-01-01
Master
Data Science | Published: 2024-01-02
...
Scraping
page 2...
...
🔄 Example 2: Paginate
Until No More Results
Some sites don’t limit the number of pages explicitly.
Instead, you keep scraping until no results appear.
page
= 1
while
True:
url =
f"https://example.com/blog?page={page}"
response = requests.get(url)
soup = BeautifulSoup(response.text,
'html.parser')
posts = soup.find_all('div', class_='post')
if not posts:
break
# no more results
for post in posts:
print(post.find('h2').text)
page += 1
time.sleep(1)
🧠 Detecting Pagination
Patterns
|
Clue |
Example |
|
URL Parameter |
?page=2, ?start=20, &offset=10 |
|
"Next" Button |
<a href="/page/2">Next</a> |
|
Total pages shown |
<span>Page 1 of 10</span> |
Inspect the pagination area in browser DevTools or look at
the network requests.
🧑💻
Example 3: Follow “Next” Links Dynamically
Some sites change the pagination URLs each time.
url
= "https://example.com/blog"
while
url:
response = requests.get(url)
soup = BeautifulSoup(response.text,
'html.parser')
for post in soup.select('.post h2'):
print(post.text)
next_btn = soup.find('a', text='Next')
if next_btn:
url = "https://example.com" +
next_btn['href']
else:
url = None
time.sleep(1)
📁 Store Results to CSV
import
csv
with
open('posts.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Title', 'Date'])
for page in range(1, 6):
url =
f"https://example.com/blog?page={page}"
soup =
BeautifulSoup(requests.get(url).text, 'html.parser')
posts = soup.find_all('div',
class_='post')
for post in posts:
title = post.find('h2').text
date = post.find('p').text
writer.writerow([title, date])
💡 Best Practices
|
Tip |
Why it matters |
|
Use delays (time.sleep()) |
Prevents server overload |
|
Add headers (User-Agent) |
Reduces bot detection |
|
Stop if no data is returned |
Prevents infinite loops |
|
Save frequently |
Prevents data loss if interrupted |
⚠️ Common Mistakes
|
Mistake |
How to Fix |
|
Not handling empty pages |
Use if not results: break |
|
Hitting rate limits |
Add time.sleep(), use proxies |
|
Scraping too many pages too fast |
Add throttling, scrape in batches |
|
Not testing pagination pattern |
Always manually test at least 3 pages |
📦 Real-World Use Cases
✅ Summary
In this chapter, you learned:
The most popular ones are requests, BeautifulSoup, lxml, Selenium, and recently Playwright for dynamic websites.
BeautifulSoup is used for parsing static HTML content, while Selenium is used for scraping JavaScript-heavy websites by simulating a browser.
Use looped requests where you modify URL parameters (e.g., ?page=2) or parse "next" links from HTML dynamically.
Not always. You should always check the website's robots.txt file and Terms of Service. Many sites restrict scraping or require permission.
Some typical errors include 403 Forbidden, 404 Not Found, Captchas, and broken selectors due to dynamic content.
Yes. Websites may detect bots through headers, request frequency, or missing JavaScript execution. Using User-Agent headers and delays helps.
These require using Selenium or Playwright to simulate scroll events, wait for content to load, and then extract the data.
Tutorials are for educational purposes only, with no guarantees of comprehensiveness or error-free content; TuteeHUB disclaims liability for outcomes from reliance on the materials, recommending verification with official sources for critical applications.
Kindly log in to use this feature. We’ll take you to the login page automatically.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Your experience on this site will be improved by allowing cookies. Read Cookie Policy
Your experience on this site will be improved by allowing cookies. Read Cookie Policy
Comments(0)