Top 4 Web Scraping Interview Questions in Python (With Solutions and Pro Tips)

7.76K 0 0 0 0

Chapter 5: Scraping Paginated Data (Multiple Pages)


Most useful datasets aren’t found on a single page. Think product listings on Amazon, articles on a blog, or job postings on LinkedIn. These use pagination to divide results across multiple pages.

In this chapter, you’ll learn:

  • How to detect and work with URL-based pagination
  • How to scrape multiple pages with a for loop
  • How to use BeautifulSoup to combine results
  • When to stop pagination (next button or last page)
  • Best practices for looping, throttling, and saving multi-page data

📄 What is Pagination?

Pagination splits content into parts, usually with:

  • Page numbers in URLs
    E.g., example.com/page/1, page=2, etc.
  • "Next" buttons or links
  • Load more buttons (infinite scroll – dynamic; handled in another chapter)

🔍 Example 1: Pagination via URL Parameter

Let’s say we’re scraping blog articles from a site like:

https://example.com/blog?page=1

https://example.com/blog?page=2

...

Each page shows a list of posts like:

<div class="post">

  <h2>Post Title</h2>

  <p>Published: 2024-01-01</p>

</div>


Step-by-Step Scraper for Multiple Pages

import requests

from bs4 import BeautifulSoup

import time

 

base_url = "https://example.com/blog?page="

 

for page in range(1, 6):  # scrape pages 1 to 5

    print(f"Scraping page {page}...")

    url = base_url + str(page)

    headers = {'User-Agent': 'Mozilla/5.0'}

    response = requests.get(url, headers=headers)

 

    soup = BeautifulSoup(response.text, 'html.parser')

    posts = soup.find_all('div', class_='post')

 

    for post in posts:

        title = post.find('h2').text

        date = post.find('p').text

        print(f"{title} | {date}")

 

    time.sleep(1)  # be polite!


📊 Output (Sample):

Scraping page 1...

Learn Python Fast | Published: 2024-01-01

Master Data Science | Published: 2024-01-02

...

Scraping page 2...

...


🔄 Example 2: Paginate Until No More Results

Some sites don’t limit the number of pages explicitly. Instead, you keep scraping until no results appear.

page = 1

while True:

    url = f"https://example.com/blog?page={page}"

    response = requests.get(url)

    soup = BeautifulSoup(response.text, 'html.parser')

 

    posts = soup.find_all('div', class_='post')

    if not posts:

        break  # no more results

 

    for post in posts:

        print(post.find('h2').text)

 

    page += 1

    time.sleep(1)


🧠 Detecting Pagination Patterns

Clue

Example

URL Parameter

?page=2, ?start=20, &offset=10

"Next" Button

<a href="/page/2">Next</a>

Total pages shown

<span>Page 1 of 10</span>

Inspect the pagination area in browser DevTools or look at the network requests.


🧑💻 Example 3: Follow “Next” Links Dynamically

Some sites change the pagination URLs each time.

url = "https://example.com/blog"

while url:

    response = requests.get(url)

    soup = BeautifulSoup(response.text, 'html.parser')

 

    for post in soup.select('.post h2'):

        print(post.text)

 

    next_btn = soup.find('a', text='Next')

    if next_btn:

        url = "https://example.com" + next_btn['href']

    else:

        url = None

 

    time.sleep(1)


📁 Store Results to CSV

import csv

 

with open('posts.csv', 'w', newline='') as f:

    writer = csv.writer(f)

    writer.writerow(['Title', 'Date'])

 

    for page in range(1, 6):

        url = f"https://example.com/blog?page={page}"

        soup = BeautifulSoup(requests.get(url).text, 'html.parser')

        posts = soup.find_all('div', class_='post')

        for post in posts:

            title = post.find('h2').text

            date = post.find('p').text

            writer.writerow([title, date])


💡 Best Practices

Tip

Why it matters

Use delays (time.sleep())

Prevents server overload

Add headers (User-Agent)

Reduces bot detection

Stop if no data is returned

Prevents infinite loops

Save frequently

Prevents data loss if interrupted


️ Common Mistakes

Mistake

How to Fix

Not handling empty pages

Use if not results: break

Hitting rate limits

Add time.sleep(), use proxies

Scraping too many pages too fast

Add throttling, scrape in batches

Not testing pagination pattern

Always manually test at least 3 pages


📦 Real-World Use Cases

  • Scraping all product listings from e-commerce sites
  • Extracting all job posts from a hiring board
  • Collecting all articles in a news archive
  • Gathering all course listings from an online platform

Summary

In this chapter, you learned:


  • How to scrape multiple pages using loops
  • How to detect and use pagination in URLs
  • How to follow “Next” links dynamically
  • Best practices for performance, ethics, and error handling
  • How to save multi-page data into CSV

Back

FAQs


1. What are the most commonly used Python libraries for web scraping?

The most popular ones are requests, BeautifulSoup, lxml, Selenium, and recently Playwright for dynamic websites.

2. What’s the difference between BeautifulSoup and Selenium?

BeautifulSoup is used for parsing static HTML content, while Selenium is used for scraping JavaScript-heavy websites by simulating a browser.

3. How do I handle pagination while scraping a website?

Use looped requests where you modify URL parameters (e.g., ?page=2) or parse "next" links from HTML dynamically.

4. Is it legal to scrape data from any website?

Not always. You should always check the website's robots.txt file and Terms of Service. Many sites restrict scraping or require permission.

5. What are some common errors encountered during web scraping?

Some typical errors include 403 Forbidden, 404 Not Found, Captchas, and broken selectors due to dynamic content.

6. How do I scrape content behind a login wall?

  1. Use a session with the requests library to log in, or automate login with Selenium if JavaScript is involved.

7. Can web scraping be detected by the target site?

Yes. Websites may detect bots through headers, request frequency, or missing JavaScript execution. Using User-Agent headers and delays helps.

8. How do you scrape data from infinite scrolling pages?

These require using Selenium or Playwright to simulate scroll events, wait for content to load, and then extract the data.

Tutorials are for educational purposes only, with no guarantees of comprehensiveness or error-free content; TuteeHUB disclaims liability for outcomes from reliance on the materials, recommending verification with official sources for critical applications.