Top 4 Web Scraping Interview Questions in Python (With Solutions and Pro Tips)

4.4K 0 0 0 0

Chapter 2: Scraping Static HTML Pages Using Requests and BeautifulSoup

🧠 Introduction

Web scraping starts with the most essential skill: extracting data from static HTML pages. These are pages whose content doesn't change dynamically via JavaScript. If you can view the full data using the browser’s “View Page Source” feature, it's a static page — and it can usually be scraped with simple tools like requests and BeautifulSoup.

In this chapter, you’ll:

  • Understand how HTML is structured
  • Learn how to use requests to fetch web pages
  • Use BeautifulSoup to parse HTML
  • Extract tags, attributes, text, tables, and links
  • Build a basic web scraper that can be extended for real-world use

Let’s get started.


🛠️ Tools You’ll Need

pip install requests beautifulsoup4


📄 Sample Static Web Page

Let’s imagine we want to scrape a blog listing page. Here’s a simplified HTML structure of such a page:

<html>

  <head><title>Tech Blog</title></head>

  <body>

    <h1>Latest Posts</h1>

    <div class="post">

      <h2>How to Learn Python</h2>

      <p>Published on: 2024-01-01</p>

      <a href="/posts/python">Read More</a>

    </div>

    <div class="post">

      <h2>Web Scraping Basics</h2>

      <p>Published on: 2024-01-05</p>

      <a href="/posts/web-scraping">Read More</a>

    </div>

  </body>

</html>


📥 Step 1: Fetch the Page using Requests

import requests

 

url = "https://example.com/blog"

headers = {

    'User-Agent': 'Mozilla/5.0'

}

response = requests.get(url, headers=headers)

 

print(response.status_code)  # 200 means OK

print(response.text)         # Raw HTML content


🧾 Step 2: Parse HTML with BeautifulSoup

from bs4 import BeautifulSoup

 

soup = BeautifulSoup(response.text, 'html.parser')

 

# Title of the page

print(soup.title.text)  # Output: Tech Blog


📚 Step 3: Extracting Specific Tags and Content

Extracting Post Titles:

titles = soup.find_all('h2')

for title in titles:

    print(title.text)

Output:

How to Learn Python

Web Scraping Basics

Extracting Published Dates:

dates = soup.find_all('p')

for date in dates:

    print(date.text)

Extracting Post Links:

links = soup.find_all('a')

for link in links:

    print(link['href'])


🧠 Understanding HTML Parsing with BeautifulSoup

Method

Description

find()

Finds the first matching tag

find_all()

Finds all matching tags

select()

Uses CSS selectors to find elements

get_text()

Extracts just the text content of a tag

attrs or []

Extracts specific tag attributes (like href)


📦 Combine Everything into a Clean Web Scraper

import requests

from bs4 import BeautifulSoup

 

def scrape_blog_posts(url):

    headers = {'User-Agent': 'Mozilla/5.0'}

    response = requests.get(url, headers=headers)

 

    soup = BeautifulSoup(response.text, 'html.parser')

    posts = soup.find_all('div', class_='post')

 

    for post in posts:

        title = post.find('h2').text

        date = post.find('p').text

        link = post.find('a')['href']

        print(f"Title: {title}")

        print(f"Date: {date}")

        print(f"Link: {link}\n")

 

scrape_blog_posts("https://example.com/blog")


📝 Output:

Title: How to Learn Python

Date: Published on: 2024-01-01

Link: /posts/python

 

Title: Web Scraping Basics

Date: Published on: 2024-01-05

Link: /posts/web-scraping


📊 Table: Comparison of Parsing Techniques

Technique

Syntax Example

Use Case

.find()

soup.find('h2')

Get the first match

.find_all()

soup.find_all('a')

Get all <a> tags

.select()

soup.select('.post h2')

CSS-style querying

.get_text()

tag.get_text()

Extract plain text

tag['href']

link['href']

Access attribute (e.g., URL)


🚧 Error Handling and Best Practices

# Check for valid response

if response.status_code == 200:

    soup = BeautifulSoup(response.text, 'html.parser')

else:

    print("Failed to retrieve page")

 

# Always include headers

headers = {

    'User-Agent': 'Mozilla/5.0'

}


🧑💻 Bonus Challenge

Scrape all product titles and prices from this HTML:

<div class="product">

  <h3 class="title">Laptop</h3>

  <span class="price">$1200</span>

</div>

<div class="product">

  <h3 class="title">Mouse</h3>

  <span class="price">$25</span>

</div>

Tip: Use .find_all('div', class_='product') and then extract h3 and span.


🧠 Summary

By the end of this chapter, you should be able to:

  • Use requests to fetch static HTML content
  • Parse HTML using BeautifulSoup
  • Extract tag content, links, and attributes
  • Build a simple scraper that works on static websites



Back

FAQs


1. What are the most commonly used Python libraries for web scraping?

The most popular ones are requests, BeautifulSoup, lxml, Selenium, and recently Playwright for dynamic websites.

2. What’s the difference between BeautifulSoup and Selenium?

BeautifulSoup is used for parsing static HTML content, while Selenium is used for scraping JavaScript-heavy websites by simulating a browser.

3. How do I handle pagination while scraping a website?

Use looped requests where you modify URL parameters (e.g., ?page=2) or parse "next" links from HTML dynamically.

4. Is it legal to scrape data from any website?

Not always. You should always check the website's robots.txt file and Terms of Service. Many sites restrict scraping or require permission.

5. What are some common errors encountered during web scraping?

Some typical errors include 403 Forbidden, 404 Not Found, Captchas, and broken selectors due to dynamic content.

6. How do I scrape content behind a login wall?

  1. Use a session with the requests library to log in, or automate login with Selenium if JavaScript is involved.

7. Can web scraping be detected by the target site?

Yes. Websites may detect bots through headers, request frequency, or missing JavaScript execution. Using User-Agent headers and delays helps.

8. How do you scrape data from infinite scrolling pages?

These require using Selenium or Playwright to simulate scroll events, wait for content to load, and then extract the data.

Tutorials are for educational purposes only, with no guarantees of comprehensiveness or error-free content; TuteeHUB disclaims liability for outcomes from reliance on the materials, recommending verification with official sources for critical applications.