Web Scraping with Beautiful Soup: A Beginner’s Guide

Web scraping is the process of extracting data from a website. It can be used for a variety of purposes, such as gathering market data, tracking prices, or even just keeping up with the latest news.

There are many different ways to web scrape, but one of the most popular methods is to use Beautiful Soup. Beautiful Soup is a Python library that makes it easy to parse HTML and XML documents.

In this guide, we’ll show you how to use Beautiful Soup to web scrape a simple website. We’ll also cover some of the basics of web scraping, such as how to find the data you want to extract and how to format it for your needs.

Getting Started

The first thing you’ll need to do is install Beautiful Soup. You can do this using the following command:

pip install beautifulsoup4

Once you have Beautiful Soup installed, you can start scraping websites.

Finding the Data You Want

The first step in any web scraping project is to find the data you want to extract. You can do this by looking at the HTML or XML document of the website you’re scraping.

In our example, we’re going to scrape the Wikipedia page for the city of London: https://en.wikipedia.org/wiki/London.

If you open the HTML document of this page, you’ll see that the city’s population is located in the following element:

HTML
<th scope="row">Population</th>
<td>9,787,426 (2023 est.)</td>

To extract this data using Beautiful Soup, we can use the following code:

Python
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/London"

response = requests.get(url)
html_content = response.content

soup = BeautifulSoup(html_content, "html.parser")

population = soup.find("th", text="Population").find_next_sibling("td").text

print(population)

This code will print the following output:

9,787,426 (2023 est.)

Formatting the Data

The data you extract from a website may not be in the format you need. In our example, the population data is in a string format. If we want to use this data in a spreadsheet or database, we’ll need to format it as a number.

We can do this using the following code:

Python
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/London"

response = requests.get(url)
html_content = response.content

soup = BeautifulSoup(html_content, "html.parser")

population = soup.find("th", text="Population").find_next_sibling("td").text

population = int(population)

print(population)
9787426

Summary

In this guide, we showed you how to use Beautiful Soup to web scrape a simple website. We also covered some of the basics of web scraping, such as how to find the data you want to extract and how to format it for your needs.

With a little practice, you’ll be able to use Beautiful Soup to extract data from any website.

Some extras I made code for scrapping detik.com and kompas.com trending news here


Discover more from Susiloharjo

Subscribe to get the latest posts sent to your email.

Related Posts

neuralink brain chip

Neuralink: Revolutionizing Human-Computer Interaction with Brain-Computer Interfaces

What is Neuralink? Neuralink is a revolutionary technology developed by Elon Musk and his team of neuroscientists and engineers. The primary goal of Neuralink is to create…

vector storage

Advantages of Vector Storage for LLMs and AI

In the rapidly evolving landscape of artificial intelligence (AI) and machine learning (ML), one technology stands out as a game-changer: vector storage. This specialized form of data…

GRIT

Implementing GRIT in the Workplace: A Step-by-Step Guide

This Article is based on the Book GRIT by Angela Duckworth, The Power of Passion and Perseverance, As employees, we’ve all faced challenges and obstacles that can…

behaviour-experiment

Understanding Behavioral Experiments

Behavioral experiments are practical, low-risk interventions designed to test and implement new behaviors within an organization. They are a strategic approach to driving cultural and operational change…

Can AI Help Your Company Innovate

Can AI Help Your Company Innovate?

As a business leader, you’re constantly looking for ways to innovate and stay ahead of the competition. But with the rapid pace of technological change, it can…

how AI can revolutionize olympics

How AI Can Revolutionize the Olympics

The Olympic Games, a premier international multi-sport event, has been a symbol of human achievement and excellence for over a century. As technology continues to advance, the…

Discover more from Susiloharjo

Subscribe now to keep reading and get access to the full archive.

Continue reading