web scraping

Web scraping is the process of extracting data from a website. It can be used for a variety of purposes, such as gathering market data, tracking prices, or even just keeping up with the latest news.

There are many different ways to web scrape, but one of the most popular methods is to use Beautiful Soup. Beautiful Soup is a Python library that makes it easy to parse HTML and XML documents.

In this guide, we’ll show you how to use Beautiful Soup to web scrape a simple website. We’ll also cover some of the basics of web scraping, such as how to find the data you want to extract and how to format it for your needs.

Getting Started

The first thing you’ll need to do is install Beautiful Soup. You can do this using the following command:

pip install beautifulsoup4

Once you have Beautiful Soup installed, you can start scraping websites.

Finding the Data You Want

The first step in any web scraping project is to find the data you want to extract. You can do this by looking at the HTML or XML document of the website you’re scraping.

In our example, we’re going to scrape the Wikipedia page for the city of London: https://en.wikipedia.org/wiki/London.

If you open the HTML document of this page, you’ll see that the city’s population is located in the following element:

HTML
<th scope="row">Population</th>
<td>9,787,426 (2023 est.)</td>

To extract this data using Beautiful Soup, we can use the following code:

Python
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/London"

response = requests.get(url)
html_content = response.content

soup = BeautifulSoup(html_content, "html.parser")

population = soup.find("th", text="Population").find_next_sibling("td").text

print(population)

This code will print the following output:

9,787,426 (2023 est.)

Formatting the Data

The data you extract from a website may not be in the format you need. In our example, the population data is in a string format. If we want to use this data in a spreadsheet or database, we’ll need to format it as a number.

We can do this using the following code:

Python
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/London"

response = requests.get(url)
html_content = response.content

soup = BeautifulSoup(html_content, "html.parser")

population = soup.find("th", text="Population").find_next_sibling("td").text

population = int(population)

print(population)
9787426

Summary

In this guide, we showed you how to use Beautiful Soup to web scrape a simple website. We also covered some of the basics of web scraping, such as how to find the data you want to extract and how to format it for your needs.

With a little practice, you’ll be able to use Beautiful Soup to extract data from any website.

Some extras I made code for scrapping detik.com and kompas.com trending news here