Extracting selected data from a single page using lxml.html.xpath

Mulianaraul
3 min readDec 19, 2020

Fast and Simple ways to learn scrapping!

In this exercise, we will use XPath to collect information from the provided URL and use lxml. But, before we start it, please refer to the XPath and CSS selectors using DevTools. This article

Here is the little documentation about XPath expression:

Part 1
Part 2
Part 3
Part 4

We’re going to use this link for doing a “web scrapping”:

First, we have to install the module in our environment, if you are using macOS try to install through terminal, then we can import the module through IDE. It depends on you either using jupyter notebook, pycharm or VScode.

Step-1

A musicURL string object contains a link to the main page. musicURL is parsed using the parse() function, which results in the doc object which has lxml.etree.ElementTree object type.

Let’s get a base element through <article>

Get the <article> path

Make a new object as articles and put the <article> path into cell by using xpath() function and do not forget to add [0] at the last of code cause we’re just get the first area of <article> tags. The XPath for the articles posseses all of the fields that are available inside , such as title, price, availability, imageUrl, and starRating.

First <article> path

After this, we have to set up the individual expression. If we are going to get information about title, price, availability and imageURL, we should to declare the individual XPath expression such as:

Set up Individual XPath Expression

Check the result by printing each result object. There is still messy data on availability, imageUrl and starRating object. So let’s try to clean up the data

Checking
Cleaning Process

So, here is the final step, we are going to make a DataFrame after doing a “Web Scrapping” using lxml.html and XPath.

Make a list of number
DataFrame from Web Scrapping

This article is used to fast learning documentation about web-scrapping using lxml.html and XPath. Hope u enjoy it!

--

--