Extracting selected data from a single page using lxml.html.xpath

3 min readDec 19, 2020

Fast and Simple ways to learn scrapping!

In this exercise, we will use XPath to collect information from the provided URL and use lxml. But, before we start it, please refer to the XPath and CSS selectors using DevTools. This article

Here is the little documentation about XPath expression:

We’re going to use this link for doing a “web scrapping”:

Music

Edit description

books.toscrape.com

First, we have to install the module in our environment, if you are using macOS try to install through terminal, then we can import the module through IDE. It depends on you either using jupyter notebook, pycharm or VScode.

A musicURL string object contains a link to the main page. musicURL is parsed using the parse() function, which results in the doc object which has lxml.etree.ElementTree object type.

Let’s get a base element through <article>

Make a new object as articles and put the <article> path into cell by using xpath() function and do not forget to add [0] at the last of code cause we’re just get the first area of <article> tags. The XPath for the articles posseses all of the fields that are available inside , such as title, price, availability, imageUrl, and starRating.

After this, we have to set up the individual expression. If we are going to get information about title, price, availability and imageURL, we should to declare the individual XPath expression such as:

Check the result by printing each result object. There is still messy data on availability, imageUrl and starRating object. So let’s try to clean up the data