2

Scraping a Web Page in Browser using XPath and Javascript

 2 years ago
source link: https://dev.to/dendihandian/scrapping-a-web-page-in-browser-using-xpath-and-javascript-3m17
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

As a programmer we should think to automate anything related to our daily task every single time if possible. For instance when you gathering amount of data on a web page, rather than copying the text one-by-one you could do a simple web scraping.

The Case

I will demonstrate how to scrap the youtube playlist of PyCon ID 2020 Talks in this youtube page https://www.youtube.com/playlist?list=PLIv0V1YCmEi3A6H6mdsoxh4RDpzvnJpMq. As the result, I will have a list of the video titles.

The XPath

XPath is the query languange to get the nodes/element on the XML or HTML, you could learn it more on other resources like W3school https://www.w3schools.com/xml/xpath_intro.asp. The simple query example for getting nodes containing the video titles is this:

//a[@class="yt-simple-endpoint style-scope ytd-playlist-video-renderer"]

Enter fullscreen mode

Exit fullscreen mode

The above xpath syntax may not work if the web page structure is changed in the future.

You could also try this yourself in the Chrome/Edge Browser developer tools, on the Elements tab and Ctrl + F to start using Xpath. The result indicates that it has 39 items and it seems to be right.

The XPath Utility Function in Javascript

After found the right xpath for the element, now open Console tab in the browser developer tools to begin typing some javascript. Javascript has built-in XPath utility function that syntax like this $x(). We could add the xpath string the function and check the length:

$x('//a[@class="yt-simple-endpoint style-scope ytd-playlist-video-renderer"]').length

Enter fullscreen mode

Exit fullscreen mode

If the output length matches the numbers of items we want to scrap, then the function will works. Now we just get list of titles and return it to the console screen:

$x('//a[@class="yt-simple-endpoint style-scope ytd-playlist-video-renderer"]').map(function(el){return el.text.trim()}).join("\n")

Enter fullscreen mode

Exit fullscreen mode

The output in the console may be weird because of the \n but if you copy the string contents and paste it on the editor like Visual Studio Code, you will get the clean result:

Hope this will be useful for you.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK