Gotta Catch Em All!

Web Scraping With Selenium, MongoDB, and Beautiful Soup: Part I

RbqiUb3.jpg!web

Hello! In this series of blog posts, I will share my experience scraping and share how to implement some powerful tools to conduct a web scrape. But first…

WHAT IS WEB SCRAPING?

Web scraping, also known as web data extraction, web harvesting, and also screen scraping, is a technique used to extract data from websites and saved locally on a computer. A very rudimentary method to accomplish this task would be to manually click through and copy-paste information, which can be a boring, slow job, and also prone to error and mistakes. I’m here to tell you there are ways that automate the process!

One method would be to use a Web Browser Automation Tool, such as Selenium. Firstly utilized for automating web apps for testing purposes, it can also be used to web scrape. It allows one to open a browser window and perform many tasks a living, breathing human would, for example:

Clicking buttons
browsing through elements, multiple pages
Entering information in forms
Searching for specific information on the web pages

Selenium is capable of operating off of your favorite web browser (Chrome and Firefox for example). But before we can start navigating through webpages to scrape, one approach would be to use the browser’s developer tools to identify the elements on the webpage we desire.

But, what if you don’t know what you want when scraping? What if you just wanted to grab everything now, and parse through the information later?

THE SAVIOR: MongoDB, a NoSQL database

Rather than being specific during the scrape, we can implement a MongoDB database. Rather than storing information in tables and rows as in relational databases(SQL), the MongoDB architecture is comprised of collections and JSON-like documents. Together with Selenium, we can now use save the whole HTML data of a website to sort through at a later time.

Although this can seem like a LOT of information now, there are ways to parse through this data easily.

7NVbiua.png!web

Pretty Porridge would have also been a cool name?

Beautiful Soup is a Python library for extracting data out of HTML and XML files. This is the last piece of the puzzle to complete our web scrape for the desired information from our scraped website(s). To correctly obtain the desired data, again we need to call upon a browser’s developer tool to correctly identify the elements which will lead us to the data!

Conclusion

Implementing Selenium, MongoDB, and BeautifulSoup are powerful tools for any data scientist to master. Together, they can help one save time and effort in any web scraping project. Next week, I will demonstrate how to properly install the necessary packages and libraries within Python using Jupyter Notebooks within the OS environment.

Until then, if you have any questions, comments, or critiques, I can be reached via Linkedin or the comment section.

WHAT IS WEB SCRAPING?

THE SAVIOR: MongoDB, a NoSQL database

Conclusion

Recommend

7 Ways to Implement Conditional Rendering in React Applications

[译] 经典系统设计面试题解析：如何设计 TinyURL（二）

基于统计的预警：同环比预警实现深度剖析

GitHub Actions, the missing notes

常问的15个顶级Java多线程面试题

拼多多面试：为什么RedisCluster有16384个槽?

挖矿常识：POS挖矿是什么？

Jdk14 都要出了，Jdk8 的时间处理姿势还不了解一下？

起底阿里完整前端技术体系

Text Mining with the Democratic Debates

About Joyk