29

Gotta Catch Em All!

 4 years ago
source link: https://www.tuicool.com/articles/z6VBVnf
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Web Scraping With Selenium, MongoDB, and Beautiful Soup: Part I

RbqiUb3.jpg!web

Hello! In this series of blog posts, I will share my experience scraping and share how to implement some powerful tools to conduct a web scrape. But first…

WHAT IS WEB SCRAPING?

Web scraping, also known as web data extraction, web harvesting, and also screen scraping, is a technique used to extract data from websites and saved locally on a computer. A very rudimentary method to accomplish this task would be to manually click through and copy-paste information, which can be a boring, slow job, and also prone to error and mistakes. I’m here to tell you there are ways that automate the process!

One method would be to use a Web Browser Automation Tool, such as Selenium. Firstly utilized for automating web apps for testing purposes, it can also be used to web scrape. It allows one to open a browser window and perform many tasks a living, breathing human would, for example:

  • Clicking buttons
  • browsing through elements, multiple pages
  • Entering information in forms
  • Searching for specific information on the web pages

Selenium is capable of operating off of your favorite web browser (Chrome and Firefox for example). But before we can start navigating through webpages to scrape, one approach would be to use the browser’s developer tools to identify the elements on the webpage we desire.

But, what if you don’t know what you want when scraping? What if you just wanted to grab everything now, and parse through the information later?

THE SAVIOR: MongoDB, a NoSQL database

Rather than being specific during the scrape, we can implement a MongoDB database. Rather than storing information in tables and rows as in relational databases(SQL), the MongoDB architecture is comprised of collections and JSON-like documents. Together with Selenium, we can now use save the whole HTML data of a website to sort through at a later time.

Although this can seem like a LOT of information now, there are ways to parse through this data easily.

7NVbiua.png!web

Pretty Porridge would have also been a cool name?

Beautiful Soup is a Python library for extracting data out of HTML and XML files. This is the last piece of the puzzle to complete our web scrape for the desired information from our scraped website(s). To correctly obtain the desired data, again we need to call upon a browser’s developer tool to correctly identify the elements which will lead us to the data!

Conclusion

Implementing Selenium, MongoDB, and BeautifulSoup are powerful tools for any data scientist to master. Together, they can help one save time and effort in any web scraping project. Next week, I will demonstrate how to properly install the necessary packages and libraries within Python using Jupyter Notebooks within the OS environment.

Until then, if you have any questions, comments, or critiques, I can be reached via Linkedin or the comment section.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK