16

All Roger Ebert's Great Movies That You Can Watch on Amazon Prime

 3 years ago
source link: https://www.linisnil.com/articles/scraping-roger-ebert-reviews-and-amazon/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

My wife and I are big fans of the late film critic Roger Ebert . We also share an Amazon prime membership.

I wondered: which of Roger Ebert’s favorite movies are available to watch for free on prime? Since there are hundreds of reviews by Roger Ebert, I had the perfect excuse for writing a web scraper!

In this article, I will:

  • Show my not so pretty scraping code
  • Discuss some roadblocks / gotchas I ran into along the way
  • Share with you the list of movies rated as great by Roger Ebert. That’s what you’re here for, right?

PS:If you just want to see the list of movies, just jump to the end of this article.

Code Quality Warning: I hacked this together as fast as I could without much refactoring, so it’s not the most readable or optimized. But it works… for now.

Roadblocks

I hit a few roadblocks while working on this that I think are worth calling out and will clarify some of the decisions I made in the implementation.

scraping rogerebert.com

Performing a regular GET with an Accept: text/html header (which I think is the default for the requests library) against the url assigned to the variable ebert_url will always return the first page of movies (regardless of what you set the page query parameter to).

Solution? The Accept header field needs to be set to application/json for the server to return JSON containing movies for that specific page.

scraping amazon.com

No public API

First, there is no publicaly available Amazon API for their catalog search. It seems like you could email them to get authorization, but I didn’t want to waste my time doing that.

Not automation friendly

I started off using the requests library. Turns out that if you don’t set a proper browser agent, you’ll get a 503 and some message about how automation isn’t welcome. If you do fake a proper agent but you’re not setting cookies from the server respond, you’ll get:

Sorry, we just need to make sure you’re not a robot. For best results, please make sure your browser is accepting cookies.

I got frustrated and switched over to using a more stateful HTTP tool:mechanize.

That worked.

Bad HTML …

You’ll notice that I’m using some regex in the function amazon_search to parse out the movie title search results on the page. The reason is that when I tried using beautifulsoup 's find_all function on the search result tags, I got nothing. My guess is that there’s some invalid HTML on the page and confused the beautifulsoup html.parser parser which isn’t super lenient.

Turns out, rather than using regex, I could have switched over to use the html5lib parser .

For example: BeautifulSoup(match, features="html5lib") .

The html5lib parser is the most lenient parser - much more lenient than html.parser . So if I needed to make additional changes to this function, I’d refactor it to use that parser and get rid of the nasty looking regex.

Results

Without further adieu, here’s all the great movies movies that are included with prime! I included the full list via google drive at the very end.

Title Review The Battle of Algiers https://www.rogerebert.com/reviews/great-movie-the-battle-of-algiers-1967 The Gospel According to St. Matthew https://www.rogerebert.com/reviews/great-movie-gospel-according-to-st-matthew-1964 Atlantic City https://www.rogerebert.com/reviews/great-movie-atlantic-city-1980 Fitzcarraldo https://www.rogerebert.com/reviews/great-movie-fitzcarraldo-1982 Howards End https://www.rogerebert.com/reviews/great-movie-howards-end-1992 Paths of Glory https://www.rogerebert.com/reviews/great-movie-paths-of-glory-1957 The Adventures of Robin Hood https://www.rogerebert.com/reviews/great-movie-the-adventures-of-robin-hood-1938 The Good, the Bad and the Ugly https://www.rogerebert.com/reviews/great-movie-the-good-the-bad-and-the-ugly-1968 Breathless https://www.rogerebert.com/reviews/great-movie-breathless-1960 Moonstruck https://www.rogerebert.com/reviews/great-movie-moonstruck-1987 Snow White and the Seven Dwarfs https://www.rogerebert.com/reviews/great-movie-snow-white-and-the-seven-dwarfs-1937 Make Way for Tomorrow https://www.rogerebert.com/reviews/make-way-for-tomorrow-1937 Pickpocket https://www.rogerebert.com/reviews/great-movie-pickpocket-1959 The Big Sleep https://www.rogerebert.com/reviews/great-movie-the-big-sleep-1946 The General https://www.rogerebert.com/reviews/great-movie-the-general-1927 The Night of the Hunter https://www.rogerebert.com/reviews/great-movie-the-night-of-the-hunter-1955 Orpheus https://www.rogerebert.com/reviews/great-movie-orpheus-1949 Some Like It Hot https://www.rogerebert.com/reviews/great-movie-some-like-it-hot-1959 Beauty and the Beast https://www.rogerebert.com/reviews/great-movie-beauty-and-the-beast-1946 House of Games https://www.rogerebert.com/reviews/great-movie-house-of-games-1987 Dracula https://www.rogerebert.com/reviews/great-movie-dracula-1931 Spirited Away https://www.rogerebert.com/reviews/great-movie-spirited-away-2002 The Man Who Shot Liberty Valance https://www.rogerebert.com/reviews/great-movie-the-man-who-shot-liberty-valance-1962 Nosferatu the Vampyre https://www.rogerebert.com/reviews/great-movie-nosferatu-the-vampyre-1979 Sunset Boulevard https://www.rogerebert.com/reviews/great-movie-sunset-boulevard-1950 Aguirre, the Wrath of God https://www.rogerebert.com/reviews/great-movie-aguirre-the-wrath-of-god-1972 The Bicycle Thief https://www.rogerebert.com/reviews/great-movie-the-bicycle-thief--bicycle-thieves-1949 It’s a Wonderful Life https://www.rogerebert.com/reviews/great-movie-its-a-wonderful-life-1946 Psycho https://www.rogerebert.com/reviews/great-movie-psycho-1960 Pinocchio https://www.rogerebert.com/reviews/great-movie-pinocchio-1940 Trouble in Paradise https://www.rogerebert.com/reviews/great-movie-trouble-in-paradise-1932 The Silence https://www.rogerebert.com/reviews/the-silence-1963 My Man Godfrey https://www.rogerebert.com/reviews/great-movie-my-man-godfrey-1936 Johnny Guitar https://www.rogerebert.com/reviews/johnny-guitar-1954 What Ever Happened to Baby Jane? https://www.rogerebert.com/reviews/great-movie-what-ever-happened-to-baby-jane-1962 Detour https://www.rogerebert.com/reviews/great-movie-detour-1945 Woman in the Dunes https://www.rogerebert.com/reviews/great-movie-woman-in-the-dunes-1964 The Sweet Smell of Success https://www.rogerebert.com/reviews/great-movie-the-sweet-smell-of-success-1957 A Christmas Story https://www.rogerebert.com/reviews/great-movie-a-christmas-story-1983 Beat the Devil https://www.rogerebert.com/reviews/great-movie-beat-the-devil-1954 The Long Goodbye https://www.rogerebert.com/reviews/great-movie-the-long-goodbye-1973 Night Moves https://www.rogerebert.com/reviews/great-movie-night-moves-1975

Here’s a FULL data set of movies (not available on amazon, available but not free with prime, and free with prime): https://docs.google.com/spreadsheets/d/1XkdEqzXbhivEGhty_hVV8nNeJBhd4HKKSCSIM97MbjA/edit?usp=sharing .

Enjoy.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK