17

Web crawler in Rust

 4 years ago
source link: https://rolisz.ro/2020/03/01/web-crawler-in-rust/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

I have heard many good things about Rust for several years now. A couple of months ago, I finally decided to start learning Rust. I skimmed through the Book and did the exercises from rustlings . While they helped me get started, I learn best by doing some projects. So I decided to replace thecrawler that I used for my Ghost blog, which had been written in bash with wget , with something written in Rust.

And I was pleasantly surprised. I am by no means very knowledgeable in Rust, I still have to look up most of the operations on the Option and Result types, I have to DuckDuckGo how to make HTTP requests, read and write files and so on, but I was still able to write a minimal crawler in about 2-3 hours and then in about 10 hours of total work  I had something that was both faster and had fewer bugs than the wget script.

So let's start writing a simple crawler that downloads all the HTML pages from a blog.

Initializing a Rust project

After installing Rust, let's create a project somewhere:

> cargo new rust_crawler

This initializes a Hello World program, which we can verify that it runs using:

> cargo run
   Compiling rust_crawler v0.1.0 (D:\Programming\rust_crawler)
    Finished dev [unoptimized + debuginfo] target(s) in 9.31s
     Running `target\debug\rust_crawler.exe`
Hello, world!

Making HTTP requests

Let's make our first HTTP request. For this, we will use the reqwest library. It has both blocking and asynchronous APIs for making HTTP calls. We'll start off with the blocking API, because it's easier.

use std::io::Read;

fn main() {
    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";
    let mut res = client.get(origin_url).send().unwrap();
    println!("Status for {}: {}", origin_url, res.status());

    let mut body  = String::new();
    res.read_to_string(&mut body).unwrap();
    println!("HTML: {}", &body[0..40]);
}
> cargo run
   Compiling rust_crawler v0.1.0 (D:\Programming\rust_crawler)
    Finished dev [unoptimized + debuginfo] target(s) in 2.30s
     Running `target\debug\rust_crawler.exe`
Status: 200 OK https://rolisz.ro/
HTML <!DOCTYPE html>
<html lang="en">
<head>

We create a new reqwest blocking client, create a GET request and we send it. The send call normally returns a Result , which we just unwrap for now. We print out the status code, to make sure the request returned ok and then we copy the content of the request into a mutable variable and we print it out. So far so good.

Extracting links

Now let's parse the HTML and extract all the links we find. For this we will use the select crate, which can parse HTML and allows us to search through the nodes.

use std::io::Read;
use select::document::Document;
use select::predicate::Name;

fn main() {
    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";
    let mut res = client.get(origin_url).send().unwrap();
    println!("Status for {}: {}", origin_url, res.status());

    let mut body  = String::new();
    res.read_to_string(&mut body).unwrap();
   
    Document::from(body.as_str())
        .find(Name("a"))
        .filter_map(|n| n.attr("href"))
        .for_each(|x| println!("{}", x));
}
> cargo run --color=always --package rust_crawler --bin rust_crawler
   Compiling rust_crawler v0.1.0 (D:\Programming\rust_crawler)
    Finished dev [unoptimized + debuginfo] target(s) in 2.65s
     Running `target\debug\rust_crawler.exe`
Status for https://rolisz.ro/: 200 OK
https://rolisz.ro
https://rolisz.ro
https://rolisz.ro/projects/
https://rolisz.ro/about-me/
https://rolisz.ro/uses/
https://rolisz.ro/tag/trips/
https://rolisz.ro/tag/reviews/
#subscribe
/2020/02/13/lost-in-space/
/2020/02/13/lost-in-space/
/author/rolisz/
/author/rolisz/
...
/2020/02/07/interview-about-wfh/
/2020/02/07/interview-about-wfh/
/2019/01/30/nas-outage-1/
/2019/01/30/nas-outage-1/
/author/rolisz/
/author/rolisz/
https://rolisz.ro
https://rolisz.ro
https://www.facebook.com/rolisz
https://twitter.com/rolisz
https://ghost.org
javascript:;
#

We search for all the anchor tags, filter only those that have a valid href attribute and we print the value of those attributes.

We see all the links in the output, but there are some issues. First, some of the links are absolute, some are relative, and some are pseudo-links used for doing Javascript things. Second, the links that point towards posts are duplicated and third, there are links that don't point towards something on my blog.

The duplicate problem is easy to fix: we put everything into a HashSet and then we'll get only a unique collection of URLs.

use std::io::Read;
use select::document::Document;
use select::predicate::Name;
use std::collections::HashSet;

fn main() {
    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";
    let mut res = client.get(origin_url).send().unwrap();
    println!("Status for {}: {}", origin_url, res.status());

    let mut body  = String::new();
    res.read_to_string(&mut body).unwrap();

    let found_urls = Document::from(body.as_str())
        .find(Name("a"))
        .filter_map(|n| n.attr("href"))
        .map(str::to_string)
        .collect::<HashSet<String>>();
    println!("URLs: {:#?}", found_urls)
}

First we have to convert the URLs from str type to String , so we get objects that have a separate lifetime from the original string which contains the whole HTML. Then we insert all the strings into a hash set, using the collect function from Rust, which handles insertion into all kinds of containers, in all kinds of situations.

To solve the other two problems we have to parse the URLs, using methods provided by reqwest .

use std::io::Read;
use select::document::Document;
use select::predicate::Name;
use std::collections::HashSet;
use reqwest::Url;

fn get_links_from_html(html: &str) -> HashSet<String> {
    Document::from(html.as_str())
        .find(Name("a").or(Name("link")))
        .filter_map(|n| n.attr("href"))
        .filter_map(normalize_url)
        .collect::<HashSet<String>>()
}

fn normalize_url(url: &str) -> Option<String> {
    let new_url = Url::parse(url);
    match new_url {
        Ok(new_url) => {
            if new_url.has_host() && new_url.host_str().unwrap() == "ghost.rolisz.ro" {
                Some(url.to_string())
            } else {
                None
            }
        },
        Err(_e) => {
            // Relative urls are not parsed by Reqwest
            if url.starts_with('/') {
                Some(format!("https://rolisz.ro{}", url))
            } else {
                None
            }
        }
    }
}

fn main() {
    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";
    let mut res = client.get(origin_url).send().unwrap();
    println!("Status for {}: {}", origin_url, res.status());

    let mut body = String::new();
    res.read_to_string(&mut body).unwrap();

    let found_urls = get_links_from_html(&body);
    println!("URLs: {:#?}", found_urls)
}

We moved all the logic to a function get_links_from_html . We apply another filter_map to the links we find, in which we check if we can parse the URL. If we can, we check if there is a host and if it's equal to my blog. Otherwise, if we can't parse, we check if it starts with a / , in which case it's a relative URL. All other cases lead to rejection of the URL.

Iterating over all links

Now it's time to start going over these links that we get so that we crawl the whole blog. We'll do a breadth first traversal and we'll have to keep track of the visited URLs.

fn fetch_url(client: &reqwest::blocking::Client, url: &str) -> String {
    let mut res = client.get(url).send().unwrap();
    println!("Status for {}: {}", url, res.status());

    let mut body  = String::new();
    res.read_to_string(&mut body).unwrap();
    body
}

fn main() {
    let now = Instant::now();

    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";

    let body = fetch_url(&client, origin_url);

    let mut visited = HashSet::new();
    visited.insert(origin_url.to_string());
    let found_urls = get_links_from_html(&body);
    let mut new_urls = found_urls
    	.difference(&visited)
        .map(|x| x.to_string())
        .collect::<HashSet<String>>();

    while !new_urls.is_empty() {
        let mut found_urls: HashSet<String> = new_urls.iter().map(|url| {
            let body = fetch_url(&client, url);
            let links = get_links_from_html(&body);
            println!("Visited: {} found {} links", url, links.len());
            links
        }).fold(HashSet::new(), |mut acc, x| {
                acc.extend(x);
                acc
        })
        visited.extend(new_urls);
        
        new_urls = found_urls
        	.difference(&visited)
            .map(|x| x.to_string())
            .collect::<HashSet<String>>();
        println!("New urls: {}", new_urls.len())
    }
    println!("URLs: {:#?}", found_urls);
    println!("{}", now.elapsed().as_secs());

}

First, we moved the code to fetch a URL to its own function, because we will be using it in two places.

Then the idea is that we have a HashSet containing all the pages we have visited so far. When we visit a new page, we find all the links in that page and we subtract from them all the links that we have previously visited. These will be new links that we will have to visit. We repeat this as long as we have new links to visit.

So we run this and we get the following output:

Status for https://rolisz.ro/favicon.ico: 200 OK
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { kind: InvalidData, error: "stream did not contain valid UTF-8" }', src\libcore\result.rs:1165:5
stack backtrace:
   0: core::fmt::write
             at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14\/src\libcore\fmt\mod.rs:1028
   1: std::io::Write::write_fmt<std::sys::windows::stdio::Stderr>
             at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14\/src\libstd\io\mod.rs:1412
   2: std::sys_common::backtrace::_print
             at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14\/src\libstd\sys_common\backtrace.rs:65
   3: std::sys_common::backtrace::print
             at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14\/src\libstd\sys_common\backtrace.rs:50
...

The problem is that our crawler tries to download, as text, pictures and other binaries. The Rust String has to be valid UTF-8 , so when it tries to put there all kinds of bytes, we will have some that lead to invalid UTF-8 so we get a panic. We could solve this in two different ways: either download URLs as bytes and then convert to strings only those that we know are HTML, or we can skip the ones that are not HTML. Because I am interested in only the textual content of my blog, I will implement the latter solution.

fn has_extension(url: &&str) -> bool {
    Path::new(url).extension().is_none()
}

fn get_links_from_html(html: &str) -> HashSet<String> {
    Document::from(html.as_str())
        .find(Name("a").or(Name("link")))
        .filter_map(|n| n.attr("href"))
        .filter(has_extension)
        .filter_map(normalize_url)
        .collect::<HashSet<String>>()
}

To determine if it's an HTML, we look if there is an extension or not and we that as a filter to our function which retrieves link from the HTML.

Writing the HTML to disk

We are now getting all the HTML we want, time to start writing it to disk.

fn write_file(path: &str, content: &str) {
    fs::create_dir_all(format!("static{}", path)).unwrap();
    fs::write(format!("static{}/index.html", path), content);
}

fn main() {
    let now = Instant::now();

    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";

    let body= fetch_url(&client, origin_url);

    write_file("", &body);
    let mut visited = HashSet::new();
    visited.insert(origin_url.to_string());
    let found_urls = get_links_from_html(&body);
    let mut new_urls = found_urls
    	.difference(&visited)
        .map(|x| x.to_string())
        .collect::<HashSet<String>>();

    while new_urls.len() > 0 {
        let mut found_urls: HashSet<String> = new_urls
        	.iter()
            .map(|url| {
                let body = fetch_url(&client, url);
                write_file(&url[origin_url.len() - 1..], &body);
                let links = get_links_from_html(&body);
                println!("Visited: {} found {} links", url, links.len());
                links
        })
        .fold(HashSet::new(), |mut acc, x| {
                acc.extend(x);
                acc
        })
        visited.extend(new_urls);
        new_urls = found_urls
            .difference(&visited)
            .map(|x| x.to_string())
            .collect::<HashSet<String>>();
        println!("New urls: {}", new_urls.len())
    }
    println!("URLs: {:#?}", found_urls);
    println!("{}", now.elapsed().as_secs());

}

We use the create_dir_all function, which works like mkdir -p in Linux to create the nested folder structure. We write the HTML page to the index.html file in the same folder structure as the URL structure. Most web servers will then serve the index.html file when going to the URL, so the output in the browser will be the same as the one from Ghost serving dynamic pages.

Speeding it up

Letting this run on my blog takes about 110 seconds. Let's see if we can speed it up by downloading the pages in parallel.

fn main() {
    let now = Instant::now();

    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";

    let body = fetch_url(&client, origin_url);

    write_file("", &body);
    let mut visited = HashSet::new();
    visited.insert(origin_url.to_string());
    let found_urls = get_links_from_html(&body);
    let mut new_urls = found_urls
        .difference(&visited)
        .map(|x| x.to_string())
        .collect::<HashSet<String>>();

    while !new_urls.is_empty() {
        let found_urls: HashSet<String> = new_urls
            .par_iter()
            .map(|url| {
                let body = fetch_url(&client, url);
                write_file(&url[origin_url.len() - 1..], &body);

                let links = get_links_from_html(&body);
                println!("Visited: {} found {} links", url, links.len());
                links
            })
            .reduce(HashSet::new, |mut acc, x| {
                acc.extend(x);
                acc
            });
        visited.extend(new_urls);
        new_urls = found_urls
            .difference(&visited)
            .map(|x| x.to_string())
            .collect::<HashSet<String>>();
        println!("New urls: {}", new_urls.len())
    }
    println!("URLs: {:#?}", found_urls);
    println!("{}", now.elapsed().as_secs());
}

In Rust there is this awesome library called Rayon which provides a very simple primitive for running functions in parallel: par_iter , which is short for parallel iterator. It's an almost drop-in replacement for iter , which is part of the standard library for collections, and it runs the provided closure in parallel, taking care of boring stuff like thread scheduling. Besides changing iter to par_iter , we have to change the fold to reduce and provide a closure that returns the "zero" element, so it can generate multiple of them.

This reduces the running time to 70 seconds, down from 110 seconds.

Proper error handling

One more thing to fix in our program: error handling. Rust helps us a lot with error handling with it's builtin Option and Result types, but so far we've been ignoring them, liberally sprinkling unwrap everywhere. unwrap returns the inner type or panics if there is an error (for Result ) or None value (for Option ). To handle these correctly, we should create our own error type.

One appearance of unwrap that we can get rid of easily is in the normalize_url function. In the if we have new_url.has_host() && new_url.host_str().unwrap() == "ghost.rolisz.ro" This can't possibly panic, because we do a check first that the host string exists, but there is a nicer Rust way to express this:

if let Some("ghost.rolisz.ro") = new_url.host_str() {
	Some(url.to_string())
}

To my Rust newbie eyes, it looks really weird at a first glance, but it does make sense eventually.

For the other cases we need to define our own Error type, which will be a wrapper around the other types, providing a uniform interface to all of them:

#[derive(Debug)]
enum Error {
    Write { url: String, e: IoErr },
    Fetch { url: String, e: reqwest::Error },
}

type Result<T> = std::result::Result<T, Error>;

impl<S: AsRef<str>> From<(S, IoErr)> for Error {
    fn from((url, e): (S, IoErr)) -> Self {
        Error::Write {
            url: url.as_ref().to_string(),
            e,
        }
    }
}

impl<S: AsRef<str>> From<(S, reqwest::Error)> for Error {
    fn from((url, e): (S, reqwest::Error)) -> Self {
        Error::Fetch {
            url: url.as_ref().to_string(),
            e,
        }
    }
}

We have two kinds of errors in our crawler: IoErr and reqwest::Error . The first is returned when trying to write a file, the second when we try to fetch a URL. Besides the original error, we add some context, such as the URL or path that was accessed when we got the error. We provide implementation to convert from each library error to our own error type and we also define a Result helper type so that we don't always have to type out our Error type.

fn fetch_url(client: &reqwest::blocking::Client, url: &str) -> Result<String> {
    let mut res = client.get(url).send().map_err(|e| (url, e))?;
    println!("Status for {}: {}", url, res.status());

    let mut body = String::new();
    res.read_to_string(&mut body).map_err(|e| (url, e))?;
    Ok(body)
}

fn write_file(path: &str, content: &str) -> Result<()> {
    let dir = format!("static{}", path);
    fs::create_dir_all(format!("static{}", path)).map_err(|e| (&dir, e))?;
    let index = format!("static{}/index.html", path);
    fs::write(&index, content).map_err(|e| (&index, e))?;

    Ok(())
}

Our two functions that can produce errors now return a Result type. All the operations that can return an error have a map_err applied to the result, and we generate our own Error from the existing error.

let (found_urls, errors): (Vec<Result<HashSet<String>>>, Vec<_>) = new_urls
      .par_iter()
      .map(|url| -> Result<HashSet<String>> {
            let body = fetch_url(&client, url)?;
            write_file(&url[origin_url.len() - 1..], &body)?;

            let links = get_links_from_html(&body);
            println!("Visited: {} found {} links", url, links.len());
            Ok(links)
       })
       .partition(Result::is_ok);

Our main loop to download new URLs changes a bit. Our closure now returns either a set of URLs or an error. To separate the two kinds of results, we partition the iterator based on Result::is_ok , resulting in the vectors, one with HashSet s, one with Error s, but both still wrapped in Result s.

visited.extend(new_urls);
new_urls = found_urls
    .into_par_iter()
    .map(Result::unwrap)
    .reduce(HashSet::new, |mut acc, x| {
        acc.extend(x);
        acc
    })
    .difference(&visited)
    .map(|x| x.to_string())
    .collect::<HashSet<String>>();
println!("New urls: {}", new_urls.len());

We handle each vector separately. For the success one we have to unwrap and the merge all the HashSet s into one.

println!(
   "Errors: {:#?}",
    errors
        .into_iter()
        .map(Result::unwrap_err)
        .collect::<Vec<Error>>()
)

For the Vec containing the Error s, we have to unwrap the errors and then we just  print them out.

And with that we have a small and simple web crawler, which runs fairly fast and which handles most (all?) errors correctly. The final version of the code can be found here .

Special thanks to Cedric Hutchings and lights0123 who reviewed my code on Code Review .


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK