I’m not rusty anymore; how do I learn new technologies?

I have heard a lot of positive things about Rust over the last few years. This is the reason I recently made the decision to begin learning Rust. I’ve chosen to share some of my progress in this post.

Rust vs. C++

C++ is one of the most effective programming languages available when it comes to efficiency. It was/is still the top option for operating systems and the gaming industry. In essence, C++ was your best option if you required something that could work closely with the metal.

Rust is a somewhat new programming language (developed by Mozilla since 2010) that focuses mainly on performance and safety. What do I mean by safety? Rust places more emphasis on secure concurrency features than C++ does. It also has revolutionary memory management as Rust does not allow null or dangling pointers. CLI tools, WebAssembly, Networking and Embedded devices are the most common application areas for Rust. It can also be utilized in situations when memory safety and performance are crucial.

Getting started with Rust

Starting with Rust is pretty easy. Visit Rust webpages and install Rust. For Windows installation, download the installer, then run the program and follow the onscreen instructions. You might be prompted to install Visual Studio C++ Build tools. If so, you can install it using a Visual Studio installation and selecting the respective MVSC compiler and Windows XY SDK (XY being the version of Windows you are running).

This installer also includes Cargo, the package manager, I will talk about next.

Hello, Cargo!

Python has pip (or conda). Node has npm. Cargo is the package manager for Rust. It helps us to download third party libraries and build our own libraries.

Creating a new project

We will create a new project named headings_crawler using this Cargo command.

cargo new headings_crawler --bin

This way, we have created a new binary crate. If we wanted to create a new library crate, the only thing different would be using –lib instead of –bin. This command has created a project structure which looks like this:

headings_crawler/
    --> src/
            main.rs
    --> Cargo.toml
    --> Cargo.lock

Cargo TOML and LOCK

You might have noticed the two Cargo files. One is called Cargo.toml, while the other is called Cargo.lock. These files include references to other libraries that our project depends on as well as configuration information. Cargo.lock is automatically filled by Cargo and shouldn’t be altered by the user.

We should only modify Cargo.toml by adding dependencies. This file has the TOML format which is starting to become more popular because it is easily readable by human. Most of the time, we will only add dependencies under the [dependencies] section. Before every build, Cargo first checks the TOML file if there is any dependency not added. If so, it will include it in the LOCK file and update the project’s dependencies before compiling.

This is an example of my TOML file.

[package]
name = "headings_crawler"
version = "0.1.0"
edition = "2022"

[dependencies]
regex = "1"
reqwest = { version = "0.11", features = ["blocking"] } # reqwest with JSON parsing support
scraper = "0.12.0"

Run your first program

Our code is located in the main.rs file. This is how our main file currently looks like:

fn main() {
    println!("Hello world!");
}

We can compile and run our program using:

cargo run header_crawler

Which should out Hello world to the terminal.

Learning by doing is the preferred method

I usually prefer to learn a new technology by doing, whether it’s a programming language or framework. This entails doing some research before getting started on the task at hand. I also enjoy learning by (re)creating tools that I or others can use. By doing this, I can guarantee that I will be able to endure even if things do not go well right away (which is often the case when starting to learn some new technology).

Project: Article headings parser

I spent a lot of time reading articles. Whether it was to complete a school assignment, expand my knowledge of cybersecurity and software engineering, or simply pass the time on my daily commute. The fact that I read dozens of articles every day affects me in that I frequently only skim the headlines before deciding whether or not to spend time reading the entire piece. I have utilized my recently acquired Rust abilities to make a simple program that parses headings from the web articles.

I would like to pass the URL of the article as one of the command line arguments and have the result displayed in the console.

Getting the arguments

Here is the code I use to handle command-line arguments. I am assigning them to the args variable which is of the type Vec<String> which is a vector (something like a list) that holds String elements. Rust supports static typing which is done by using the colon character followed by the type of variable.

I am also making sure that the args are the proper length. In this case it should be equal to 2 as the first argument is the path to the executable file and the second argument should be equal to the URL we will be scraping. This approach is known as “defensive programming” as we are trying to escape early and escape often. At the end I am binding the argument to the local variable source_url.

fn main() {
    let args: Vec<String> = env::args().collect();
    
    if args.len() != 2 {
        println!("Insufficient amount of arguments!");
        process::exit(-1);
    }

    let source_url = &args[1];
}

Fetching the HTML document

We now know which URL we will be processing. The next goal is to get the HTML document located at that address. We’ll utilize the package reqwest, which comes with built-in support for making HTTP requests, to achieve this. Then, using the parse_fragment() method, we will parse this document and produce an Html structure, which we then assign to the html_document variable.

The next step is to create selectors which are the Selector objects used to parse the Html structure. We will send the reference to the html_document to the separate method select_element() where the parsing is carried out.

fn main() {
    let args: Vec<String> = env::args().collect();
    
    if args.len() != 2 {
        println!("Insufficient amount of arguments!");
        process::exit(-1);
    }

    let source_url = &args[1];
    
    // Creating HTTP request (GET)
    let raw_html = blocking::get(source_url).unwrap().text().unwrap();

    // Parsing HTML document
    let html_document = Html::parse_fragment(&raw_html);

    // Selectors
    let h2_selector = Selector::parse("h2").unwrap();
    let h3_selector = Selector::parse("h3").unwrap();

    println!("H2: ");
    select_element(&html_document, h2_selector);

    println!("\nH3: ");
    select_element(&html_document, h3_selector);
}

Parsing the HTML document

We will then utilize the select_element() method in getting the headings. As we are just fetching for h2 and h3 tags (used for headings in majority of articles) we are getting all the HTML tags and arguments between the opening and closing tag of h2. Due to this, we are also utilizing this regex to eliminate all HTML tags and obtain only the necessary text.

fn select_element(document: &Html, selector: Selector) {
    let regex = Regex::new(r"<[^>]*>").unwrap();

    for element in document.select(&selector) {
        let header_text = element.inner_html();
        let header_parsed = regex.replace_all(&header_text, "");
        println!("{}", header_parsed);
    }
}

Conclusion and what’s next?

Isn’t rust pretty fascinating? I have to admit that learning is a little more difficult than I had anticipated. However, because I find Rust’s compiler messages to be stricter than those in other languages, I really like them. Definitely keep an eye out for the upcoming article as I plan to expand my current scraping tool!

If you have read so far, you might want to follow me here on Hashnode. Feel free to connect with me over at LinkedIn or Mastodon.