Web Scraping with Java and Jsoup

Web scraping refers to the process of automatically extracting structured information from websites. The amount of information available online keeps increasing every day, but it is often hard to access when an API is not available to access the data programmatically. If you work in software development, it’s very likely you’ve already been in this situation before or are in this situation right now: if that’s the case this article will give you a very quick start on how to scrape web pages using Java and the Jsoup library.

Introduction

Jsoup is a great Java library to parse and manipulate HTML content. It provides methods that allow you to interact with the DOM of a page in a very similar way to how you would do it in Javascript (e.g., you can use CSS selectors in Jsoup just like you would use them in jQuery). Besides parsing the HTML content of a page, Jsoup also provides methods to retrieve the content of a given URL, so that you can also skip the implementation of this part.

Setup

We first need to include the Maven dependency for Jsoup in our pom.xml file.

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>

If you are not familiar with Maven I suggest you do some reading about that, but in the meantime you can also include Jsoup JARs manually from your project configuration.

Basic Jsoup functionality

Retrieving a web page

You can easily retrieve the content of a web page using the following code:

Document document = Jsoup.connect("http://example.com").get();

The Document object that you get as a result contains the parsed version of the web page and allows you to perform all different kinds of processing.

Note: this method will throw an HttpStatusException if the server returns anything different from a 200 OK response (e.g., 404 Not Found) so you should be thinking how to handle this exception:

try {
    Document document = Jsoup.connect("http://repubblica.it/asdasd").get();
} catch (HttpStatusException e) {
    // Handle the exception here.
    // You can handle differently according to the HTTP status code:
    // e.getStatusCode(); 
}

Parsing an HTML file

If you have already downloaded the content of a given URL, you can get the Document object as follows:

Document document = Jsoup.parse(htmlContent);

Once you have your Document object you can start operating on the DOM the same way you would do it in Javascript.

Selecting DOM elements

The easiest and fastest way to select DOM elements is to use the Document::select method which takes a CSS selector as an argument:

document.select("a");        // Select all links
document.select("p.text");   // Select all paragraphs with class 'text'
document.select("#primary"); // Select the element with ID 'primary'

The select method will return an object of class Elements, which is basically just a list of Element objects. For this reason, you can use methods that you would normally use with a list, such as forEach:

document.select("a").forEach(a -> {
    System.out.println(p.attr("href"));
    System.out.println(p.text());
});

Editing the DOM

There are methods that you can use to edit the content of the HTML page. These methods can be applied to any Element object that you have previously selected. The most useful/commonly used methods are:

appendChild/prependChild: appends/prepends an Element object to the current one
append/prepend: appends/prepends some raw HTML to the current object:

// Appends a span to every link in the page
document.select("a").forEach(a -> {
    a.append("<span>Hello!</span>");
});

remove: removes a node from the DOM:

// Removes all the non-HTTPS links in the page
document.select("a").forEach(a -> {
    if (a.attr("href").startsWith("http://"))
        a.remove();
});

text/html: set the text or HTML of a given element

// Sets the text of all links to bold
document.select("a").forEach(a -> {
    a.html("<strong>" + a.html() + "</strong>");
});

After you are finished editing the document you can get the full, edited HTML by calling document.html().

Conclusions

Jsoup is a powerful and very intuitive library to parse HTML content in Java, especially if you are already familiar with CSS selectors. Even if this is not the case, there are countless tutorials on CSS selectors that will get you started in a matter of minutes.

I hope you found this post useful. You can leave a comment for any question and I’ll do my best to answer them timely. See you the next time!

Using Java for Web Scraping and Data Extraction

Published

April 15, 2020

AILEF in Tutorials | April 15, 2020