in Tutorials

Web Scraping with Java and Jsoup

Web scraping refers to the process of automatically extracting structured information from websites. The amount of information available online keeps increasing every day, but it is often hard to access when an API is not available to access the data programmatically. If you work in software development, it’s very likely you’ve already been in this situation before or are in this situation right now: if that’s the case this article will give you a very quick start on how to scrape web pages using Java and the Jsoup library.

Introduction

Jsoup is a great Java library to parse and manipulate HTML content. It provides methods that allow you to interact with the DOM of a page in a very similar way to how you would do it in Javascript (e.g., you can use CSS selectors in Jsoup just like you would use them in jQuery). Besides parsing the HTML content of a page, Jsoup also provides methods to retrieve the content of a given URL, so that you can also skip the implementation of this part.

Setup

We first need to include the Maven dependency for Jsoup in our pom.xml file.

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>

If you are not familiar with Maven I suggest you do some reading about that, but in the meantime you can also include Jsoup JARs manually from your project configuration.

Basic Jsoup functionality

Retrieving a web page

You can easily retrieve the content of a web page using the following code:

Document document = Jsoup.connect("http://example.com").get();

The Document object that you get as a result contains the parsed version of the web page and allows you to perform all different kinds of processing.

Note: this method will throw an HttpStatusException if the server returns anything different from a 200 OK response (e.g., 404 Not Found) so you should be thinking how to handle this exception:

try {
    Document document = Jsoup.connect("http://repubblica.it/asdasd").get();
} catch (HttpStatusException e) {
    // Handle the exception here.
    // You can handle differently according to the HTTP status code:
    // e.getStatusCode(); 
}

Parsing an HTML file

If you have already downloaded the content of a given URL, you can get the Document object as follows:

Document document = Jsoup.parse(htmlContent);

Once you have your Document object you can start operating on the DOM the same way you would do it in Javascript.

Selecting DOM elements

The easiest and fastest way to select DOM elements is to use the Document::select method which takes a CSS selector as an argument:

document.select("a");        // Select all links
document.select("p.text");   // Select all paragraphs with class 'text'
document.select("#primary"); // Select the element with ID 'primary'

The select method will return an object of class Elements, which is basically just a list of Element objects. For this reason, you can use methods that you would normally use with a list, such as forEach:

document.select("a").forEach(a -> {
    System.out.println(p.attr("href"));
    System.out.println(p.text());
});

Editing the DOM

There are methods that you can use to edit the content of the HTML page. These methods can be applied to any Element object that you have previously selected. The most useful/commonly used methods are:

  • appendChild/prependChild: appends/prepends an Element object to the current one
  • append/prepend: appends/prepends some raw HTML to the current object:
// Appends a span to every link in the page
document.select("a").forEach(a -> {
    a.append("<span>Hello!</span>");
});
  • remove: removes a node from the DOM:
// Removes all the non-HTTPS links in the page
document.select("a").forEach(a -> {
    if (a.attr("href").startsWith("http://"))
        a.remove();
});
  • text/html: set the text or HTML of a given element
// Sets the text of all links to bold
document.select("a").forEach(a -> {
    a.html("<strong>" + a.html() + "</strong>");
});

After you are finished editing the document you can get the full, edited HTML by calling document.html().

Conclusions

Jsoup is a powerful and very intuitive library to parse HTML content in Java, especially if you are already familiar with CSS selectors. Even if this is not the case, there are countless tutorials on CSS selectors that will get you started in a matter of minutes.

I hope you found this post useful. You can leave a comment for any question and I’ll do my best to answer them timely. See you the next time!

Webmentions

  • tamsulosin warnings

    tamsulosin warnings

    tamsulosin warnings

  • is abilify a controlled substance

    is abilify a controlled substance

    is abilify a controlled substance

  • can you buy protonix over the counter

    can you buy protonix over the counter

    can you buy protonix over the counter

  • diltiazem cd 180mg

    diltiazem cd 180mg

    diltiazem cd 180mg

  • inforce study ezetimibe

    inforce study ezetimibe

    inforce study ezetimibe

  • diclofenac vs celebrex

    diclofenac vs celebrex

    diclofenac vs celebrex

  • buspar dosage for sleep

    buspar dosage for sleep

    buspar dosage for sleep

  • cardiovascular safety of celecoxib

    cardiovascular safety of celecoxib

    cardiovascular safety of celecoxib

  • how does depakote work

    how does depakote work

    how does depakote work

  • neurontin 800 mg

    neurontin 800 mg

    neurontin 800 mg

  • how long after taking bactrim can i drink alcohol

    how long after taking bactrim can i drink alcohol

    how long after taking bactrim can i drink alcohol

  • generic bactrim

    generic bactrim

    generic bactrim

  • cephalexin vs clindamycin

    cephalexin vs clindamycin

    cephalexin vs clindamycin

  • how does duloxetine work

    how does duloxetine work

    how does duloxetine work

  • can i drink alcohol while on keflex

    can i drink alcohol while on keflex

    can i drink alcohol while on keflex

  • does cymbalta cause erectile dysfunction

    does cymbalta cause erectile dysfunction

    does cymbalta cause erectile dysfunction

  • cephalexin and alcohol interaction

    cephalexin and alcohol interaction

    cephalexin and alcohol interaction

  • will fluoxetine make me sleepy

    will fluoxetine make me sleepy

    will fluoxetine make me sleepy

  • clindamicina gel para que sirve

    […] clindamicina gel para que sirve[…]

    clindamicina gel para que sirve