in Tutorials

Web Scraping with Java and Jsoup

Web scraping refers to the process of automatically extracting structured information from websites. The amount of information available online keeps increasing every day, but it is often hard to access when an API is not available to access the data programmatically. If you work in software development, it’s very likely you’ve already been in this situation before or are in this situation right now: if that’s the case this article will give you a very quick start on how to scrape web pages using Java and the Jsoup library.

Introduction

Jsoup is a great Java library to parse and manipulate HTML content. It provides methods that allow you to interact with the DOM of a page in a very similar way to how you would do it in Javascript (e.g., you can use CSS selectors in Jsoup just like you would use them in jQuery). Besides parsing the HTML content of a page, Jsoup also provides methods to retrieve the content of a given URL, so that you can also skip the implementation of this part.

Setup

We first need to include the Maven dependency for Jsoup in our pom.xml file.

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>

If you are not familiar with Maven I suggest you do some reading about that, but in the meantime you can also include Jsoup JARs manually from your project configuration.

Basic Jsoup functionality

Retrieving a web page

You can easily retrieve the content of a web page using the following code:

Document document = Jsoup.connect("http://example.com").get();

The Document object that you get as a result contains the parsed version of the web page and allows you to perform all different kinds of processing.

Note: this method will throw an HttpStatusException if the server returns anything different from a 200 OK response (e.g., 404 Not Found) so you should be thinking how to handle this exception:

try {
    Document document = Jsoup.connect("http://repubblica.it/asdasd").get();
} catch (HttpStatusException e) {
    // Handle the exception here.
    // You can handle differently according to the HTTP status code:
    // e.getStatusCode(); 
}

Parsing an HTML file

If you have already downloaded the content of a given URL, you can get the Document object as follows:

Document document = Jsoup.parse(htmlContent);

Once you have your Document object you can start operating on the DOM the same way you would do it in Javascript.

Selecting DOM elements

The easiest and fastest way to select DOM elements is to use the Document::select method which takes a CSS selector as an argument:

document.select("a");        // Select all links
document.select("p.text");   // Select all paragraphs with class 'text'
document.select("#primary"); // Select the element with ID 'primary'

The select method will return an object of class Elements, which is basically just a list of Element objects. For this reason, you can use methods that you would normally use with a list, such as forEach:

document.select("a").forEach(a -> {
    System.out.println(p.attr("href"));
    System.out.println(p.text());
});

Editing the DOM

There are methods that you can use to edit the content of the HTML page. These methods can be applied to any Element object that you have previously selected. The most useful/commonly used methods are:

  • appendChild/prependChild: appends/prepends an Element object to the current one
  • append/prepend: appends/prepends some raw HTML to the current object:
// Appends a span to every link in the page
document.select("a").forEach(a -> {
    a.append("<span>Hello!</span>");
});
  • remove: removes a node from the DOM:
// Removes all the non-HTTPS links in the page
document.select("a").forEach(a -> {
    if (a.attr("href").startsWith("http://"))
        a.remove();
});
  • text/html: set the text or HTML of a given element
// Sets the text of all links to bold
document.select("a").forEach(a -> {
    a.html("<strong>" + a.html() + "</strong>");
});

After you are finished editing the document you can get the full, edited HTML by calling document.html().

Conclusions

Jsoup is a powerful and very intuitive library to parse HTML content in Java, especially if you are already familiar with CSS selectors. Even if this is not the case, there are countless tutorials on CSS selectors that will get you started in a matter of minutes.

I hope you found this post useful. You can leave a comment for any question and I’ll do my best to answer them timely. See you the next time!

Webmentions

  • gabapentin dosage for dogs

    gabapentin dosage for dogs

    gabapentin dosage for dogs

  • carbamazepine dose for neuralgia

    carbamazepine dose for neuralgia

    carbamazepine dose for neuralgia

  • celecoxib cox-1

    celecoxib cox-1

    celecoxib cox-1

  • wellbutrin with tegretol

    wellbutrin with tegretol

    wellbutrin with tegretol

  • etodolac compared to diclofenac

    etodolac compared to diclofenac

    etodolac compared to diclofenac

  • child porn

    Web Scraping with Java and Jsoup

    https://yourpetsportrait.com.au/2019/10/17/hello-world/

  • best online pharmacy soma

    best online pharmacy soma

    best online pharmacy soma

  • porn

    Web Scraping with Java and Jsoup

    https://spesialistv.id/penggantian-panel-lcd-tv-lg-55-inch-karena-panel-blank-tidak-tampil-gambar/

  • how much tadalafil can i take

    how much tadalafil can i take

    how much tadalafil can i take

  • tadalafil refractory period

    tadalafil refractory period

    tadalafil refractory period

  • online pharmacy same day delivery

    online pharmacy same day delivery

    online pharmacy same day delivery

  • sildenafil vs vardenafil

    sildenafil vs vardenafil

    sildenafil vs vardenafil

  • how much tadalafil to take

    how much tadalafil to take

    how much tadalafil to take

  • viagra without a prescription

    viagra without a prescription

    viagra without a prescription

  • vardenafil citrate 20 mg

    vardenafil citrate 20 mg

    vardenafil citrate 20 mg

  • online pharmacy ambien generic

    online pharmacy ambien generic

    online pharmacy ambien generic

  • simdi ananı sıktım senin

    Web Scraping with Java and Jsoup

    https://alankarnews.in/2021/11/02/this-jugad-only-in-ind/

  • child porn

    Web Scraping with Java and Jsoup

    https://www.bulvarsamsun.com.tr/cumhuriyet-bayrami-etkinligi/

  • porn

    Web Scraping with Java and Jsoup

    http://www.pantonec.com/home/projects-archive/都市藝術中學課程/

  • porn

    Web Scraping with Java and Jsoup

    https://zelenaberza.com.mk/bolestite-na-zhivotnite-kje-ne-napravat-vegetarijanci/

  • porn

    Web Scraping with Java and Jsoup

    https://sortiedegrange.com/tuto-montage-kit-stage-1-sur-une-mini/

  • porn

    Web Scraping with Java and Jsoup

    http://u-partners.net/fukumen/?wptouch_switch=mobile

  • child porn

    Web Scraping with Java and Jsoup

    https://aldybaby.com/پوشاک-بچه-گانه-کلاه-دار/

  • porn

    Web Scraping with Java and Jsoup

    https://ericksonautoslo.com/hello-world/

  • child porn

    Web Scraping with Java and Jsoup

    https://www.vanderzwaard.nl/2021/03/26/door-het-verrichten-van-passende-arbeid-in-het-kader-van-re-integratie-is-niet-een-nieuwe-verplichting-tot-loondoorbetaling-tijdens-ziekte-ontstaan/

  • child porn

    Web Scraping with Java and Jsoup

    https://wearefloss.org/hello-world/

  • child porn

    Web Scraping with Java and Jsoup

    https://www.greenelectric.cat/consum/les-comunitats-solars/

  • meritking ananı siker bir daha öyle bir makale yayınlarsan

    Web Scraping with Java and Jsoup

    https://hitechaem.com/product/1-4-one-touch-elbow/

  • child porn

    Web Scraping with Java and Jsoup

    https://rasterbase.com/обзор-бк-1хбет-официальный-сайт-1xbet-линия/

  • porn

    Web Scraping with Java and Jsoup

    https://labeximagem.com/providing-the-best-care-for-seniors/

  • cristiano ronaldo skills

    cristiano ronaldo skills

    cristiano ronaldo skills hello i am a football player

  • football skills

    football skills

    football skills hello i am a football player

  • grandpashabet

    Web Scraping with Java and Jsoup

    https://caminada.eu/plast-design-10-2017/

  • web sitesi kurma

    Web Sitesi Kurma

    Kurumsal internet sitesi kurma paketlerini inceledim, en uygununu secip 1 haftada teslim aldigim.

  • Buy instagram followers

    buy followers instagram

    Decent way to buy followers Instagram newcomers might need. Helped my initial growth.

  • Konya Seo Uzmanı

    Konya SEO Uzmanı

    Konya SEO uzmanı, site hızını artırmak için sayfaları atlı araba ile taşıyor.

  • track location by phone number

    track location by phone number

    The cell phone tracker features are amazing! Helped me recover my stolen device within hours.

  • spam

    Web Scraping with Java and Jsoup

    http://yourcontentkitchen.nl/hello-world/

  • porn

    Web Scraping with Java and Jsoup

    http://mystadolphe.com/2011/11/07/fitness-classes-this-coming-week-2/

  • porn

    Web Scraping with Java and Jsoup

    https://opiniaodetudo.com/index.php/2021/03/08/curso-de-desenvolvimento-web-aula2-parte-3-declaracao-de-variaveis-php/

  • child porn

    Web Scraping with Java and Jsoup

    https://www.neucarol.com/are-shorts-appropriate-attire-in-church/

  • spam

    Web Scraping with Java and Jsoup

    http://upirlihoy.ru/category/новости/олени/page/4/

  • meritking

    Web Scraping with Java and Jsoup

    https://www.manekine.co/news/post_40.html

  • porn

    Web Scraping with Java and Jsoup

    https://wolverhamptonhipandkneeclinic.co.uk/ris-vitae-nisi-vitae-urna-consequat-vulp/etiam-commodo-convallis/

  • porn

    Web Scraping with Java and Jsoup

    http://www.classic-sex.com/redirect?url=https://batmanapollo.ru/список-записей-обновляемый/

  • porn

    Web Scraping with Java and Jsoup

    https://anitavermeeren.nl/galerie/galerie-ewa-helena-hamburg-duitsland/

  • child porn

    Web Scraping with Java and Jsoup

    https://www.noiinternationalconsulting.com/2020/03/01/specifica-in-cina/

  • iporn

    Web Scraping with Java and Jsoup

    https://bcda.net/welcome/

  • child porn

    Web Scraping with Java and Jsoup

    https://elolivarmojacar.es/descubre-la-belleza-e-historia-de-mojacar-andalucia-una-guia/

  • child porn

    Web Scraping with Java and Jsoup

    https://torekara.com/poke05/

  • porn

    Web Scraping with Java and Jsoup

    https://www.cplc.org.pk/cplcs-funtions-2/

  • porn

    Web Scraping with Java and Jsoup

    https://rootstockvinhos.com.br/2022-os-vinhos-do-ano/

  • spam

    Web Scraping with Java and Jsoup

    https://meebee.pl/tapety-dla-dzieci-jakie-tapety-dzieciece-wybrac/

  • sex

    Web Scraping with Java and Jsoup

    http://ekvator-oil.ru/guestbook/https:/sites.google.com/coinbaswallet.com/coinbaseprimelogin/Atomic Wallet?page=2170

  • vimeo

    Web Scraping with Java and Jsoup

    https://www.learninleague.com/7-makeup-tips-nobody-told-you-about/

  • child porn

    Web Scraping with Java and Jsoup

    https://www.informationlife.net/big-data-creativity-can-learn-house-cards/

  • fuck

    Web Scraping with Java and Jsoup

    https://www.parcheggiopinguino.it/this-is-gallery-post-heading-4/

  • child porn

    Web Scraping with Java and Jsoup

    https://fujicosmetic.com/km/but-i-must-explain-to-you-how-all-this-mistaken-idea/

  • porn

    Web Scraping with Java and Jsoup

    https://comunidadeempregope.com.br/oportunidades-de-emprego-em-pernambuco-no-comunidade-de-emprego-em-17-10-2023/

  • child porn

    Web Scraping with Java and Jsoup

    https://www.catedradehermeneutica.org/onlenheres/pablo_javier_perez/

  • child porn

    Web Scraping with Java and Jsoup

    https://havenhouserecovery.com/product/haven-house-honey/

  • Cocuk pornosu

    Web Scraping with Java and Jsoup

    https://arnouldart.com/2021/05/06/illusion5c/

  • porn

    Web Scraping with Java and Jsoup

    https://myrthatv.com/21375-2/

  • lisinopril fatigue

    lisinopril fatigue

    lisinopril fatigue

  • lyrica price

    lyrica price

    lyrica price

  • trimox prospect

    trimox prospect

    trimox prospect

  • child porn

    Web Scraping with Java and Jsoup

    http://grupochirene.com/2014/08/28/mike-example-3/

  • porn

    Web Scraping with Java and Jsoup

    https://theindustrialsolutions.com.pk/10-reason-why-roofing-are-factmake-easier-7/

  • ananın amı

    Web Scraping with Java and Jsoup

    http://goldqueen.ewinds.net/recruit/epad/epad.cgi?pg=30

  • animal porn

    Web Scraping with Java and Jsoup

    https://nanchtours.com/el-salvador-flash-4-dias-3-noches/

  • animal porn

    Web Scraping with Java and Jsoup

    http://ashevilleblog.com/contact/

  • viagra 10mg price in india

    viagra 10mg price in india

    viagra 10mg price in india

  • vardenafil vs tadalafil

    vardenafil vs tadalafil

    vardenafil vs tadalafil

  • stromectol uk buy

    stromectol uk buy

    stromectol uk buy

  • buy viagra cheap online uk

    buy viagra cheap online uk

    buy viagra cheap online uk

  • stromectol 0.5 mg

    stromectol 0.5 mg

    stromectol 0.5 mg

  • sildenafil directions

    sildenafil directions

    sildenafil directions

  • levitra online us

    levitra online us

    levitra online us

  • forgot to take spironolactone for a week

    forgot to take spironolactone for a week

    forgot to take spironolactone for a week

  • medikamente acarbose

    medikamente acarbose

    medikamente acarbose

  • tamsulosin warnings

    tamsulosin warnings

    tamsulosin warnings

  • is abilify a controlled substance

    is abilify a controlled substance

    is abilify a controlled substance

  • can you buy protonix over the counter

    can you buy protonix over the counter

    can you buy protonix over the counter

  • diltiazem cd 180mg

    diltiazem cd 180mg

    diltiazem cd 180mg

  • inforce study ezetimibe

    inforce study ezetimibe

    inforce study ezetimibe

  • diclofenac vs celebrex

    diclofenac vs celebrex

    diclofenac vs celebrex

  • buspar dosage for sleep

    buspar dosage for sleep

    buspar dosage for sleep

  • cardiovascular safety of celecoxib

    cardiovascular safety of celecoxib

    cardiovascular safety of celecoxib

  • how does depakote work

    how does depakote work

    how does depakote work

  • neurontin 800 mg

    neurontin 800 mg

    neurontin 800 mg

  • how long after taking bactrim can i drink alcohol

    how long after taking bactrim can i drink alcohol

    how long after taking bactrim can i drink alcohol

  • generic bactrim

    generic bactrim

    generic bactrim

  • cephalexin vs clindamycin

    cephalexin vs clindamycin

    cephalexin vs clindamycin

  • how does duloxetine work

    how does duloxetine work

    how does duloxetine work

  • can i drink alcohol while on keflex

    can i drink alcohol while on keflex

    can i drink alcohol while on keflex

  • does cymbalta cause erectile dysfunction

    does cymbalta cause erectile dysfunction

    does cymbalta cause erectile dysfunction

  • cephalexin and alcohol interaction

    cephalexin and alcohol interaction

    cephalexin and alcohol interaction

  • will fluoxetine make me sleepy

    will fluoxetine make me sleepy

    will fluoxetine make me sleepy

  • clindamicina gel para que sirve

    […] clindamicina gel para que sirve[…]

    clindamicina gel para que sirve