Web scraping refers to the process of automatically extracting structured information from websites. The amount of information available online keeps increasing every day, but it is often hard to access when an API is not available to access the data programmatically. If you work in software development, it’s very likely you’ve already been in this situation before or are in this situation right now: if that’s the case this article will give you a very quick start on how to scrape web pages using Java and the Jsoup library.
Introduction
Jsoup is a great Java library to parse and manipulate HTML content. It provides methods that allow you to interact with the DOM of a page in a very similar way to how you would do it in Javascript (e.g., you can use CSS selectors in Jsoup just like you would use them in jQuery). Besides parsing the HTML content of a page, Jsoup also provides methods to retrieve the content of a given URL, so that you can also skip the implementation of this part.
Setup
We first need to include the Maven dependency for Jsoup in our pom.xml
file.
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup --> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.13.1</version> </dependency>
If you are not familiar with Maven I suggest you do some reading about that, but in the meantime you can also include Jsoup JARs manually from your project configuration.
Basic Jsoup functionality
Retrieving a web page
You can easily retrieve the content of a web page using the following code:
Document document = Jsoup.connect("http://example.com").get();
The Document
object that you get as a result contains the parsed version of the web page and allows you to perform all different kinds of processing.
Note: this method will throw an HttpStatusException
if the server returns anything different from a 200 OK
response (e.g., 404 Not Found
) so you should be thinking how to handle this exception:
try { Document document = Jsoup.connect("http://repubblica.it/asdasd").get(); } catch (HttpStatusException e) { // Handle the exception here. // You can handle differently according to the HTTP status code: // e.getStatusCode(); }
Parsing an HTML file
If you have already downloaded the content of a given URL, you can get the Document
object as follows:
Document document = Jsoup.parse(htmlContent);
Once you have your Document
object you can start operating on the DOM the same way you would do it in Javascript.
Selecting DOM elements
The easiest and fastest way to select DOM elements is to use the Document::select
method which takes a CSS selector as an argument:
document.select("a"); // Select all links document.select("p.text"); // Select all paragraphs with class 'text' document.select("#primary"); // Select the element with ID 'primary'
The select
method will return an object of class Elements
, which is basically just a list of Element
objects. For this reason, you can use methods that you would normally use with a list, such as forEach
:
document.select("a").forEach(a -> { System.out.println(p.attr("href")); System.out.println(p.text()); });
Editing the DOM
There are methods that you can use to edit the content of the HTML page. These methods can be applied to any Element
object that you have previously selected. The most useful/commonly used methods are:
appendChild
/prependChild
: appends/prepends anElement
object to the current oneappend
/prepend
: appends/prepends some raw HTML to the current object:
// Appends a span to every link in the page document.select("a").forEach(a -> { a.append("<span>Hello!</span>"); });
remove
: removes a node from the DOM:
// Removes all the non-HTTPS links in the page document.select("a").forEach(a -> { if (a.attr("href").startsWith("http://")) a.remove(); });
text
/html
: set the text or HTML of a given element
// Sets the text of all links to bold document.select("a").forEach(a -> { a.html("<strong>" + a.html() + "</strong>"); });
After you are finished editing the document you can get the full, edited HTML by calling document.html()
.
Conclusions
Jsoup is a powerful and very intuitive library to parse HTML content in Java, especially if you are already familiar with CSS selectors. Even if this is not the case, there are countless tutorials on CSS selectors that will get you started in a matter of minutes.
I hope you found this post useful. You can leave a comment for any question and I’ll do my best to answer them timely. See you the next time!
Webmentions
Web Scraping with Java and Jsoup
https://rootstockvinhos.com.br/2022-os-vinhos-do-ano/
Web Scraping with Java and Jsoup
https://meebee.pl/tapety-dla-dzieci-jakie-tapety-dzieciece-wybrac/
Web Scraping with Java and Jsoup
http://ekvator-oil.ru/guestbook/https:/sites.google.com/coinbaswallet.com/coinbaseprimelogin/Atomic Wallet?page=2170
Web Scraping with Java and Jsoup
https://www.learninleague.com/7-makeup-tips-nobody-told-you-about/
Web Scraping with Java and Jsoup
https://www.informationlife.net/big-data-creativity-can-learn-house-cards/
Web Scraping with Java and Jsoup
https://www.parcheggiopinguino.it/this-is-gallery-post-heading-4/
Web Scraping with Java and Jsoup
https://fujicosmetic.com/km/but-i-must-explain-to-you-how-all-this-mistaken-idea/
Web Scraping with Java and Jsoup
https://comunidadeempregope.com.br/oportunidades-de-emprego-em-pernambuco-no-comunidade-de-emprego-em-17-10-2023/
Web Scraping with Java and Jsoup
https://www.catedradehermeneutica.org/onlenheres/pablo_javier_perez/
Web Scraping with Java and Jsoup
https://havenhouserecovery.com/product/haven-house-honey/
Web Scraping with Java and Jsoup
https://arnouldart.com/2021/05/06/illusion5c/
Web Scraping with Java and Jsoup
https://myrthatv.com/21375-2/
lisinopril fatigue
lisinopril fatigue
lyrica price
lyrica price
trimox prospect
trimox prospect
Web Scraping with Java and Jsoup
http://grupochirene.com/2014/08/28/mike-example-3/
Web Scraping with Java and Jsoup
https://theindustrialsolutions.com.pk/10-reason-why-roofing-are-factmake-easier-7/
Web Scraping with Java and Jsoup
http://goldqueen.ewinds.net/recruit/epad/epad.cgi?pg=30
Web Scraping with Java and Jsoup
https://nanchtours.com/el-salvador-flash-4-dias-3-noches/
Web Scraping with Java and Jsoup
http://ashevilleblog.com/contact/
viagra 10mg price in india
viagra 10mg price in india
vardenafil vs tadalafil
vardenafil vs tadalafil
stromectol uk buy
stromectol uk buy
buy viagra cheap online uk
buy viagra cheap online uk
stromectol 0.5 mg
stromectol 0.5 mg
sildenafil directions
sildenafil directions
levitra online us
levitra online us
forgot to take spironolactone for a week
forgot to take spironolactone for a week
medikamente acarbose
medikamente acarbose
tamsulosin warnings
tamsulosin warnings
is abilify a controlled substance
is abilify a controlled substance
can you buy protonix over the counter
can you buy protonix over the counter
diltiazem cd 180mg
diltiazem cd 180mg
inforce study ezetimibe
inforce study ezetimibe
diclofenac vs celebrex
diclofenac vs celebrex
buspar dosage for sleep
buspar dosage for sleep
cardiovascular safety of celecoxib
cardiovascular safety of celecoxib
how does depakote work
how does depakote work
neurontin 800 mg
neurontin 800 mg
how long after taking bactrim can i drink alcohol
how long after taking bactrim can i drink alcohol
generic bactrim
generic bactrim
cephalexin vs clindamycin
cephalexin vs clindamycin
how does duloxetine work
how does duloxetine work
can i drink alcohol while on keflex
can i drink alcohol while on keflex
does cymbalta cause erectile dysfunction
does cymbalta cause erectile dysfunction
cephalexin and alcohol interaction
cephalexin and alcohol interaction
will fluoxetine make me sleepy
will fluoxetine make me sleepy
[…] clindamicina gel para que sirve[…]
clindamicina gel para que sirve