Web scraping using jsoup

Posted on April 29, 2013 by Siva Prasad Rao Janapati — 2 Comments

In this article, we will see how we can scrap the web using JSoup. Before getting into the details, we will see what is web scraping? and what are the use cases to use web scraping?

What is web scraping?

Web scraping (web harvesting or web data extraction) is a data mining technique of extracting the data from websites Or Converting the unstructured data from web to structured data for analysis is known as “Web Scraping”.

What are the use cases to use web scraping?

The use cases to use “Web Scraping” is more. But, majorly

For Research

To visualize the unstructured data from multiple sources, for analysis.

Market Analysis

To watch the services or products provided by competitors.

Lead Generation

To gather contact details like email address, phone numbers, website URL, etc from justdial.com, yellowpages.com or linkedin.com for businesses or individuals.

To Avoid XSS

To inspect the user submitted data for XSS attacks.

Note: Web scraping may be against the terms of use of some websites.

Now, we will see how to set up the open source Java HTML parser called “jsoup”. First, we should download the latest jsoup jar from http://jsoup.org/download . In this article, I am using jsoup 1.7.2 version.

To demonstrate jsoup, I have created a java application and kept the jsoup jar file in classpath.

Once the project setup is done, connect to the URL using jsoup and get the HTML content as a document.


Document doc = Jsoup.connect("http://www.amazon.com/Samsung-XE303C12-A01US-Chromebook-Wi-Fi-11-6-Inch/dp/B009LL9VDG/ref=sr_1_1?ie=UTF8&qid=1366683807&sr=8-1&keywords=laptop").get();

Now, look at the view source of the mentioned URL to know the HTML tags to be extracted. In this case, we are trying to extract the product name and the price. From the HTML source, we came to know that the product name is available under a span tag given below.

<h1>
<span id="btAsinTitle">Samsung Chromebook (Wi-Fi, 11.6-Inch)</span></h1>

So, from the document, we need to extract a span tag by calling the select method.

Elements titleElements = doc.select("span[id=btAsinTitle]");

The above code will return all the matched elements. But, with the above CSS selector, we will get only one element. So, from the first element, we can extract the text between the span tags by calling the text method.

String title = titleElements.get(0).text();

The same way, we can extract the price of the product. In the HTML source, the price is available as shown below.

<span id="actualPriceValue">
<b>$249.00</b>
</span>

So, we need to extract the price, from the <b> tag. The extraction code snippet is given below.

Elements priceElements = doc.select("b[class=priceLarge]");

From the above CSS selector, we will get only one element. So, from the first element, we can extract the price.


String price = priceElements.get(0).text();

The source code used in this article is available at https://github.com/2013techsmarts/Web-Harvester/tree/master/Sample_Jsoup_Proj

In the coming article, we will see one more web harvest tool.

Keep reading…

About Siva Prasad Rao Janapati

Siva Janapati is an Architect with experience in building Cloud Native Microservices architectures, Reactive Systems, Large scale distributed systems, and Serverless Systems. Siva has hands-on in architecture, design, and implementation of scalable systems using Cloud, Java, Go lang, Apache Kafka, Apache Solr, Spring, Spring Boot, Lightbend reactive tech stack, APIGEE edge & on-premise and other open-source, proprietary technologies. Expertise working with and building RESTful, GraphQL APIs. He has successfully delivered multiple applications in retail, telco, and financial services domains. He manages the GitHub(https://github.com/2013techsmarts) where he put the source code of his work related to his blog posts.

Tagged with: jsoup, web harvest, web scrapping
Posted in jsoup, Web Scrapping

2 comments on “Web scraping using jsoup”

AlexB says:

November 6, 2013 at 2:26 AM

Any other web scrapping tools available?

Reply
dujggp@gmail.com says:

November 4, 2013 at 5:45 AM

Good explanation with example.

Reply