In this article, we will see how we can scrap the web using JSoup. Before getting into the details, we will see what is web scraping? and what are the use cases to use web scraping?
What is web scraping?
Web scraping (web harvesting or web data extraction) is a data mining technique of extracting the data from websites Or Converting the unstructured data from web to structured data for analysis is known as “Web Scraping”.
What are the use cases to use web scraping?
The use cases to use “Web Scraping” is more. But, majorly
- For Research
To visualize the unstructured data from multiple sources, for analysis.
- Market Analysis
To watch the services or products provided by competitors.
- Lead Generation
To gather contact details like email address, phone numbers, website URL, etc from justdial.com, yellowpages.com or linkedin.com for businesses or individuals.
- To Avoid XSS
To inspect the user submitted data for XSS attacks.
Note: Web scraping may be against the terms of use of some websites.
Now, we will see how to set up the open source Java HTML parser called “jsoup”. First, we should download the latest jsoup jar from http://jsoup.org/download . In this article, I am using jsoup 1.7.2 version.
To demonstrate jsoup, I have created a java application and kept the jsoup jar file in classpath.
Once the project setup is done, connect to the URL using jsoup and get the HTML content as a document.
Document doc = Jsoup.connect("http://www.amazon.com/Samsung-XE303C12-A01US-Chromebook-Wi-Fi-11-6-Inch/dp/B009LL9VDG/ref=sr_1_1?ie=UTF8&qid=1366683807&sr=8-1&keywords=laptop").get();
Now, look at the view source of the mentioned URL to know the HTML tags to be extracted. In this case, we are trying to extract the product name and the price. From the HTML source, we came to know that the product name is available under a span tag given below.
<h1> <span id="btAsinTitle">Samsung Chromebook (Wi-Fi, 11.6-Inch)</span></h1>
So, from the document, we need to extract a span tag by calling the select method.
Elements titleElements = doc.select("span[id=btAsinTitle]");
The above code will return all the matched elements. But, with the above CSS selector, we will get only one element. So, from the first element, we can extract the text between the span tags by calling the text method.
String title = titleElements.get(0).text();
The same way, we can extract the price of the product. In the HTML source, the price is available as shown below.
<span id="actualPriceValue"> <b>$249.00</b> </span>
So, we need to extract the price, from the <b> tag. The extraction code snippet is given below.
Elements priceElements = doc.select("b[class=priceLarge]");
From the above CSS selector, we will get only one element. So, from the first element, we can extract the price.
String price = priceElements.get(0).text();
The source code used in this article is available at https://github.com/2013techsmarts/Web-Harvester/tree/master/Sample_Jsoup_Proj
In the coming article, we will see one more web harvest tool.
Keep reading…
Any other web scrapping tools available?
Good explanation with example.