I was working on a task to parse some of the Amazon Web Services. There are lots of ways to parse using DOM/SAX/Stax. All of them require some amount of coding. I wanted a quick fix and I finally landed on jsoup, an Open Source HTML Parser (another HTML parser I like is HTMLParser). In this article, I am going to explain how to parse DZone HTML links in Java.
I’ll be retrieving descriptions of all links in DZone using the code.
Note: This is not the best way to read links from DZone (you can use RSS feeds instead). This tutorial is to take you through the CSS selectors for Java.
All DZone pagination queries look like this: http://www.dzone.com/links/?type=html&p=2.
I used an Open Source Java library to parse this and extract the link text description (jsoup).
Here are sample tags we have in a DZone response:
<a name="link-613399">
</a>
<div class="linkblock frontpage " id="link-613399">
<div id="thumb_613399" class="thumb">
<a onmouseup="track(this, 'twitter4j_oauth_on_android', ''); "
href="http://www.xoriant.com/blog/mobile-application-development/
twitter4j-oauth-on-android.html">
<img width="120" height="90"
src="http://cdn.dzone.com/links/images/thumbs/120x90/613399-1307624607000.jpg"
class="thumbnail" alt="Link 613399 thumbnail"
onmouseover="return OLgetAJAX('/links/themes/reader/jsps/
nodecoration/thumb-load.jsp?linkId=613399',
OLcmdExT1, 300, 'bigThumbBody');"
onmouseout="OLclearAJAX(); nd(100);" />
</a>
</div>
<div id="hidden_thumb_613399">
</div>
<div class="tools">
</div>
<div class="details">
<div class="vwidget" id="vwidget-613399">
<a id="upcount-613399" href="#" class="upcount"
onclick="showLoginDialog(613399, null); return false">7</a>
<a id="downcount-613399" href="#"
onclick="showLoginDialog(613399, null); return false;"
class="downcount">0</a>
</div>
<h3>
<a onmouseup="track(this, 'twitter4j_oauth_on_android', ''); "
href="http://www.xoriant.com/blog/mobile-application-development/
twitter4j-oauth-on-android.html"
rel="bookmark"> Twitter4j OAuth on Android</a>
</h3>
<p class="voteblock">
<a href="http://www.codeproject.com/links/users/profile/811805.html">
<img width="24" height="24"
src="http://cdn.dzone.com/links/images/std/avatars/default_24.gif"
class="avatar" alt="User 811805 avatar" />
</a>
</p>
<p class="fineprint byline">
<a href="http://www.codeproject.com/links/users/profile/811805.html">RituR</a>
via
<a href="http://www.codeproject.com/links/search.html?
query=domain%3Axoriant.com">xoriant.com</a>
</p>
<p class="fineprint byline">
<b>Promoted: </b>
Jun 08 / 17:27. Views:
520, Clicks: 266
</p>
<p class="description">
OAuth is an open protocol
which allows the users to share their private information and assets
like photos, videos etc. with another site...
<a href='http://www.codeproject.com/links/
twitter4j_oauth_on_android.html'>more »
</a>
</p>
<p class="fineprint stats">
<a
href="http://twitter.com/home?status=RT+%40DZone+%22Twitter4j+OAuth+on+
Android%22+http%3A%2F%2Fdzone.com%2FTBxR"
class="twitter">Tweet</a>
<a href="http://www.codeproject.com/links/twitter4j_oauth_on_android.html"
class="comment">0
Comments</a>
<span class="linkUnsaved" id="save-link-613399"
onclick="showLoginDialog(613399); return false;">Save</span>
<span class="linkUnshared" id="share-link-613399"
onclick="showLoginDialog(613399); return false;">Share</span>
Tags:
<a href="http://www.codeproject.com/links/tag/mobile.html"
class="tags" rel="tag">mobile</a>
,
<a href="http://www.codeproject.com/links/tag/standards.html"
class="tags" rel="tag">standards</a>
</p>
</div>
</div>
To get the description, we have to get data from the element “P
” with class “description
” which is actually present in a DIV
with class “details
”. Here is how we can do that in Java:
package com.linkwithweb.parser;
import java.io.File;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class HTMLParser {
public static void main(String[] args) {
try {
File input = new File("input/dZoneLinks.xml");
Document doc = Jsoup.parse(input, "UTF-8",
"http://www.dzone.com/links/?type=html&p=2");
Elements descriptions = doc.select("div.details > p.description");
for (Element element : descriptions) {
System.out.println(element.ownText());
System.out.println("--------------");
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Mavenized code has been checked in to SVN at the following location: http://code.google.com/p/linkwithweb/source/browse/trunk/Utilities/HTMLParser.
Enjoy parsing anything easily using jsoup.