jsoup HTML parser and parsing DZone links using CSS selectors in Java

AshwinRayaprolu

0.00/5 (No votes)

10 Jun 2011Apache1 min read

20.4K

Retrieving descriptions of all links in DZone using jsoup.

I was working on a task to parse some of the Amazon Web Services. There are lots of ways to parse using DOM/SAX/Stax. All of them require some amount of coding. I wanted a quick fix and I finally landed on jsoup, an Open Source HTML Parser (another HTML parser I like is HTMLParser). In this article, I am going to explain how to parse DZone HTML links in Java.

I’ll be retrieving descriptions of all links in DZone using the code.

Note: This is not the best way to read links from DZone (you can use RSS feeds instead). This tutorial is to take you through the CSS selectors for Java.

All DZone pagination queries look like this: http://www.dzone.com/links/?type=html&p=2.

I used an Open Source Java library to parse this and extract the link text description (jsoup).

Here are sample tags we have in a DZone response:

HTML

<a name="link-613399">
</a>

<div class="linkblock frontpage " id="link-613399">
    <div id="thumb_613399" class="thumb">
        <a onmouseup="track(this, 'twitter4j_oauth_on_android', ''); "
            href="http://www.xoriant.com/blog/mobile-application-development/
                   twitter4j-oauth-on-android.html">
            <img width="120" height="90"
                src="http://cdn.dzone.com/links/images/thumbs/120x90/613399-1307624607000.jpg"
                class="thumbnail" alt="Link 613399 thumbnail"
                onmouseover="return OLgetAJAX('/links/themes/reader/jsps/
                              nodecoration/thumb-load.jsp?linkId=613399', 
                              OLcmdExT1, 300, 'bigThumbBody');"
                onmouseout="OLclearAJAX(); nd(100);" />
        </a>
    </div>
    <div id="hidden_thumb_613399">

    </div>
    <div class="tools">
    </div>
    <div class="details">
        <div class="vwidget" id="vwidget-613399">
            <a id="upcount-613399" href="#" class="upcount"
                onclick="showLoginDialog(613399, null); return false">7</a>

            <a id="downcount-613399" href="#"
                onclick="showLoginDialog(613399, null); return false;" 
                class="downcount">0</a>
        </div>
        <h3>
            <a onmouseup="track(this, 'twitter4j_oauth_on_android', ''); "
                href="http://www.xoriant.com/blog/mobile-application-development/
                       twitter4j-oauth-on-android.html"
                rel="bookmark"> Twitter4j OAuth on Android</a>
        </h3>
        <p class="voteblock">
            <a href="http://www.codeproject.com/links/users/profile/811805.html">
                <img width="24" height="24"
                    src="http://cdn.dzone.com/links/images/std/avatars/default_24.gif"
                    class="avatar" alt="User 811805 avatar" />
            </a>
        </p>
        <p class="fineprint byline">
            <a href="http://www.codeproject.com/links/users/profile/811805.html">RituR</a>
            via
            <a href="http://www.codeproject.com/links/search.html?
                        query=domain%3Axoriant.com">xoriant.com</a>
        </p>
        <p class="fineprint byline">
            <b>Promoted: </b>
            Jun 08 / 17:27. Views:
            520, Clicks: 266
        </p>
        <p class="description">
            OAuth is an open protocol
            which allows the users to share their private information and assets
            like photos, videos etc. with another site... 
            <a href='http://www.codeproject.com/links/
                     twitter4j_oauth_on_android.html'>more &raquo;
            </a>
        </p>
        <p class="fineprint stats">
            <a
                href="http://twitter.com/home?status=RT+%40DZone+%22Twitter4j+OAuth+on+
                      Android%22+http%3A%2F%2Fdzone.com%2FTBxR"
                class="twitter">Tweet</a>
            <a href="http://www.codeproject.com/links/twitter4j_oauth_on_android.html" 
                class="comment">0
                Comments</a>
            <span class="linkUnsaved" id="save-link-613399"
                onclick="showLoginDialog(613399); return false;">Save</span>
            <span class="linkUnshared" id="share-link-613399"
                onclick="showLoginDialog(613399); return false;">Share</span>
            Tags:
            <a href="http://www.codeproject.com/links/tag/mobile.html" 
              class="tags" rel="tag">mobile</a>
            ,
            <a href="http://www.codeproject.com/links/tag/standards.html" 
              class="tags" rel="tag">standards</a>
        </p>

    </div>
</div>

To get the description, we have to get data from the element “P” with class “description” which is actually present in a DIV with class “details”. Here is how we can do that in Java:

Java

/**
 *
 */
package com.linkwithweb.parser;

import java.io.File;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/****************************************************************
 * Description
 * jsoup elements support a CSS (or jquery) like selector syntax to find
 * matching elements, that allows very powerful and robust queries.
 *
 * The select method is available in a Document, Element, or in Elements.
 * It is contextual, so you can filter by selecting from a specific element, or
 * by chaining select calls.
 *
 * Select returns a list of Elements (as Elements), which provides
 * a range of methods to extract and manipulate the results.
 *
 * Selector overview
 * tagname: find elements by tag, e.g. a
 * ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
 * #id: find elements by ID, e.g. #logo
 * .class: find elements by class name, e.g. .masthead
 * [attribute]: elements with attribute, e.g. [href]
 * [^attr]: elements with an attribute name prefix,
 * e.g. [^data-] finds elements with HTML5 dataset attributes
 * [attr=value]: elements with attribute value, e.g. [width=500]
 * [attr^=value], [attr$=value], [attr*=value]: elements with attributes
 * that start with, end with, or contain the value, e.g. [href*=/path/]
 * [attr~=regex]: elements with attribute values that match
 * the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
 * : all elements, e.g. *
 * Selector combinations
 * el#id: elements with ID, e.g. div#logo
 * el.class: elements with class, e.g. div.masthead
 * el[attr]: elements with attribute, e.g. a[href]
 * Any combination, e.g. a[href].highlight
 * ancestor child: child elements that descend from ancestor, e.g. .body p
 * finds p elements anywhere under a block with class "body"
 * parent > child: child elements that descend directly from parent,
 * e.g. div.content > p finds p elements; and body > * finds the direct children of
 * the body tag
 * siblingA + siblingB: finds sibling B element immediately
 *                      preceded by sibling A, e.g. div.head + div
 * siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
 * el, el, el: group multiple selectors, find unique elements that
 * match any of the selectors; e.g. div.masthead, div.logo
 * Pseudo selectors
 * :lt(n): find elements whose sibling index (i.e. its position
 * in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
 * :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
 * :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
 * :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
 * :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
 * :contains(text): find elements that contain the given text.
 * The search is case-insensitive; e.g. p:contains(jsoup)
 * :containsOwn(text): find elements that directly contain the given text
 * :matches(regex): find elements whose text matches the
 * specified regular expression; e.g. div:matches((?i)login)
 * :matchesOwn(regex): find elements whose own text matches the specified regular expression
 * Note that the above indexed pseudo-selectors are 0-based, that is,
 * the first element is at index 0, the second at 1, etc
 * See the Selector API reference for the full supported list and details.
 *
 * @author Ashwin Kumar
 *
 */
public class HTMLParser {

    /**
     * @param args
     */
    public static void main(String[] args) {
        try {
            File input = new File("input/dZoneLinks.xml");
            Document doc = Jsoup.parse(input, "UTF-8",
                    "http://www.dzone.com/links/?type=html&p=2");

            Elements descriptions = doc.select("div.details > p.description");
            // get all description elements in this HTML file
            /*
             * Elements pngs = doc.select("img[src$=.png]");
             * // img with src ending .png
             *
             * Element masthead = doc.select("div.masthead").first();
             */
            // div with

            // Elements resultLinks = doc.select("h3.r > a"); // direct a after h3
            /**
             * Iterate over all descriptions and display them
             */
            for (Element element : descriptions) {
                System.out.println(element.ownText());
                System.out.println("--------------");
            }

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

}

Mavenized code has been checked in to SVN at the following location: http://code.google.com/p/linkwithweb/source/browse/trunk/Utilities/HTMLParser.

Enjoy parsing anything easily using jsoup.

License

This article, along with any associated source code and files, is licensed under The Apache License, Version 2.0