Introduction
This article is about a User JavaScript (User JS) script that will enable desktop browsers to mimic the functionality typically seen in ebook reader apps of mobile devices. It will try to display only the main content of a web page and remove all or most extraneous information from it. It is recommended for web pages that display news or articles. It works best on HTML5 pages but it should perform all right on structurally well-organized pages that use plain old HTML.
Background
User JS scripts are also known as browser scripts and are similar to bookmarklets. They are also known as GreaseMonkey scripts, after the Firefox add-on that allows browser users to customize their favorite websites by injecting their own Javascript scripts from the client-side.
I developed this script for inclusion in a namesake Android browser app that I developed. The app has special support for User CSS and User JS files. During development, I used a Firefox browser with a GreaseMonkey
add-on, as coding and testing it on an Android device emulator is cumbersome. I use this version of the script when reading articles on my desktop or laptop using the Bamboo RSS reader add-on for Firefox.
Using the Code
My Android app requires the JavaScript code to be converted to bookmarklet style. So, much of the code is in that style. For the Firefox (desktop) browser, I added an event handler for the "DOMContentLoaded
" event. It runs the book reader mode after a 5-second delay. This gives me enough time to click on links even on home pages of websites where this script is not expected to work very well.
Here is the GreaseMonkey Javascript. If the code gets mangled when it is published, then the raw Javascript source code text can be copied from its GitHub location.
if (subhash_browser_js == null) {
var subhash_browser_js = {};
}
subhash_browser_js.book_reader_js = {
sHtml: "",
bTitleFound: false,
arYucks: [ "-ads", "_ads", "advert", "adcode", "adselect", "addthis", "alsoread", "comment",
"discuss", "email", "facebook", "float", "follow", "franchise", "googlead",
"hide_", "hidden", "hover", "jump", "lazy", "linkedin", "navig", "notifi",
"outbrain", "partner", "popular", "popup", "print", "reddit", "share", "sharing",
"short-url", "social", "sponsor", "sprite", "subscribe", "taboola", "trend",
"twitter", "url-short", "zipr" ],
createHeader: function() {
subhash_browser_js.book_reader_js.sHtml = "<head>\n" +
" <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />\n" +
" <title>" + document.title + "</title>\n" +
" <style>\n" +
" a { border-bottom: 1px dotted navy; }\n" +
" body { background-color: rgb(200,200,220); color: black; font-family: sans-serif;\n" +
" font-size: 0.5cm; margin: 1em auto; padding: 1em; max-width: 9in; }\n" +
" code { font-family: monospace; }\n" +
" h1 { text-align: center; border-bottom: 1px solid black; padding-bottom: 0.2em; }\n" +
" a h1, a h2, a h3, a h4, a h5, a h6, h1 a, h2 a, h3 a, h4 a, h5 a, h6 a\n" +
" { color: black; border-bottom: 1px dotted black; }\n" +
" pre, figure { margin: 1em auto; padding: 1em; }\n" +
" img { display: block; margin: 1em auto; max-height: 40%; max-width: 40%; }\n" +
" img[src*='.svg'] { display: none!important; }\n" +
" figcaption { font-weight: bold; font-size: 0.8em; text-align: center; }\n" +
" header, footer, aside, nav { display: none; }\n" +
" </style>\n" +
"</head>\n" +
"<body>\n";
},
removeUnwantedTags: function() {
var arrTagsToHide = [ "aside", "footer", "iframe", "nav", "noscript", "script"];
for (var i = 0; i < arrTagsToHide.length; i++) {
var arElementsToHide = document.getElementsByTagName(arrTagsToHide[i]);
var j = arElementsToHide.length;
while (j > 0) {
arElementsToHide[j-1].parentNode.removeChild(arElementsToHide[j-1]);
--j;
}
}
},
addNoYuckiesStyle: function() {
var sStyle = "\n<style>";
for (var i = 0; i < subhash_browser_js.book_reader_js.arYucks.length; i++) {
sStyle += "*[class*=\"" + subhash_browser_js.book_reader_js.arYucks[i] + "\"], *[id*=\"" + subhash_browser_js.book_reader_js.arYucks[i] + "\"] ";
if (i < (subhash_browser_js.book_reader_js.arYucks.length-1)) {
sStyle += ",";
}
}
sStyle += " { display: none!important; }\n</style>\n";
document.getElementsByTagName("body")[0].innerHTML += sStyle;
},
parseFiniteElement: function(aoEl) {
subhash_browser_js.book_reader_js.isTitleTag(aoEl.tagName.toLowerCase());
if (!subhash_browser_js.book_reader_js.bTitleFound ||
!subhash_browser_js.book_reader_js.hasNoYuckiness(aoEl)) { return; }
var sElTag = aoEl.tagName.toLowerCase();
if (!subhash_browser_js.book_reader_js.isUsefulTag(sElTag)) { return; }
if (sElTag == "a" && aoEl.href) {
if (aoEl.href.indexOf("#") == 0) {
subhash_browser_js.book_reader_js.sHtml += aoEl.textContent;
} else {
subhash_browser_js.book_reader_js.sHtml += "<a href=\"" + aoEl.getAttribute("href") + "\">" + aoEl.textContent + "</a>";
}
} else if (sElTag == "abbr") {
subhash_browser_js.book_reader_js.sHtml += aoEl.textContent +
" (" + aoEl.getAttribute("title") + ") " + "\n";
} else if ((sElTag == "b") || (sElTag == "em") || (sElTag == "strong")) {
subhash_browser_js.book_reader_js.sHtml += "<b>" + aoEl.textContent + "</b>";
} else if (sElTag == "br") {
subhash_browser_js.book_reader_js.sHtml += "<br />";
} else if ((sElTag == "cite") || (sElTag == "i") || (sElTag == "time")) {
subhash_browser_js.book_reader_js.sHtml += "<i>" + aoEl.textContent + "</i>";
} else if ((sElTag == "ins") || (sElTag == "kbd") || (sElTag == "mark") || (sElTag == "u")) {
subhash_browser_js.book_reader_js.sHtml += "<u>" + aoEl.textContent + "</u>";
} else if (sElTag == "img") {
subhash_browser_js.book_reader_js.sHtml += "<img src=\"" +
aoEl.getAttribute("src") + "\" />";
} else if ((sElTag == "cite") || (sElTag == "s") || (sElTag == "strike")) {
subhash_browser_js.book_reader_js.sHtml += "<s>" + aoEl.textContent + "</s>";
} else if ((sElTag == "code") || (sElTag == "samp") || (sElTag == "var")) {
subhash_browser_js.book_reader_js.sHtml += "<code>" + aoEl.textContent + "</code>";
} else if ((sElTag == "sub")) {
subhash_browser_js.book_reader_js.sHtml += "<sub>" + aoEl.textContent + "</sub>";
} else if (sElTag == "sup") {
subhash_browser_js.book_reader_js.sHtml += "<sup>" + aoEl.textContent + "</sup>";
} else if ((sElTag == "label") || (sElTag == "span") || (sElTag == "wbr")) {
subhash_browser_js.book_reader_js.sHtml += aoEl.textContent;
} else if ((sElTag == "h1") || (sElTag == "h2") || (sElTag == "h3") ||
(sElTag == "h4") || (sElTag == "h5") || (sElTag == "h6") ||
(sElTag == "figcaption") || (sElTag == "p")) {
subhash_browser_js.book_reader_js.sHtml += "<" + sElTag + ">" +
aoEl.textContent + "</" + sElTag + ">";
}
},
isUsefulTag: function(asTag) {
var arTags = [ "a", "b", "i", "s", "u", "abbr", "article", "br", "code",
"cite", "em", "figure", "figcaption", "h1", "h2", "h3", "h4", "h5",
"h6", "img", "ins", "kbd", "label", "li", "main", "mark", "navig", "ol",
"p", "pre", "samp", "strike", "sub", "sup", "span",
"strong", "time", "ul", "var", "wbr" ];
for (var i = 0; i < arTags.length; i++) {
if (asTag == arTags[i]) {
return(true);
}
}
return(false);
},
isTitleTag: function(asTag) {
if ((!subhash_browser_js.book_reader_js.bTitleFound) &&
((asTag == "h1") || (asTag == "h2") || (asTag == "h3"))) {
subhash_browser_js.book_reader_js.bTitleFound = true;
}
return(subhash_browser_js.book_reader_js.bTitleFound);
},
hasNoYuckiness: function(aoNode) {
if (aoNode.className) {
if (aoNode.className.indexOf) {
for (var i = 0; i < subhash_browser_js.book_reader_js.arYucks.length; i++) {
if (aoNode.className.toLowerCase().indexOf(subhash_browser_js.book_reader_js.arYucks[i]) > -1) {
return(false);
} else {
}
}
}
}
if (aoNode.getAttribute) {
if (aoNode.getAttribute("id")) {
if (aoNode.getAttribute("id").indexOf) {
for (var i = 0; i < subhash_browser_js.book_reader_js.arYucks.length; i++) {
if (aoNode.getAttribute("id").toLowerCase().indexOf(subhash_browser_js.book_reader_js.arYucks[i]) > -1) {
return(false);
}
}
}
}
}
return(true);
},
parseNode: function(aoNode) {
var sTag = aoNode.nodeName.toLowerCase();
subhash_browser_js.book_reader_js.isTitleTag(aoNode.nodeName.toLowerCase());
if (subhash_browser_js.book_reader_js.bTitleFound &&
subhash_browser_js.book_reader_js.isUsefulTag(sTag) &&
subhash_browser_js.book_reader_js.hasNoYuckiness(aoNode)) {
if (sTag == "a" && (aoNode.href)) {
subhash_browser_js.book_reader_js.sHtml += "<" + sTag +
" href=\"" + aoNode.href + "\">";
} else {
subhash_browser_js.book_reader_js.sHtml += "<" + sTag + ">";
}
}
for (var i = 0; i < aoNode.childNodes.length; i++) {
var oNode = aoNode.childNodes[i];
subhash_browser_js.book_reader_js.isTitleTag(oNode.nodeName.toLowerCase());
if (oNode.nodeType == Node.ELEMENT_NODE) {
subhash_browser_js.book_reader_js.parseElement(oNode);
} else if (oNode.nodeType == Node.TEXT_NODE) {
if (subhash_browser_js.book_reader_js.bTitleFound) {
subhash_browser_js.book_reader_js.sHtml += oNode.nodeValue;
}
}
}
if (subhash_browser_js.book_reader_js.bTitleFound &&
subhash_browser_js.book_reader_js.isUsefulTag(sTag)) {
subhash_browser_js.book_reader_js.sHtml += "</" + sTag + ">";
}
},
parseElement: function(aoEl) {
if (window.getComputedStyle(aoEl)) {
if (window.getComputedStyle(aoEl).getPropertyValue("display") == "none") {
try { console.error("Ignoring hidden element: " +
aoEl.outerHTML.substr(0,300)); } catch (e) {}
return;
}
}
subhash_browser_js.book_reader_js.isTitleTag(aoEl.tagName.toLowerCase());
if (aoEl.children.length > 0) {
subhash_browser_js.book_reader_js.parseNode(aoEl);
} else if (subhash_browser_js.book_reader_js.bTitleFound) {
subhash_browser_js.book_reader_js.parseFiniteElement(aoEl);
}
},
changeToReader: function() {
try {
subhash_browser_js.book_reader_js.addNoYuckiesStyle();
subhash_browser_js.book_reader_js.removeUnwantedTags();
subhash_browser_js.book_reader_js.createHeader();
var oEl = document.getElementsByTagName("body")[0];
subhash_browser_js.book_reader_js.parseElement(oEl);
subhash_browser_js.book_reader_js.sHtml += "</body>\n";
document.getElementsByTagName("html")[0].innerHTML = subhash_browser_js.book_reader_js.sHtml;
} catch (e) {
console.error("Subhash Browser BRV Error" + e);
}
},
handle_DOMLoaded: function() {
try {
window.setTimeout(
function() {
subhash_browser_js.book_reader_js.changeToReader();
},
5*1000);
} catch (e) {
console.error("Subhash Browser BRV Error: " + e);
}
}
}
document.addEventListener
("DOMContentLoaded", subhash_browser_js.book_reader_js.handle_DOMLoaded, false);
Some Samples
Many websites serve HTML5 but they do not use the tags in a contextually meaningful way. This is how a Verge website article looked. It did not have many problems.
Some websites put their H1
(title) tag inside the header tag instead of the article or main tag. Their reasoning is, obviously, the article heading should be in the header! Some websites have no use for h
-tag hierarchy at all and everything is dumped in a DIV
box styled with inline CSS. It is definitely not worth using this JavaScript on those sites.
Here is a randomly chosen CodeProject article page, incidentally written by me.
I had to make a change to my original code because CodeProject places the entire article inside a form
tag. The original version deletes form
tags along with others such as script, footer and aside. Why does CodeProject put its article inside a form
tag? Is it an ASP.NET thing?
The popular blogging tool WordPress is much worse. It take the tags/ categories that a blogger adds to a blog post and suffixes them to the class attribute of the div
containing the post contents. So, if the blogger adds a category "aside
" to an blog post, then this JavaScript will discard the entire post. Hence, the scope for failure can be high with some website software.
Here is an AP News article.
It seems rough in some places but that is how the web designers are presenting the content to search engines and accessibility applications. This is another reason why you should think carefully before choosing your content management system or server scripting technology.
Here is how it looks in my Android app.
Do note that when you browse from a mobile device, the server might send a different version of the same page that is customized for the device. So, this JavaScript may not work in the same way as on the desktop. However, if the HTML tag organization of the mobile page is content-aware, then there should be no problem.
Points of Interest
The code eliminates an HTML element based on its tag, class name and ID attribute. To make the code customizable, it uses JavaScript arrays in which the tags, classes and IDs can be easily added or removed. The code works initially by discarding all tags until it finds the title. It assumes that the title might be in a H1
, H2
or H3
tag. Then it parses the remaining tags - limiting itself to paragraphs, lists, text nodes, images and hyperlinks, and discarding everything else.
This exercise made me look into the guts of popular websites and frankly, I am disgusted by the amount of useless junk that popular CMS software and JavaScript frameworks dump into a web page. Social media plugins are a major culprit. They not only increase the weight of the page but also block the loading of the contents. The server farms of the world would consume quite a lot of less energy if there was not so much JavaScript inside webpages.