Creating a service for storing your Pocket articles as PDFs

Martin-Hallonqvist

4.88/5 (10 votes)

19 Aug 2013CPOL5 min read

27.7K

A small project for creating a service which downloads web articles saved to Pocket for later reading as PDFs.

Introduction

I'm currently using a system where I handle my news flow by browsing and reading articles in Feedly and the ones I feel I want to keep for future reference, I save to my Pocket account.

Having them saved in Pocket is great for passing on links to friends and so on, but after having built up quite a large reference library of articles in Pocket, it's hard to easily find the articles I'm looking for, and it's also hard to search for something in all my referenced material at once.

To solve this, I've built a Windows service that uses the Pocket API to download a list of all my saved links, and from that list, it downloads all new entries as PDFs and saves them to my local computer. That way, I end up with all the articles as searchable PDFs, which I then can organize and cite using my favorite citation program.

Using the code

Prerequisites

Before you begin using the code, you need to create your own Pocket Application and create an access_token for your Pocket user. Since this is more of a hack than a completed project, I'll just guide you through how you can do it without coding.

Here are the steps to create your needed keys:

Create a Pocket application.

This is a simple step. All you need to do is to go to this page and enter the needed information. When that is done, you will get your own application "consumer key", which you will use later to authorize your user for your new application.

Create your access_token.

This takes a bit more hacking. You can save the following text to an HTML-file and follow the steps there to get your access_token. It is documented here on the Pocket developers site, and basically you are doing the steps with a bit of cheating with the callbacks.

HTML

<html><head></head><body>
1. Enter your consumer_key and hit "Submit" and you should get a response saying "code=XXXX"
<form action="http://getpocket.com/v3/oauth/request" method="post">Consumer_key<input type="text" 
name="consumer_key" value =""><input type="hidden" name="redirect_uri" 
value="fobar://test"><input type="submit" value="Get code"></form>
2. Go to the following link and accept the application authorization request.
<p>Go to https://getpocket.com/auth/authorize?request_token=&lt;code from step 1&gt;&amp;redirect_uri=fobar://test</p>
3. Enter your consumer_key and the code from step 1&gt; Push submit and you should get a response with "access_token=YYYYY"
<form action="http://getpocket.com/v3/oauth/authorize" method="post">Consumer_key<input 
type="text" name="consumer_key" value =""> Code<input type="text" 
name="code" value =""><input type="submit" value="Authorize"></form>
</body></html>

Now you've got your access_token and your consumer_key, which are used in the service to download your Pocket information.

You also need to download the iTextsharp library since it is referenced in the code.

General logic

The general idea of this project is to periodically get a list of articles (URLs) saved to a specific Pocket account and compare this list to a stored list of articles already downloaded by the service. If there are any new articles, these are downloaded and converted to PDF format and then saved to a designated directory. After the article has been downloaded, the stored list of articles downloaded by the service is updated with the new entry.

Connecting to Pocket

If you've created your keys correctly, the task of obtaining a list of all your Pocket articles/URLs is pretty straight forward. Everything you need is documented at this page, and the section in my code is shown below.

using (WebClient test = new WebClient())
{
    System.Collections.Specialized.NameValueCollection reqparm = 
      new System.Collections.Specialized.NameValueCollection();
    reqparm.Add("consumer_key", consumer_key);
    reqparm.Add("access_token", access_token);
    reqparm.Add("state", "all");
    reqparm.Add("detailType", "complete");
    byte[] responsebytes = test.UploadValues("https://getpocket.com/v3/get", "POST", reqparm);
    string responsebody = Encoding.UTF8.GetString(responsebytes);
...
}

This piece of code will retrieve a JSON list of information about the given Pocket account, containing information about each saved URL in Pocket, including tags, dates added and so on.

From this JSON text, I use a regular expression to extract the information that I need, rather than using a more sophisticated JSON parser.

MatchCollection matches = Regex.Matches(responsebody, 
  "\"given_title\":\"(.*?)\".*?\"resolved_title\
  ":\"(.*?)\".*?\"resolved_url\":\"(.*?)\"");

And then I add each URL to a list for comparison with the saved list of already fetch articles. That list, by the way, is saved in an XML file and serialized/deserialized with an XmlSerializer object.

Two ways of creating PDFs

Now, for each URL not present in the articles already downloaded list, a PDF will be created in the designated directory.

I started out using WKHTMLToPDF to create the PDFs, but I found that the resulting quality was not quite as good as the online supplier PDFCrowd that I had been using before, so I hacked a small routine for using that site as well. Doing the latter though, I'm pretty sure I might be in some violation of the user agreement, but since I really do low volume conversion, I hope it will be all right.

Using WKHTMLToPDF

This approach is pretty straight forward. I just start a new process with the command line options I want to use for the application (make sure you've installed the program and that your app.config is pointing to the right path!) and let it do the rest. Afterwards, I just check the output from the program to see if it finished without problems.

System.Diagnostics.ProcessStartInfo procStartInfo =
    new System.Diagnostics.ProcessStartInfo("\""+ wkhtmlToPDFExePath + "\"");
procStartInfo.Arguments = "--load-error-handling ignore --zoom 1.33 --javascript-delay 5000 \"" + url + 
  "\" \"" + tempFileName + "\"";

Using PDFCrowd

Using PDFCrowd involves filling in your account username and password in the app.config file, and then simulating a web surfing session to create and download the PDF. There is some details in the code that you can see for yourself on how the session is simulated.

One part that I should mention though is the using of a CookieAwareWebClient class that will use cookies to match the server session of the client, mainly for the login status.

Adding PDF meta data

As a small addition, I use iTextsharp to add some meta data to the PDF so that it is easier to import to document management systems. I first wanted to add keywords as well to the file - and keep these synced with Pocket tags - but since my DMS does not support that feature, I kind of left it not implemented. I do add the source URL, the web page as Author and the title as title in the PDF meta data though.

private static void AddMetadataToPDF(string filePath, string url, 
        string author, string title, string keywords)
{
    string tempFileName = System.IO.Path.GetTempFileName();
    PdfReader pdfReader = new PdfReader(filePath);
    pdfReader.SelectPages("1-" + pdfReader.NumberOfPages);
    using (PdfStamper stamper = 
      new PdfStamper(pdfReader, new FileStream(tempFileName, FileMode.Create)))
    {
        Dictionary<String, String> info = new Dictionary<string,string>();
        info.Add("Keywords", keywords);
        info.Add("Title", title);
        info.Add("Author", author);
        info.Add("OriginalUrl", url);
        info.Add("CreationDate", "D:" + 
          DateTime.Now.ToString("yyyyMMddHHmmss") + "-01'01'");
        stamper.MoreInfo = info;
        stamper.Close();
    }
    pdfReader.Close();
    File.Delete(filePath);
    File.Move(tempFileName, filePath);
}

Settings

The app.config file holds configurationFile, logFile, downloadFolder, interval, accessToken, consumerKey, createUsingPDFCrowd, pdfCrowdUserName, pdfCrowdPassword, createUsingWKHTMLToPDF, wkhtmlToPDFExePath. All these are pretty self explanatory, but you need to set them right in order for the program to work.

The configuration.xml file pointed to by the entry configurationFile also needs to be an xml file with the following content to work. This is the file that will hold the list of downloaded URLs by the way.

XML

<?xml version="1.0" encoding="utf-8"?>
<ArrayOfDownloadedUrl xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
          xmlns:xsd="http://www.w3.org/2001/XMLSchema">
</ArrayOfDownloadedUrl>

Error handling

This is more of a "proof of concept" kind of project than a release candidate, so error handling is bad at best :)

If you want to get it in some kind of production state, you really need to take some steps to make it more error resilient.

Considering that this is a windows service with no user interface and no way of communicating with the user/administrator than by logging of different sorts, you really need to make it fail proof.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)