gbCapture - Extract Protected Document Content

gbeene

5.00/5 (3 votes)

21 Mar 2023CPOL5 min read

Capture protected document content as text without the need for breaking DRM

Many users want to convert documents for personal use, formatted for the viewing applications they are familiar with. There are several conversion tools available but for documents that are DRM protected, the number of free tools are particularly limited. Calibre is one of those free tools and is used, with a DeDRM extension, to convert Kindle books to a variety of other formats. Furthermore, in the case of Kindle books, the DeDRM tool requires that the book be downloaded using an older copy of the Amazon Kindle app that can provide the file format supported by the DeDRM extension. gbCapture can capture document text content from the document’s native viewer without the need for DRM removal.

Download documentation, source code and demo project

For a simple capture, gbCapture is run in “mini” mode, as shown in this image:

For evaluating all of the gbCapture settings, selectively extracting pages, viewing details of extracted text/images and to just generally better understand how gbCapture works, the following user interface is also supported.

Introduction

I work with low vision users – particularly folks with macular degeneration. Their loss of eyesight is a great disappointment for them, in no small part because it prevents them from reading books, such as those they buy from Amazon. Native book reading apps such as the Kindle app do not allow low vision users to adjust the way the book content is displayed, at least not to the degree that low vision users require. A product of mine called EZReader allows low vision users to reformat the books so they can better see the content, as well as to read the content out loud. But, EZReader only works with text content so conversion of the Kindle book to text is required to allow display of the book in EZReader.

So, the purpose of this article is to demonstrate a straightforward technique for converting protected documents, such as DRM protected Kindle books, to text. I call the utility gbCapture and its interface is designed so that even low vision users can successfully convert their DRM-protected books to text for import into EZReader or for conversion to other formats by other conversion applications.

One of the best known, and free, solutions to converting DRM-protected books to other formats is Calibre. Like other available solutions, it requires the use of a DeDRM extension. And in the case of Kindle books, it also requires that the book be downloaded with an earlier version of the Kindle app, one that downloads the books in a format that DeDRM supports.

With gbCapture, the text is extracted by viewing the book in its native viewer (such as in any version of the Kindle app), taking a picture of each page in the document and using OCR (tesseract) to extract the text. gbCapture automates the process, capturing each page one at a time then turning the page in the native viewer until the end of the document is reached. When the last page is captured, the extracted text will be merged into a single text file.

The Article Body

gbCapture is written in the PowerBASIC language, whose ability to directly access Win32 API particularly makes it easy to supplement the PowerBASIC statements. gbCapture makes heavy use of the Win32 API.

Let’s go through the basic operation of gbCapture, then I’ll highlight several areas and provide code examples for the more critical gbCapture procedures. I’ll use the Kindle app for the discussion but gbCapture can also work on other applications (Word, WordPad, NotePad, etc.) that display documents.

To begin a book capture, open the Kindle app to display a book. The app should be placed in the center of the desktop and sized to ensure that the pages to be captured have margins around the text. gbCapture automatically detects the app which covers the center of the desktop. It also detects the Window within the app that contains the displayed text.

With gbCapture opened in “mini” mode off to the side of the Kindle app, simply press Capture on the gbCapture toolbar. gbCapture will take a picture of the currently viewed page, extract the text, turn the page and then repeat the process until it reaches the end of the book. “End of Book” is determined by two consecutive captures returning the same text.

Here’s an image the gbCapture provides to confirm to the user that the windows containing the text has been identified.

Once Capture is pressed, gbCapture begins the automated capture/extract text/turn page procedure. The gbCapture statusbar will indicate progress. Capture will stop automatically when the end of the book is reached. The user can manually stop the capture at any time.

Key Code Sections

The majority of the code is very straight forward and does not require any in-depth discussion. However, I’ve selected a few of the procedures for additional discussion. The window containing the text is found using the appropriate Win32 API.

PowerBasic

Sub GetHandles
   Local pt As Point
   Desktop Get Client To pt.x, pt.y
   pt.x = pt.x/2 : pt.y = pt.y/2
   hCenterWindow = WindowFromPoint(pt)
   hCenterApp    = GetParent(hCenterWindow)
End Sub

gbCapture provides two ways in which to capture an image of the text – one by capturing only the container window and another by capturing the entire desktop following by extracting the container window. Here’s the code for capturing the container window. Not all viewing apps respond to both approaches, so gbCapture allows the user to select which to use. PowerBASIC provides a number of “Graphic” statements to make it easier to work with images. In particular, the statement “Graphic Bitmap” is a PowerBASIC statement that creates an in-memory bitmap structure, whose DC is used in the following capture code.

PowerBasic

Sub CapturePageImage_Window
   'Capture Image to ImgName
   Local hPageDC, hBMP, hBMPDC As Dword, w,h As Long

   GetWindowRect hCenterWindow, rcCenter
   hPageDC = GetDC(hCenterWindow)

   w = rcCenter.Right - rcCenter.Left
   h = rcCenter.Bottom - rcCenter.Top

   Graphic Bitmap New w,h To hBMP
   Graphic Attach hBMP, 0
   Graphic Get DC To hBMPDC

   BitBlt hBMPDC, 0, 0, w, h, hPageDC, 0, 0, %SRCCopy
   ReleaseDC %Null, hPageDC

   Graphic Save ImgName
   Graphic Bitmap End
   Statusbar Set Text hDlg, %IDC_Statusbar, 1,0, " Image: " + PathName$(Namex,ImgName)
End Sub

With the page image captured, the tesseract OCR library is used to extract the text. The time to complete a page image capture and to extract the text depends on the user’s PC capability, but in general might take up to 1s per page capture. This means that a complete book of 600 page would take about 10 minutes – much slower than other available tools, but still without the need for removing the document DRM protection.

PowerBasic

Sub ExtractTextFromImage
   Local ExtractedText$
   'extract and clean the text using Tesseract
   Shell ($Tesseract + " --psm " + IIf$(MultiColumn,"4 ","1 ") + _
   ImgName + " " + PathName$(Path,TextName) + _
   PathName$(Name,TextName), 0) 'wait for it to finish
   Statusbar Set Text hDlg, %IDC_Statusbar, 1,0, " Text: " + PathName$(Namex,TextName)
   Open TextName For Binary As #1 : Get$ #1, Lof(1), _
   ExtractedText$ : Close #1 'get extracted text
   CleanText(ExtractedText$)
   If LastText$ = ExtractedText Then 'end of document
      If IsFile(ImgName)  Then Kill ImgName
      If IsFile(TextName) Then Kill TextName
      StopCapture = 1
      Exit Sub
   Else
      LastText$ = ExtractedText$        'for comparison, to know when stop
   End If
   Open TextName For Output As #1 : Print #1, _
   ExtractedText$; : Close #1           'save cleaned, but not formatted, 
                                        'extracted text
   'append the extracted text to $Document
   Open $Document For Append As #1
   If AutoParagraphFormatting Then ParagraphFormatting(ExtractedText$)
   Print #1, $CrLf + ExtractedText$ + IIf$(UsePageNumbers, _
             $CrLf + "Page: " + Str$(ActionCount) + $CrLf, "") ;
   Close #1
   Statusbar Set Text hDlg, %IDC_Statusbar, 1,0, " Text: " + PathName$(Namex,TextName)
End Sub

Once the current page is captured and its text extracted, the viewing page is sent a Next Page command by giving focus to the application and sending a Page Down keystroke to the center application. Both key down and key up are sent to the center app.

PowerBasic

Sub NextPage
   SetFocus hCenterApp
   SetForeGroundWindow hCenterApp
   keybd_event(%VK_NEXT, 0, 0, 0)
   keybd_event(%VK_NEXT, 0, %KEYEVENTF_KEYUP, 0)
End Sub

Conclusion and Points of Interest

For capturing text from a document, gbCapture offers an alternative to other existing solutions. In particular, it can extract text from protected documents without having to enable document editing or to break the document DRM protection or without using a specific version of the native file viewer. With the text in hand, users can turn to other conversion applications to create documents in other formats.

History

20^th March, 2023: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)