Papers 3 and PDF Pen – a match made in Heaven (a workflow)

Overview

The problem: PDFs as scanned images (the text can’t be ‘selected’)
The workflow: Extracting PDFs from Papers 3 for mac and OCRing them with PDF Pen and replacing the original PDF with the new OCR version in the right spot
The software: Papers 3 for mac (reference management software) and PDF Pen for mac (PDF manipulation)

The Problem

Managing PDFs is the bane of my existence – well at least on the ‘academic’ side. Fortunately there have been some recent improvements made by the developers of my favourite reference manager Papers 3 for Mac that makes things a bit easier. Finally.

For anyone who has been using Papers for any length of time, you will be aware that Papers 2 was awesome… and then they released Papers 3. We don’t talk about the early period of this release except in hushed tones. I think everyone will agree it is best that it was forgotten. However, with the most recent update (v 3.3.2), Papers seems to be back on track again. Many of the features that were so dearly loved in versions 2.x are back and they seem to have sorted out local wifi sync which means that I can sync my library with an iPad without having to use a third party service like Dropbox (which would do weird things to my file directory/naming conventions).

Like most PDF/reference managers Papers allows me to ‘mark up’ the PDF being read – including extracting quotes, highlighting, underlining etc. This works because most PDFs downloaded directly from the publisher have been OCR’d first. However, if I get a PDF from my University Library Document Delivery service, it usually arrives as an image file. This means no OCR. It also means no clicky, selecty, highlighty, extracty goodness.

Enter PDF Pen by the good folk at Smile Software.

PDF Pen is a powerful piece of software that allows me to alter PDFs in many different ways, but the one I rely upon the most is the ability to OCR a scanned PDF.

OCR stands for Optical Character Recognition. What this means is that the software will look at an image file (in this case a scanned PDF) and if it recognises words in the image, it can convert those images of the words to actual words that the computer can read[Footnote 1].

The Workflow

How to magically OCR a PDF in Papers 3
In the past it has been a nightmare trying to find the actual image of the PDF file within the (hidden) library of Papers 3, extracting it, opening PDF Pen, OCRing the document, saving it somewhere and then replacing the original image file in Papers 3 with the new OCR version. Papers 3 would see the ‘new’ version of the PDF and add it as a supplementary paper, rather than replacing it as the primary paper. Now, with the latest release of Papers 3, the process is much easier[Footnote 2]:

  1. Import scanned image of PDF into Papers 3
  2. Make sure all the metadata is correct using the inspector
  3. Save as a new record in my Papers 3 library
  4. In the inspector panel within Papers 3 right click on the PDF file (see screenshot below)
  5. Choose the option to “open with PDF Pen”
  6. PDF Pen will recognise the image of the PDF as a scanned image and will offer to OCR it for me. Click yes.
  7. PDF Pen does its OCR magic and when completed, overwrites the original PDF image file in Papers 3 with the new OCR version saving it in the same location as the previous image file.
Right click on the PDF image in the inspector panel and then select "Open with PDF Pen"

[CLICK TO ENLARGE] Right click on the PDF image in the inspector panel and then select “Open with PDF Pen”

It works like magic and the PDF is now searchable; sections of text can be highlighted, direct quotes can be extracted etc. all without the messy business of trying to find the original file and making sure that the delicate file structure that I’ve set up is not screwed up.

The fact that the most recent release of Papers 3 now allows spotlight to index the text of the PDFs within its library, it means that I can search for text within any of my PDFs in Papers 3 right from the desktop.

Wow.


  1. It doesn’t actually covert the text itself, but rather places another layer on top of the document which mimics the underlying text. This makes the text readable by a machine/software. Crucially, it also means that the PDF now becomes searchable.  ↩
  2. If you already have the image of the PDF within your Papers 3 database, you can ignore steps 1 to 3  ↩