Text and Image Documents

There are two basic types of documents you can use in a paperless office.  The first type of document is text based.  These are formats like .TXT .RTF .DOC, etc.  These store text as editable information.  You can go in and change the document, fix spelling, copy sentences, etc.

scanner-main

The second type of document is an image based document.  This includes formats like .TIF, .JPG, .PNG, .GIF, etc.  These documents just represent a bunch of pixels.  The computer can’t edit the words themselves other than by deleting pixels and putting new pixels down.  You can’t copy a sentence and paste it into another program if you are using this format.

The advantage of the text-based formats is the fact that they can be searched.  If the document contains the word “Smith Contract,” a search on your computer for those words should show the document in the results.  With image-based documents you don’t have that luxury.  If you want to be able to find it, you had better name it using the keywords you might use for your search, put it in a directory with the name you will search for, or associate meta information with the document containing all the keywords you might use.

The advantage of image-based documents is the way they preserve the layout and non-text elements.  If you have to go to court to show someone signed a contract, you are going to want to have an image-based document with their signature. (There are some ways to do things with digital PKI signatures that will stand up in court, but that gets quite a bit more complicated.)

Of course, the problem is, you may have a hard time locating the particular contract unless you were particularly careful about where and how you saved it.

The PDF format solves many of these issues.  PDFs allow you to store a document as an image AND as text.  Think of it as two layers: you have a text layer that contains the words in a computer-readable format and you have the image layer that contains a picture of the document–including any pictures, annotations, etc.  So if you want to search for a keyword, it acts as a text-based document.  If you need to print out a copy of the document, it acts as an image-based document.

doc-types.png

When you scan your document, you want to make sure both types of information are recorded.  To get text information from a scanned document, you need to use some type of optical character recognition.  Usually your scanner will come with some type of OCR software.  Many scanning programs will automatically add the text layer into a PDF.  The newer versions of Acrobat have OCR capabilities built in so you can take image-based documents and add the text layer with a few clicks.

In my work-flow, my scanner sends the image of each document directly to a program that  performs some optimizations, does OCR and then saves the results as a PDF in my document repository.

If you are looking at setting up a paperless office, you will need to consider how the character recognition takes place.  The more you are able to automate the process, the easier it will be to work with.

Note: If you are creating a PDF directly from your computer, there is a way to skip the image layer while still preserving the layout of the page.  If you start adding signatures and markups, it will create an image layer to put those items in.

Originally published January 17, 2008.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>