Scanning & OCR Technical Documentation

Scanning OCR

Scanning & OCR Technical Documentation

Scanning involves digitizing texts or images by using a scanner. Scanning is an easy way to produce PDFs of originally print publications for broader dissemination. THL relies heavily on scanning to digitize journals and books that are either out of print, or to which we have secured copyright permission to reproduce. In addition, OCR (Optical Character Recognition) can be used to actual interpret the text and produce word processing documents with the scanned text. A separate use of scanning technology is to scan print pictures, negatives, or slides for digitizing photography that wasn't originally digital.

Scanning equipment:

Local resources for scanning books and documents are:

A small A4 flatbed scanner is available at Institut d'Asie Orientale.

A Canon XXXX is available at Institut d'Asie Orientale for formats up to A3. The scanner-photocopier can digitize both B&W and color documents at a resolution of XXXdpi. It has fedsheet capacity that we recommend to use for large quantities of documents, provided they mette the standards of quality for sheet feeding.

A Digibook XXXXXX is available in the Media Department of ENS de Lyon. We recommend its use for loose-sheet documents, especially when paper quality does not allow sheet-feeding. It can also be used for digitizing books, provided the book opens well. Yet there will remain a deformation that can be processes digitally. For thick books, however, we recommend using the XXXXX Digibook co-owned by ISH and MOM and located at the MOM (see below). To use the Digibook of the Media Department, a reservation need to be made with the department.

The most sophisticated scanner device is the XXXXXX Digibook located at the MOM.It is a device that requires training before one can use it. ISH can provide a digitization assistant for digitization, but this is a fee-based service. The use of the XXXXX Digibook is not free. There is a daily, weekly and monthly fee, depending on the quantity of materials to be processed.

For scanning images:

Scanning Guidelines For Images

The general guidelines for digitizing images by scanning in negatives, slides, or prints is as follows:

Resolution: 3000 pixels on the long side. The software may then ask you to specify resolution in addition, but in fact specifying the long side is what determines the actual resolution. This subsequent question just determines what goes into the image header metadata and does not actually influence the image’s resolution.
Bit depth: 24 bit (full color), which can also be understood as 8 bits per channel. Your scanning software may offer other choices such as 12 or 16 bits per channel, but it is important that you scan the images at 8 per channel.
Editing: Uncorrected; may do gamma adjustment before scanning, not after.
Color space: If you are given a choice, select Adobe RGB 98.
File format: TIFF
Compression: LZW is okay.
File size: The images will probably wind up around 20 Mb per scan.
Cropping: Please crop the scan to the size of the picture. It is a waste of space and resources to scan an area larger than the desired object.

Scanning Negatives

Is scanning negatives much better than scanning prints from the negatives?

This goes against all current “archival wisdom”. Yes, almost all sources would say that the negative is a better source. Sometimes the prints are a better source than the negative, especially on newer material. The process of creating prints often corrects for incorrect exposures, under contrasted images, etc. A lot of the answer to this question lies in the quality of the originals. My recommendation is to scan both and see which one has a)more detail when zoomed in, and b)most closely resembles the actual thing photographed.

For scanning texts in general:

Scanning Texts

Choice of Equipment

You can use scanners or digital cameras to produce images of texts. Scanners can be either flatbed or have automatic feeders. Flatbed scanners are good for not damaging paper, but also take more time since you have to manually place each piece of paper and remove it. In addition, they are large and bulky, especially if you have one large enough to accommodate long pages. Scanners with automated feeding are much more compact and thus easy to transport as well as accommodate long pages, but they may also damage paper in the process - especially if the paper is torn, fragile, or of irregular texture. Either way, there is a natural wear and tear on the glass as paper passes through scanners.

The requirements for scanning documents and books will depend on several factors:

- are the documents loose sheets or bound into a single volume or as a sewn stack (a typical situation in Chinese archives)?

- nature of mount, thickness of the volume, nature of the paper.

For loose sheets, if the quality of the paper is good and there is no risk to damage the document, you can use a fedsheet scanner. If the paper is fragile, scanning needs to be donne by hand sheet by sheet, and if necessary with the use of a transparent plastic sheet to prevent damage during manipulation.

It should be stressed that any serious conservator will strongly resist any use of automated feeders because of the damage that they potentially do to a manuscript. If you are working with a unique or uncommon manuscript, we strongly advise against using an automated feed system for this reason. However, one may very well be scanning a common print so that damage is not a concern.

Scanners are generally easier to use, but digital cameras can particularly be good when you are worried about fragile or torn paper being damaged in the process of feeding them through a scanner. They also address the issue of the paper itself, if it is rough, scratching the glass of a scanner and causing lines in the resulting scans which can be major problems.

The main challenge with cameras is having a tripod or related set up that allows you easily keep the camera pointed at the pages with a perfectly straight shot looking down. There are special structures for doing this called Copy Stands, but they tend to be expensive and heavy. Various library or archival repositories like Library of Congress in the US or the UK National Archives provide either camera stands (Library of Congress) or even cameras on stands that deliver images to an email account. This facilitates a relatively rapid work process despite the inherent slowness of using a camera.

Preparing to Scan

Before you scan, assess the paper - how long are the pages? how clear and consistent is the print in terms of the darkness of the ink? what color is the ink? what is the texture of the paper? Depending on the answers, you should choose what equipment to use and what settings. Experiment with different settings, especially the dpi, and color, to determine which to use.

In addition, you have to be concerned about missing pages and pages out of order. Since archival documents or similar unbound documents are looseleaf without binding, it is easy for pages to go missing or to become out of order. This can either be checked prior to scanning, or you can try to sort out the digital images. If there is time, inspecting and fixing the paper as possible prior to scanning can be more efficient since paper is easier to look at than digital images.

It may also be that paper has become creased, so that it will present an irregular surface for scanning. In this case, you may want to experiment with exposing the paper to humidity to essentially “stretch it out” to make it more flat for scanning.

Open a target folder for the new scans. Output files should be TIFF, with no compression or increase. Set the scan type to long page duplex and specify the exact size of the pages. Setting the scan type to duplex is important: it ensures the autofeed scanner will scan both sides of the page simultaneously.

In terms of determining the scanner settings, the real litmus test is - when you print the scanned image, does it look good, is readable, etc.?

File format: tiff. For printing, you may want to subsequently convert to PDFs and print from the PDF. However, you should always save the scans as TIFFs.

Compression: none

Resolution: 300 dots per inch (dpi) minimum If the source material is really rich in detail, then do a test scan at 300 and at 600 dpi. If you see more detail in the 600 dpi scan, then I would use the higher setting. As an example, is the scanned text more readable at 600 dpi? If so, then use 600 dpi.

Color/Grayscale/bitonal: variable. This really depends on the source material. If it's yellow parchment colored paper with faint gray writing on it, then you definitely want to scan in 24-bit color. Actually, you can't go wrong if you scan in color, and if it makes sense to do so, it can always be converted later. However, one issue is scanning in color can be much slower and take up much more storage, which can both pose challenges. However color can be crucial where detail is necessary - everything depends on the document.

How to Scan

If you use an autofeed scanner that scans both sides of a page simultaneously, the scanning process entails little more than feeding pages into the scanner. It is recommended you feed the pages into the scanner in succession yourself rather than stack a few pages into the tray and assume the autofeed will scan them in the correct order and without grabbing multiple pages at a time. Most autofeed scanners have guides in the feed tray which you can adjust according to the size of the pages your scanning. These are extremely helpful in ensuring pages feed through the scanner evenly. As one page feeds through, prepare to insert the next page as soon the page being scanned is nearly finished. Autofeed scanners typically have a short delay between when one scan finishes and another starts. Thus, it's not difficult to maintain a steady rhythm of scanning many pages in succession-even if you fumble for a moment in preparing the next page to be fed into the scanner.

It is helpful to intermittently save a group of scans as you go so as not to lose all of your work if the computer crashes. Initially, you should number the files in sync with the text's page numbers themselves. Thus in case a page goes un-scanned, etc., it is easy to figure out exactly where you made a mistake. Your scanning equipment may automatically apply a custom name and number to scanned images as it saves them. This can greatly save time, though be careful to make sure the numbers it applies and the text's own page numbers are in sync.

Later, you can easily batch rename scans according to whatever makes the most sense.

We advise making one image for each folio side, rather than storing back and front (recto and verson) of a given folio on a single image. The reason for this is that by having one folio side on one image, you have preserved maximum flexibility for how you later deliver them.

Scanning Pages With Background Distortion

When scanning pages, there may occasionally be thin pages where the back-page text actually bleeds partially through to the front-page, distorting the text. This is different from pages so thin that the back-page text can be seen, an issue easily resolved with a thick piece of white paper. In the case of bleed-through, there are two solutions depending on severity. Either of these solutions should successfully lighten the distorting bleed-through in the background and make the front-page text more readable.

Under “Advanced Settings” > “Image Quality” set the brightness to the lightest possible setting before scanning. Do the above and photocopy the pages. Then scan the photocopies while repeating the above step, thus effectively doubling the lightness. This solution is more time-consuming and results can be somewhat grainy, so it is advised only for severe instances.

Processing

After scanning, processing may entail renaming files, organizing images into folders by text or volume, converting images to jpegs or PDF, and- in special cases- editing the image itself to remove blemishes or other distortions. For processing purposes it is helpful to use one or two programs that allow for (1) easily browsing a large number of TIFF files, (2) batch renaming, and (3) editing. Adobe Photoshop can perform all of these functions though it tends to perform slowly. “Thumbs Up” is a program that allows for easily viewing and renaming tiff images. Kodak Imaging is an light-weight, fast Windows application for editing images.

Storing images

High resolution scans take a large amount of disk space. You may want to use some form of lossless compression technique like “zip” if you are in need a temporary solution for conserving disk space. “Lossless” means that image will compress without any loss of information. (By contrast, a “lossy” image is one in which compression resulted in loss of information. A jpeg is a lossy image format.)

Zipping a file is acceptable for a temporary on-site storage solution, but a zipped file is not considered an acceptable archival format. For the long term, it is advisable to store the files uncompressed on a dedicated hard disk.

IMPORTANT: Compress a file using zip after you have saved the image as an uncompressed .tiff file! Do NOT save the file as a compressed tiff! This may be tempting as it does save disk space. Your scanning software in fact may offer this as a default option. However, you will lose information and be left with an inferior quality image.

As an alternative to the .zip format, it's fine to use other compression techniques (like .7z or other lossless methods) if they provide better results – as long as, again, these solutions are only temporary. Note however that .7z and other formats may not always work across platforms. (For example, you may not be able to decompress a .7z file on Mac).

Delivery options

Typically, final scans are delivered on DVDs or a hard drive. We generally recommend use of a hard drive if possible, since then it is easier to transfer the images as a whole rather than having to serially process a large number of DVDs. Of course this becomes less of an issue as high capacity DVD storage becomes more common. In addition, it is essential to keep two entirely separate copies of all scans in case the media of storage gets corrupted or lost.

Training

The training given to the people doing the scanning is crucial. It is important to closely supervise their initial work, and also stress to them not to change agreed upon settings later on because they decide they have to go faster, or because they switch staff, etc.

Scanning Journals

There are several methods to scan journals for the creation of PDFs. Which one you will use depends on the physical characteristics of the journal, and the desired quality of the finished PDF. There are two types of scanners at the Institut d'Asie Orientale and two different digibooks at ENS de Lyon and MOM respectively (see Scanning Equipment).

Brief directions for auto-feed scanning using IAO XXXXXX flatbed Scanner

The most time-efficient option for scanning is to use the Fujitsu ScanPartner because of its auto-feed feature. However, this can only be used if you have a loose-leaf document, or you are able to cut the binding off the journal. If so, proceed in the following way:

Press on the “Send” button.
Select your email address in address directory.
Change resolution to “300 x 300” dpi (or higher if necessary). Define output format as “PDF” unless you want image formay (TIFF). You may also want to change the brightness setting depending on the darkness of the text in your document.
Click “Start” and begin scanning. It may be necessary to place pages one by one into the auto-feed tray to avoid mis-feeds if working with thin or low quality paper.
If scanning a double-sided document, when the front sides finish scanning, flip the document over and place in the autofeed tray.
When the scan is complete your document will be emailed to your computer. Scroll through the document to make sure the pages are all there and in the correct order.
Crop pages if necessary by opening the crop tool from the tool menu, or right clicking a page thumbnail in the Pages sidebar.
Add metadata in the Document Properties (Control+D). Under Description enter “full issue” in the title field and in the subject field enter the name of the journal, volume, number, and date. For example: Shilin, Volume 7, Number 2, June-Oct 1983. Click “OK”
Save the file according to file naming conventions (see below).

Brief directions for scanning using IAO XXXXXX flatbed A4 Scanner

Open Scanning XXXXXXX assistant.
Select “Create PDF,” and inside this window choose the desired scanner (Fujitsu Limited TWAIN Driver), the original document's format (Double-sided or Single-sided), and select adapt compression page to content for “Adobe Acrobat 6.0 or later”. Before clicking “OK” make sure the document, or at least the first page, is loaded into the auto-feed tray face down with top of page loading first.
In the scan configuration window, most of the default settings are fine. Change resolution to “300 x 300” dpi. You may also want to change the brightness setting depending on the darkness of the text in your document.
When the scan is complete your document will appear in a new window. Scroll through the document to make sure the pages are all there and in the correct order.
Crop pages if necessary by opening the crop tool from the tool menu, or right clicking a page thumbnail in the Pages sidebar.
Add metadata in the Document Properties (Control+D). Under Description enter “full issue” in the title field and in the subject field enter the name of the journal, volume, number, and date. For example: Kailash, Volume 7, Number 2, June-Oct 1983. Click “OK”
Save the file according to file naming conventions.

NOTE: Poor quality paper, or very thin paper (such as that commonly used for journals produced in Asia) may not always correctly auto-feed. With these types of papers, if loaded into the tray all at once, the auto-feed has a tendency to take more than one page at a time. To avoid misfeeds, it may be necessary to place the pages one by one or a few at a time into the auto-feed tray. Misfeeds can ruin the scan job (because the pages will coallate incorrectly, which is not easily fixed).

Separating a Journal Issue into Individual Article Files

Once a full issue has been scanned, it needs to be broken down into smaller files containing front matter, articles, back, and any other sections. While working on these steps, be sure to keep your whole issue file intact.

Open the whole issue file you created from scanning.
Click on the “Pages” side tab of your document's window. The Pages sidebar makes it easy to select the pages of the various sections.
Click on the very first page (or cover as the case may be) in the sidebar. This will select that page and mark it as such with a blue highlight ring around the thumbnail.
Scroll down, still in the Pages sidebar, to where the front matter ends. This may include things such as the cover, title page, editorial data, contents, list of illustrations or plates, notes about contributors, and preface or forward. It is generally everything up to the first page of the first article.
Hold down the Shift key and click on the last thumbnail page of this section. This will select and highlight all the pages in the section.
Right click on one of the highlighted pages. A menu of tools will pop up. Select “Extract Pages” from the menu. Another window will open verifying the pages to be extracted. Click “OK.” A new window will open with the extracted pages.
Open the Pages sidebar in this window, and scroll through to make sure all your pages are there. At this point, you can delete any blank pages in the section (I have used the convention of leaving blank pages in the whole document file, but deleting them from the separated files). Just select the blank page, or pages, right click on one, and select “Delete Pages.” A window will appear confirming your deletion. Click “OK.”
Now add metadata to this document. You can select “Document Properties” from the File menu, or just press Control+D, this will bring up the Document Properties window. Select “Description” from the left sidebar. Then fill out the fields for Title, Author, and Subject. The Subject field is used for the jounal title, volume and number of issue, and date. For articles also include page numbers. For example: Bulletin of Tibetology, Volume 3, Number 2, June 1966, pp 8-19. In the Title field enter the title of the article or a description of the section, like “front matter” or “full issue” (for whole issue files). In the Author filed enter the author(s) of the article first name first then last name with multiple authors separated by comma or “and”. For example: John Henry and Polly Ann Henry. Or, James Madison (trans.).
When finished adding Document Properties, click “OK.”
Now save your file using correct file name standards.
Go on to the next section, and repeat the process. Select the first page of the section by clicking on the thumbnail of that page. Then scroll down to the last page of the section and click on it while holding down the Shift key. Right click on a page and select “Extract Pages.” Add the necessary metadata to the Document Properties (Control+D) and then save the file with the correct file name.

Tip: If you fill out the Subject field within Document Properties for the whole issue first, then whenever you extract pages from it, this field will already be filled out in the extracted pages file and you only need to add the relevant page numbers to the Subject field. Another Tip: I find it helpful to leave the front matter file open and put it down in the corner of the screen with the table of contents page showing as I separate the rest of the issue. This is a nice little reference to guide you as you extract articles from the full issue.

Optimize the PDF

Optimizing the PDF in most cases will improve the quality and readability of the scan.

Save the PDF with a different name, by adding “-opt” before the .pdf
Pull down the Document menu and select Optimize
After it finishes optimizing, check the quality against the original PDF and use whichever is better.

File Naming Conventions

Files should be given short descriptive names in the following format:

  JournalName_VolumeNumber_IssueNumber_ArticleNumber(or a descriptive word)

Use underscores between information. If a journal has a long title, sometimes it helps to abbrieviate it. For example, you have scanned volume 3, number 2 of EJEAS, which has a cover and contents section, several articles, a notes and topics section, a book review, and then the back material. You would name these sections as follows:

  ejeas_03_02_front
  ejeas_03_02_01
  ejeas_03_02_02
  ejeas_03_02_03
  ejeas_03_02_notes
  ejeas_03_02_reviews
  ejeas_03_02_back

Sometimes a journal, like Shilin 史林 only has volumes, so then just put the volume numer.

  shilin_02_front
  shilin_02_01
  shilin_02_02

Sometimes a journal uses the year or issue number like a volume number and then has numbers for each year. For these put the year intead of the volume number, and then the issue number:

  JournalName_Year_IssueNumber_ArticleNumber, or
  JournalName_Number_ArticleNumber

If an issue spans more than one voume or number, use a hyphen. For example, an issue of Shilin is designated as numbers 53-56, so name the files:

  Shilin_53-56_front
  Shilin_53-56_01
  Shilin_53-56_02, and so forth

Finally, remember that the scanner is your friend, even when it crumples your document and jams.

  Converting Multiple TIFFs to PDF
  Processing Multi-Image TIFFs
  Extracting Individual TIFF Files from a PDF
  Processing Finished Scans for Inclusion into Online Catalog

Converting Multiple TIFFs to PDF

Basic instructions

Open Adobe Acrobat 11.0 Standard.
Choose the menu option Create PDF > From Multiple Files.
Under Add Files, select the Browse button. Go to the appropriate folder with all the TIFF files. Highlight and select all the files you want to convert and click Add. Press Okay.
The TIFF files will be added into a single PDF file and you will be prompted to save it; however, press the Cancel button. To properly save the file with compression, go to the File menu and select Reduce File Size, then press yes on the dialogue box that pops up.
Save the file following proper file naming convention. Example: Dzd-Gt-v05.pdf
A Reduce File Size dialogue box will appear. In the “Make Compatible With” drop down menu, select Adobe 7.0 and later. Press okay. Again, save the file. The file is now a single compressed PDF file.
Make any necessary adjustments to the file. To rotate pages and align as necessary go to Document > Rotate Pages; options include rotating one, a set, or all of the images 90 degrees clockwise or counterclockwise, or a full 180 degrees, as well as rotating odd and/or even pages. Other useful features under the Document menu are Insert Pages, Extract Pages, Delete Pages, and Crop Pages.

Processing Multi-Image TIFFs

If you have a TIFF file consisting of multiple individual images (individual scans of multiple pages, for example) and you can't open the file with Photoshop:

Open the tiff file in Adobe Acrobat Professional 11.0 (for large files this will take some time).
Pull down the Acrobat menu and select Preferences.
Under categories, click on Convert From PDF
Under Converting from PDF, click on JPEG
Click on the Edit Settings button
Under File Settings, pull down Grayscale and select JPEG (Quality: Maximum)
Under File Settings, pull down Color and select JPEG (Quality: Maximum)
Click on the “Picture Tasks” button and select Export Pictures.
If in the toolbar the “Picture Tasks” button doesn’t display, click on the
“Pages” tab on the left side of the window. The Picture Tasks button will now appear in the toolbar.
In the dialogue box that opens
Click on the Select All button
Drag the slider (image quality) all the way to the right
Under File Names, select Common Base Name, and type in the directory into which you want the individual files to go, with a dash (-) at the end. Example: MDo-32-32-
Under Save In, click the Change button and select the directory you want the files to go.
Click on the Export button. Individual jpg files will be created for each page in the directory you specified.

Extracting Individual TIFF Files from a PDF

In order to extract the individual images that make up a single PDF and save them as TIFF files, do the following:

Open the PDF file in Adobe Acrobat Professional 11.0 (for large files this will take some time).
Pull down the Edit menu and select Preferences.
Under Categories, click on Convert From PDF.
Under Converting from PDF, click on TIFF.
Click on the Edit Settings button.
Under File Settings, pull down Monochrome and select LZW.
Under File Settings, pull down Grayscale and select LZW.
Under File Settings, pull down Color and select LZW.
Press Okay to close out the “Save As TIFF Settings” dialog box.
Press Okay to close out the “Preferences” dialog box.
Pull down the File menu and select Save As.
Type in a desired file name (the PDF file name will already be entered by default. If this is not desired then follow appropriate THL file naming conventions).
In the “Save as type” drop box, select: TIFF (*.tif,*.tiff).
Chose an appropriate folder in which to save all the extracted TIFF files.
Press Save.

Adobe Acrobat will spend a few minutes saving every page of the document as a TIFF file in the chosen folder. It will further order them correctly by adding a basic numbering convention to the end of the file name provided; i.e., filename_Page_001. For a 600-page PDF file, it should take 5-7 minutes for the individual images to be extracted.

For scanning microfiche:

The following instructions are for digitizing texts from microfiche documents.

Location

1. Go to ISH (Institut des sciences de l'homme) but contact in advance the Informatique division.

2. ISH can provide the following equipment:

Micro-film/micro fiche scanner Resolution: 600 dpi (greyscale) Micro-film reader (16 & 35 mm) Micro-fiche reader. Output on A3 printer or digital files

For Doing OCR to Make Digitized Texts Searchable

How to OCR a PDF Using Adobe Acrobat Professional

This is the process for running OCR on a PDF so that it is searchable:

For most PDFs, you want to run Optimize after you scan them. First rename the file; then pull down the Document menu and select Optimize.
Then, to run OCR: open the PDF file you want to run OCR on.
Pull down the File menu, choose “Save as,” and add “-ocr.pdf” to the file name
Pull down the Document menu, point to “OCR Text Recognition,” and then point to “Recognize Text Using OCR…” and “start”
The OCR process will start. It will take some time, depending on the number of pages in the PDF.
When it finishes, save the file. Be sure to check by doing a search on “the” or another word in the file and make sure it returns results.

Table of Contents