User Tools

Site Tools


Sidebar

Navigator

scanning_ocr:scanning_texts

<!– uid=38ef530386634042d8f838271aa1371e347f0571 –> <!– time=1327585105 –> <!– ip=86.67.96.72 –> <!– content-type=text/html –> <!– name=An Keqiang –> <!– email=campumoru@gmail.com –>

Choice of Equipment

You can use scanners or digital cameras to produce images of texts. Scanners can be either flatbed or have automatic feeders. Flatbed scanners are good for not damaging paper, but also take more time since you have to manually place each piece of paper and remove it. In addition, they are large and bulky, especially if you have one large enough to accommodate long pages. Scanners with automated feeding are much more compact and thus easy to transport as well as accommodate long pages, but they may also damage paper in the process - especially if the paper is torn, fragile, or of irregular texture. Either way, there is a natural wear and tear on the glass as paper passes through scanners.

The requirements for scanning documents and books will depend on several factors: - are the documents loose sheets or bound into a single volume or as a sewn stack (a typical situation in Chinese archives)? - nature of mount, thickness of the volume, nature of the paper.

For loose sheets, if the quality of the paper is good and there is no risk to damage the document, you can use a fedsheet scanner. If the paper is fragile, scanning needs to be donne by hand sheet by sheet, and if necessary with the use of a transparent plastic sheet to prevent damage during manipulation.

It should be stressed that any serious conservator will strongly resist any use of automated feeders because of the damage that they potentially do to a manuscript. If you are working with a unique or uncommon manuscript, we strongly advise against using an automated feed system for this reason. However, one may very well be scanning a common print so that damage is not a concern.

Scanners are generally easier to use, but digital cameras can particularly be good when you are worried about fragile or torn paper being damaged in the process of feeding them through a scanner. They also address the issue of the paper itself, if it is rough, scratching the glass of a scanner and causing lines in the resulting scans which can be major problems. The main challenge with cameras is having a tripod or related set up that allows you easily keep the camera pointed at the pages with a perfectly straight shot looking down. There are special structures for doing this called Copy Stands, but they tend to be expensive and heavy. Various library or archival repositories like Library of Congress in the US or the UK National Archives provide either camera stands (Library of Congress) or even cameras on stands that deliver images to an email account. This facilitates a relatively rapid work process despite the inherent slowness of using a camera.

Preparing to Scan

Before you scan, assess the paper - how long are the pages? how clear and consistent is the print in terms of the darkness of the ink? what color is the ink? what is the texture of the paper? Depending on the answers, you should choose what equipment to use and what settings. Experiment with different settings, especially the dpi, and color, to determine which to use.

In addition, you have to be concerned about missing pages and pages out of order. Since archival documents or similar unbound documents are looseleaf without binding, it is easy for pages to go missing or to become out of order. This can either be checked prior to scanning, or you can try to sort out the digital images. If there is time, inspecting and fixing the paper as possible prior to scanning can be more efficient since paper is easier to look at than digital images.

It may also be that paper has become creased, so that it will present an irregular surface for scanning. In this case, you may want to experiment with exposing the paper to humidity to essentially “stretch it out” to make it more flat for scanning.

Open a target folder for the new scans. Output files should be TIFF, with no compression or increase. Set the scan type to long page duplex and specify the exact size of the pages. Setting the scan type to duplex is important: it ensures the autofeed scanner will scan both sides of the page simultaneously.

In terms of determining the scanner settings, the real litmus test is - when you print the scanned image, does it look good, is readable, etc.?

File format: tiff. For printing, you may want to subsequently convert to PDFs and print from the PDF. However, you should always save the scans as TIFFs.

Compression: none

Resolution: 300 dots per inch (dpi) minimum If the source material is really rich in detail, then do a test scan at 300 and at 600 dpi. If you see more detail in the 600 dpi scan, then I would use the higher setting. As an example, is the scanned text more readable at 600 dpi? If so, then use 600 dpi.

Color/Grayscale/bitonal: variable. This really depends on the source material. If it's yellow parchment colored paper with faint gray writing on it, then you definitely want to scan in 24-bit color. Actually, you can't go wrong if you scan in color, and if it makes sense to do so, it can always be converted later. However, one issue is scanning in color can be much slower and take up much more storage, which can both pose challenges. However color can be crucial where detail is necessary - everything depends on the document.

How to Scan

If you use an autofeed scanner that scans both sides of a page simultaneously, the scanning process entails little more than feeding pages into the scanner. It is recommended you feed the pages into the scanner in succession yourself rather than stack a few pages into the tray and assume the autofeed will scan them in the correct order and without grabbing multiple pages at a time. Most autofeed scanners have guides in the feed tray which you can adjust according to the size of the pages your scanning. These are extremely helpful in ensuring pages feed through the scanner evenly. As one page feeds through, prepare to insert the next page as soon the page being scanned is nearly finished. Autofeed scanners typically have a short delay between when one scan finishes and another starts. Thus, it's not difficult to maintain a steady rhythm of scanning many pages in succession-even if you fumble for a moment in preparing the next page to be fed into the scanner.

It is helpful to intermittently save a group of scans as you go so as not to lose all of your work if the computer crashes. Initially, you should number the files in sync with the text's page numbers themselves. Thus in case a page goes un-scanned, etc., it is easy to figure out exactly where you made a mistake. Your scanning equipment may automatically apply a custom name and number to scanned images as it saves them. This can greatly save time, though be careful to make sure the numbers it applies and the text's own page numbers are in sync. Later, you can easily batch rename scans according to whatever makes the most sense.

We advise making one image for each folio side, rather than storing back and front (recto and verson) of a given folio on a single image. The reason for this is that by having one folio side on one image, you have preserved maximum flexibility for how you later deliver them.

Scanning Pages With Background Distortion

When scanning pages, there may occasionally be thin pages where the back-page text actually bleeds partially through to the front-page, distorting the text. This is different from pages so thin that the back-page text can be seen, an issue easily resolved with a thick piece of white paper. In the case of bleed-through, there are two solutions depending on severity. Either of these solutions should successfully lighten the distorting bleed-through in the background and make the front-page text more readable.

  1. Under “Advanced Settings” > “Image Quality” set the brightness to the lightest possible setting before scanning. - Do the above and photocopy the pages. Then scan the photocopies while repeating the above step, thus effectively doubling the lightness. This solution is more time-consuming and results can be somewhat grainy, so it is advised only for severe instances.

Processing

After scanning, processing may entail renaming files, organizing images into folders by text or volume, converting images to jpegs or PDF, and- in special cases- editing the image itself to remove blemishes or other distortions. For processing purposes it is helpful to use one or two programs that allow for (1) easily browsing a large number of TIFF files, (2) batch renaming, and (3) editing. Adobe Photoshop can perform all of these functions though it tends to perform slowly. “Thumbs Up” is a program that allows for easily viewing and renaming tiff images. Kodak Imaging is an light-weight, fast Windows application for editing images.

Storing images

High resolution scans take a large amount of disk space. You may want to use some form of lossless compression technique like “zip” if you are in need a temporary solution for conserving disk space. “Lossless” means that image will compress without any loss of information. (By contrast, a “lossy” image is one in which compression resulted in loss of information. A jpeg is a lossy image format.)

Zipping a file is acceptable for a temporary on-site storage solution, but a zipped file is not considered an acceptable archival format. For the long term, it is advisable to store the files uncompressed on a dedicated hard disk.

IMPORTANT: Compress a file using zip after you have saved the image as an uncompressed .tiff file! Do NOT save the file as a compressed tiff! This may be tempting as it does save disk space. Your scanning software in fact may offer this as a default option. However, you will lose information and be left with an inferior quality image.

As an alternative to the .zip format, it's fine to use other compression techniques (like .7z or other lossless methods) if they provide better results – as long as, again, these solutions are only temporary. Note however that .7z and other formats may not always work across platforms. (For example, you may not be able to decompress a .7z file on Mac).

Delivery options

Typically, final scans are delivered on DVDs or a hard drive. We generally recommend use of a hard drive if possible, since then it is easier to transfer the images as a whole rather than having to serially process a large number of DVDs. Of course this becomes less of an issue as high capacity DVD storage becomes more common. In addition, it is essential to keep two entirely separate copies of all scans in case the media of storage gets corrupted or lost.

Training

The training given to the people doing the scanning is crucial. It is important to closely supervise their initial work, and also stress to them not to change agreed upon settings later on because they decide they have to go faster, or because they switch staff, etc.

scanning_ocr/scanning_texts.txt · Last modified: 2013/04/06 23:14 (external edit)