Scanning involves digitizing texts or images by using a scanner. Scanning is an easy way to produce PDFs of originally print publications for broader dissemination. THL relies heavily on scanning to digitize journals and books that are either out of print, or to which we have secured copyright permission to reproduce. In addition, OCR (Optical Character Recognition) can be used to actual interpret the text and produce word processing documents with the scanned text. A separate use of scanning technology is to scan print pictures, negatives, or slides for digitizing photography that wasn't originally digital.
Local resources for scanning books and documents are:
A small A4 flatbed scanner is available at Institut d'Asie Orientale.
A Canon XXXX is available at Institut d'Asie Orientale for formats up to A3. The scanner-photocopier can digitize both B&W and color documents at a resolution of XXXdpi. It has fedsheet capacity that we recommend to use for large quantities of documents, provided they mette the standards of quality for sheet feeding.
A Digibook XXXXXX is available in the Media Department of ENS de Lyon. We recommend its use for loose-sheet documents, especially when paper quality does not allow sheet-feeding. It can also be used for digitizing books, provided the book opens well. Yet there will remain a deformation that can be processes digitally. For thick books, however, we recommend using the XXXXX Digibook co-owned by ISH and MOM and located at the MOM (see below). To use the Digibook of the Media Department, a reservation need to be made with the department.
The most sophisticated scanner device is the XXXXXX Digibook located at the MOM.It is a device that requires training before one can use it. ISH can provide a digitization assistant for digitization, but this is a fee-based service. The use of the XXXXX Digibook is not free. There is a daily, weekly and monthly fee, depending on the quantity of materials to be processed.
The general guidelines for digitizing images by scanning in negatives, slides, or prints is as follows:
Is scanning negatives much better than scanning prints from the negatives?
This goes against all current “archival wisdom”. Yes, almost all sources would say that the negative is a better source. Sometimes the prints are a better source than the negative, especially on newer material. The process of creating prints often corrects for incorrect exposures, under contrasted images, etc. A lot of the answer to this question lies in the quality of the originals. My recommendation is to scan both and see which one has a)more detail when zoomed in, and b)most closely resembles the actual thing photographed.
You can use scanners or digital cameras to produce images of texts. Scanners can be either flatbed or have automatic feeders. Flatbed scanners are good for not damaging paper, but also take more time since you have to manually place each piece of paper and remove it. In addition, they are large and bulky, especially if you have one large enough to accommodate long pages. Scanners with automated feeding are much more compact and thus easy to transport as well as accommodate long pages, but they may also damage paper in the process - especially if the paper is torn, fragile, or of irregular texture. Either way, there is a natural wear and tear on the glass as paper passes through scanners.
The requirements for scanning documents and books will depend on several factors:
- are the documents loose sheets or bound into a single volume or as a sewn stack (a typical situation in Chinese archives)?
- nature of mount, thickness of the volume, nature of the paper.
For loose sheets, if the quality of the paper is good and there is no risk to damage the document, you can use a fedsheet scanner. If the paper is fragile, scanning needs to be donne by hand sheet by sheet, and if necessary with the use of a transparent plastic sheet to prevent damage during manipulation.
It should be stressed that any serious conservator will strongly resist any use of automated feeders because of the damage that they potentially do to a manuscript. If you are working with a unique or uncommon manuscript, we strongly advise against using an automated feed system for this reason. However, one may very well be scanning a common print so that damage is not a concern.
Scanners are generally easier to use, but digital cameras can particularly be good when you are worried about fragile or torn paper being damaged in the process of feeding them through a scanner. They also address the issue of the paper itself, if it is rough, scratching the glass of a scanner and causing lines in the resulting scans which can be major problems.
The main challenge with cameras is having a tripod or related set up that allows you easily keep the camera pointed at the pages with a perfectly straight shot looking down. There are special structures for doing this called Copy Stands, but they tend to be expensive and heavy. Various library or archival repositories like Library of Congress in the US or the UK National Archives provide either camera stands (Library of Congress) or even cameras on stands that deliver images to an email account. This facilitates a relatively rapid work process despite the inherent slowness of using a camera.
Before you scan, assess the paper - how long are the pages? how clear and consistent is the print in terms of the darkness of the ink? what color is the ink? what is the texture of the paper? Depending on the answers, you should choose what equipment to use and what settings. Experiment with different settings, especially the dpi, and color, to determine which to use.
In addition, you have to be concerned about missing pages and pages out of order. Since archival documents or similar unbound documents are looseleaf without binding, it is easy for pages to go missing or to become out of order. This can either be checked prior to scanning, or you can try to sort out the digital images. If there is time, inspecting and fixing the paper as possible prior to scanning can be more efficient since paper is easier to look at than digital images.
It may also be that paper has become creased, so that it will present an irregular surface for scanning. In this case, you may want to experiment with exposing the paper to humidity to essentially “stretch it out” to make it more flat for scanning.
Open a target folder for the new scans. Output files should be TIFF, with no compression or increase. Set the scan type to long page duplex and specify the exact size of the pages. Setting the scan type to duplex is important: it ensures the autofeed scanner will scan both sides of the page simultaneously.
In terms of determining the scanner settings, the real litmus test is - when you print the scanned image, does it look good, is readable, etc.?
File format: tiff. For printing, you may want to subsequently convert to PDFs and print from the PDF. However, you should always save the scans as TIFFs.
Resolution: 300 dots per inch (dpi) minimum If the source material is really rich in detail, then do a test scan at 300 and at 600 dpi. If you see more detail in the 600 dpi scan, then I would use the higher setting. As an example, is the scanned text more readable at 600 dpi? If so, then use 600 dpi.
Color/Grayscale/bitonal: variable. This really depends on the source material. If it's yellow parchment colored paper with faint gray writing on it, then you definitely want to scan in 24-bit color. Actually, you can't go wrong if you scan in color, and if it makes sense to do so, it can always be converted later. However, one issue is scanning in color can be much slower and take up much more storage, which can both pose challenges. However color can be crucial where detail is necessary - everything depends on the document.
If you use an autofeed scanner that scans both sides of a page simultaneously, the scanning process entails little more than feeding pages into the scanner. It is recommended you feed the pages into the scanner in succession yourself rather than stack a few pages into the tray and assume the autofeed will scan them in the correct order and without grabbing multiple pages at a time. Most autofeed scanners have guides in the feed tray which you can adjust according to the size of the pages your scanning. These are extremely helpful in ensuring pages feed through the scanner evenly. As one page feeds through, prepare to insert the next page as soon the page being scanned is nearly finished. Autofeed scanners typically have a short delay between when one scan finishes and another starts. Thus, it's not difficult to maintain a steady rhythm of scanning many pages in succession-even if you fumble for a moment in preparing the next page to be fed into the scanner.
It is helpful to intermittently save a group of scans as you go so as not to lose all of your work if the computer crashes. Initially, you should number the files in sync with the text's page numbers themselves. Thus in case a page goes un-scanned, etc., it is easy to figure out exactly where you made a mistake. Your scanning equipment may automatically apply a custom name and number to scanned images as it saves them. This can greatly save time, though be careful to make sure the numbers it applies and the text's own page numbers are in sync.
Later, you can easily batch rename scans according to whatever makes the most sense.
We advise making one image for each folio side, rather than storing back and front (recto and verson) of a given folio on a single image. The reason for this is that by having one folio side on one image, you have preserved maximum flexibility for how you later deliver them.
When scanning pages, there may occasionally be thin pages where the back-page text actually bleeds partially through to the front-page, distorting the text. This is different from pages so thin that the back-page text can be seen, an issue easily resolved with a thick piece of white paper. In the case of bleed-through, there are two solutions depending on severity. Either of these solutions should successfully lighten the distorting bleed-through in the background and make the front-page text more readable.
Under “Advanced Settings” > “Image Quality” set the brightness to the lightest possible setting before scanning. Do the above and photocopy the pages. Then scan the photocopies while repeating the above step, thus effectively doubling the lightness. This solution is more time-consuming and results can be somewhat grainy, so it is advised only for severe instances.
After scanning, processing may entail renaming files, organizing images into folders by text or volume, converting images to jpegs or PDF, and- in special cases- editing the image itself to remove blemishes or other distortions. For processing purposes it is helpful to use one or two programs that allow for (1) easily browsing a large number of TIFF files, (2) batch renaming, and (3) editing. Adobe Photoshop can perform all of these functions though it tends to perform slowly. “Thumbs Up” is a program that allows for easily viewing and renaming tiff images. Kodak Imaging is an light-weight, fast Windows application for editing images.
High resolution scans take a large amount of disk space. You may want to use some form of lossless compression technique like “zip” if you are in need a temporary solution for conserving disk space. “Lossless” means that image will compress without any loss of information. (By contrast, a “lossy” image is one in which compression resulted in loss of information. A jpeg is a lossy image format.)
Zipping a file is acceptable for a temporary on-site storage solution, but a zipped file is not considered an acceptable archival format. For the long term, it is advisable to store the files uncompressed on a dedicated hard disk.
IMPORTANT: Compress a file using zip after you have saved the image as an uncompressed .tiff file! Do NOT save the file as a compressed tiff! This may be tempting as it does save disk space. Your scanning software in fact may offer this as a default option. However, you will lose information and be left with an inferior quality image.
As an alternative to the .zip format, it's fine to use other compression techniques (like .7z or other lossless methods) if they provide better results – as long as, again, these solutions are only temporary. Note however that .7z and other formats may not always work across platforms. (For example, you may not be able to decompress a .7z file on Mac).
Typically, final scans are delivered on DVDs or a hard drive. We generally recommend use of a hard drive if possible, since then it is easier to transfer the images as a whole rather than having to serially process a large number of DVDs. Of course this becomes less of an issue as high capacity DVD storage becomes more common. In addition, it is essential to keep two entirely separate copies of all scans in case the media of storage gets corrupted or lost.
The training given to the people doing the scanning is crucial. It is important to closely supervise their initial work, and also stress to them not to change agreed upon settings later on because they decide they have to go faster, or because they switch staff, etc.
There are several methods to scan journals for the creation of PDFs. Which one you will use depends on the physical characteristics of the journal, and the desired quality of the finished PDF. There are two types of scanners at the Institut d'Asie Orientale and two different digibooks at ENS de Lyon and MOM respectively (see Scanning Equipment).
The most time-efficient option for scanning is to use the Fujitsu ScanPartner because of its auto-feed feature. However, this can only be used if you have a loose-leaf document, or you are able to cut the binding off the journal. If so, proceed in the following way:
NOTE: Poor quality paper, or very thin paper (such as that commonly used for journals produced in Asia) may not always correctly auto-feed. With these types of papers, if loaded into the tray all at once, the auto-feed has a tendency to take more than one page at a time. To avoid misfeeds, it may be necessary to place the pages one by one or a few at a time into the auto-feed tray. Misfeeds can ruin the scan job (because the pages will coallate incorrectly, which is not easily fixed).
Once a full issue has been scanned, it needs to be broken down into smaller files containing front matter, articles, back, and any other sections. While working on these steps, be sure to keep your whole issue file intact.
Tip: If you fill out the Subject field within Document Properties for the whole issue first, then whenever you extract pages from it, this field will already be filled out in the extracted pages file and you only need to add the relevant page numbers to the Subject field. Another Tip: I find it helpful to leave the front matter file open and put it down in the corner of the screen with the table of contents page showing as I separate the rest of the issue. This is a nice little reference to guide you as you extract articles from the full issue.
Optimizing the PDF in most cases will improve the quality and readability of the scan.
Files should be given short descriptive names in the following format:
JournalName_VolumeNumber_IssueNumber_ArticleNumber(or a descriptive word)
Use underscores between information. If a journal has a long title, sometimes it helps to abbrieviate it. For example, you have scanned volume 3, number 2 of EJEAS, which has a cover and contents section, several articles, a notes and topics section, a book review, and then the back material. You would name these sections as follows:
ejeas_03_02_front ejeas_03_02_01 ejeas_03_02_02 ejeas_03_02_03 ejeas_03_02_notes ejeas_03_02_reviews ejeas_03_02_back
Sometimes a journal, like Shilin 史林 only has volumes, so then just put the volume numer.
shilin_02_front shilin_02_01 shilin_02_02
Sometimes a journal uses the year or issue number like a volume number and then has numbers for each year. For these put the year intead of the volume number, and then the issue number:
JournalName_Year_IssueNumber_ArticleNumber, or JournalName_Number_ArticleNumber
If an issue spans more than one voume or number, use a hyphen. For example, an issue of Shilin is designated as numbers 53-56, so name the files:
Shilin_53-56_front Shilin_53-56_01 Shilin_53-56_02, and so forth
Finally, remember that the scanner is your friend, even when it crumples your document and jams.
Converting Multiple TIFFs to PDF Processing Multi-Image TIFFs Extracting Individual TIFF Files from a PDF Processing Finished Scans for Inclusion into Online Catalog
If you have a TIFF file consisting of multiple individual images (individual scans of multiple pages, for example) and you can't open the file with Photoshop:
In order to extract the individual images that make up a single PDF and save them as TIFF files, do the following:
Adobe Acrobat will spend a few minutes saving every page of the document as a TIFF file in the chosen folder. It will further order them correctly by adding a basic numbering convention to the end of the file name provided; i.e., filename_Page_001. For a 600-page PDF file, it should take 5-7 minutes for the individual images to be extracted.
The following instructions are for digitizing texts from microfiche documents.
1. Go to ISH (Institut des sciences de l'homme) but contact in advance the Informatique division.
2. ISH can provide the following equipment:
Micro-film/micro fiche scanner Resolution: 600 dpi (greyscale) Micro-film reader (16 & 35 mm) Micro-fiche reader. Output on A3 printer or digital files
This is the process for running OCR on a PDF so that it is searchable: