Nothing Special   »   [go: up one dir, main page]

Search   |   Back Issues   |   Author Index   |   Title Index   |   Contents

Articles

spacer

D-Lib Magazine
March/April 2009

Volume 15 Number 3/4

ISSN 1082-9873

Going Grey?

Comparing the OCR Accuracy Levels of Bitonal and Greyscale Images

 

Tracy Powell
National Library of New Zealand
<Tracy.Powell@natlib.govt.nz>

Gordon Paynter
National Library of New Zealand
<Gordon.Paynter@natlib.govt.nz>

Red Line

spacer

Abstract

Newspaper collections are the subject of an increasing number of large-scale digitisation projects. In Papers Past (http://paperspast.natlib.govt.nz), a collection of over a million newspaper pages, the introduction of full-text search has made a wealth of information findable that was previously hidden. The search feature is dependent on text extracted from the newspaper page images with Optical Character Recognition (OCR), so any improvement in OCR accuracy will add value to the collection by improving our users' chances of finding useful information.

The Papers Past newspapers were digitised from microfilm as 400 DPI bitonal images over a period of several years. For future newspapers, we wondered whether OCR accuracy would be improved by "going grey", and digitising to 8-bit greyscale instead. Accepted wisdom is that greyscale digitisation produces higher OCR accuracy than bitonal digitisation. To test this assumption, we digitised three reels of microfilmed historic newspapers in both bitonal and greyscale, had them OCRed, and carried out a hand-count of the OCR accuracy on a random set of text samples. The experiment had a clear and surprising outcome: using our existing business processes, there was no evidence of any improvement in OCR accuracy from greyscale digitisation.

1. Introduction

The National Library of New Zealand ("the Library") has made a collection of digitised newspapers available through its Papers Past website at <http://paperspast.natlib.govt.nz>. At the time of writing the site provides access to nearly 1.2 million newspaper pages, comprising 225 thousand issues from 45 titles published between 1839 and 1920.1

The original Papers Past website debuted in 2001, and gave users access to scans of the newspaper pages, which could be viewed and printed, but not searched. In 2005 the Library ran a pilot project to investigate using optical character recognition (OCR) to generate full text and make the newspapers in Papers Past searchable. The pilot was successful, and the Library decided that all future content in Papers Past should be OCRed.

Papers Past was re-launched in September 2007 with a new interface and the ability to search the text of those titles whose content has been OCRed. Currently 24 of 45 titles (representing about 56% of the collection) are searchable, and the remainder will be OCRed over the next twelve months as part of our digitisation programme.

Offering full-text search has increased the usage of Papers Past dramatically, from 9,275 unique visitors and 14,680 visits in August 2007 to 117,000 unique visitors and 252,000 visits in June 2008. We attribute the site's increased popularity to three factors: improved user interface, improved functionality (i.e., full text search), and increased referrals from the Google search engine, which has indexed much of the OCRed text.

During the OCR pilot project we spoke to many vendors who recommended that we start using greyscale digitisation as it would increase OCR accuracy. While this was not an option for the million pages already digitised as bitonal images, the Library decided to investigate digitising newspapers as 400 DPI 8-bit greyscale images in future years. A new project was initiated to gather evidence that greyscale input images really do benefit OCR accuracy.

This article describes the new project, known internally as the Greyscale Evaluation Project. We began with the expectation that greyscale digitisation would deliver obvious improvements in OCR accuracy, but this proved not to be the case in our situation. This article describes how we evaluated the effect of greyscale digitisation on OCR accuracy, summarises our results, and discusses our thoughts about the outcome and the lessons we have drawn from it.

Table 1
Some terms used in this article....
Digitisation The process of converting information into a digital format
Scanning Digitisation of paper documents, such as newspapers, by using a scanner or camera to capture the document directly or via a surrogate (such as a microfilm copy of the document)
Dots Per Inch (DPI) or Points Per Inch (PPI) A measure of the resolution of a digitised image
Bit Depth or Depth A measure of the number of colours or levels of grey in a digitised image
Bitonal scanning Scanning process that uses one bit per pixel to represent black or white (i.e., bit depth = 1)
Greyscale scanning Scanning process that uses multiple bits per pixel to represent shades of grey (all greyscale scanning in this report uses a bit depth of 8)
Binarisation The process of converting a greyscale image into a bitonal image
Optical Character Recognition (OCR) A machine process for analysing an image file that depicts a page of text, and extracting the textual content of that image

2. Experiment

We decided that the best way of determining whether changing to greyscale scanning would have a positive impact on the overall level of OCR accuracy was to process some microfilmed newspapers in both bitonal and greyscale and compare the outputs. We concluded that a hand-count of samples taken at random would be the best way of making a fair comparison.

2.1. Hypothesis

We started our experiment with the following hypothesis:

Greyscale scanning produces a significantly higher level of OCR accuracy than bitonal scanning.

2.2. Data preparation

Three rolls of microfilm were digitised and OCRed multiple times using a process that was identical except for digitisation parameters (bit depth and resolution). The full data preparation process was as follows:

  • Step 1: The Library asked its digitisation vendor to digitise three reels of microfilm from two different papers in bitonal 400dpi, and greyscale 400dpi. One reel was also digitised in greyscale at 300dpi. A SunRise S2500 scanner was used to digitise these reels.
  • Step 2: The Library asked its OCR vendor to process the pages using our existing agreed processes and to supply us with the outputs to our existing specifications, including METS XML files containing issue-level metadata; ALTO XML files containing the OCR output; and TIFF production master images (cropped, deskewed, and at the same resolution and bit-depth as the source image). The CCS docWORKs software was used to produce these outputs.
  • Step 3: The Library randomly selected three sets of pages from the newspaper images, totalling 160 pages. For each page a target region numbered 1 through 9 was randomly chosen from the page (the nine regions were defined by dividing each page into a three-by-three grid).
  • Step 4: The Library asked its OCR vendor for a hand-count of OCR accuracy of each region. Operators employed by the vendor chose a sample approximately 500 characters long, of uniform font and without illustrations, from the target regions, and performed a hand-count of the OCR accuracy of the sample. The hand-count was repeated for each bit-depth and resolution. Two operators performed each count independently, and their results were then compared and (if different) resolved by a third operator. The OCR vendor supplied the Library with the full set of results, and scans of every sample counted. An example is shown in Figure 1 and 2 below.

Figure 1: DSC bitonal sample (Reel 28258)

Image showing an example of the results of DSC bitonal digitisation

Figure 2: DSC greyscale sample (Reel 28258)

Image showing an example of the results of DSC greyscale digitisation

2.3. Results

The Library compared the hand-counted OCR accuracy rate for bitonal and greyscale for each sample, and then calculated the average accuracy rates. Table 1 summarises the data and results.

Table 2: The sample data and hand-count accuracy results.
Newspaper Title Reel number Number of pages selected Average hand-count accuracy
Bitonal 400DPI Greyscale 400DPI Greyscale 300DPI
Daily Southern Cross 28258 40 98.27% 95.93%
28259 40 98.49% 97.73%
The Colonist 35903 80 95.83% 88.64% 83.88%
Combined   160 97.53% 94.10%  

The overall average bitonal accuracy rate was 97.53%, whereas that of greyscale was 3.43% less at 94.10%. The images scanned at 300dpi had a lower average OCR accuracy so the 300DPI data has been excluded from further analyses.

We also performed a direct comparison for the 400DPI scans for each reel and found that in reel 28258 bitonal had the higher accuracy rate 34 times (85%), in reel 28259 bitonal had the higher accuracy rate 31 times (77.5%), and in reel 35903 bitonal had the higher accuracy rate 72 times (90%).

2.4. Analysis of outliers

Following this experiment, we examined the variation by looking at every instance of 5% or more variation in accuracy between bitonal and greyscale.

The two Daily Southern Cross reels (28258 and 28259) had few samples in which the variation between greyscale and bitonal was greater than 5% – there were six instances out of 40 in 28258 and two out of 40 in 28259. The Colonist reel (35903) had twenty-five samples out of 80 in which the variation between greyscale and bitonal was greater than 5%.

In general, in cases where greyscale digitisation was more than 5% worse, the scans were characterised by pale text, blurry text, and poor contrast. This was a particular problem for reel 35903. Overall, there are only two instances where greyscale is more than 5% better.

3. (Lack of) conclusions

The greyscale evaluation was based on the assumption stated in our hypothesis – that greyscale scanning produces a significantly higher level of OCR accuracy than bitonal scanning – and our results were therefore quite unexpected.

Our hypothesis is unsupported by the available evidence. The hand-count provided no evidence that greyscale digitisation improves OCR accuracy, and the analysis of outliers did not have enough data to draw any conclusions.

Based on vendor recommendations, and our own understanding of the OCR process, we expected to see significant and obvious evidence that greyscale was superior. However, this evidence did not materialise.

4. Discussion

This experiment was designed to yield a practical outcome that could be applied in a large-scale and largely automated digitisation and OCR workflow. We are primarily interested in solutions that will help us in affordable ways in our commercial setting with current technology. We are less interested in ways that OCR can be improved on a page-by-page basis under the guidance of a human expert.

4.1 Why greyscale scanning is theoretically better

Prior to the experiment we reviewed the available commercial and academic literature. We also looked at best practice and the choices made by other similar projects. However, it was difficult to find much material published on the topic.

We found that there was general consensus among OCR vendors and other experts that greyscale images result in higher OCR accuracy than bitonal images, for several reasons:

  • Deskewing, despeckling and other image processing. Greyscale allows more effective image processing, resulting in versions of the page with reduced speckling, cleaner backgrounds, and sharper text characters and illustrations. In particular, greyscale is thought to allow greatly superior deskewing as images can be rotated without causing breaks in the letter forms that might mislead the OCR program.
  • Specialised binarisation algorithms. Scanners are optimised for human viewing. OCR programs, on the other hand, work better with higher-contrast images. Therefore specialised binarisation algorithms optimised for OCR will give better OCR accuracy.
  • Additional information. Greyscale scans inherently hold more information that can be used by OCR algorithms to process text.

All these reasons suggest that greyscale scanning will improve OCR accuracy, though most authorities we contacted noted these improvements will vary depending on the type of material. Several experts identified potential improvements that greyscale may offer in the future, but that are not yet commercially available, such as the ability to binarise a greyscale image after it has been zoned into articles to reduce the effects of variation within each newspaper page.

4.2. Possible explanations

In our experiment greyscale scanning was no better than bitonal scanning. We have considered a number of reasons that this might be the case.

  1. Problems with our scanning methods. It is possible, even likely, that we simply do not know how we are supposed to scan to get good greyscale results. This situation could be exacerbated by the fact that the Library uses different scanning and OCR vendors.
  2. Our data is special. The three reels we used in the sample may have been either significantly better or worse than typical OCR material, or normal Papers Past material. We attempted to find microfilm that was typical of Papers Past to use in this experiment. However, it may be that we have chosen high quality scans, and that the major benefit of greyscale scanning is reaped from poor quality inputs. In the end, we chose reels that were typical of our newspaper data, since we hope to apply the results to similar data in the future.2 If greyscale is better only in some cases, it would be very useful to know what cases those are.
  3. Binarisation algorithms in scanners are just as good as those in OCR programs. The argument for this explanation is that human readable text is good for OCR, and hardware scanners are tuned to produce very good human-readable bitonal text. Also, some scanners digitise in 12-bit greyscale and then downscale to 8-bit, so even specialised OCR binarisation algorithms are working on data that has already been processed by the scanners, so may be limited in the techniques they can apply, even to raw scans. This conjecture is, however, unsupported by evidence either way as far as we know.
  4. Sample too small. Our sample may just have been too small.
  5. We should have considered X when choosing the samples. There are several other factors we could have taken into account when choosing samples, such as the curvature of the page or the presence of backgrounds. We chose not to do so, and to reply on a random set of pages and regions, because our goal was to perform a direct comparison of methods (not to measure the level of accuracy of each method), and because we were not confident that we knew which factors affect OCR accuracy and which do not.

In conclusion, we believe that the explanation is probably a mixture of these. The major benefits of greyscale digitisation are in handling pages that need substantial preparation, such as deskewing, whilst our pages are generally of good and consistent quality and therefore will not benefit from these. The other obvious benefits of greyscale – superior binarisation and additional information – are both unproven with current technology, and apparently had little impact on OCR accuracy.

4.3. The costs and benefits of going grey

An obvious long-term advantage of changing to greyscale scanning, with regard to OCR quality, is that there is more information in the scans for the OCR program to work from. This additional information has not helped in the present experiment, which uses the best technology currently commercially available, but in the future advanced OCR software may be able to make better use of it. Greyscale scanning can therefore be seen as a hedge against future technology improvements.

A second advantage of greyscale scanning is the better representation of pictorial content, such as photographs and illustrations, that may appear in historic newspapers. While this is out of scope for the current experiment, it may drive our thinking in the future: Papers Past currently does not contain very much pictorial content, but this may change as we start including more 20th century material.

However, going grey also means more cost, effort and inconvenience. The OCR processing costs more as it requires vendors to use more resources (CPU and storage). Both storage and transport requirements increase because 8-bit greyscale TIFFs can be up to 80 times3 as large as bitonal TIFFs when uncompressed, and even when compressed (with some loss of detail), they can be 20 to 40 times as large. This would dramatically increase our long-term storage costs and make the transfer of data to vendor problematic.

In light of New Zealand's low rate of broadband uptake, there may be other costs in the delivery of large greyscale images to Papers Past users. As they are much larger than bitonal images, they would either take much longer to download, or would have to be transformed in some way to be made accessible. With increasing access to broadband, this problem should diminish over time.

Going grey would also incur a certain amount of risk as we replace a known and successful workflow with one that is relatively unknown to us. For example, the OCR quality from greyscale images appears to be quite sensitive to changes in the digitisation parameters. In a preliminary experiment we digitised the same microfilm reels using a scanning process that sharpened the images (resulting in a halo effect around the letters), and the resulting OCR accuracy of the greyscale images was on average 10% worse than the bitonal equivalents. This suggests that the process is vulnerable to scanning errors that decrease image and OCR quality, and that changing to greyscale scanning may require ongoing monitoring above and beyond our current quality assurance procedures.

Finally, we believe that there are other ways to increase OCR accuracy that may be cheaper or more effective than changing to greyscale digitisation, such as re-filming. In 2006/2007 we re-filmed and re-digitised the New Zealand Gazette and Wellington Spectator, providing cleaner inputs to OCR, and resulting in an improvement in OCR accuracy of nearly 10% (as measured by average machine-estimated accuracy rates, not by hand-count). The National Library of Australia has reached similar conclusions, determining that the only way to significantly improve OCR accuracy is to improve the quality of the source materials or make manual adjustments to the process for each file, and ultimately attempting to solve the problem laterally, by asking users to voluntarily correct the OCR output after the fact.4

4.3.1. Future considerations

Most major historical newspaper projects have chosen to digitise in greyscale, and most disagreement is over the resolution: for example, the British Library and the Bibliothèque nationale de France consider that 300dpi is sufficient, while the Library of Congress and the National Library of Australia require 400dpi. The Databank of Digital Daily Newspapers (DDD) project from the Koninklijke Bibliotheek found that scanning at 300ppi was the consensus.5 In the meantime the debate has moved on to the benefits of colour scanning and direct digitisation.

As noted by Edwin Klijn, "Scanning from the originals is generally acknowledged to produce higher quality master images. There is some disagreement among the survey respondents as to whether one should scan in colour or greyscale. Scanning in colour produces a master that is closer to the original newspaper (more 'authentic') than greyscale. Also, according to some respondents colour images may lead to better OCR results, or at least provide better 'raw materials' to improve the OCR in due course."6

4.4. Open questions

We were surprised to find little documented evidence of claims that greyscale digitisation provides a higher level of OCR accuracy than bitonal. This may be a result of commercial sensitivity on the part of OCR vendors, or may be an assumption based on the undisputed advantage of greyscale scanning: the better representation of pictorial content. Whatever the reason, it has left us with several questions to place before the newspaper digitisation community:

  • How much benefit should we expect to see from using greyscale digitisation instead of bitonal, and does this apply to all digitised material, or only to some?
  • How it is that OCR software tunes its binarisation to produce results that are (supposedly) superior to the algorithms used in scanning software?
  • How do OCR vendors recommend that binarisation be done to produce optimum OCR?

The last of these questions is particularly important in high-volume settings. One of the major costs of greyscale OCR is the transportation of large quantities of data, but this can be reduced significantly by binarising at the time of digitisation and sending the resulting bitonal images to the OCR vendor. This approach has been adopted by the National Library of Australia, who have run experiments to compare binarisation programs and techniques and select the software that yields the best OCR results.7

Another argument used in favour of greyscale is that users who are reading digitised newspapers on computer screens prefer to read text from greyscale scans. However, we suspect this is another assumption that is not tested and documented, and that may be untrue. In our limited experience, it is true of advanced users (such as image processing professionals), but regular users often express a preference for bitonal. We are considering a follow-up experiment to test this issue.

5. Conclusions

In this article, we have described an experiment to test the immediate benefit to our users of "going grey" by scanning historic newspapers for Papers Past in greyscale rather than bitonal.

Given our existing selection policy, digitisation methods, and vendors, we could find no evidence that using greyscale digitisation in Papers Past would increase OCR accuracy (which is not the same as saying we found bitonal scans are better than greyscale scans). As a result, the project team recommended that the Library continue its practice of bitonal digitisation for Papers Past for now, but that we be prepared to review this decision as more information becomes available, and as more pictorial content is selected for digitisation

In the future we will investigate other ways of improving OCR accuracy for historic newspapers that are robust and reliable in a high-volume setting.

Our key message to anyone else with a treasure trove of bitonal scans is not to assume that their quality is too poor to OCR. You might be pleasantly surprised at the value of OCRing what you have now, rather than re-scanning in greyscale.

Acknowledgements

The authors would like to thank the following people:

David Adams
Rose Holley
Bronwyn Lee
Steve Knight
Frederick Zarndt
Andy Fenton
Simon Gotlieb

Notes and References

1. Current collection statistics are available from: <http://paperspast.natlib.govt.nz/cgi-bin/paperspast?a=p&p=about>.

2. One way to estimate whether the chosen reels are "typical quality" is to compare their machine-estimated OCR accuracy for the bitonal scans to that of the wider Papers Past collection. The machine-estimated accuracy for the Daily Southern Cross was 95.526% for reel 28258 and 95.277% for reel 28259. This can be compared to the machine-estimated OCR accuracy rates for the first 13 titles OCRed in Papers Past, which were a mixture of poor and high quality titles. These range from 72.90% to 99.20%, with an average of 93.16%. Eleven of the 13 had average estimates in the 90-99% range. This suggests (but does not confirm) that the Daily Southern Cross data in the current experiment were slightly better than average quality.

3. Utah Digital Newspapers Digital Newspaper Project Handbook. Slide 34.

4. Holley, Rose. "Increasing the Accuracy of OCR". <http://www.nla.gov.au/ndp/project_details/documents/ANDP_IncreasingOCRaccuracy.pdf>.

5. Klijn, Edwin. "The current state-of-art in newspaper digitization : a market perspective". D-Lib Magazine, January/February 2008, 14(1/2), <doi:10.1045/january2008-klijn>.

6. Ibid.

7. Holley, Rose. Personal communication.

Copyright © 2009 Tracy Powell and Gordon Paynter
spacer
spacer

Top | Contents
Search | Author Index | Title Index | Back Issues
Opinion | Next Article
Home | E-mail the Editor

spacer
spacer

D-Lib Magazine Access Terms and Conditions

doi:10.1045/march2009-powell