Validating Quality in
Large-Scale Digitization


VALIDATING QUALITY IN LARGE-SCALE DIGITIZATION

The Institute of Museum and Library Services (IMLS) awarded a grant of $674,000 to the School of Information (SI) at the University of Michigan (Ann Arbor) to define, test, and apply measures of image and text quality in a very large collection of digitized books and serials. The two-year project (2011-12) is a collaboration between SI and the University of Michigan Libraries, with important contributions from the HathiTrust Digital Library and the University of Minnesota Libraries.

The large-scale digitization of books and serials is generating extraordinary collections of intellectual content that are transforming teaching and scholarship. Questions are being raised, however, regarding the quality and usefulness of digital surrogates produced by third-party vendors and deposited for preservation in digital repositories. For preservation repositories and their communities of users to trust that digital documents have the capacity to meet the uses envisioned for them, repositories must validate the quality and fitness for use of the objects they preserve. This research project is designed to develop and test a methodology for assessing the quality of digital surrogates and to validate the findings with groups of end-users.

The project builds upon a planning effort in 2009-10, sponsored by the Andrew W. Mellon Foundation, to formulate a research methodology for evaluating the quality of digitized books and journals held by research libraries, but produced in large-scale digitization programs third-by Google, the Internet Archive, and other third-party vendors.

The HathiTrust Digital Library serves as the source of digitized books and serials for the project, which has two overlapping phases.
  • Research Phase 1 (2011) - focuses on defining a model of digitization error and a scale for recording consistently and accurately the severity of observed error.
  • Research Phase 2 (2011-12) - focuses on applying the research methodology developed in Phase 1 and validating the results of the error analysis within the context of specific use case scenarios.
Representative samples of volumes from sub-populations of the overall HathiTrust collection are drawn and reviewed through a process of manual inspection. The evaluation of coding reliability is an important part of the analysis. Use case studies explore the relationship between quality measures and the use of HathiTrust content in four contexts: reading images online, printing content on demand, mining the underlying text, and managing print collections I research libraries.