ИнфоРост
информационные технологии для архивов и библиотек
11 / 35
Main Publications, presentations, research papers CDLA’s Large Scale Digitization Program. Concept and planning docum...

CDLA’s Large Scale Digitization Program. Concept and planning document (Kirill Fesenko, 2009)

Large scale digitization programs in the context of the UNC Library

The development of large-scale digitization programs began in the mid-2000’s with the onset of large book digitization projects initiated by Google, Microsoft, Yahoo! and the Internet Archive. As pioneers of an industrial-strength approach to digital conversion, they realized the growing reliance of the scholarly community on the availability of online access to information and quickly moved to make thousands of books accessible on the Internet. Although libraries have been digitizing their collections since mid-1990’s, it has been routinely done on a project by project basis and mostly through grant funding. The Google and Internet Archive projects became the first examples of large production-type programs for digitizating library collections. These examples, along with the increased pressure on libraries to broaden online access to paper collections, stimulated a more active exploration of venues for mass digitization. Among many of the challenging questions faced by libraries are: Do we partner with Google or with the Internet Archive, or both? Should we develop our own large scale digitization operations? If so, will we be able to sustain them in the long run? How does our investment in large scale digitization efforts correspond with other library priorities? How do we successfully transition from  project based digitization to a streamlined programmatic approach?

Choosing to invest in large scale digitization programs is not a simple choice for any library. In the case of Google, the vendor pays for the digitization, selects and digitizes the material, and provides online access to it. The participating library is responsible for pulling the books from shelves and shipping them to one of Google's digitization centers. As the Google program evolved, libraries also engaged more actively in storage of the digital files, and in uploading the electronic books to local servers for access with some restrictions on indexing by Internet search engines. The Internet Archive model didn't pay for the digitization, but it provided libraries with a greater degree of control over the selection and digitization process. For the UNC Library, the opportunity to gain hands-on experience in the selection of material while also balancing of digitization workload was attractive as one fitting our long term strategy for developing and sustaining a large scale in-house digitization program. Our strategy goes beyond the digitization of general and rare book collections, but also includes the large scale digitization of manuscript, archival, photographic, institutional repository and other collections and formats (still image, audio, video, etc.).


Focus on quality rather than on quantity, development of programmatic approach to digitization

The term "large-scale digitization" places a strong emphasis on the quantity of material. Although we recognize that the ability to digitize large quantities of material is our essential long term goal, at this early stage in our digitization programs development, we choose to focus more on the quality of material, the research and development of cost-effective and sustainable technologies and workflows, and on building solid materials selection strategies in close collaboration with scholars, curators, bibliographers and other institutions. In order to highlight this emphasis on the quality and development of sustainable production type digitization operations, we will use the term Programmatic Digitization rather than Large Scale  Digitization for the purpose of this paper.


Project Based vs. Programmatic Digitization

As was mentioned earlier, libraries have been using digitization as a method to improve access to collections for the past 10-15 years starting with early grants through the Library of Congress in mid-1990s. Grant funded projects have certainly stimulated experimentation and innovation in libraries, and have helped improve access to many collections. However, as is the case with any temporary funding, the nature of these projects have raised concerns regarding their long term sustainability, the fragmentation of access to resources caused by an increasing number of one-off projects and databases, long term storage and migration of digital archives, and the difficulties in unifying  of technological infrastructures. More libraries now realize that future sustainable Programmatic Digitization will require high productivity rates and cost effective equipment and operations, better long term planning of data storage capacities and technical infrastructure development, stable sources of funding, and overall integration of the digital component with traditional library operations. In this approach, Programmatic Digitization appears as a concept which highlights interdependencies between many library departments and activities, and integrates various parts such as selection, digitization, storage, and online access into one natural continuous process.


Programmatic Digitization workflows resemble paper library workflows

Viewed from the perspective of one continuous streamlined process, Programmatic Digitization resembles the traditional life cycle of library and archival materials processing which is independent of its format (paper or digital):

Programmatic Digitization emphasizes several interlinked steps which represent various stages of one process. The process starts with the selection of materials and results in online access. It involves other necessary stages such as technical processing and preservation of material to support the final goal – improved online access to library collections. 


Components of UNC Library Programmatic Digitization and Plans for their Development

The above framework for Programmatic Digitization is also helpful for examining its individual parts and for program planning in the context of the UNC Library.

1. Selection of material

Selection of materials in the Programmatic Digitization loop is the first and the most meaningful part of the process. In the project based approach to digitization, selection of material is viewed mainly as a one time effort which is focused on the needs of a particular project and includes a given set of material. Programmatic Digitization gives curators and bibliographers a new degree of flexibility to cherry pick smaller or larger parts of their collections for digitization and online access as part of their daily routine. This means a new level of curator/bibliographer understanding of the needs of their constituencies, and of campus instructional and research priorities with regard to online access to collections. This approach also means an increased level of communication between the curators and faculty about new possibilities of  Programmatic Digitization for instruction and research, and give them the ability to convert faculty needs into a long term materials selection strategy for digitization. Collaborative collection development with other repositories is another attractive area of joint effort as demonstrated by our Scribe projects with Duke and ECU/State libraries.  Other areas include gradual integration of library traditional on-demand, ILL and document delivery services with streamlined digitization operations, and other projects such as the transfer of paper collections  to remote storage. 

Our efforts to develop a solid long term strategy for materials selection and prioritization should be matched with equally vigorous efforts to identify sources of funding for the support of Programmatic Digitization. These tasks are interlinked– the more skillful we become at developing a strong long term selection strategy, the better the chances that we will also succeed at generating growing financial support for our digitization efforts.

Plans for the selection for Programmatic Digitization

a. Integrate our digitization efforts more fully with the UNC curriculum and research

    projects, increase visibility of the program and communications with scholars.

b. Invite more curators and bibliographers to participate in our digitization efforts

    - Davis and branch library collections

    - Begin Programmatic Digitization of  manuscript collections

c. Continue efforts to incorporate Programmatic Digitization more fully with traditional library

    services and operations:

    - Scanning of unique books for  remote storage

    - ILL

    - document delivery

    - preservation scanning

d. Pursue more collaborative projects with other repositories in NC and beyond

e. Improve and expand long term selection and prioritization strategy by the inclusion of new streams    of content generated by new projects such as ECU/State library and NC ECHO.

f.  More active participation in grant funded projects.

g. Seek new library internal and collaborative projects which more actively employ non-book formats

    (Duke/index cards, MESDA/index cards. Ackland Art Museum, etc.)

2. Digital conversion, descriptive and technical metadata (“Technical processing”)

During this stage we digitize selected materials and link available metadata and/or create new metadata.

Digitization equipment and workflows

Our strategy here includes the development of the Digital Production Center which can effectively manage a comprehensive set of high productivity digital conversion equipment. This process started with the addition of the high speed Scribe scanning station, replacement of the old digital back camera for a newer,  faster one, the addition of the Zeutschel and Kodak slide/film scanners, and most recently, the Fujitsu sheet fed scanner . With the latter addition, the DPC is now fully equipped with a minimal, but sufficient, set of high productivity digital equipment which can handle a large variety of material and formats in a production manner.

Plan for the equipment

a. incorporate second Scribe into DPC's workflows;

b. incorporate new scanning equipment into DPC's workflows as may be required by the

    NC ECHO program.

Descriptive and technical metadata, Optical Character Recognition (OCR)

- linking/creation of descriptive metadata with the digitized objects

- generation of technical and preservation metadata

- OCR as a way to improve access through key word searching

Plan for the metadata

- increased use of available minimal descriptive metadata

- develop technologies and workflows to generate technical and preservation metadata for

    objects as part of the digitization process (checksums, etc.)

- continue efforts to develop production OCR operations.

3. Archiving and online publishing of digital objects (“Shelving”)

- Digital Archiving of objects and related descriptive, tech/preservation metadata

- uploading of digital objects and metadata to online access/publishing platforms (CDM, DocSouth,

    etc.)

- administration, support and long term maintenance of the Digital Archive, including data back up,

    migration, and network space (“Stacks management”).

Plans/improvements

- continue efforts to improve long term planning for the DA and network space;

- start adding technical and preservation metadata to files for the DA deposit.

4. Online access to digital collections and objects

- interface to online access/publishing platforms (out of the box and developed in-house)

- functionality – keyword searching, browsing, value added features (out of the box and developed in-

    house)

- administration, support, long term maintenance and migration of online databases (CONTENTdm, DocSouth, RBR, Blake Archive, etc).

Plans/improvements

- consolidate access to resources and develop an access platform

    - effective keyword searching and browsing across heterogeneous resources

    - links from citations to full texts

- development of value added functionality (geo linking, commenting, tagging, transcription, etc.)

- directions for future work

   - Systems dept. guidelines for production systems

   - development of an infrastructure for comprehensive access to digital resources as a collaborative project of CDLA and the Systems dept.


Comments from library fellows:

> 1- What is your position on quality control and quality assurance in the
> context of large-scale digitization?  (my CLIR report has some comments on
> this issue)
> 2- How will you encourage and use user feedback in your development cycle in
> regard to image quality, metadata, search/view features etc.?
> 3- What is your storage policy and how will you address redundancy needs
> (including geographic)? 
> 4- How would you facilitate the discovery of digitized collections - such as
> through harvesting or OAI etc.?  What are your push and pull strategies to
> ensure the usage and use of digital materials?
> 5- How does the digital collection relate to (or integrated with) other UNC
> collections (both digital and print - such as finding resources from the
> online catalog?) - you mention this in 1a on page 3 but would be nice to
> expand it elsewhere.
> 6- How about fiscal management or accountability?  Do you need to have an
> ongoing business planning activity to make sure that you are developing a
> sustainable program?  For instance, will you explore any revenue-generating
> opportunities (such as PoD)?
>
> Some of the issues I've raised may appear to be broader than your mandate
> but I think the core principle of moving from a project to program mode
> should be taking an integrative approach that address technical, public
> services, and collection development issues.