By Euan Cochrane
On Digital Preservation Day 2017, Euan Cochrane, Digital Preservation Manager at Yale University Library, discusses the challenges of scale associated with vast collections of born-digital content
Today marks the first annual ‘Digital Preservation Day,’ when preservation practitioners around the world can reflect on the importance of ensuring critical digital records, whether of cultural or business importance, remain accessible, secure and future proofed. It’s a day for us to consider not only the value of the collections we have, but also the scale of many quickly growing born-digital archives.
The preservation and scalability challenges of using CD-ROMs to store unique digital collections brings into sharp contrast the difficulty in managing both native and born-digital content. A few years ago at Yale University Library we started to take action to mitigate against the risk of media degradation by actively preserving our born-digital content on external media in our general collections.
We began by identifying the CD-ROMs and floppy disks that were residing in our collections, and then systematically recalled the media to make disk images of them using BitCurator and Kryoflux tools and software.
After the first 200 CD-ROMs had been imaged I decided to analyze them to produce some metrics on what we had in order to help with understanding their preservation needs, discussing their value and their preservation context, reporting on progress, and explaining them more generally. I hoped this would show us what we had to do to best meet our preservation needs, and open up a discussion on the value of our digital assets. In the first 200 successfully imaged CD-ROMs there were:
152 distinct “formats” identified by DROID (a file format identification tool) and 43 format “versions” amongst those
52,361 files whose formats couldn’t be identified
To communicate this volume and complexity with colleagues more familiar with analogue content, I thought it would be useful to covert these numbers to “paper equivalents”. The majority of the 243,714 files were made up of PDF files (in 9 different versions) with an average size of 343KB per file, while a normal archival box can fit a maximum of 2,500 sheets of paper.
With a conservative set of conversion assumptions where we assume that each PDF file, if printed, only consists of one page, and we can fit 2,500 pages per box, then we would have equivalent of 98 boxes of printed paper within the first 200 CD-ROMs. This shows that for each 1,000 CD-ROMs we have at a minimum, the equivalent of 487 boxes of printed-paper equivalent, or approximately 487 shelf-feet at 1 foot per box. Using more reasonable page (but still potentially conservative) conversion assumptions of 1,500 pages per box and 211KB per page we get at least 1,321 printed-paper box-equivalents per 1,000 CD-ROMs, or 1,321 shelf-feet of boxes.
Why is this important? This illustrates that, likely without intending to, many libraries have already built large collections of born-digital content. Collections that may well rival their analogue collections in volume, complexity and numbers of objects.
Euan Cochrane @yalelibrary
Preservica is changing the way organizations around the world future-proof and access critical long-term digital information – enabling companies to drive innovation, confidently meet compliance and legal requirements and safeguard digital content of unique cultural and brand importance.