Collections

Celebrating World Digital Preservation Day – 3 November 2022

A blog by Jan Whalen, Digital Preservation and Systems Manager, The University of Manchester Library.

Organized by the Digital Preservation Coalition and supported by digital preservation networks around the globe, World Digital Preservation Day is dedicated to showing the benefits and opportunities enabled by the work of the dynamic digital preservation community.

Digital Preservation at The University of Manchester Library – An update from the Digital Preservation and Systems team

In 2017, we purchased Preservica as the Library’s digital preservation system. Preservica is essentially a management system, and we chose Amazon S3 as the storage environment it would manage. The aim was to provide a secure home for the large number of photographs our Imaging team had amassed on shared drives, as well as the increasing acquisitions of born digital archives such as the Carcanet Press email archive.

One of the roles of the Digital Preservation Team is to make sure the transfer of born digital material into Preservica happens with appropriate fixity checks along the way and that everything ends up safety in Preservica.  

Preservica now holds a wealth of Special Collections material in many different formats including audio and AV.  Recent ingests include footage of writer Rosie Garland reading from one of her novels and soundtracks by musician and composer Delia Derbyshire

Spotlight on Email Archives

Every 4 years we acquire an accrual of Carcanet emails (Outlook PST files). The files are created by logging in to the email accounts at Carcanet and exporting the required emails. Before lockdown, a visit to Carcanet Press in Manchester was required; during lockdown we logged in remotely. Some of the issues we faced were:

  • Unreliability of Outlook PST exports: we would get a different size/ fixity for the PST if we repeated the export. We had to repeat the process 3 times until we got 3 matching fixity checks.
  • The email accounts were used in ‘Cached’ mode, i.e., only the last 12 months were on the pc (the rest on the server): essentially the download took a long time.
  • There is no straightforward way of selecting individual components within a mailbox (e.g. omitting calendar, contacts etc), so we had to export the whole mailbox each time, again time consuming.
  • Even just opening the PST file in an Outlook account would cause the file to change (resulting in new fixity). This meant the files had to be handled very carefully and copies kept.

Once we had the files, they were appraised and listed using Paraben’s E3 (‘Email Examiner’) software. Outlook was again used to open, edit and finally compact the PSTs as required. Going forward, this stage of the process will be made much easier by the use of ePADD, an open source software developed by Stanford University to support the processing and delivery of email archives. The above link provides a good introduction to the tool and shows UoM Curator, Paul Carlyle demonstrating the latest version.

Issues Around Ingesting Email Files to Preservica

A PST is a container file and as such does not need to be packaged up for ingest, we can simply move it into Preservica’s upload folder and start the ingest workflow letting the transfer agent do its thing. We have an ingest workflow linked to our Carcanet Press email collection in Preservica to make things easy.

Preservica unpacks the PST and carries out a virus, fixity, metadata, and content integrity check. It also identifies any attachments. After storing the files, it creates an index of both email and any attachments.

An ingest would take typically several hours and it would often take 3-4 hours to get an error message if something was wrong, so this was quite frustrating.  Some issues that occurred were:

  • Size of ingest / network limitations led to failure early on in our efforts.
  • Emojis: storing these requires the database to be encoded in utf8mb4 (originally it was utf8). The Preservica team reconfigured the database to solve this issue. 😊
  • Control characters: Preservica makes each email a separate eml file and renames the file by using the subject header. In two cases the email subject had a control character so it failed as a filename.  In the end we narrowed the offending email down to one folder, and then to the email itself (a lengthy process).
  • OneDrive links: Preservica saw these (when encountered in emails) as a potential threat – the Preservica support team ascertained how we wanted them dealt with and amended the ingest workflow to neutralise the links.
  • Length of subject header: If the subject was too long, again the ingest would fail. The Preservica team again resolved this issue for us.
  • Viruses: Preservica identifies these and stops the ingest if it finds any. Fortunately Preservica identifies the infected file for us, enabling us to  deal with it locally before trying again.

What Happens When the Emails are in Preservica?

Unpacked from the PST, the messages are now eml files and are named by the subject header (so there are many called ‘no subject.eml’).

Attachments will be in a range of formats, some of which might need to be migrated to new/more preservation-friendly formats. An automated preservation workflow can carry out this migration on several files at once if required. The original is kept, and we can save the new migrated version (along with its associated email).

The metadata we get is extracted from the usual email fields: to, from, cc, subject, size, attachments, date, etc. The indexing Preservica does means it is easy to search the emails in Preservica using any of the indexed fields as well as text in the body of emails. Multiple filters can be applied to narrow down the search. In the case of the Carcanet Press archive, we can pinpoint emails to/from a specific writer, pick up book titles and names of writers mentioned in the body of messages, and emails of individuals from a particular time-period.

We can also add other metadata. An EAD metadata file might contains information about the email and the archive more generally. A Premis file might contain info about how the file got to Preservica (who ingested it, what checksums, what virus checks, appraisal actions, etc.).

How Safe is Preservica?

Preservica is certified against ISO 27001, and so can demonstrate compliance with internationally recognised standards of information security. It has secure login access and we can manage access rights to ensure each of its collections is accessible only how we want them to be.  Deleting an item requires a two-step confirmation involving two different approvers. It stores and maintains  four copies of any item we ingest. On ingest:

  • A checksum is calculated that uniquely identifies all aspects of the object. 
  • The object is saved to a server in a data centre and then copied to another server.
  • Both of these are then replicated to two other data centres.  
  • The checksums of all four copies are re-calculated to confirm they all match and Preservica can then report the object ‘safely stored’. 

Going forward this checksum calculation is repeated at regular intervals; if it changes, the disk is re-commissioned and a new copy created from one of the other objects held elsewhere. This process is called “self-healing”.  Amazon can boast durability of 99.999999999% over the year (so if storing 10,000 objects you can theoretically expect to lose one every 10,000,000 years).  AWS (Amazon Cloud) stores trillions of objects and (according to Amazon) has never lost one!

Summary

The above process was very much a voyage of discovery for the Digital Preservation and Systems team and often a case of working it out as we went along, working from copious notes left by Rylands archivist, Fran Baker and adapting to the changes thrown up by new versions of Microsoft Office and Outlook. As mentioned above, The University of Manchester Library is part of the ongoing ePADD+ project. ePADD currently exports to ‘Bag’ format (which can be ingested into Preservica) and the ePADD team hope to build a direct integration between ePADD and Preservica. Hopefully by the time of our next Carcanet Press accrual in 2024, the process will be much smoother.

0 comments on “Celebrating World Digital Preservation Day – 3 November 2022

Leave a Reply

%d bloggers like this: