We are nearing the final portion of our Palladium email archive project, and with the assistance of Jochen Farwer, University of Manchester Library developer, we have begun to introduce some alterations to the ePADD software that we’re using to manage our email archives.
Below is an update on the development work which thus far has covered ingest of files into the system, the display of information about email account folder structures, split messages, duplication reports and message redaction.
New functionality
Folders
We discovered that for appraisal, it is useful to be able to see each level of the folder structure of the email account when viewing an individual email. Some Carcanet accounts have very complex folder structures which only really make sense if you can see the whole chain, and we’ve been successful in displaying this data. This was achieved with a script for manipulating the names of the MBOX files before import so that they represent the folder structures.
Redaction
The other big piece of development work we’re undertaking is an exploration of the potential to redact selected content within an email, rather than restrict access to an entire message. Please see a very short demo of what we’ve been working on here:
This is a work in progress, and we’re still considering the parameters for using this functionality in a way in which we can still be confident in record integrity, but it does have lots of potential for increasing access to records where only a portion of the text is sensitive. Our plan at the moment is that ePADD will retain two copies of the message, one full and one redacted, which can be marked up appropriately for transfer or restricted as required.
Search results as downloadable CSV files
The most recent piece of development work requested is the option to download the header metadata of emails for the results of a search conducted within ePADD as an exported CSV file. The option to export this data provides myriad options for the data sets available to create data visualisations, as discussed in a previous blog post.
Challenges
Ingest
One of the first challenges we encountered is that importing large files encountered memory issues, although the latest ePADD upgrade has improved this. Our largest MBOX file was 15GB, and ePADD was crashing at some point during ingest. This was partly because the messages were being imported in batches of 10,000, and the process wasn’t taking into account the size of the messages within that batch. The fix for this was that Jochen reduced the batch size of the import to 100 messages. ePADD was able to read this with no issues, and there doesn’t seem to be an impact on performance.
Import/Export encoding errors
When Carcanet’s emails were imported into ePADD a number of Unicode characters appeared in the messages. When the emails were exported to MBOX files and displayed in Thunderbird, the number of Unicode characters increased. We discovered that this was because the format of the MBOX files wasn’t being read quite right in ePADD, and we were able to amend this, and the export issue was solved by updating the version of the decoder ePADD was using.
Split messages
We also discovered that in some circumstances, ePADD was picking up the word ‘From’ followed by a space in the text of an email, thinking that it was the start of a new email message, and making two entries for it. So there were email messages with no header info, and others with truncated text. This was partly a format issue and partly an issue with ePADD’s parser, which Jochen was able to fix.
The solution we came up with for the formatting issue was to add a hash after the word to ensure that ePADD’s reader doesn’t get confused. This seems to be working fine, and doesn’t interfere too much with our ability to ready the text.
Deduplication figures in reports
ePADD’s reports give you important information on duplicate email; if an email appeared in the MBOX file more than once because it was part of a chain, then ePADD will de-duplicate it. We were getting different figures for this process on the summary report, and the individual MBOX file reports, owing to a couple of bugs in translating the text and the import process. We also noticed that the figures were being calculated differently for different reports, and we’re working on making that clearer in the report terminology.
More updates as we progress
We’re excited about the progress that we’ve been able to make with development work through Palladium, and we hope that our suggestions will be seen to have a wider applicability and relevance to other ePADD users. We’re also hoping to discuss their inclusion in the generally released version of ePADD so that they may be used by institutions as well.
0 comments on “Palladium Project – development work on ePADD”