April 30, 2014
Chronicles Interoperability Tools
The Chronicles project focused on developing strategies, workflows, and tools for ingesting preservation-ready news content. Early in the project, the PI, PM, and Technical Advisor hosted a set of concentrated discussions across the three DDP partners (MetaArchive, UNT Coda, and Chronopolis). In these conversations, we formalized a strategy for accomplishing content exchanges using common tools and mechanisms wherever possible.
Kurt Nordstrom and Matt Schultz produced an inventory and performed a gap analysis between existing and needed interoperability tools. The resulting Interoperability Tools Inventory and Gap Analysis (LINK) provided the basis for the project’s technical development work, priorities, and milestones.
In the project, we prioritized existing technologies such as BagIt. Thanks to its lightweight specification and user-friendly utilities (especially Bagger), BagIt proved enormously successful for these exchanges. The specifics of the various test exchanges revealed that there was room and justification for developing various supporting utilities to make further optimal use of BagIt and its corresponding workflows. Below, we describe each of the five Chronicles Interoperability Tools that we developed during the project and provide links to the github location where each can be accessed and used.
The first Chronicles Interoperability Tool is the BagIt Split/Unsplit utility. This utility is a Python-based script that works hand-in-glove with the core Java BagIt utilities to facilitate the splitting of a large Bag of digital content into several smaller Bags. The BagIt Split/Unsplit Utility stores the original Bag details with the smaller Bags so that if/when the original large Bag needs to be recompiled it can be validated after the script has performed its Unsplit function. This BagIt Split/Unsplit utility helped to optimize two specific processes that will be useful for large collection exchanges like those involved with digital newspapers. First of all, breaking large collection sets into smaller units makes their transfer via a network less susceptible to process timeouts from server to server. Secondly, storing large collections as smaller units of data on a file system supports more efficient archival management processes on that data—for example audit or fixity checks.
The second Chronicles Interoperability Tool is the Find Bad Files utility. The ability to perform retrievals of Bags of digital newspaper collections via HTTP was extremely useful in the project. Depending on file naming conventions (e.g., use of spaces or other troublesome characters) or the presence of various hidden files, config files, .DS_store, tilde-prefixed backups, and other likely-undesired files, it is possible, under some conditions, for the contents of Bags to be augmented during a download (read/write) operation to a remote file system. These changes, if persistent, would interfere with any subsequent Bag validations. The Find Bad Files utility, another Python-based script, recursively scans a directory for filenames that violate a set of naming standards meant to prevent such problems when ingesting collections.
The third Chronicles Interoperability Tool is the Bag Describe utility. Because one goal of the project was to lower the barrier for digital newspaper curators to preserve their collections in more standards-oriented ways, an effort was made to enhance a Bagged collection with technical metadata prior to its transfer into a preservation system. This technical metadata can be useful to both the owner and the preservation system for on-going content characterization and any future migrations or normalizations that may need to take place. The Bag Describe utility, also a Python-based script, recursively runs the DAITSS Description Service over a set of Bagged data and produces PREMIS metadata on a per-file basis. The resulting records are saved in a new tag directory named “premis” inside the Bag’s root.
The fourth Chronicles Interoperability Tool is the Bag Diff utility. The Bag Diff utility was developed by the team to facilitate the easy detection of any added, changed, or deleted content that is encompassed by a previously ingested Bag. The Bag Diff utility does this by comparing a recently updated Bag of digital content against the manifest page of the same Bag prior to its updates. It loads any changes that it finds into a new Bag that can then be ingested alongside the original. The project team successfully tested the utility and performance of this versioning tool; however, at the time of the content exchanges, none of the Chronicles Committee partners had any active change management scenarios to which the utility could be applied. The MetaArchive Cooperative, as one group of interested stakeholders, plans to make use of the utility for its ongoing BagIt-based ingests.
The final Interoperability Tool developed by the Chronicles team is the LOCKSSsum Validator. The LOCKSSsum Validator is a web service that is designed to facilitate on-demand audits for content that has been ingested into a LOCKSS network in the form of Bags. The LOCKSSsum Validator can run as a background script or a browser supported web service. It accepts the upload of a Bag’s manifest-md5.txt or manifest-sha256.txt and a list of LOCKSS produced checksums for that corresponding Bag. The LOCKSSsum Validator then compares the two lists for any missing files or checksum mismatches, and reports back on its findings. The LOCKSSsum Validator brings the MetaArchive’s work with BagIt full-circle, and has been integrated into the functionality of the next-generation Conspectus collection database. The success or failure of any on-demand audit is also logged to a running instance of the PREMIS Event Service (developed by the University of North Texas Libraries) where those outcomes can be referenced for on-going archival management. Though the LOCKSSsum Validator is extremely useful for comparing Bags in a LOCKSS environment, its code could be easily adapted to compare Bag manifests with other formatted lists of per-file checksums.
All tools and documentation were released in April 2014 on the project’s Github account.