PBMP Bibliography: Excel → RIS → Zotero → Omeka.

PBMP Bibliography: Excel → RIS → Zotero → Omeka.

The existence of an online, searchable, 13,000+ reference bibliography on Pompeii is tantalizingly close. With the expertise of two great UMass Librarians, Aaron Rubinstien (University and Digital Archivist) and Ron Peterson (Discovery and Integrated Systems Coordinator), the PBMP has moved our massive spreadsheet of citations into bibliographic formats readable by the content platforms we intend to use. Out first attempt to publish the bibliography is now available on our Zotero and Omeka sites. The process of migrating those citations to the web, although it appeared to be a simple one, has not been easy.

In no small part, this difficulty is the legacy of the ‘boot strap’ beginnings of the PBMP. In 2009, before this project was funded by the NEH, ACLS, UMass DHI or CHFA (again I thank them all!), and before Garcia y Garcia partnered with Arbor Sapiente to update his work and publish online as pdfs, I began scanning the Nova Bibliotheca Pompeiana and correcting the terrible OCR transcripts in Microsoft Word. With the generous funding from UMass, it became possible to parse those word docs into tabular form and hire students to continue to correct the data. Originally, I had intended to use Microsoft Access to produce easy to use forms for students to continue the process of correcting the raw citation text and splitting it into appropriate fields. Ironically, “Access” was not easily accessible for students (not included in Microsoft Office for Students). For this reason, we shifted to Excel.

Doubtless because I am not a librarian and am not educated in their best practices, I was surprised to learn that neither Zotero nor Omeka would import from Excel, .csv, .tsv, or .txt. Surely this is to protect the specifically structured contents from being regularly fed into the wrong fields. Our task was therefore to convert our spreadsheet formatted data into one of the formats that our platforms would accept. Zotero will import from Zotero RDF, MODS, RIS, BibTeX, Refer/BiblX, and unqualified Dublin Core RDF, while Omeka, importantly, can import from Zotero. It therefore seemed appropriate to create a chain of transformations: Excel → RIS → Zotero → Omeka. Aaron, Ron, and I mapped the fields to be transferred from Excel to RIS and then Aaron wrote the scripts that processed that translation. He then imported them to Zotero with its native import tool, getting 12,804 records online. It was obvious at this point, however, that the encoding of special characters in Excel and their re-expression in Zotero was going to be problematic. Universal character and symbol recognition and translation is an endemic issue. For example, the title of this post was first translated into the body of this post by Worpress as “Excel à RIS à Zotero à Omeka”. Continuing our transformation chain, Aaron then applied the “Zotero Import” Plugin to import the Zotero records into Omeka. 10,479 records we imported before some error was introduced that halted the import.

Zotero_EXFor a first attempt, our process of translation and upload was remarkably successful, but these results are obviously not good enough. Beyond the problems already mentioned – special character issues and missing records in Omeka import – there are other issues to overcome. For example, we discovered that some elements of the field mapping were faulty. Sometimes this was a problem with the translation script, but more often it was a problem with the original data being inconsistent. In complex bibliographic citations, (e.g., items with multiple authors in an edited volume that is part of a series books) students were often excusably confused while working on the data, and some citations they parsed incorrectly. There are also the differences in Italian publishing standards and Garcia y Garcia’s own (understandable on such a large project) personal idiosyncracies that meant information did not always go in the right places.  One strange issue, however, is that the RIS field for “Place”, that is, the location where an item was published, just won’t read into Zotero’s related field. BibTeX seems to have a greater range of fields so we will try that format on our second attempt. Another item to overcome is the absence of an unique handle for each citation that our GIS system can use. That’s just a global application of a serial identifier, in this case, (e.g.) “PBMP_BIB_000001”.

To help overcome these issues, we are enlisting the help of one of my senior undergraduate students, Juliana van Roggen whose Guardstones blog you should also check out for some rugged data analysis and visualization of street stones in Pompeii, a topic dear to my heart. Dedicated to fixing the bibliography, Juliana is working to resolve many of the inconsistencies in the data as well as preparing those data for remapping, multiple imports, and for life online. Her current tasks include:

  1. Using conditional formatting to assign the language of the work and to define its object type (i.e., book, journal, diss, etc.).
  2. Sorting out the journal number issues and preparing to map journal abbreviations to their full names.
  3. Joining the struggle to figure out how to keep the character encoding as citations move from Excel to online.
  4. Connecting to full-text objects online, including those 2953 itmes the PBMP has recently received from Hathi Trust and others previously received from Internet Archive.

Once these corrections are made we will be in good stead to run a second import into Zotero and Omeka. It is my hope that at this point the first part of this process – moving from Excel to Zotero – of this process will be finished and not repeated. We should then be able to make changes online directly into Zotero as needed. This means that a second import into Zotero will not likely also be a final import into Omeka. It should be noted that Zotero is not merely a stepping stone in our process, but rather is envisioned as an integral tool in our larger bibliographic resource.  Although we run the risk of redundancy and asynchronous parallel systems, the different functionalities of Zotero and Omeka make keeping them both a preferred option. For Omeka, this means a much more customizable experience of the data. Individual items can be more fully manipulated and groups can be cultivated not only as collections, but also curated as exhibits, turning the bibliography from mere catalog to platform to illustrate and even to make arguments from its contents. On the other hand, with the robustness and rigidity of Zotero’s design comes a greater ability to create and share individual citations and collections. Most importantly, however, it is a more collaborative space where the PBMP can find, collect, and incorporate new or previously unknown references to Pompeii.