Documentation on Linking Data

With the first GIS map of Pompeii now available online, we are turning more of our attention to the problem of connecting our spatial data to our bibliographic data. While there is still some important spatial work to be done with the current map, the planning and documentation for the bibliographic integration serves as a worthwhile distraction. To that end and following a discussion last week with Alexander Stepanov, the PBMP’s GIS architect, I’ve decided to write up some very quick documentation for our data and their connections as a blog post. I’ve also decided to try something else new. Below is a Google Slide with the designs and discussions we drew on a whiteboard as the background. Over this are shapes representing files we need to link together with their names hyperlinked to their locations on the web (as hosted sites or Dropbox objects). In this way, the blog post operates in three different dimensions:

  1. As a public discussion
  2. As a living, internal document
  3. As an interface to the repository of files we’re using.

The files listed are as follows:

A single file of spatial data to start, the Propeties by Eschebach (Prop_ESCH), representing all the building and occupied spaces in the city. Later this will expand to include other, more generalized features of the landscape, such as the City Blocks, Gates, and Fortification Walls.

Three files from the Nova Bibliotheca Pompeiana are given here:

  1. The first 10,000 citations (GYG Citations_BIBLIOGRAPHY) completed from the NBP as there were prepared for uploading to Zotero (and then to Omeka). This shows how the data were divided and might be recombined.
  2. A list of property addresses from the Spatial Index from the NBP (GYG Citations_INDEX). This gives as a one-to-many relationship the address of a property and the one or more citations that relate to it.
  3. A list of addresses per citation as extracted from the full-text of the first two volumes of the NBP (GYG Citations_TEXT). This gives as a one-to-many relationship the bibliographic citation as given by Garcia y Garcia and the one or more addresses that relate to it.

Naturally, there will be a significant overlap between #2 and #3, which will reduce the total number of connections, but also offer a chance to preform quality control test on the data as extracted from the NBP.

If thinking of this a merely a spatial data problem, the work to be done is non-trivial, but also not conceptually difficult. That is, if all we wanted to do was to connect the bibliographic data to the map so that users could click on it and access that information, the process would be straight-forward: combine and proof tables #2 and #3, then join them to the spatial data of Properties by Eschebach. Indeed, that *is* our primary goal, but we also want those bibliographic citations to be linked to their full references on our other platforms (i.e., Zotero and Omeka). Moreover, we want users to be able to use search functions in the map – beyond navigating and clicking – to both find and leverage bibliographic information. For example, we want people to be able to search for an author in the map and have the sites and buildings associated with that author appear highlighted. The user should also then be able to create a new search off of this subset of data, using either additional bibliographic criteria or spatial definitions. To make these functions possible, however, the data stored in the map cannot only be reference numbers linked out to other resources. Finally, we would like to eventually have searches in our bibliography be (passed to and) responsive in the map, so that the results of regular bibliographic searches might be visualized in the map as well as in the listing of citations.

As you can see from teh image, we’ve got an outline of how we’ll do this. Nonetheless, if you are a GIS architect, a digital collections librarian, data designer, or all around smart person and have an opinion on how this might be done, in all or in parts, please do email me: Pompeiana[AT]gmail.com

– EP

PBMP Bibliography: Excel → RIS → Zotero → Omeka.

PBMP Bibliography: Excel → RIS → Zotero → Omeka.

The existence of an online, searchable, 13,000+ reference bibliography on Pompeii is tantalizingly close. With the expertise of two great UMass Librarians, Aaron Rubinstien (University and Digital Archivist) and Ron Peterson (Discovery and Integrated Systems Coordinator), the PBMP has moved our massive spreadsheet of citations into bibliographic formats readable by the content platforms we intend to use. Out first attempt to publish the bibliography is now available on our Zotero and Omeka sites. The process of migrating those citations to the web, although it appeared to be a simple one, has not been easy.

In no small part, this difficulty is the legacy of the ‘boot strap’ beginnings of the PBMP. In 2009, before this project was funded by the NEH, ACLS, UMass DHI or CHFA (again I thank them all!), and before Garcia y Garcia partnered with Arbor Sapiente to update his work and publish online as pdfs, I began scanning the Nova Bibliotheca Pompeiana and correcting the terrible OCR transcripts in Microsoft Word. With the generous funding from UMass, it became possible to parse those word docs into tabular form and hire students to continue to correct the data. Originally, I had intended to use Microsoft Access to produce easy to use forms for students to continue the process of correcting the raw citation text and splitting it into appropriate fields. Ironically, “Access” was not easily accessible for students (not included in Microsoft Office for Students). For this reason, we shifted to Excel.

Doubtless because I am not a librarian and am not educated in their best practices, I was surprised to learn that neither Zotero nor Omeka would import from Excel, .csv, .tsv, or .txt. Surely this is to protect the specifically structured contents from being regularly fed into the wrong fields. Our task was therefore to convert our spreadsheet formatted data into one of the formats that our platforms would accept. Zotero will import from Zotero RDF, MODS, RIS, BibTeX, Refer/BiblX, and unqualified Dublin Core RDF, while Omeka, importantly, can import from Zotero. It therefore seemed appropriate to create a chain of transformations: Excel → RIS → Zotero → Omeka. Aaron, Ron, and I mapped the fields to be transferred from Excel to RIS and then Aaron wrote the scripts that processed that translation. He then imported them to Zotero with its native import tool, getting 12,804 records online. It was obvious at this point, however, that the encoding of special characters in Excel and their re-expression in Zotero was going to be problematic. Universal character and symbol recognition and translation is an endemic issue. For example, the title of this post was first translated into the body of this post by Worpress as “Excel à RIS à Zotero à Omeka”. Continuing our transformation chain, Aaron then applied the “Zotero Import” Plugin to import the Zotero records into Omeka. 10,479 records we imported before some error was introduced that halted the import.

Zotero_EXFor a first attempt, our process of translation and upload was remarkably successful, but these results are obviously not good enough. Beyond the problems already mentioned – special character issues and missing records in Omeka import – there are other issues to overcome. For example, we discovered that some elements of the field mapping were faulty. Sometimes this was a problem with the translation script, but more often it was a problem with the original data being inconsistent. In complex bibliographic citations, (e.g., items with multiple authors in an edited volume that is part of a series books) students were often excusably confused while working on the data, and some citations they parsed incorrectly. There are also the differences in Italian publishing standards and Garcia y Garcia’s own (understandable on such a large project) personal idiosyncracies that meant information did not always go in the right places.  One strange issue, however, is that the RIS field for “Place”, that is, the location where an item was published, just won’t read into Zotero’s related field. BibTeX seems to have a greater range of fields so we will try that format on our second attempt. Another item to overcome is the absence of an unique handle for each citation that our GIS system can use. That’s just a global application of a serial identifier, in this case, (e.g.) “PBMP_BIB_000001”.

To help overcome these issues, we are enlisting the help of one of my senior undergraduate students, Juliana van Roggen whose Guardstones blog you should also check out for some rugged data analysis and visualization of street stones in Pompeii, a topic dear to my heart. Dedicated to fixing the bibliography, Juliana is working to resolve many of the inconsistencies in the data as well as preparing those data for remapping, multiple imports, and for life online. Her current tasks include:

  1. Using conditional formatting to assign the language of the work and to define its object type (i.e., book, journal, diss, etc.).
  2. Sorting out the journal number issues and preparing to map journal abbreviations to their full names.
  3. Joining the struggle to figure out how to keep the character encoding as citations move from Excel to online.
  4. Connecting to full-text objects online, including those 2953 itmes the PBMP has recently received from Hathi Trust and others previously received from Internet Archive.

Once these corrections are made we will be in good stead to run a second import into Zotero and Omeka. It is my hope that at this point the first part of this process – moving from Excel to Zotero – of this process will be finished and not repeated. We should then be able to make changes online directly into Zotero as needed. This means that a second import into Zotero will not likely also be a final import into Omeka. It should be noted that Zotero is not merely a stepping stone in our process, but rather is envisioned as an integral tool in our larger bibliographic resource.  Although we run the risk of redundancy and asynchronous parallel systems, the different functionalities of Zotero and Omeka make keeping them both a preferred option. For Omeka, this means a much more customizable experience of the data. Individual items can be more fully manipulated and groups can be cultivated not only as collections, but also curated as exhibits, turning the bibliography from mere catalog to platform to illustrate and even to make arguments from its contents. On the other hand, with the robustness and rigidity of Zotero’s design comes a greater ability to create and share individual citations and collections. Most importantly, however, it is a more collaborative space where the PBMP can find, collect, and incorporate new or previously unknown references to Pompeii.