Hosting the Digital Rāmamālā Library at Penn, or, thinking about open licenses for non-Western digitized manuscripts

This talk was presented as part of a panel at the Global Digital Humanities Symposium at Michigan State University, March 16-17 2017: ARC Panel: Access, Data, and Collaboration in the Global Digital Humanities

My story begins in 2012, when Dr. Benjamin Fleming, Visiting Scholar in Religious Studies and Cataloger of Indic Manuscripts for the Kislak Center for Special Collections at the University of Pennsylvania, proposed and was awarded an Endangered Archives grant from the British Library. The main purpose of the grant was to write a catalog for the Ramamala Library, which is one of the oldest still-active traditional libraries in Bangladesh. A secondary part of this grant was to digitize around 150 of the most fragile manuscripts from the Ramamala Library, and an agreement was made that the University of Pennsylvania Libraries would be responsible for hosting these digital images. At this time, someone from the Penn libraries recommended to Dr. Fleming that they get a Creative Commons license, and the non-commercial license was given as an example. The proposal went forward with a CC-NC license, which both the Penn Libraries and the Ramamala Library agreed to, and everything was fine.

So a bit of historical logistics might be helpful here. 2012, the year this proposal was agreed to, was one year before the Schoenberg Institute for Manuscript Studies (SIMS) was founded at Penn. One of the very first things that SIMS was tasked with doing – one of the things it was designed to do – was to create some kind of open access portal to enable the resue of digital images of our medieval manuscripts. Penn has been digitizing our manuscripts and posting them online since the late 1990s, and in 2013 all of them were online in a system called Penn in Hand. Penn in Hand is a kind of black box – you can see the manuscripts in there, search for them, navigate them, but if you want to publish an image in a book or use them in a project, you have to do some work to figure out what’s allowed in terms of licensing, and then figure out how to get access to high-resolution images that are going to be usable for your needs.

It took us a couple of years, but on May 1, 2015, we launched OPenn: Primary Digital Resources Available for Everyone, as a platform not for viewing images, but explicitly for downloading and reusing images and metadata. OPenn includes high-resolution master TIFF images and smaller JPEG derivatives, as well as robust metadata in TEI/XML using the Manuscript Description element. We started hosting our own data, but today we host manuscript data for 17 institutions in the Philadelphia area with others in the US and Europe (including Hebrew manuscripts from the John Rylands Library at the University of Manchester) to come online in the next year. One the things we wanted was for users of OPenn to always be certain about what they could do with the data, so we decided that anything that goes into OPenn must follow those licenses that Creative Commons has approved for Free Cultural Works:

  • the CC Public Domain mark
  • CC0 (“CC-zero”), the Public Domain dedication for copyrighted works
  • CC-BY, the Creative Commons Attribution license
  • CC-BY-SA, the Creative Commons Attribution-Share Alike license

Note that licenses with a non-commercial clause are not approved for Free Cultural Works, and thus OPenn, by policy, is not able to host them.

You see where this is going.

So in March 2016, a year after OPenn was launched, and well after the Ramamala Library manuscripts had been photographed, Dr. Fleming asked about adding the Ramamala Library material to OPenn, in addition to having it in Penn in Hand (where it was already going online). It wasn’t until he asked this question that we realized, under current policy, we couldn’t include that material in OPenn because of the license. Over the next few weeks we (we being representatives of OPenn, and Dr. Fleming) had several conversations during which we floated some various ideas:

  1. We could loosen the “Free Cultural Works” requirement and allow inclusion of the Ramamala data with the noncommercial license.
  2. We could build a parallel OPenn to contain data with a noncommercial license.
  3. We could use OPenn as a kind of carrot, to encourage the Ramamala library administration to loosen the noncommercial clause on the license and release the data as Free Cultural Works.

The third option was struck down almost as soon as it was suggested. There were, it turns out, highly sensitive discussions that had been happening at the Ramamala Library during the course of the project that would have made such a request difficult to say the least. As Dr. Fleming said in an email to me as we were talking about this talk, “It would be highly inappropriate and complex to try and revisit the copyright agreement as, even as it was, the act of a Western organization making digital copies of a small set of mss unraveled a dense set of internal issues related to private property and government control over cultural property (digital or otherwise).”

Before I return to our other two options I want to take a quick detour to talk a bit about how this conversation changed my thinking about Open Access in general, and about open access of non-Western material specifically.

I am by all accounts an evangelist for open access to medieval manuscript material. I like to complain: about institutions that keep their images under strict licenses, that make their images difficult or impossible to download, that charge hefty fees for manuscripts that have been digitized for years. OPenn is a reaction against that kind of thinking. We say: Here are our manuscript images! Here is our metadata! Here’s how you can download them. Do whatever you like with them. We own the books, but we acknowledge that they are our shared cultural heritage and in fact they belong to all of us. So the very least we can do is give you digital copies.

Until Ramamala, I would have told you that it was necessary that digital images of every manuscript in every culture that was written before modern times should be in the public domain and available for everyone. The people who wrote them are dead, and they wouldn’t have had the same conception of copyright ownership in any case, so why not? But suddenly I wasn’t so sure. I was forced to move beyond thinking in very black and white terms about “old vs. new” to thinking in a more nuanced way about “old vs. new”, sure, but also about “what is yours vs. what is ours” – and what ownership of the physical means for ownership of the digital. Again, until Ramamala I would have told you that physical owners owe it to the rest of us to allow the digital to sit in the public domain. But what does this mean for countries who have suffered under colonialism, and who have been forced for the past however many hundred years or more to share their cultural heritage with the west? In my somewhat unstructured thoughts, I keep coming back to the Elgin Marbles, which is just one example of cultural vandalism by the west (in this case, with the assistance of the Ottoman Empire, which ruled Greece at the time the marbles were taken) but is a particularly egregious one. I’m sure you’re all familiar with the Elgin Marbles, which used to decorate buildings in the Arcopolis, including the Parthenon, before they were removed to Britain between 1801 and 1812, later purchased by the British Museum where they can be seen today, although the current government of Greece has urged their return for many years.

Now the situation of the manuscripts in the Ramamala Library isn’t the same as that of the Elgin Marbles – we aren’t suggesting to move the physical collection to Penn – but I can’t help but believe there is a parallel here, and particularly a case to be made for respecting the ownership of cultural heritage by the cultures that created the heritage. But what does that mean? Dr. Fleming’s comment above implies the complexities around this question. Does “the culture that created the heritage” mean the current government? Or the cultural institutions? Or the citizens of the countries, or the citizens of the culture in those countries, no matter where they live now? And once we figure out the who, how do we ensure they have power to make decisions about their heritage objects? But I have taken my tangent far enough and I want to get back to talking about our ideas for making the Ramamala Library data available in OPenn.

When we left off, we had two other ideas:

  1. We could loosen the “Free Cultural Works” requirement and allow inclusion of the Ramamala data with the noncommercial license.
  2. We could build a parallel OPenn to contain data with a noncommercial license.

We considered the first suggestion but decided very quickly that we didn’t want to open that can of worms. Our concern was that if we started allowing one collection with a noncommercial license, no matter what the circumstances, other institutions that would want to include such a clause could point to it and say “Well them, why not us”? We have in fact used our “Free Cultural Works” only policy as a carrot for other institutions we host data for, including several museums and libraries in Philadelphia, and it works remarkably well (free hosting in exchange for an open license is apparently an attractive prospect). We don’t want to lose that leverage – particularly when it comes to institutions that own materials from former colonial countries, we have the ability and thus the responsibility to make that data available again, and it’s not a responsibility we take lightly.

The second idea, building a parallel OPenn to allow noncommercially licenced data, was more attractive but, we decided, just too much work at this time, with only one collection. However it did seem like something that would be a good community project: an open access portal, similar to OPenn in design, but with policies designed for the concerns surrounding access and reuse for cultural heritage data from former colonial countries. If something like this is currently being designed I would love to hear about it, and I expect we would be very happy to put our Ramamala data into such a platform.

So what did we do? We decided to take the path of least resistance and not do anything. The Ramamala manuscripts are available on Penn in Hand, and, since the license information was entered into the MARC record notes field, the license is obvious (unlike most other materials in Penn in Hand). The images are still not easily downloadable, although we are working on a new page for the project through which people can request free access to the high-resolution TIFF files. However, they aren’t available on OPenn. And I’m not sorry about that. I’m not sorry that OPenn’s policy is strictly for “Free Cultural Works,” because I think within our community, serving mainly institutions in Philadelphia and other US and Western European cities, the policy helps us leverage collections into Open Access that might otherwise be under more strict licenses. But I do think it’s important for us to keep talking and thinking about how we can serve other communities with respect.

Edit: after posting this talk, I was contacted by Caroline Schroeder who pointed me to an article she wrote about similar issues with Coptic manuscripts. I share a link to that article here: Caroline T. Schroeder, “Shenoute in Code: Digitizing Coptic Cultural Heritage For Collaborative Online Research and Study” Coptica 14 (2015), 21-36.

Manuscript PDFs: Update

My last post was an announcement that I’d posted the University of Pennsylvania’s Schoenberg Collection manuscripts on Google Drive as PDF files, along with details on how I did it. This is a follow-up to announce that I’ve since added PDF files for UPenn’s Medieval and Renaissance Manuscript collection, AND for the Walters Art Museum manuscripts (which are available for download through The Digital Walters).

As with the Schoenberg Manuscripts, these two other collections are in their own folders, along with a spreadsheet you can search and brows to aid in discovery. You are free to download the PDF files and redistribute them as you wish. They are in the public domain.

The main directory for the manuscripts is here.

Enjoy!

Title: Initial "C" with St. Paul trampling Agrippa Form: Historiated initial "C," 12 lines Text: Psalm 97 Comment: The inscriptions on the scrolls read "Paulus/Agr[ipp]a." The second inscription is partially obliterated.
Source: Walters Ms. W.36, Touke Psalter, fol. 89r
Title: Initial “C” with St. Paul trampling Agrippa
Form: Historiated initial “C,” 12 lines
Text: Psalm 97
Comment: The inscriptions on the scrolls read “Paulus/Agr[ipp]a.”
The second inscription is partially obliterated.

 

It’s been a while since I rapped at ya

I’m not dead! I’m just really bad when it comes to blogging. I’m better at Facebook, and somewhat better at Twitter (and Twitter), and I do my best to update Tumblr.

The stated purpose of this blog is to give technical details of my work. This mostly involves finding data, and moving it around from one format to another. I use XSLT, because it’s what I know, although I’ve seen the promise of Python and I may eventually take the time to learn it well. I don’t know when that will happen, though.

I’ve taken to posting files and documentation on Github, so if you’re curious you can look there. If you’re familiar with my interests, and you share them, the most interesting things will be VisColl, a developing system for generating visualizations of manuscript codices showing elements of physical construction, DistributionVis which is, as described on Github, “a wee script to visualize the distribution of illustration in manuscripts from the Walters Art Museum,” and ebooks, files I use to start the process of building ebooks from our digitized collection. (Finished ebooks are archived in UPenn’s institutional repository, you can download them there)

VisColl – quire with an added leaf
DistributionVis – Different color lines refer to different types of illustrations or texts.

VisColl has legs, a lot of interest in the community, and is part of a major grant from the Mellon Foundation to the University of Toronto. Woohoo! DistributionVis is something I threw together in an afternoon because I wanted to see if I could. I thought ebooks were a nice way to provide a different way for people to access our collection. I’ve no idea if either of these two are any use to anyone, but I put them out there, because why not?

I do a lot of putting-things-out-there-because-why-not – it seems to be what I’m best at. So I’m going to continue doing that. And when I do, I shall try my very best to put it here too!

Until next time…

Disbinding Some Manuscripts, and Rebinding Some Others (presented at ICMS, Kalamazoo, MI, May 2014)

I presented my collaborative project on visualizing collation at the International Congress on Medieval Studies in Kalamazoo, Michigan, last week, and it was really well received. Also last week I discovered the Screen Recording function in QuickTime on my Mac. So, I thought it might be interesting to re-present the Kalamazoo talk in my office and record it so people who weren’t able to make the talk could still see what we are up to. I think this is longer than the original presentation – 23 minutes! – so feel free to skip around if it gets boring. Also there is no editing, so um ah um sorry about that. (Watch out for a noise at 18:16, I think my hand brushed the microphone, it’s unnerving if you’re not expecting it)

We’ll also be presenting this work as a poster/demo at the Digital Humanities 2014 Conference in Lausanne this July.

How to get MODS using the NYPL Digital Collections API

Last week I figured out how to batch-download MODS records from the NYPL Digital Collections API (http://api.repo.nypl.org/) using my limited set of technical skills, so I thought I would share my process here.

I had a few tools at my disposal. First, I’m on a Macbook. I’m not sure how I would have done this had I been on a Windows machine. Second, I’m pretty good with XSLT. Although I have some experience with some other languages (javascript, python, perl) I’m not really good at them. It’s possible one could do something like this using other languages and it would be more effective – but I use what I know. I also had a browser, which came in handy in the first step.

The first thing I had to do is find all the objects that I wanted to get the MODS for. I wanted all the medieval objects (surprise!), so to get as broad a search as possible I opted for the “Search across all MODS fields” option (Method 4 in the API Documentation), which involves constructing a URL to stick in a browser. Because the most results the API will return on a single search is 500, I included that limit in my search. I ended up constructing four URLs, since it turned out there were between 1500 and 2000 objects:

I plugged these into my browser, then just saved these result pages as XML files in a directory on my Mac. Each of these results pages had a brief set of fields for each object: UUID (the unique identifier for the objects, and the thing I needed to use to get the MODS), title, typeOfResource, imageID, and itemLink (the URL for the object in the NYPL Digital Collections website).

Next, I had to figure out how to feed the UUIDs back into the API. I thought about this for most of a day, and an evening, and then a morning. I tapped my network for some suggestions, and it wasn’t until Conal Tuohy suggested using document() in XSLT that I thought XSLT might actually work.

To get the MODS record for any UUID, you need to simply construct a URL that points to the MODS record on the NYPL file directory. They look like this:

http://api.repo.nypl.org/api/v1/items/mods/[UUID].xml

For my first attempt, I wrote an XSLT document that used document(), constructing pointers to each MODS record when processed over the result documents I saved from my browser. Had this worked, it would have pulled all the MODS records into a new document during processing. I use Oxygen for most all of my XML work, including processing, but when I tried to process my first result document I got an I/O error. Of course I did – the API doesn’t allow just any old person in. You need to authenticate, and when you sign up with the API they send you an authentication token. There may be some way to authenticate through Oxygen, but if so I couldn’t figure it out. So, back to the drawing board.

Over lunch on the second day, I picked the brain of my colleague Doug Emery. Doug and I worked together on the Walters BookReaders (which are elsewhere on this site), and I trust him to give good advice. We didn’t have much time, but he suggested using a curl request through the terminal on my Mac – maybe I could do something like that? I had seen curl mentioned on the API documentation as well, but I hadn’t heard of it and certainly hadn’t used it before. But I went back to my office and did some research.

Basically, curl is a command-line tool for grabbing the content of whatever is at the other end of a URL. You give it a URL, and it sends back whatever is on the other end. So, if you send out the URL for an NYPL MODS record, it will send the MODS record back. There’s an example on the NYPL API documentation page which incorporates the authentication token. Score!

curl “http://api.repo.nypl.org/api/v1/items?identifier_type=local_bnumber&identifier_val=b11722689” -H ‘Authorization: Token token=”abcdefghijklmn”‘ where ‘abcdefghijklmn’ is the authentication token you receive when you sign up (link coming soon).

Next, I needed to figure out how to send between 1500 and 2000 URLs through my terminal, without having to do them one by one. Through a bit of Google searching I discovered that it’s possible to replace the URL in the command with a pointer to a text file containing a list of URLs, in the format url = [url]. So I wrote a short XSLT that I used to process over all four of the result documents, pulling out the UUIDs, constructing URLs that pointed to the corresponding MODS records, and putting them in the correct format. Then I put pointers to those documents in my curl command:

curl -K “nypl_medieval_4_forCurl.txt” -H ‘Authorization: Token token=”[my_token]”‘> test.xml

Voila – four documents chock full of MODS goodness. And I was able to do it with a Mac terminal and just a little bit of XSLT.

Q: How do you teach TEI in an hour?

A: You don’t! But you can provide a substantial introduction to the concept of the TEI, and explain how it functions.

On June 4 I participated in PhillyDH@Penn, a day of workshops and unconference sessions sponsored by PhillyDH and held in my own beloved Van Pelt-Dietrich Library on the University of Pennsylvania campus. I was sick, so I wasn’t able to participate fully, but I was able to lead a one-hour Introduction to TEI. I aimed it at absolute beginners, with the intention to a) Give the audience an idea of what TEI is and what it’s for (to help them answer the question, Is TEI really what I need?) and b) explain enough about the TEI so they will know a bit of something walking into their first “real” (multi-hour, hands-on) TEI workshop. I got a lot of good feedback, so hopefully it did its job. And I do hope to have the opportunity to follow this up with more substantial workshops.

Slides (in PDF format) are posted here.

EDIT: Need to add that these slides owe a ton to James Cummings, with whom I have taught TEI and to whom I owe much of what I know about it!

New Job

I have a new job! As of April 1, I am the Curator, Digital Research Services, Schoenberg Institute for Manuscript Studies, Special Collections Center, University of Pennsylvania.

Now say that five times fast.

In this role, I’m responsible for the digital initiatives coming out of SIMS. I’m also a curator for the Schoenberg Manuscript collection (as are all SIMS staffers in the SCC), and I manage the Vitale Special Collections Digital Media Lab (Vitale 2). It’s also clear that I will have some broad role in the digital humanities at UPenn, although that part of the job isn’t so clear to me yet. I’m the inaugural DRS Curator, so there is still a lot to be worked out although after three weeks on the job, I’m feeling really confident that things are under control.

I report to Will Noel, Director of the SCC and Director of SIMS. Will may be best known as the Project Director for the Archimedes Palimpsest Project at the Walters Art Museum. Will is awesome, and I’m thrilled to be working for him. I’m also thrilled to be working at the University of Pennsylvania, in a beautiful brand new space, in the great city of Philadelphia.

I plan to blog more in this new position, so keep your eyes here and on the MESA blog.

Medieval Electronic Scholarly Alliance

It’s not that I haven’t been doing anything; I just haven’t been doing anything here.

In June, Tim Stinson at NCSU and I were awarded a grant from the Andrew W. Mellon Foundation to implement the Medieval Electronic Scholarly Alliance, MESA, and since we started preparing metadata for indexing in, oh, about mid-September, all of my technical energy has been going into that work. MESA is a node of the Advanced Research Consortium (ARC), and follows the model of scholarly federation spearheaded by NINES and followed by 18thconnect, using Collex as its underlying software support.

What does it involve? It involves taking metadata already created by digital collections or projects (like, for example, the Walters Art Museum, Parker on the Web, or e-Codices), mapping their fields out to the Collex and MESA fields, and then generating RDF that follows to the Collex guidelines. It’s these three collections (and, just in the last week and with a ton of help from Lisa McAulay at UCLA, the St. Gall Project) that I’ve spent the most time on, and here are some key things I’ve learned so far that might be helpful for people considering adding their collections to MESA (or to any of the other ARC nodes, for that matter)… or for that matter, for anyone who is interested in sharing their data, any time ever.

  1. If your metadata is in an XML format, put unique identifiers on all the tags. Just do it.
  2. Follow best practices for file and folder naming. In this case, best practice probably just means BE CONSISTENT. My favorite example for illustrating consistency is probably the Digital Walters at the Walters Art Museum. One of the reasons that it’s a great example is that their data (image files and metadata) has been released under Open Access licensing, with the intention that people (like me) will come along and grab them, and work with them. What this means is that everything was designed with this usage in mind, so it’s simple, and that all their practices are well-documented, so it’s easy to take them and run without having to figure a lot out. Their file and folder naming practices are documented here.
  3. Be thoughtful about how you format your dates. Yeah, dates are a big deal. The metadata of one of the projects (I won’t say which one) had dates formatted as, e.g., “xii early-xiii late (?)”, and with no ISO (or an other even slightly computer-readable) version in, e.g., an attribute value. I ended up mapping every individual date value to a computer-readable form. As always, I figure there was a better way to do it, but at the time… well, it took a while, but in the end it did what I needed it to do. If you are in the process of setting up a new project, consider including computer-readable dates in your metadata. Because Collex/MESA includes values for both date label and date value, you can include both a computer-readable date to be used for searching, and a human-readable version, which may have more nuance than can be included in a computer readable value. For example, LABEL: xii early-xiii late (?); VALUE: 800,900.
  4. Is whatever you are describing in English? Great! If it’s not, find someplace in your metadata to indicate what language it is in. Use a controlled vocabulary, such as the ISO 639-2 Language Code List, and as with dates it helps to include a computer-readable code in addition to a human-readable version. (This is less important for MESA, as we can map between the columns in the ISO 639-2 Language Code List, but it’s a good general rule to follow)
  5. You can actually do a lot with a little. In the past few weeks the metadata I’ve worked with range from really amazing full manuscript Descriptions (in TEI P5) from the Walters Art Museum to very simple and basic descriptions in Dublin Core from e-Codices. Although the full descriptions provide more information (helpful for full-text searching), the simple DC records, assuming they include the basic information you’d want for searching (title, date, language, maybe incipits) are just fine. So don’t let a lack of detail put you off from contributing to MESA!

I would also encourage you to release your data under an Open Access license (something from the Creative Commons), and to make your data easy to grab (either by posting everything online, as the Walters Art Museum has done, or by releasing it through an OAI provider, as e-Codices has done).

Manuscript-level records in Omeka

I had some time tonight to figure out how to get the manuscript-level records into Omeka.

I’d already worked out the (very simple) XPath needed to pull out the few bits of information I wanted from the TEI msDesc: siglum, title, URL to the Walters data, BookReader URL, and URL to the first thumbnail image in the ms (which I’m just using as an illustration, a placeholder for the full ms). This evening I did the final checks to make sure the output was right, then just ran the XSLT against all the msDesc files and created .csv files for each one (there’s an Omeka plug-in that will accept CSV input, in order to bulk ingest metadata and files). I used the “insert file contents” function in TextWrangler (bless that program) to pull all of the individual CSV rows into a single document, then ingested that into Omeka. There were a few bugs, of course, but generally it was smooth. I’ve made a few of the records live, just those that already have illumination records tagged in Omeka too. What this means is that you can now go to the Omeka site, go to “Browse Items by Tag” (http://www.dotporterdigital.org/omeka/items/tags), and click on one of the larger tags (each ms and illumination is tagged with the ms siglum; the more illustrations there are, the larger the tag will appear in the browse list). At the moment the first entry in the list will be the record for the manuscript, followed by the record for the illuminations… although I don’t know if that’s just because the manuscript records are newer.

I would like to include each illumination in the ms records (HasPart) and the ms in the illumination records too (IsPartOf), but I am not certain that’s something I’ll be able to do programmatically. Anyway, I think that is the next thing on my list. That and tagging all of the other illumination records with the siglum (so they will be browseable with the manuscript).