Zombie Manuscripts: Digital Facsimiles in the Uncanny Valley

This is a version of a paper presented at the International Congress on Medieval Studies, May 12, 2018, in session 482, Digital Skin II: ‘Franken-Manuscripts’ and ‘Zombie Books’: Digital Manuscript Interfaces and Sensory Engagement, sponsored by Information Studies (HATII), Univ. of Glasgow, and organized by Dr. Johanna Green.

The uncanny valley was described by Masahiro Mori in a 1970 article in the Japanese journal Energy, and it wasn’t translated into English completely until 2012.[1] In this article, Mori discusses how he envisions people responding to robots as they become more like humans. The article is a thought piece – that is, it’s not based on any data or study. In the article, which we’ll walk through closely over the course of this presentation, Mori posits a graph, with human likeness on the x axis and affinity on the y axis. Mori’s proposition is that, as robots become more human-like, we have greater affinity for them, until they reach a point at which the likeness becomes creepy, or uncanny, leading to a sudden dip into negative affinity – the uncanny valley.

Now, Mori defined the uncanny valley specifically in relation to robotics, but I think it’s an interesting thought exercise to see how we can plot various presentations of digitized medieval manuscripts along the affinity/likeness axes, and think about where the uncanny valley might fall.

In 2009 I presented a paper, “Reading,
 Writing,
 Building: 
the 
Old
 English
Illustrated
 Hexateuch,” (unpublished but archived in the Indiana University institutional repository) in which I considered the uncanny valley in relation to digital manuscript editions. This consideration followed a long description of the “Turning the Pages Virtualbook” technology which was then being developed at the British Library, of which I was quite critical. At that time, I said:

In my mind, the models created by Turning the Pages™ fall at the nadir of the “uncanny valley of digital texts” – which has perhaps a plain text transcription at one end and the original manuscript at the other end, with print facsimiles and editions, and the various digital displays and visualizations presented earlier in this paper falling somewhere between the plain text and the lip above the chasm.

Which would plot out something like this on the graph. (Graph was not included in the original 2009 paper)

Dot’s 2009 Conception of the Uncanny Valley of Manuscripts

Nine years of thinking on this and learning more about how digital manuscripts are created and how they function, I’m no longer happy with this arrangement. Additionally, in 2009 I was working with imperfect knowledge of Mori’s proposition – the translation of the article I referred to then was an incomplete translation from 2005, and included a single, simplified graph in place of the two graphs from the original article – which we will look at later in this talk.

Manuscripts aren’t people, and digitized manuscripts aren’t robots, so before we start I want to be clear about what exactly I’m thinking about here. Out of Mori’s proposition I distill four points relevant to our manuscript discussion:

First, Robots are physical objects that resemble humans more or less (that is the x-axis of the graph)

Second, as robots become more human-like, people have greater affinity for them (until they don’t – uncanny valley) – this is the y-axis of the graph

Third, the peak of the graph is a human, not the most human robot

Fourth, the graph refers to robots and to humans generally, not robots compared to a specific human.

Four parallel points can be drawn to manuscripts:

First, digitized manuscripts are data about manuscripts (digital images + structural metadata + additional data) that are presented on computers. Digitized manuscripts are pieces, and in visualizing the manuscript on a computer we are reconstructing them in various ways. (Given the theme of the session I want to point out that this description makes digitized manuscripts sound a lot more like Frankenstein’s creature than like a traditional zombie, and I’m distraught that I don’t have time to investigate this concept further today) These presentations resemble the parent manuscript more or less (this is the x-axis)

Second, as presentations of digitized manuscripts become more manuscript-like, people have greater affinity for them (until they don’t – uncanny valley) – this is the y-axis

Third, the peak of the graph is the parent manuscript, not the most manuscript-like digital presentation

Fourth, the graph refers to a specific manuscript, not to manuscripts generally

I think that this is going to be the major difference in applying the concept of the uncanny valley to manuscripts vs. robots: while Robots are general, not specific (i.e., they are designed and built to imitate humans and not specific people), the ideal (i.e., most manuscript-like) digital presentation of a manuscript would need to be specific, not general (i.e., it would need to be designed to look and act like the parent manuscript, not like any old manuscript)

Now let’s move on to Affinity

A Valley in One’s Sense of Affinity

Mori’s article is divided into four sections, the first being “A Valley in One’s Sense of Affinity”. In this section Mori describes what he means by affinity and how affinity is affected by sensory input. Figure one in this section is the graph we saw before, which starts with an Industrial Robot (little likeness, little affinity), then a Toy Robot (more likeness, more affinity), then drops to negative affinity at about 80-85% likeness, with Prosthetic Hand at negative affinity and Bunraku Puppet on the steep rise to positive affinity and up to Healthy Person.

For Mori, sensory input beyond the visual is important for an object’s placement on the x-axis. An object might look very human, but if it feels strange, that doesn’t only send the affinity into the negative, but it also lessens the likeness. Mori’s original argument focuses on prosthetic hands, specifically about realistic prosthetic hands, which cannot be distinguished at a glance from real ones. I’m afraid the language in his example is abelist so I don’t want to quote him,

Luke Skywalker’s prosthetic hand in The Empire Strikes Back

but his argument is essentially that a very realistic prosthetic hand, when one touches it and realizes it is not a real hand (as one had been led to believe), it becomes uncanny. Relating this feeling to the graph, Mori says, “In mathematical terms, this can be represented by a negative value. Therefore, in this case, the appearance of the prosthetic hand is quite humanlike, but the level of affinity is negative, thus placing the hand near the bottom of the valley in Figure 1.”

The character Osono, from the play Hade Sugata Onna Maiginu (艶容女舞衣), in a performance by the Tonda Puppet Troupe of Nagahama, Shiga Prefecture. https://en.wikipedia.org/wiki/Bunraku#/media/File:Osonowiki.jpg (CC:BY:SA)

Bunraku puppets, while not actually resembling humans physically as strongly as a very realistic prosthetic hand visually resembles a human hand, fall farther up the graph both in terms of likeness and in affinity. Mori makes it clear that likeness is not only, or even mostly, a visual thing. He says:

I don’t think that, on close inspection, a bunraku  puppet appears similar to a human being. Its realism in terms of size, skin texture, and so on, does not even reach that of a realistic prosthetic hand. But when we enjoy a puppet show in the theater, we are seated at a certain distance from the stage. The puppet’s absolute size is ignored, and its total appearance, including hand and eye movements, is close to that of a human being. So, given our tendency as an audience to become absorbed in this form of art, we might feel a high level of affinity for the puppet.

So it’s not that bunraku puppets look like humans in great detail, but when we experience them within the context of the puppet show they have the affect of being very human-like, thus they are high on the human likeness scale.

For a book-related parallel I want to quote briefly a blog post, brought to my attention earlier this week, by Sean Gilmore. Sean is an undergraduate student at Colby College and this past semester took Dr. Megan Cook’s Book History course, for which he wrote this post, “Zombie Books; Digital Facsimiles for the Dotty Dimple Stories.” There’s nothing in this post to suggest that Sean is familiar with the uncanny valley, but I was tickled with his description of reading a digital facsimile of a printed book. Sean says:

In regards to reading experience, reading a digital facsimile could not be farther from the experience of reading from the Dotty Dimple box set. The digital facsimile does in truth feel like reading a “zombie book”. While every page is exactly the same as the original copy in the libraries of the University of Minnesota, it feels as though the book has lost its character. When I selected my pet book from Special Collection half of the appeal of the Dotty Stories was the small red box they came in, the gold spines beckoning, almost as if they were shouting out to be read. This facsimile, on the other hand, feels like a taxidermy house cat; it used to be a real thing, but now it feels hollow, and honestly a little weird.

Sean has found the uncanny valley without even knowing it exists.

The Effect of Movement

The second section of Mori’s article, and where I think it really gets interesting for thinking about digitized manuscripts, is The Effect of Movement. In the first section we were talking in generalities, but here we see what happens when we consider movement alongside general appearance. Manuscripts, after all, are complex physical objects, much as humans are complex physical objects. Manuscripts have multiple leaves, which are connected to each other across quires, the quires which are then bound together and, often, connected to a binding. So moving a page doesn’t just move a page, much as bending your leg doesn’t just move your leg. Turning the leaf of a manuscript might tug on the conjoined leaf, push against the binding, tug on the leaves preceding and following – a single movement provoking a tiny chain reaction through the object, and one which, with practice, we are conditioned to recognize and expect.

Mori says:

Movement is fundamental to animals— including human beings—and thus to robots as well. Its presence changes the shape of the uncanny valley graph by amplifying the peaks and valleys (Figure 2). For illustration, when an industrial robot is switched off, it is just a greasy machine. But once the robot is programmed to move its gripper like a human hand, we start to feel a certain level of affinity for it.

And here, finally, we find our zombie, at the nadir of the “Moving” line of the uncanny valley. The lowest point of the “Still” line is the Corpse, and you can see the arrow Mori has drawn from “Healthy Person” at the pinnacle of the graph down to “Corpse” at the bottom. As Mori says, “We might be glad that this arrow leads down into the still valley of the corpse and not the valley animated by the living dead.” A zombie is thus, in this proposition, an animated corpse. So what is a “dead” manuscript? What is the corpse? And what is the zombie? (I don’t actually have answers, but I think Johanna might be addressing these or similar questions in her talk)

Reservoir Dogs (not zombies)
The Walking Dead (shuffling zombies)
28 Days Later (manic zombies)

I expect most of us here have seen zombie movies, so, in the same way we’ve been conditioned to recognize how manuscripts move, we’ve been conditioned to understand when we’re looking at “normal” humans and when we’re looking at zombies. They move differently from normal humans. It’s part of the fun of watching a zombie film – when that person comes around the corner, we (along with the human characters in the film) are watching carefully. [13] Are they shuffling or just limping? [14] Are they running towards us or away from something else? It’s the movement that gives away a zombie, and it’s the movement that will give away a zombie manuscript.

 

I want to take a minute to look at a manuscript in action. This is a video of me turning the pages of Ms. Codex 1056, a Book of Hours from the University of Pennsylvania. This will give you an idea of what this manuscript is like (its size, what its pages look like, how it moves, how it sounds), although within Mori’s conception this video is more similar to a bunraku puppet than it is like the manuscript itself.

It’s a copy of the manuscript, showing just a few pages, and the video was taken in a specific time and space with a specific person. If you came to our reading room and paged through this manuscript, it would not look and act the same for you.

e-codices manuscript viewer
e-codices viewed through Mirador

Now let’s take a look at a few examples of different page-turning interfaces. The first is from e-codices, and is their regular, purpose-built viewer. When you select the next page, the opening is simply replaced with the next opening (after a few seconds for loading). The second is also e-codices, but is from the Mirador viewer, a IIIF viewer that is being adopted by institutions and that can also be used by individuals. Similar to the other viewer, when you select the next page the opening is replaced with the next opening (and you can also track through the pages using the image strip along the bottom of the window). The next example is a Bible from Swarthmore College near Philadelphia, presented in the Internet Archive BookReader. This one is designed to mimic a physical page turning, but it simply tilts and moves the image. This would be fine (maybe a bit weird) if the image were text-only, but as the image includes the edges of the text-block and you can see a bit of the binding, the effect here is very odd. Finally, my old friend Turning the Pages (a newer version than the one I complained about in my 2009 paper), which works very hard to mimic the movement of a page turning, but doing so in a way that is unlike any manuscript I’ve ever seen.

Escape by Design

In the third section of his article, Mori proposes that designers focus their work in the area just before the uncanny valley, creating robots that have lower human likeness but maximum affinity (similar to how he discussed bunraku puppets in the section on affinity, although they are on the other side of the valley). He says:

In fact, I predict that it is possible to create a safe level of affinity by deliberately pursuing a nonhuman design. I ask designers to ponder this. To illustrate the principle, consider eyeglasses. Eyeglasses do not resemble real eyeballs, but one could say that their design has created a charming pair of new eyes. So we should follow the same principle in designing prosthetic hands. In doing so, instead of pitiful looking realistic hands, stylish ones would likely become fashionable.

Floral Porcelain Leg from the Alternative Limb Project (http://www.thealternativelimbproject.com/project/floral-porcelain-leg/)

And here’s an example of a very stylish prosthetic leg from the Alternative Limb Project, which specializes in beautiful and decidedly not realistic prosthetic limbs (and realistic ones too). This is definitely a leg, and it’s definitely not her real leg.

 

In the world of manuscripts, there are a few approaches that would, I think, keep digitized manuscript presentations in that nice bump before the valley:

 

“Page turning” interfaces that don’t try to hard to look like they are actually turning pages (see the two e-codices examples above)

Alternative interfaces that are obviously not attempting to show the whole manuscript but still illustrate something important about them (for example, RTI, MSI, or 3D models of single pages). This example is an interactive 3D image of the miniature of St. Luke from Bill Endres’s Manuscripts of Lichfield Cathedral project.

Visualizations that illustrate physical aspects of the manuscript without trying to imitate them (for example, VisColl visualizations with collation diagrams and bifolia)

 

I think these would plot out something like this on the graph.

Dot’s 2018 Conception of the Uncanny Valley of Digitized Manuscripts

This is all I have to say about the uncanny valley and zombie books, but I’m looking forward to Johanna, Bridget and Angie’s contributions and to our discussion at the end. I also want to give a huge shout-out to Johanna and Bridget, to Johanna for conceiving of this session and inviting me to contribute, and both of them for being immensely supportive colleagues and friends as I worked through my thoughts about frankenbooks and zombie manuscripts, many of which, sadly, didn’t make it into the presentation, but which I look forward to investigating in future papers.

[1] M. Mori, “The uncanny valley,” Energy, vol. 7, no. 4, pp. 33–35, 1970 (in Japanese);  M. Mori, K. F. MacDorman and N. Kageki, “The Uncanny Valley [From the Field],” in IEEE Robotics & Automation Magazine, vol. 19, no. 2, pp. 98-100, June 2012. (translated into English) (https://ieeexplore.ieee.org/document/6213238/)

Data for Curators: OPenn and Bibliotheca Philadelphiensis as Use Cases

Following are my remarks from the Collections as Data National Forum 2 event held at the University of New Mexico, Las Vegas, on May 7 2018. Collections as Data is an Institute of Museum and Library Services supported effort that aims to foster a strategic approach to developing, describing, providing access to, and encouraging reuse of collections that support computationally-driven research and teaching in areas including but not limited to Digital Humanities, Public History, Digital History, data driven Journalism, Digital Social Science, and Digital Art History. The event was organized by Thomas Padilla, and I thank him for inviting me. It was a great event and I was honored to participate.

Today I’m going to be talking about curators as an audience for collections as data, using two projects from the University of Pennsylvania’s Kislak Center for Special Collections, Rare Books and Manuscripts as use cases. I am a curator in the Kislak Center, and most of my time I work on projects under the aegis of the Schoenberg Institute for Manuscript Studies, which is a unit under the Kislak Center. SIMS is a kind of research and development group (our director likes to refer to it as a think tank) that focuses on manuscript studies writ large, mostly but by no means only focused on medieval manuscripts from Europe, and that specializes in examining the relationship between manuscripts as physical objects and their digitized counterparts.

For this session, we’ve been asked to react to this assertion from the Collections as Data Santa Barbara Statement: Collections as data designed for everyone serve no one, and to discuss the audiences that our collections as data are built for.

I’ll start with OPenn, which launched in May 2015 as an open access collection of Penn’s digitized manuscript material. Penn started digitizing its manuscripts in the mid 1990s, but they had been virtually locked in a black box system. To create OPenn we cracked opened the box, generated new derivative images from the master TIFF files, generated TEI/XML manuscript description files using the data from our catalog and supporting databases, and put it all in a fully public file server. The collection navigation is provided by HTML pages – one that lists all the repositories, pages listing the manuscripts in each repository, and finally HTML pages for each manuscript presenting the catalog data and links to the image files. At the time OPenn launched, there was no search facility, although one has recently been added.

OPenn’s developer, Doug Emery, describes the access that OPenn provides as friction-free access, referring both to the licensing (the image files are in the public domain, the metadata is licensed cc:by) and to the technical access. There’s no login and no API. You can navigate to the site in a browser and download images, or you can point wget at the server and bulk download entire manuscripts.

When we were designing OPenn, we weren’t thinking that much about the audience, honestly. We were thinking about pushing the envelope with fully available, openly licensed, high resolution, robustly described and well-organized digitized medieval manuscripts. We did imagine who might use our collections, and how, and you can read the statement from our readme here on the screen.

But I can’t say that we built the system to serve any audience in particular. We did build the system in a way that we thought would be generally useful and usable. But it became clear after OPenn launched that our lack of an audience made it difficult for us to “sell” OPenn to any group of people. Medievalists, faculty and students, who might want to use the material, were put off by the relatively high technical learning curve, the simple interface (lacking the expected page-turning view) and by the lack of search (we do have a Google Search now, but it was only added to the site in the past month). Data analysts who might want to visualize the collection-wide data were put off by the formatting of each manuscript having its own TEI file. Indeed data designed for everyone does seem to serve no one.

But wait! Don’t lose hope! An accidental audience did present itself. In the months and into the first year after OPenn launched, it was slowly used as a source for projects. The Philadelphia Area Consortium of Special Collections Libraries, PACSCL, undertook a collaborative project whereby each member institution digitized five diaries from their collections, which were put on OPenn, the PACSCL Diaries Project.

When the project went live, the folks at PACSCL wanted a user-friendly way to make the diaries available, so I generated page-turning interfaces using the Internet Archive Bookreader  that pulled in metadata from the TEI files and that point to the image files served on OPenn.

At some point I decided that I wanted to get a better sense of one of our manuscript collections, the Lawrence J. Schoenberg Collection, so again I wrote a script to generate a CSV file pulling from all the collection’s TEI files. Jessie Dummer, the Kislak Center’s Digitization Project Coordinator, cleaned up the data in the CSV, and we were able to load the CSV into Palladio for visualization and analysis (on github)

I combined the links to images on OPenn with data gathered through another SIMS project, VisColl (which I’ll describe in a bit more detail later) to generate a visualization of the gathering structure of manuscripts with the bifolia, or sheets, laid alongside. And last but not least, I experimented with setting up a IIIF image server that could serve the images from OPenn as IIIF-compatible images (this is a screenshot of the github site where I published IIIF manifests I generated as part of that project, but they don’t work because the server no longer exists).

The accidental audience? It was me.

I don’t remember thinking about or discussing with the rest of the team as we planned for OPenn how I might use it as part of my regular work. I was familiar with the concept of an open collection of metadata and image files online; OPenn was based on The Digital Walters, which both the Director of the Kislak Center Will Noel and Doug Emery had built when they were employed at the Walters Art Museum in Baltimore, and I had been playing with that data for a year before I was even hired at Penn. I must have know that I would use it, I just didn’t realize how much I would use it, or how having it available to me would change the way I thought about my work, and the way I worked with the collections. The things that made it difficult for other people to use OPenn – the lack of a search facility, the dependence on XML – didn’t affect me negatively. I already knew the collection, so a search wasn’t necessary; at the time OPenn launched I had been working with XML technologies for 10 years or so, so I was very comfortable with it.

Having OPenn as a source for data gives me so much in my curatorial role. I have the flexibility to build the interfaces I want using tools I can understand, and flexibility, easy access, familiar formats

At the very end of 2015, several months after OPenn was launched, we, along with PACSCL, Lehigh University, and the Free Library of Philadelphia, were awarded a grant from the Council on Library and Information Resources under the “Digitizing Hidden Collections” program to digitize western Medieval manuscripts in 15 Philadelphia area libraries. We call the project Bibliotheca Philadelphiensis, the “library of Philadelphia”, or BiblioPhilly for short. Working from my experience working with data on OPenn, during the six-month lead up to cataloging and digitization I was able to build the requirements for the BiblioPhilly metadata in a way to guarantee that the resulting data would be useful to me and to the curators and librarians at the other institutions. Some of the things we implemented include a closed list of keywords (based on the keyword list developed for the Digital Walters), in contrast with the Library of Congress subject headings in OPenn, and four different date fields (date range start, date range end, single date, and narrative date) with strict instructions for each (except for narrative date) to ensure that the dates will be computer readable.

We have also integrated data from VisColl into BiblioPhilly, both into the data itself, and in combination with the data in the interfaces. VisColl, as I mentioned before, is a system to model and visualize the quire structure of manuscripts. (A manuscript’s quire structure is called its collation, hence the name VisColl – visualizing collation) VisColl models are XML files that describe each leaf in a manuscript and how those leaves relate to each other (if they are in the same quire, or if they are conjoined, if a leaf is missing or has been added, etc.). From a model we’re able to generate a concise description of a manuscripts’ construction, in a format referred to as a collation formula, and this formula is included in the manuscript’s cataloging and becomes part of the TEI manuscript description. However we’re also able to combine the information from the collation model with the links to the image files on OPenn to generate views of a collation diagram alongside the sheets that make up the quires. 

For BiblioPhilly, because of the experimentation we did with Penn manuscripts on OPenn, we’ve been able to make the digitized BiblioPhilly manuscripts available online in ways that are more user-friendly to non-technical users than OPenn is, even before we have an “official” project interface. We did this by building an In Progress Viewer relatively early on. The aim of the In Progress viewer was 1) to provide technically simple, user-friendly ways to search, browse, and view the manuscripts, and 2) to make available information both about the manuscripts that were online, and about the manuscripts that had yet to go online (including the date they were photographed, so users can track manuscripts of particular interest through the process).

The first In Progress Viewer was built in the Library of Congress’s Viewshare,  which provided federated browsing for all the fields in our records, along with a timeline and simple mapping facility. Unfortunately the Library of Congress is no longer supporting Viewshare, and when it went offline on March 20 we moved to an Omeka platform, which is more attractive but lacks the federated searching that made Viewshare so compelling. From Omeka (and Viewshare before it) we link to the manuscript data on OPenn, to Internet Archive BookReader page-turners, and to VisColl collation views. Both the BookReaders and VisColl views are generated locally from scripts and hosted on a Digital Ocean droplet. This is a temporary system, and is not built to last beyond the end of the project. It will be replaced by an official, longer-lived interface.

We’re also able to leverage the OPenn design of BiblioPhilly and VisColl for this “official” interface, which is currently under development with Byte Studios of Milwaukee, Wisconson. While our In Progress Viewer has both page-turning facility and collation views, those elements are separate and are not designed to interact. The interface that we are designing with Byte Studios incorporates the collation data with the page-turning and will allow a user to switch seamlessly between page openings and full sheets.

It’s exciting that we’ve been able to leverage what was essentially an audience-less platform into something that can so well serve its curator, but there is a question that this approachpushes wide open: What does it mean to be a curator? With a background in digital humanities focused on the development of editions of medieval manuscripts I was basically the perfect curator for OPenn. But that was a happy accident. Most special collections curators don’t have my background or my technical training, so access to something like OPenn wouldn’t help them, and I’m very hesitant to suggest that every curator be trained in programming. I do think that every special collections department should have some in-house digital expertise, and maybe that’s the direction to go. Anyway, I’m very happy being in my current situation and I only wish we’d considered the curator as an audience for OPenn earlier in the process.

Workflow: MS Word to TEI

For the past couple of years I’ve been refining a workflow to convert MS Word files into publishable TEI. By “publishable” I mean TEI that can be loaded into some existing publication system (something like TEI Publisher, Edition Visualization Technology (EVT), or TEI Boilerplate), or that you could process yourself in some other way.

Why might you want to use such a workflow? In my experience, it’s useful when you have a person or people who are designated as transcribers, but who aren’t comfortable or interested in encoding in XML. Microsoft Word is ubiquitous, so pretty much everyone in academia uses it and has access to it. For people who don’t want to work with pointy brackets but still want to collaborate on a digital editing project, a workflow that converts Microsoft Word to TEI can be very useful. (I have also used this workflow myself, even though I’m capable of hand-encoding XML, just because there are times when I’d rather just to it in Word. YMMV!)

I think the workflow works best when there is one person designated to do all the conversion at the end (steps 4 and 5) and any number of people involved in the first three steps. The workflow could be used in the classroom as a group project (where the students model the TEI, plan the pseudocodes, and do the encoding, and one student or the instructor does the conversion work at the end) although I’ve only used it for non-classroom editing projects.

There are a few things you need in order to be successful with this workflow:

  1. You need a team that knows TEI. This doesn’t mean they need to know XML! (Although yes, you will need someone on your team who knows XML, but that’s not related to TEI) You need to know TEI basics –  what tags and attributes are, how modules and classes work – and you need these because you need to know what TEI tags you want in your final document before you start transcribing.
  2. Microsoft Word (obviously)
  3. OxGarage conversion tools. OxGarage is a service of the Text Encoding Initiative, which provides scripts for converting between a variety of text formats, including MS Word to TEI.
  4. OxygenXML Editor (or an XML editor of your choice). OxygenXML is popular with the TEI community, and it has the find & replace functionality that is required by this workflow. BBEdit is another XML/text editor that I use a lot, and it has a great find & replace functionality, but it doesn’t work as well for this workflow for reasons I’ll describe later in this post.

The steps of the workflow are (briefly):

  1. Model your TEI.
  2. Create pseudocodes to “tag” in MS Word.
  3. Transcribe in MS Word, using the pseudocode “tags” to indicate those things that will eventually be converted into TEI.
  4. Convert the finished MS Word document into TEI using OxGarage.
  5. Use find & replace in OxygenXML to convert the pseudocode tags into TEI tags, resulting in well-formed and complete TEI.

In more detail:

Model your TEI

The very first thing you need to do is to decide what you need your finished TEI to be able to do. If you’re working with an existing system (e.g., if you know you’ll be publishing in EVT at the end) some of your decisions will be made for you, because you will need to have TEI code that the system can use.[1] Are you encoding abbreviations, and if so are you going to tag the entire word or just the abbreviation and expansion? Are you going to normalize spelling, and if so are you going to do it silently or tag it? Are there marginalia you want to include in your TEI code? Do you want to include editorial notes?

Make a list of everything you need in your TEI, which TEI tags you plan to use, and how you plan to use them. You’ll need this to do the next step of the workflow, the creation of pseudocodes.

Create pseudocodes and tag them in MS Word

Pseudocodes are what I call non-TEI formatting elements that are used to set text apart, and are later processed into TEI tags. Pseudocodes can be divided into two main types: native MS Word formatting (italics, underlining, superscript, etc.) and punctuation marks.

Native MS Word Formatting

Native formatting in Word

MS Word formatting is converted by OxGarage into TEI <hi> tags with the relevant values for @rend. For example, Italics converts to <hi rend=“italic”>Italics</hi>, Bold converts to <hi rend=”bold”>Bold</hi rend=”bold”>, Underscore converts to <hi rend=“underline”>Underscore</hi>, Strikethrough converts to <hi rend=“strikethrough”>Strikethrough</hi>, Red text converts to <hi rend=“color(FF0000)”>Red text</hi>, Yellow highlight (not an option in WordPress) converts to <hi rend=“background(yellow)”>Yellow highlight</hi>,Superscript converts to <hi rend=“superscript”>Superscript</hi>, and Subscript converts to<hi rend=“subscript”>Subscript</hi>.

 

This is of course useful if you want these exact tags reflected in your final TEI, but once the TEI comes out of OxGarage, you can use the find and replace function in OxygenXML (or some other text/XML editor) to convert these tags into other tags. More on this below.

Native MS Word formatting works very well and can represent a very large number of TEI tags (using just text color and highlight would give you 75 pseudocodes mapping to 75 TEI tags or tag/abbreviation combinations), but there are definitely cases when you would want to use punctuation marks instead.

Punctuation Marks

You can use punctuation marks to set text apart that might not correspond 1:1 with a TEI tag. These are cases, such as expanded abbreviations or corrected readings, where you need tags nested within tags. Brackets work particularly well for this, especially various combinations of brackets. You do need to be careful about configuring bracket combinations, particularly when you’ll have brackets nested within brackets, and (as will also be mentioned later) the order in which you find & replace brackets later will also be relevant. This isn’t a matter to be taken lightly. You should test your pseudocodes and find & replace expressions on a section of text before encoding a full text.

Here is an example using the first line of Genesis 3, from University of Pennsylvania MS. Codex 236, fol. 31r

Genesis 3:1, UPenn Ms. Codex 236, fol. 31r

 

 

 

The text in this line reads:

sed et serpens erat callidior cūctis aīantib t̄

This includes a number of abbreviations that we could expand silently, or we could encode them in TEI in a few different ways. (For more information see the TEI Guidelines 11.3.1.2, “Abbreviations and Expansion”) Options include:

Noting that a word contains an abbreviation, without expanding it. In this example we put <abbr> tags around the complete word, and <am> tags around the abbreviated letter:

sed et serpens erat callidior <abbr>c<am>ū</am>ctis</abbr> <abbr>a<am>ī</am>anti<am>b</am></abbr> <abbr><am></am></abbr>

In word, you might choose pseudocodes using nested brackets. In this case, [[]] will be converted later into <abbr></abbr>, and [] (nested within [[]]) will be converted to <am></am>:

sed et serpens erat callidior [[c[ū]ctis]] [[a[ī]antib]] [[[t̄]…]]

Alternatively, you might choose to encode both abbreviation and expansion, and enable the system to choose between them. In this example we add <expan> and <ex> tags to the mix alongside <abbr> and <am>, and then include <choice> to make it clear that the abbreviations and expansions come in pairs.:

sed et serpens erat callidior

<choice>

<abbr>c<am>ū</am>ctis</abbr>

<expan>c<ex>un</ex>ctis</expan>

</choice>

<choice>

<abbr>a<am>ī</am>anti<am>b</am></abbr>

<expan>a<ex>nim</ex>antib</expan>

</choice>

<choice>

<abbr><am></am></abbr>

<expan><ex>ter</ex></expan>

</choice>

As above, you can come up with combinations of marks that you can use to indicate the encoding. In this case <abbr> and <am> are encoded as above, || will later be converted to <choice>, {{}} will be converted later into <expan></expan>, and {} (nested within {{}}) will be converted to <ex></ex>:

sed et serpens erat callidior |[[c[ū]ctis]]{{c{un}ctis}}| |[[a[ī]anti[b]]]{{a{nim}anti{bus}}}| |[[[t̄]]]{{{ter}…}}|

I like to group brackets of the same type together (as here, where square brackets are used for abbreviations, curly brackets for expansions, and pipes for choice) but you can also combine them in various ways for more options. For example, here are the bracketing options for a project I’m currently working on:

In all cases you need to be very careful that the punctuation marks you use don’t appear in your text, or only use them in combinations that don’t appear in your text, or else you will accidentally create TEI tags where you don’t want them.

Convert Word to TEI in OxGarage

Once you’ve transcribed and entered pseudocodes in MS Word, it’s time to convert your Word file into TEI. You can do this using OxGarage, a conversion service provided by the TEI. OxGarage has an online interface where you can convert one document at a time, described here, but you can also download the XSLTS from GitHub and run bulk conversion processes (converting multiple files at one time).

The online OxGarage interface is at http://www.tei-c.org/oxgarage/. You need to indicate that you are converting Documents, then select your options from (Microsoft Word doc or docx) and to (TEI P5), then load in your Word document and click “Convert”. Here is my input file (so you can download it and try this yourself), and a screenshot: 
OxGarage will generate a TEI file with a template header (including information gleaned from the Word document) and the textual content of the Word doc converted into very basic TEI (file here; you’ll need to change the file extension to .xml):

You can see here how the Word comment is converted into a <note> nested in <hi>, with a <date> included. I also included some red text (to indicate the tyronian et symbol) which has converted as expected. The punctuation mark pseudocodes are unchanged.

Replace Pseudocodes with TEI Tags

This is where we replace those pseudocodes – the <hi> color tags and the combinations of punctuation marks – into TEI. I like to do this in OxygenXML, because that software has advanced and advanced find & replace that enables you to search using regular expressions, including the ability to save pieces of what is being searched and reusing that in the replace (a bit like setting a variable in the search).[2]

As mentioned above the order in which you replace tags matters. You will always want to replace the outermost pseudotags first, then the interior ones, because the find & replace will always match from the first instance of a character in the regular expression to the last instance. This means that if you have […] (for <am>) nested inside [[…]] (for <abbr>) you need to replace the [[…]] before the […] or else you will end up with <am> around the word, and there will be no match when you then search for [[…]].

For example, to find [[…]] and replace it with <abbr>, using Find/Replace with the “Regular Expression” and “Dot matches all” boxes checked, you would search for:

\[\[([^\s]*)\]\] (this is a regular expression that will find every instance where a string of any character except spaces (\s), enclosed by [[ and ]] . The central part of the expression is enclosed with parentheses because we’re going to reuse that in the replace. The square brackets are preceded by \ to ensure they are considered as characters and not as part of a regular expression)

And replace that with

<abbr>$1</abbr> (This will replace the [[ and ]] with the closing and ending tags, and copy everything else in the middle – $1 refers to the piece of the search that was enclosed in parentheses)

Unfortunately, if you have an abbreviated word that starts on one line and ends on the following line (as we do here – the last word on this line is terre, but the re are on the next line) this regular expression won’t catch it because it ignores all spaces. So I do two sets of finds for each set of pseudocodes: one using the expression above, which ignores spaces, and a second one which includes spaces.

\[\[(.*)\]\] (replace with <abbr>$1</abbr>) as above

You don’t want to include spaces in your first search because if you have multiple sets of the same pseudocode in your document (which you probably do), the regular expression will include all the spaces so will only find the very first and the very last instance of the double brackets and you’ll end up with this:

The regular expression has matched the first [[ (on line 43) and the last]] (on line 46), but there are many in between that are missed because spaces are included.

Starting with the first search followed immediately with the second gives you:

Similarly, tag abbreviations by replacing |…| with <choice></choice> and {{..}} with <expan></expan>  – all the outer nesting has been replaced with TEI tags:

When you have multiple codes that may be nested in a single tag (as the multiple [] and {} now within <abbr> and <expan>) you need to modify the regular expression again, so it catches every matching pair of brackets.

\{([^\s\}]*)\} (Note the \} now within the square brackets. This will keep the expression from moving past the first closing bracket)

The result is a complete set of TEI tags encoding abbreviations and expansions (result file here, change the file extension to .txt).

 

You can also use OxygenXML’s find & replace function to replace the pseudocode TEI tags, or you can be fancy and write an XSLT to do that work. In this example, I want to replace the <hi rend=”color(FF0000)”> with <g ref=”#t_et”> (I’ll add a corresponding <glyph> tag to the <charDecl> section of the header as described in 5.5.2 of the TEI Guidelines). This is fairly straightforward, since I know the content of the tag will always be “et” I can do a find & replace for the whole thing. If the content of the tag varies, I can use a regular expression as I did above to copy content from find to replace.

And that’s the workflow. It’s still a lot of work, you need a strong handle on the TEI and you need to plan everything in advance. But if you are working with a large number of people transcribing and advanced TEI training isn’t possible or desirable. 

[1] EVT for example has specific requirements for tags it can process and how those tags need to be formatted, and if your TEI doesn’t meet those requirements it won’t work out of the box – you’ll need to modify the EVT code to suit your TEI.

[2] For more information and tutorials on Regular Expressions, visit https://regexone.com/.

Ceci n’est pas un manuscrit: Summary of Mellon Seminar, February 19th 2018

This post is a summary of a Mellon Seminar I presented at the Price Lab for Digital Humanities at the University of Pennsylvania on February 19th, 2018. I will be presenting an expanded version of this talk at the Rare Book School in Philadelphia, PA, on June 12th, 2018

In my talk for the Mellon Seminar I presented on three of my current projects, talked about what we gain and lose through digitization, and made a valiant attempt to relate my talk to the theme of the seminars for this semester, which is music and sound. (The page for the Mellon Seminars is here, although it only shows upcoming seminars.) I’m not sure how well that went, but I tried!

I started my talk by pointing out that medieval manuscripts are physical objects – sometimes very large objects! They have weight and size and heft, and unlike static objects like sculptures, manuscripts move. They need to move in order for us to read them. But digitized manuscripts – the ones you find for example in Penn in Hand, the page-turning interface for Penn’s digitized manuscript collection – don’t really move. Sure, we have an interface that gives the impression of turning the pages of the book, but those images are flat, static files that are just the latest version in a long history of facsimile copies of manuscripts. A page-turning interface for medieval manuscripts is the equivalent of taking a book, cutting the pages out, and then pasting those pages into a photo album. You can read the pages but you lose the sense of the book as a physical object.

It sounds like I’m complaining, but I’m really not. I like that digital photographs of manuscripts are readily available and relatively standard, but I do think it’s vitally important that people using them are aware of how they’re different from the “real” manuscript. So in my talk I spent some time deconstructing a screenshot from a manuscript in Penn in Hand (see above). It presents itself as a manuscript opening (that is, two facing pages), but it should be immediately apparent that this is a fake. This isn’t the opening in the book, it’s two photos placed side-by-side to give the impression of the opening of the book. There is a dark line down the center of the window which clearly delineates the photo on the left and the one on the right. You can see two gutters – the book only has one, of course, but each photo includes it – and you can also see a bit of the text on the facing page in each photo. From the way the text is angled you can tell that this book was not laid flat when it was photographed – it was held at or near a 90 degree angle (and here’s another lie – the impression that the page-turning interface gives us is that of a book laid flat. Very few manuscripts lay flat. So many lies!).

We can see in the left-hand photo the line of the edge of the glass, to the right of the gutter and just to the left of the black line. In our digitization lab we use a table with a spring-loaded top and a glass plate that lays down on the page to hold it flat. (You can see a two-part demo of the table on Facebook, Part One and Part Two) This means the photographer will always know where to focus the camera (that is, at the level of the glass plate), and as each page of the book is turned the pages are the same distance from the camera (hence the spring under the table top). I think it’s also important to know that when you’re looking at an opening in a digital manuscript, the two photos in that composite view were not taken one after the other; they were possibly taken hours apart. In SCETI, the digitization lab in the Penn Libraries, all the rectos (that is, the front of the page) are taken at one time, and then the versos (the back of the page) are taken, and then the system interleaves them. (For an excellent description of digital photography of books and issues around it please see Dr. Sarah Werner’s Pforzheimer Lecture at the Harry Ransom Center on Early Digital Facsimiles)

I moved from talking about how digital images served through page-turning interfaces provide one kind of mediated (~fake~) view of manuscripts to one of my ongoing projects that provides another kind of mediated (also fake?) view of manuscripts: video. I could talk and write for a long time about manuscript videos, and I am trying to summarize my talk and not present it in full, so I’ll just say that one advantage that videos have over digitized images is that they do give an impression of the “real” manuscript: the size of them, the way they move (Is it stiff? How far can it open? Is the binding loose or tight?), and – relevant to the Seminar theme! – how they sound. I didn’t really think about it when I started making the videos four years ago, but if you listen carefully in any of the videos you can hear the pages (and in some cases the bindings), and if you listen to several of them you can really tell the difference between how different types of parchment and paper sound. Our complete YouTube playlist of video orientations is here, but I’ll embed one of my favorites here. This is LJS 280, a 13th century copy of Decretales Gregorii IX in a 15th century chain binding that makes a lot of noise.

I don’t want to imply that videos are better than digital images – they just tell us something that digital images can’t. And digital images are useful in ways that videos aren’t. For one thing, if you’re watching a video you can see the way the book moves, but I’m the one moving it. It’s still a mediated experience, it’s just mediated in a different way. You can see how it moved at a specific time, in a specific situation, with a specific person. If you want to see folio 45v, you’re out of luck, because I didn’t turn to that page (and even if I had, the video resolution might not be high enough for you to read it; the video isn’t for reading – that’s why we have the digital images).

There’s another problem with videos.

In four years of the video orientation program, we have 74 videos online. We could have more if we made it a higher priority (and arguably we should), but each one takes time: for research, to set up and take down equipment, for the recording (sometimes multiple takes), and then for the processing. The videos are also part of the official record of the manuscript (we load them into the library’s institutional repository and link them to records in the library’s catalog) and doing that means additional work.

At this point I left videos behind and went back to digital images, but a specific project: Bibliotheca Philadelphiensis, which we call BiblioPhilly. BiblioPhilly is a major collaborative project to digitize medieval manuscripts from institutions across Philadelphia, organized by the Philadelphia Area Consortium of Special Collections Libraries (PACSCL) and funded by the Council on Library and Information Resources (CLIR). We’re just entering year three of a three-year grant, and when we’re done we’ll have 476 manuscripts online (we have around 130 online now). If you’re interested in checking out the manuscripts that are online, and to see what’s coming, you can visit our search and browse site here.

The relevance of BiblioPhilly in my talk is that we’re being experimental with the kind of data we’re creating in the cataloging work, and with how we use that data to provide new and different manuscript views.

Manuscript catalogers traditionally examine and describe the physical structure of the codex. Codex manuscripts start as sheets of parchment or paper, which are stacked and folded to create booklets called quires. Quires are then gathered together and sewn together to make a text block, then that is bound to make the codex. So describing the physical structure means answering a few questions: How many quires? How many leaves in each quire? Are there leaves that are missing? Are there leaves that are singletons (i.e., were never part of a sheet)? When a cataloger has answered these questions they traditionally describe the structure using a collation formula. The formula will list the quires, number of leaves in a quire, and any variations. For example, a manuscript with 10 quires, all of which have eight leaves except for quire six which has four, and there are some missing leaves, might have a formula like this:

1-4(8), 5(8, -4,5), 6(4), 7-10(8)

(Quires 1 through 4 have eight leaves, quire 5 had eight leaves but four and five are now missing, quire 6 has four leaves, and quires 7-10 have eight leaves)

The formula is standardized for printed books, but not for manuscripts.

Using tools developed through the research project VisColl, which is designing a data model and system for describing and visualizing the physical construction of manuscripts, we’re building models for the manuscripts as part of the BiblioPhilly cataloging process, and then using those models to generate the formulas that go into our records. This itself is good, but once we have models we can use them to visualize the manuscripts in other ways too. So if you go to the BiblioPhilly search and browse site and peek into the records, you’ll find that some of them include links to a “Collation View”

Following that link will take you to a page where you can see diagrams showing each quire, and image files organized to show how the leaves are physically connected through the quire (that is, the sheets that were originally bound together to form the quire).

Like the page-turning interface, this is giving us a false impression of what it would be like to deconstruct the manuscript and view it in a different way, but like the video is it also giving us a view of the manuscript that is based in some way on its physicality.

And this is where my talk ended. We had a really excellent question and answer session, which included a question about why I don’t wear gloves in the videos (my favorite question, which I answer here with a link to this blog post at the British Library) but also a lot of great discussion about why we digitize, and how, and why it matters, and how we can do it best.

Thanks so much to Glenda Goodman and Stewart Varner for inviting me, and to everyone who showed up.

 

Dot’s Twitter Bots

I made some Twitter bots! It was mostly very easy.

The bots I made use Zach Whalen’s SSBot, documented in “How to Make A Twitter Bot with Google Spreadsheets version 0.4” which includes all the information you need about how to link your Twitter account to the spreadsheet and start the bot tweeting. The only thing I’ll note is that the Spreadsheet’s “Project Key” (asked for in Step 4) is depreciated; you’ll need to use the Script ID instead (it’s located directly under the Project Key in the Spreadsheet’s Project Properties).

Once you link the Twitter account to the SSBot, you enter data in the spreadsheet and that data is what gets tweeted.

Here’s a list of bots I made:

For all but WhyBeBot I generated a list of 140 line strings and pasted it into column one of the “Select from Columns” tab in the SSBot spreadsheet. This was really the most difficult and interesting part, because in each case I had to figure out how to download and process the texts. For example, for CollationBot I had to figure out how to pull out just the collation formulas from the records, while for the full-text bots I had to download the texts, find the sentences, and ideally find sentences that were less than 140 characters (if you pay attention you can see that these bots were created over time, and I got much better later on about including only complete sentences). Clearly most of these bots were made before Twitter increased to 280 characters; I may go back and lengthen the strings someday.

WhyBeBot is a bit different. It takes advantage of SSBot’s ability to mix content among columns. Instead of just one column, WhyBeBot has four columns. The first contains only “Why be ” while the third contains only “when you can be “, and the second and fourth both have a randomly-generated list of a few hundred adjectives.

There are many other ways to make Twitter bots (I know that a lot of people have had good luck with Cheap Bots Done Quick – I’ve never tried it, maybe someday). I would like to do more bots, the setup is pretty simple and getting the content situated is a fun challenge.

Slides from OPenn Demo at the American Historical Association Meeting

This week I participated in a workshop organized by the Collections as Data project at the annual meeting of the American Historical Association in Washington, DC. The session was organized by Stewart Varner and Laurie Allen, who introduced the session, and the other participants were Clifford Anderson and Alex Galarza.

The stated aim of the session was “to spark conversations about using emerging digital approaches to study cultural heritage collections,” (I’ll copy the full workshop description at the end of this post) but all of our presentations ended up focusing on the labor involved in developing our projects. This was not planned, but it was good, and also interesting that all of us independently came to this conclusion.

Clifford’s presentation was about work being done by the Scholarly Communications team at Vanderbilt University Libraries as they convert data from legacy projects (which have tended to be purpose built, siloed, and bespoke) into more tractable, reusable open data, and Alex told us about the GAM Digital Archive Project, which is digitizing materials related to human rights violations in Guatemala. Both Clifford and Alex stressed the amount of time and effort it takes to do the work behind their projects. The audience was mainly history faculty and maybe a few graduate students, and I expect they, like me, wanted to make sure the audience understood that the issue of where data comes from is arguably more important than the existence of the data itself.

My own talk was about the University of Pennsylvania’s OPenn (Primary Digital Resources for Everyone), which if you know me you probably already know about. OPenn is the website in which the Kislak Center for Special Collections, Rare Books and Manuscripts publishes its digitized collections in the public domain, as well as hosting collections for many other institutions. This includes several libraries and archives around Philadelphia who are partners on the CLIR-funded Bibliotheca Philadelphiensis project (a collaboration with Lehigh University, the Free Library of Philadelphia, Penn, and the Philadelphia Area Consortium of Special Collections Libraries), which I always mention in talks these days (I’m a co-PI and much of the work of the project is being done at Penn). I also focused my talk on the labor of OPenn, mentioning the people involved and including slides on where the data in OPenn comes from, which I haven’t mentioned in a public talk before.

Ironically I ended up spending so much time talking about what OPenn is and how it works that I didn’t have time to show much of the data, or what you can do with it. But that ended up fitting the (unplanned) theme of the workshop, and the attendees seemed to appreciate it, so I consider it a success.

Here are my slides:

Workshop abstract (from this page):

The purpose of this workshop is to spark conversations about using emerging digital approaches to study cultural heritage collections. It will include a few demonstrations of history projects that make use of collection materials from galleries, libraries, archives, or museums (GLAM) in computational ways, or that address those materials as data. The group will also discuss a range of ways that historical collections can be transformed and creatively re-imagined as data. The workshop will include conversations about the ethical aspects of these kinds of transformations, as well as the potential avenues of exploration that are opened by historical materials treated as data. Part of an IMLS-funded National Digital Forum grant, this workshop will ultimately inform the development of recommendations that aim to support cultural heritage community efforts to make collections collections more readily amenable to computational use.

Reaction, a Mémoire

For the #madememedieval hashtag currently going around Twitter, here’s the story of how I became a medievalist (although I didn’t realize it until much later). This is part of the Preface to Reactions Medieval/Modern, the catalog for the exhibit I curated at the University of Pennsylvania Libraries in Fall 2016.

When I was eleven years old, my parents brought my brother (who would have been thirteen at that time) and me to England for two weeks during the summer. They rented a house in the southwest corner of the country, not far from Bath, and borrowed a car. We went all over the place; I remember Salisbury and Stonehenge, Wells Cathedral and Bath Abbey. I also remember riding in the back seat down a particularly narrow road surrounded by trees and fields and pointing out the funny stones the cows were grazing around, at which point my father remarked that we were probably getting close to Avebury. But one memory of that trip stands out above all the rest: The Castle. Over the years, The Castle in my mind has grown to almost mythical proportions as I’ve come to realize (even more so over the past couple of years as I have been preparing for this exhibition) that it marks the point at which I was destined to become a medievalist. My reaction to The Castle was an epiphany, my path set in childhood—and I didn’t realize it until almost thirty years later.

In my memory, we visited The Castle toward the end of the afternoon. I was probably tired and grouchy, although I don’t remember that. (I spent much of this trip tired and grouchy.) I do remember a small town, walking through a residential area with lots of houses, turning the corner, and all of a sudden there it was. It was very different from Warwick Castle, which we’d visited earlier in our trip and which I’d found dull and crowded and ugly. This one was small. I remember a tower, and a demolished wall; it was a ruin. There was no one else around, so my brother and I climbed on the broken walls and ran around and basically acted like kids.

At some point, I noticed that the interior of the tower, which was several meters tall, had regular sets of holes around the perimeter, several feet apart horizontally and several more running vertically all the way to the top. I asked about the holes, and someone told me that wooden beams would have gone through those holes, serving as supports for floors. And I remember being struck very suddenly that people had lived here. I was standing in this ruined tower, we were using it essentially as a playground, and yet hundreds of years ago people had made this place their home.

That experience was the first time I can remember having a visceral reaction to a physical object, a reminder that this object was not just the thing we have today, but a thing that has existed over time and been touched by so many hands and lives before it came to us, and will continue touching people long after we are gone.

As my personal experience attests, reactions are both immediate and ongoing, with potentially long-term effects (on both people and objects). Not all premodern book owners wrote in their books, and not all modern artists look to medieval manuscripts for inspiration, but by looking at the various ways that medieval and modern people have reacted to manuscripts, we may come to appreciate these objects as more than simply bearers of information, or beautiful things for us to enjoy. They bear the marks of their own history, and they still have the potential to make history today and in the future.

The Castle: Nunney Castle in Southwest England, not far from Bath. Photograph by Hugh Llewelyn, licensed CC:BY:SA.

Hosting the Digital Rāmamālā Library at Penn, or, thinking about open licenses for non-Western digitized manuscripts

This talk was presented as part of a panel at the Global Digital Humanities Symposium at Michigan State University, March 16-17 2017: ARC Panel: Access, Data, and Collaboration in the Global Digital Humanities

My story begins in 2012, when Dr. Benjamin Fleming, Visiting Scholar in Religious Studies and Cataloger of Indic Manuscripts for the Kislak Center for Special Collections at the University of Pennsylvania, proposed and was awarded an Endangered Archives grant from the British Library. The main purpose of the grant was to write a catalog for the Ramamala Library, which is one of the oldest still-active traditional libraries in Bangladesh. A secondary part of this grant was to digitize around 150 of the most fragile manuscripts from the Ramamala Library, and an agreement was made that the University of Pennsylvania Libraries would be responsible for hosting these digital images. At this time, someone from the Penn libraries recommended to Dr. Fleming that they get a Creative Commons license, and the non-commercial license was given as an example. The proposal went forward with a CC-NC license, which both the Penn Libraries and the Ramamala Library agreed to, and everything was fine.

So a bit of historical logistics might be helpful here. 2012, the year this proposal was agreed to, was one year before the Schoenberg Institute for Manuscript Studies (SIMS) was founded at Penn. One of the very first things that SIMS was tasked with doing – one of the things it was designed to do – was to create some kind of open access portal to enable the resue of digital images of our medieval manuscripts. Penn has been digitizing our manuscripts and posting them online since the late 1990s, and in 2013 all of them were online in a system called Penn in Hand. Penn in Hand is a kind of black box – you can see the manuscripts in there, search for them, navigate them, but if you want to publish an image in a book or use them in a project, you have to do some work to figure out what’s allowed in terms of licensing, and then figure out how to get access to high-resolution images that are going to be usable for your needs.

It took us a couple of years, but on May 1, 2015, we launched OPenn: Primary Digital Resources Available for Everyone, as a platform not for viewing images, but explicitly for downloading and reusing images and metadata. OPenn includes high-resolution master TIFF images and smaller JPEG derivatives, as well as robust metadata in TEI/XML using the Manuscript Description element. We started hosting our own data, but today we host manuscript data for 17 institutions in the Philadelphia area with others in the US and Europe (including Hebrew manuscripts from the John Rylands Library at the University of Manchester) to come online in the next year. One the things we wanted was for users of OPenn to always be certain about what they could do with the data, so we decided that anything that goes into OPenn must follow those licenses that Creative Commons has approved for Free Cultural Works:

  • the CC Public Domain mark
  • CC0 (“CC-zero”), the Public Domain dedication for copyrighted works
  • CC-BY, the Creative Commons Attribution license
  • CC-BY-SA, the Creative Commons Attribution-Share Alike license

Note that licenses with a non-commercial clause are not approved for Free Cultural Works, and thus OPenn, by policy, is not able to host them.

You see where this is going.

So in March 2016, a year after OPenn was launched, and well after the Ramamala Library manuscripts had been photographed, Dr. Fleming asked about adding the Ramamala Library material to OPenn, in addition to having it in Penn in Hand (where it was already going online). It wasn’t until he asked this question that we realized, under current policy, we couldn’t include that material in OPenn because of the license. Over the next few weeks we (we being representatives of OPenn, and Dr. Fleming) had several conversations during which we floated some various ideas:

  1. We could loosen the “Free Cultural Works” requirement and allow inclusion of the Ramamala data with the noncommercial license.
  2. We could build a parallel OPenn to contain data with a noncommercial license.
  3. We could use OPenn as a kind of carrot, to encourage the Ramamala library administration to loosen the noncommercial clause on the license and release the data as Free Cultural Works.

The third option was struck down almost as soon as it was suggested. There were, it turns out, highly sensitive discussions that had been happening at the Ramamala Library during the course of the project that would have made such a request difficult to say the least. As Dr. Fleming said in an email to me as we were talking about this talk, “It would be highly inappropriate and complex to try and revisit the copyright agreement as, even as it was, the act of a Western organization making digital copies of a small set of mss unraveled a dense set of internal issues related to private property and government control over cultural property (digital or otherwise).”

Before I return to our other two options I want to take a quick detour to talk a bit about how this conversation changed my thinking about Open Access in general, and about open access of non-Western material specifically.

I am by all accounts an evangelist for open access to medieval manuscript material. I like to complain: about institutions that keep their images under strict licenses, that make their images difficult or impossible to download, that charge hefty fees for manuscripts that have been digitized for years. OPenn is a reaction against that kind of thinking. We say: Here are our manuscript images! Here is our metadata! Here’s how you can download them. Do whatever you like with them. We own the books, but we acknowledge that they are our shared cultural heritage and in fact they belong to all of us. So the very least we can do is give you digital copies.

Until Ramamala, I would have told you that it was necessary that digital images of every manuscript in every culture that was written before modern times should be in the public domain and available for everyone. The people who wrote them are dead, and they wouldn’t have had the same conception of copyright ownership in any case, so why not? But suddenly I wasn’t so sure. I was forced to move beyond thinking in very black and white terms about “old vs. new” to thinking in a more nuanced way about “old vs. new”, sure, but also about “what is yours vs. what is ours” – and what ownership of the physical means for ownership of the digital. Again, until Ramamala I would have told you that physical owners owe it to the rest of us to allow the digital to sit in the public domain. But what does this mean for countries who have suffered under colonialism, and who have been forced for the past however many hundred years or more to share their cultural heritage with the west? In my somewhat unstructured thoughts, I keep coming back to the Elgin Marbles, which is just one example of cultural vandalism by the west (in this case, with the assistance of the Ottoman Empire, which ruled Greece at the time the marbles were taken) but is a particularly egregious one. I’m sure you’re all familiar with the Elgin Marbles, which used to decorate buildings in the Arcopolis, including the Parthenon, before they were removed to Britain between 1801 and 1812, later purchased by the British Museum where they can be seen today, although the current government of Greece has urged their return for many years.

Now the situation of the manuscripts in the Ramamala Library isn’t the same as that of the Elgin Marbles – we aren’t suggesting to move the physical collection to Penn – but I can’t help but believe there is a parallel here, and particularly a case to be made for respecting the ownership of cultural heritage by the cultures that created the heritage. But what does that mean? Dr. Fleming’s comment above implies the complexities around this question. Does “the culture that created the heritage” mean the current government? Or the cultural institutions? Or the citizens of the countries, or the citizens of the culture in those countries, no matter where they live now? And once we figure out the who, how do we ensure they have power to make decisions about their heritage objects? But I have taken my tangent far enough and I want to get back to talking about our ideas for making the Ramamala Library data available in OPenn.

When we left off, we had two other ideas:

  1. We could loosen the “Free Cultural Works” requirement and allow inclusion of the Ramamala data with the noncommercial license.
  2. We could build a parallel OPenn to contain data with a noncommercial license.

We considered the first suggestion but decided very quickly that we didn’t want to open that can of worms. Our concern was that if we started allowing one collection with a noncommercial license, no matter what the circumstances, other institutions that would want to include such a clause could point to it and say “Well them, why not us”? We have in fact used our “Free Cultural Works” only policy as a carrot for other institutions we host data for, including several museums and libraries in Philadelphia, and it works remarkably well (free hosting in exchange for an open license is apparently an attractive prospect). We don’t want to lose that leverage – particularly when it comes to institutions that own materials from former colonial countries, we have the ability and thus the responsibility to make that data available again, and it’s not a responsibility we take lightly.

The second idea, building a parallel OPenn to allow noncommercially licenced data, was more attractive but, we decided, just too much work at this time, with only one collection. However it did seem like something that would be a good community project: an open access portal, similar to OPenn in design, but with policies designed for the concerns surrounding access and reuse for cultural heritage data from former colonial countries. If something like this is currently being designed I would love to hear about it, and I expect we would be very happy to put our Ramamala data into such a platform.

So what did we do? We decided to take the path of least resistance and not do anything. The Ramamala manuscripts are available on Penn in Hand, and, since the license information was entered into the MARC record notes field, the license is obvious (unlike most other materials in Penn in Hand). The images are still not easily downloadable, although we are working on a new page for the project through which people can request free access to the high-resolution TIFF files. However, they aren’t available on OPenn. And I’m not sorry about that. I’m not sorry that OPenn’s policy is strictly for “Free Cultural Works,” because I think within our community, serving mainly institutions in Philadelphia and other US and Western European cities, the policy helps us leverage collections into Open Access that might otherwise be under more strict licenses. But I do think it’s important for us to keep talking and thinking about how we can serve other communities with respect.

Edit: after posting this talk, I was contacted by Caroline Schroeder who pointed me to an article she wrote about similar issues with Coptic manuscripts. I share a link to that article here: Caroline T. Schroeder, “Shenoute in Code: Digitizing Coptic Cultural Heritage For Collaborative Online Research and Study” Coptica 14 (2015), 21-36.

The Historiography of Medieval Manuscripts in England (and the USA)

The text of a lightning talk originally presented at The Futures of Medieval Historiography, a conference at the University of Pennsylvania organized by Jackie Burek and Emily Steiner. Keep in mind that this was very lightly researched; please be kind.

Rather than the originally proposed topic, the historiography of medieval manuscript descriptions, I will instead be talking about the historiography of medieval manuscripts specifically in England and the USA, as perceived through the lens of manuscript descriptions.

We’ll start in the late 12th into the 15th century, when monastic houses cataloged the books in their care using little more than a shelf-list. Such a list would be practical in nature: the community needs to be able to know what books they own, so as books are borrowed internally or loaned to other houses (or perhaps sold) they have a way to keep track of them. Entries on the list would be very simple: a brief statement of contents, and perhaps a note on the number of volumes. There is, of course, an entire field of study around reconstructing medieval libraries using these lists, and as the descriptions are quite simple it is not an easy task.

c. 1190-1200. Cambridge, Jesus College MS 34, fol. 1r. First catalogue of the library of Rievaulx. (Plate 3 from The Libraries of the Cistercians, etc. Vol. 3 in Corpus of British Medieval Library Catalogues, 1992)
Late 13th c. Oxford, Bodleian Library MS Rawlinson B. 336, page 187. Catalogue of the library of St Radegund’s abbey at Bradsole. (Plate 5 from The Libraries of the Cistercians, etc. Vol. 3 in Corpus of British Medieval Library Catalogues, 1992)
1400. London, BL MS Additional 70507, fol. 2r. Description of the library at Titchfield (Plate 6 from The Libraries of the Cistercians, etc. Vol. 3 in Corpus of British Medieval Library Catalogues, 1992)

In the 15th and 16th centuries there were two major historical events that I expect played a major role both in a change in the reception of manuscripts, and in the development of manuscript descriptions moving forward: those are the invention of the printing press in the mid-15th century, and the dissolution of the monasteries in the mid-16th century. The first made it possible to relatively easily print multiple copies of the same book, and also began the long process that rendered manuscripts obsolete. The second led to the transfer of monastic books from institutional into private hands, and the development of private collections with singular owners. When it came to describing their books, these collectors seemed to be interested in describing for themselves and other collectors, and not only for the practical purpose of keeping track of them. Here is a 1697 reprint of a catalog published in 1600 of Matthew Parker’s private collection (bequeathed to Corpus Christi College Cambridge in 1574). You can see that the descriptions themselves are not much different from those in the manuscript lists, but the technology for sharing the catalog – and thus the audience for the catalog – is different.

1600. Ecloga Oxonio–Cantabrigiensis, tributa in libros duos, quorum prior continet catalogum confusum librorum manuscriptorum in illustrissimis bibliothecis, duarum florentissimarum Acdemiarum, Oxoniae et Catabrigiae (London, 1600; reprinted in 1697)

In the later 16th and into the 17th century these private manuscript collections began to be donated back to institutions (educational and governmental), leading to descriptions for yet other audiences and for a new purpose: for institutions to inform scholars of what they have available for their use. The next three examples, from three catalogs of the Cotton Collection (now at the British Library) reflect this movement. The first is from a catalogue published in 1696, the content description is perhaps a bit longer than the earlier examples, and barely visible in the margin is a bit of a physical description: this is a codex with 155 folios. Notably this is the first description we’ve looked at that mentions the size of the book at all, so we are moving beyond a focus only on content. This next example, from 1777, is notable because it completely forefronts the contents. This catalog as a whole is organized by theme, not by manuscript (you can see below the contents listed out for Cotton Nero A. i), so we might describe it as a catalog of the collection, rather than a catalog of the manuscripts comprising the collection.

1696. Catalogue of the manuscripts in the Cottonian Library, 1696 (facsimile 1984)
1777. A catalogue of the manuscripts in Cottonian library: to which are added many emendations and additions. With an appendix containing an account of the damage sustained by the fire in 1731; and also a catalogue of the charters preserved in the same library. British Museum Dept. of Manuscripts, 1777

The third example is from the 1802 catalog, and although it’s still in Latin we can see that there is more physical description as well as more detail about the contents and appearance of the manuscript. There is also a citation to a book in which the preface on the manuscript has been published – the manuscript description is beginning to look a bit scholarly.

1802. A catalogue of the manuscripts in the Cottonian library deposited in the British museum : printed by command of His Majesty King George III. &c. &c. &c. in pursuance of an address of the House of Commons of Great Britain. British Museum Dept. of Manuscripts, 1802

We’ll jump ahead 150 years, and we can see in that time that concern with manuscripts has spread out from the institution to include the realm of the scholar. This example is from N.R. Ker’s Catalogue of Manuscripts Containing Anglo-Saxon, rather than focusing on the books in a particular collection it is focused on a class of manuscripts, regardless of where they are physically located. The description is in the vernacular, and has more detail in every regard. The text is divided into sections as well: General description; codicological description; discussion of the hands; and provenance.

1957. N. R. Ker, Catalogue of Manuscripts Containing Anglo-Saxon. Oxford, 1957.

And now we arrive at today, and to the next major change to come to manuscript descriptions, again due to new technology. Libraries around the world, including here at Penn, are writing our manuscript descriptions using code instead of on paper, and publishing them online along with digital images of the manuscript pages, so people can not only read about our manuscripts, but also see images of them and use our data to create new things. We use the data ourselves, for example in OPenn (Primary digital resources available to everyone!) we build websites from our manuscript descriptions to make them available to the widest possible audience.

I want to close by giving a shout-out to the Schoenberg Database of Manuscripts, directed by Lynn Ransom, which is pushing the definition of manuscript descriptions in new scholarly directions. In the SDBM, a manuscript is described temporally, through entries that describe where a book was at particular moments in time (either in published catalogs, or through personal observation). As scholarly needs continue to change, and technology makes new things possible, the description of manuscripts will likewise continue to change around these, even as they have already over the last 800 years.

“Freely available online”: What I really want to know about your new digital manuscript collection

So you’ve just digitized medieval manuscripts from your collection and you’re putting them online. Congratulations! That’s great. Online access to manuscripts is so important, for scholars and students and lots of other people, too (I know a tattoo artist who depends on digital images for design ideas). As the number of collections available online has grown in recent years (DMMAP lists 545 institutions offering at least one digitized manuscript), the use of digital manuscripts by medievalists has grown right along with supply.[1] If you’re a medievalist and you study manuscripts, I’m confident that you regularly use digital images of manuscripts. So every new manuscript online is a celebration. But now, you who are making digitized medieval manuscripts available online, tell us more. How, exactly, are you making your manuscripts available? And please don’t say you’re making them freely available online.

I hate this phrase. It makes my teeth clench and my heart beat faster. It makes me feel this way because it doesn’t actually tell me anything at all. I know you are publishing your images online, because where else would you publish them (the age of CDRom for these things is long gone) and I know they are going to be free, because otherwise you’d be making a very different kind of announcement and I would be making a very different kind of complaint (I’m looking at you, Codices Vossiani Latini Online). What else can you tell me?

Here are the questions I want answered when I read about an online manuscript collection.

  1. How are your images licensed? This is going to be my first question, and for me it’s the most important because it defines what I can do with your images. Are you placing them in the public domain, licensing them CC0? This is what we do at my institution, and it’s what I like to see, since, you know, medieval manuscripts are not in copyright, at least not in the USA (I understand things are more complicated in Europe). If not CC0, then what restrictions are you placing on them? Creative Commons has a tool where you can select the restrictions you want and then gives you license options. Consider using it as part of your decision-making process. A clear license is a good license.
  2. How can I find your manuscripts? Is there a search and browse function on your site, or do I have to know what I’m looking for when I come in?
  3. Will your images be served through the International Image Interoperability Framework (IIIF)? IIIF has become very popular recently, and for good reason – it enables users to pull manuscripts from any IIIF-compliant repository into a single interface, for example comparing manuscripts from different institutions in a single browser window. A user will need access to the IIIF manifests to make this work – the manifest is essentially a file containing metadata about the manuscript and a list of links to image files. So, if you are using IIIF, will the manifests be easily accessible so I can use them for my own purposes? (For reference, e-codices links IIIF manifests to each manuscript record, and it couldn’t be easier to find them.)
  4. What kind of interface will you have? I usually assume that a page-turning interface will be provided, but if there is some other interface (like, for example, Yale University, which links individual images from a thumbnail strip on the manuscript record) I’d like to know that. Will users be able to build collections or make annotations on page images, or contribute transcriptions? I’d like to know that, too.
  5. How can I get your images? I know you’re proud of your interface, but I might want to do something else with your images, either download them to my own machine or point to them from an interface I’ve built myself or borrowed from someone else (maybe using IIIF, but maybe not). If you provide IIIF manifests I have a list of URLs I can use to point to or download your image files (more or less, depending on how your server works), but if you’re not using IIIF, is there some other way I can easily get a list of image URLs for a manuscript? For example, OPenn and The Digital Walters publish TEI documents with facsimile lists. If you can’t provide a list, can you at least share how your urls are constructed? If I know how they’re made I can probably figure out how to build them myself.

Those are the big five questions I like to have answered when I read about a new digital manuscript collection, and they very rarely are. Please, please, please, next time you announce a new collection, try to go beyond freely available online and tell us all more about how your collection will be made available, and what users will be able and allowed to do with it.

[1] In 2002 33% of survey respondents reported manuscript facsimiles “print mostly, electronic sometimes” and 47% reported using “print only”. In 2011, 44% reported using them “electronic mostly, print sometimes” and 17% reported using “electronic only”. This is an enormous shift. From Dot Porter, “Medievalists and the Scholarly Digital Edition,” Scholarly Editing: The Annual of the Association for Documentary Editing Volume 34, 2013. http://www.scholarlyediting.org/2013/essays/essay.porter.html