Zombie Manuscripts: Digital Facsimiles in the Uncanny Valley

This is a version of a paper presented at the International Congress on Medieval Studies, May 12, 2018, in session 482, Digital Skin II: ‘Franken-Manuscripts’ and ‘Zombie Books’: Digital Manuscript Interfaces and Sensory Engagement, sponsored by Information Studies (HATII), Univ. of Glasgow, and organized by Dr. Johanna Green.

The uncanny valley was described by Masahiro Mori in a 1970 article in the Japanese journal Energy, and it wasn’t translated into English completely until 2012.[1] In this article, Mori discusses how he envisions people responding to robots as they become more like humans. The article is a thought piece – that is, it’s not based on any data or study. In the article, which we’ll walk through closely over the course of this presentation, Mori posits a graph, with human likeness on the x axis and affinity on the y axis. Mori’s proposition is that, as robots become more human-like, we have greater affinity for them, until they reach a point at which the likeness becomes creepy, or uncanny, leading to a sudden dip into negative affinity – the uncanny valley.

Now, Mori defined the uncanny valley specifically in relation to robotics, but I think it’s an interesting thought exercise to see how we can plot various presentations of digitized medieval manuscripts along the affinity/likeness axes, and think about where the uncanny valley might fall.

In 2009 I presented a paper, “Reading,
 Writing,
 Building: 
the 
Old
 English
Illustrated
 Hexateuch,” (unpublished but archived in the Indiana University institutional repository) in which I considered the uncanny valley in relation to digital manuscript editions. This consideration followed a long description of the “Turning the Pages Virtualbook” technology which was then being developed at the British Library, of which I was quite critical. At that time, I said:

In my mind, the models created by Turning the Pages™ fall at the nadir of the “uncanny valley of digital texts” – which has perhaps a plain text transcription at one end and the original manuscript at the other end, with print facsimiles and editions, and the various digital displays and visualizations presented earlier in this paper falling somewhere between the plain text and the lip above the chasm.

Which would plot out something like this on the graph. (Graph was not included in the original 2009 paper)

Dot’s 2009 Conception of the Uncanny Valley of Manuscripts

Nine years of thinking on this and learning more about how digital manuscripts are created and how they function, I’m no longer happy with this arrangement. Additionally, in 2009 I was working with imperfect knowledge of Mori’s proposition – the translation of the article I referred to then was an incomplete translation from 2005, and included a single, simplified graph in place of the two graphs from the original article – which we will look at later in this talk.

Manuscripts aren’t people, and digitized manuscripts aren’t robots, so before we start I want to be clear about what exactly I’m thinking about here. Out of Mori’s proposition I distill four points relevant to our manuscript discussion:

First, Robots are physical objects that resemble humans more or less (that is the x-axis of the graph)

Second, as robots become more human-like, people have greater affinity for them (until they don’t – uncanny valley) – this is the y-axis of the graph

Third, the peak of the graph is a human, not the most human robot

Fourth, the graph refers to robots and to humans generally, not robots compared to a specific human.

Four parallel points can be drawn to manuscripts:

First, digitized manuscripts are data about manuscripts (digital images + structural metadata + additional data) that are presented on computers. Digitized manuscripts are pieces, and in visualizing the manuscript on a computer we are reconstructing them in various ways. (Given the theme of the session I want to point out that this description makes digitized manuscripts sound a lot more like Frankenstein’s creature than like a traditional zombie, and I’m distraught that I don’t have time to investigate this concept further today) These presentations resemble the parent manuscript more or less (this is the x-axis)

Second, as presentations of digitized manuscripts become more manuscript-like, people have greater affinity for them (until they don’t – uncanny valley) – this is the y-axis

Third, the peak of the graph is the parent manuscript, not the most manuscript-like digital presentation

Fourth, the graph refers to a specific manuscript, not to manuscripts generally

I think that this is going to be the major difference in applying the concept of the uncanny valley to manuscripts vs. robots: while Robots are general, not specific (i.e., they are designed and built to imitate humans and not specific people), the ideal (i.e., most manuscript-like) digital presentation of a manuscript would need to be specific, not general (i.e., it would need to be designed to look and act like the parent manuscript, not like any old manuscript)

Now let’s move on to Affinity

A Valley in One’s Sense of Affinity

Mori’s article is divided into four sections, the first being “A Valley in One’s Sense of Affinity”. In this section Mori describes what he means by affinity and how affinity is affected by sensory input. Figure one in this section is the graph we saw before, which starts with an Industrial Robot (little likeness, little affinity), then a Toy Robot (more likeness, more affinity), then drops to negative affinity at about 80-85% likeness, with Prosthetic Hand at negative affinity and Bunraku Puppet on the steep rise to positive affinity and up to Healthy Person.

For Mori, sensory input beyond the visual is important for an object’s placement on the x-axis. An object might look very human, but if it feels strange, that doesn’t only send the affinity into the negative, but it also lessens the likeness. Mori’s original argument focuses on prosthetic hands, specifically about realistic prosthetic hands, which cannot be distinguished at a glance from real ones. I’m afraid the language in his example is abelist so I don’t want to quote him,

Luke Skywalker’s prosthetic hand in The Empire Strikes Back

but his argument is essentially that a very realistic prosthetic hand, when one touches it and realizes it is not a real hand (as one had been led to believe), it becomes uncanny. Relating this feeling to the graph, Mori says, “In mathematical terms, this can be represented by a negative value. Therefore, in this case, the appearance of the prosthetic hand is quite humanlike, but the level of affinity is negative, thus placing the hand near the bottom of the valley in Figure 1.”

The character Osono, from the play Hade Sugata Onna Maiginu (艶容女舞衣), in a performance by the Tonda Puppet Troupe of Nagahama, Shiga Prefecture. https://en.wikipedia.org/wiki/Bunraku#/media/File:Osonowiki.jpg (CC:BY:SA)

Bunraku puppets, while not actually resembling humans physically as strongly as a very realistic prosthetic hand visually resembles a human hand, fall farther up the graph both in terms of likeness and in affinity. Mori makes it clear that likeness is not only, or even mostly, a visual thing. He says:

I don’t think that, on close inspection, a bunraku  puppet appears similar to a human being. Its realism in terms of size, skin texture, and so on, does not even reach that of a realistic prosthetic hand. But when we enjoy a puppet show in the theater, we are seated at a certain distance from the stage. The puppet’s absolute size is ignored, and its total appearance, including hand and eye movements, is close to that of a human being. So, given our tendency as an audience to become absorbed in this form of art, we might feel a high level of affinity for the puppet.

So it’s not that bunraku puppets look like humans in great detail, but when we experience them within the context of the puppet show they have the affect of being very human-like, thus they are high on the human likeness scale.

For a book-related parallel I want to quote briefly a blog post, brought to my attention earlier this week, by Sean Gilmore. Sean is an undergraduate student at Colby College and this past semester took Dr. Megan Cook’s Book History course, for which he wrote this post, “Zombie Books; Digital Facsimiles for the Dotty Dimple Stories.” There’s nothing in this post to suggest that Sean is familiar with the uncanny valley, but I was tickled with his description of reading a digital facsimile of a printed book. Sean says:

In regards to reading experience, reading a digital facsimile could not be farther from the experience of reading from the Dotty Dimple box set. The digital facsimile does in truth feel like reading a “zombie book”. While every page is exactly the same as the original copy in the libraries of the University of Minnesota, it feels as though the book has lost its character. When I selected my pet book from Special Collection half of the appeal of the Dotty Stories was the small red box they came in, the gold spines beckoning, almost as if they were shouting out to be read. This facsimile, on the other hand, feels like a taxidermy house cat; it used to be a real thing, but now it feels hollow, and honestly a little weird.

Sean has found the uncanny valley without even knowing it exists.

The Effect of Movement

The second section of Mori’s article, and where I think it really gets interesting for thinking about digitized manuscripts, is The Effect of Movement. In the first section we were talking in generalities, but here we see what happens when we consider movement alongside general appearance. Manuscripts, after all, are complex physical objects, much as humans are complex physical objects. Manuscripts have multiple leaves, which are connected to each other across quires, the quires which are then bound together and, often, connected to a binding. So moving a page doesn’t just move a page, much as bending your leg doesn’t just move your leg. Turning the leaf of a manuscript might tug on the conjoined leaf, push against the binding, tug on the leaves preceding and following – a single movement provoking a tiny chain reaction through the object, and one which, with practice, we are conditioned to recognize and expect.

Mori says:

Movement is fundamental to animals— including human beings—and thus to robots as well. Its presence changes the shape of the uncanny valley graph by amplifying the peaks and valleys (Figure 2). For illustration, when an industrial robot is switched off, it is just a greasy machine. But once the robot is programmed to move its gripper like a human hand, we start to feel a certain level of affinity for it.

And here, finally, we find our zombie, at the nadir of the “Moving” line of the uncanny valley. The lowest point of the “Still” line is the Corpse, and you can see the arrow Mori has drawn from “Healthy Person” at the pinnacle of the graph down to “Corpse” at the bottom. As Mori says, “We might be glad that this arrow leads down into the still valley of the corpse and not the valley animated by the living dead.” A zombie is thus, in this proposition, an animated corpse. So what is a “dead” manuscript? What is the corpse? And what is the zombie? (I don’t actually have answers, but I think Johanna might be addressing these or similar questions in her talk)

Reservoir Dogs (not zombies)
The Walking Dead (shuffling zombies)
28 Days Later (manic zombies)

I expect most of us here have seen zombie movies, so, in the same way we’ve been conditioned to recognize how manuscripts move, we’ve been conditioned to understand when we’re looking at “normal” humans and when we’re looking at zombies. They move differently from normal humans. It’s part of the fun of watching a zombie film – when that person comes around the corner, we (along with the human characters in the film) are watching carefully. [13] Are they shuffling or just limping? [14] Are they running towards us or away from something else? It’s the movement that gives away a zombie, and it’s the movement that will give away a zombie manuscript.

 

I want to take a minute to look at a manuscript in action. This is a video of me turning the pages of Ms. Codex 1056, a Book of Hours from the University of Pennsylvania. This will give you an idea of what this manuscript is like (its size, what its pages look like, how it moves, how it sounds), although within Mori’s conception this video is more similar to a bunraku puppet than it is like the manuscript itself.

It’s a copy of the manuscript, showing just a few pages, and the video was taken in a specific time and space with a specific person. If you came to our reading room and paged through this manuscript, it would not look and act the same for you.

e-codices manuscript viewer
e-codices viewed through Mirador

Now let’s take a look at a few examples of different page-turning interfaces. The first is from e-codices, and is their regular, purpose-built viewer. When you select the next page, the opening is simply replaced with the next opening (after a few seconds for loading). The second is also e-codices, but is from the Mirador viewer, a IIIF viewer that is being adopted by institutions and that can also be used by individuals. Similar to the other viewer, when you select the next page the opening is replaced with the next opening (and you can also track through the pages using the image strip along the bottom of the window). The next example is a Bible from Swarthmore College near Philadelphia, presented in the Internet Archive BookReader. This one is designed to mimic a physical page turning, but it simply tilts and moves the image. This would be fine (maybe a bit weird) if the image were text-only, but as the image includes the edges of the text-block and you can see a bit of the binding, the effect here is very odd. Finally, my old friend Turning the Pages (a newer version than the one I complained about in my 2009 paper), which works very hard to mimic the movement of a page turning, but doing so in a way that is unlike any manuscript I’ve ever seen.

Escape by Design

In the third section of his article, Mori proposes that designers focus their work in the area just before the uncanny valley, creating robots that have lower human likeness but maximum affinity (similar to how he discussed bunraku puppets in the section on affinity, although they are on the other side of the valley). He says:

In fact, I predict that it is possible to create a safe level of affinity by deliberately pursuing a nonhuman design. I ask designers to ponder this. To illustrate the principle, consider eyeglasses. Eyeglasses do not resemble real eyeballs, but one could say that their design has created a charming pair of new eyes. So we should follow the same principle in designing prosthetic hands. In doing so, instead of pitiful looking realistic hands, stylish ones would likely become fashionable.

Floral Porcelain Leg from the Alternative Limb Project (http://www.thealternativelimbproject.com/project/floral-porcelain-leg/)

And here’s an example of a very stylish prosthetic leg from the Alternative Limb Project, which specializes in beautiful and decidedly not realistic prosthetic limbs (and realistic ones too). This is definitely a leg, and it’s definitely not her real leg.

 

In the world of manuscripts, there are a few approaches that would, I think, keep digitized manuscript presentations in that nice bump before the valley:

 

“Page turning” interfaces that don’t try to hard to look like they are actually turning pages (see the two e-codices examples above)

Alternative interfaces that are obviously not attempting to show the whole manuscript but still illustrate something important about them (for example, RTI, MSI, or 3D models of single pages). This example is an interactive 3D image of the miniature of St. Luke from Bill Endres’s Manuscripts of Lichfield Cathedral project.

Visualizations that illustrate physical aspects of the manuscript without trying to imitate them (for example, VisColl visualizations with collation diagrams and bifolia)

 

I think these would plot out something like this on the graph.

Dot’s 2018 Conception of the Uncanny Valley of Digitized Manuscripts

This is all I have to say about the uncanny valley and zombie books, but I’m looking forward to Johanna, Bridget and Angie’s contributions and to our discussion at the end. I also want to give a huge shout-out to Johanna and Bridget, to Johanna for conceiving of this session and inviting me to contribute, and both of them for being immensely supportive colleagues and friends as I worked through my thoughts about frankenbooks and zombie manuscripts, many of which, sadly, didn’t make it into the presentation, but which I look forward to investigating in future papers.

[1] M. Mori, “The uncanny valley,” Energy, vol. 7, no. 4, pp. 33–35, 1970 (in Japanese);  M. Mori, K. F. MacDorman and N. Kageki, “The Uncanny Valley [From the Field],” in IEEE Robotics & Automation Magazine, vol. 19, no. 2, pp. 98-100, June 2012. (translated into English) (https://ieeexplore.ieee.org/document/6213238/)

Workflow: MS Word to TEI

For the past couple of years I’ve been refining a workflow to convert MS Word files into publishable TEI. By “publishable” I mean TEI that can be loaded into some existing publication system (something like TEI Publisher, Edition Visualization Technology (EVT), or TEI Boilerplate), or that you could process yourself in some other way.

Why might you want to use such a workflow? In my experience, it’s useful when you have a person or people who are designated as transcribers, but who aren’t comfortable or interested in encoding in XML. Microsoft Word is ubiquitous, so pretty much everyone in academia uses it and has access to it. For people who don’t want to work with pointy brackets but still want to collaborate on a digital editing project, a workflow that converts Microsoft Word to TEI can be very useful. (I have also used this workflow myself, even though I’m capable of hand-encoding XML, just because there are times when I’d rather just to it in Word. YMMV!)

I think the workflow works best when there is one person designated to do all the conversion at the end (steps 4 and 5) and any number of people involved in the first three steps. The workflow could be used in the classroom as a group project (where the students model the TEI, plan the pseudocodes, and do the encoding, and one student or the instructor does the conversion work at the end) although I’ve only used it for non-classroom editing projects.

There are a few things you need in order to be successful with this workflow:

  1. You need a team that knows TEI. This doesn’t mean they need to know XML! (Although yes, you will need someone on your team who knows XML, but that’s not related to TEI) You need to know TEI basics –  what tags and attributes are, how modules and classes work – and you need these because you need to know what TEI tags you want in your final document before you start transcribing.
  2. Microsoft Word (obviously)
  3. OxGarage conversion tools. OxGarage is a service of the Text Encoding Initiative, which provides scripts for converting between a variety of text formats, including MS Word to TEI.
  4. OxygenXML Editor (or an XML editor of your choice). OxygenXML is popular with the TEI community, and it has the find & replace functionality that is required by this workflow. BBEdit is another XML/text editor that I use a lot, and it has a great find & replace functionality, but it doesn’t work as well for this workflow for reasons I’ll describe later in this post.

The steps of the workflow are (briefly):

  1. Model your TEI.
  2. Create pseudocodes to “tag” in MS Word.
  3. Transcribe in MS Word, using the pseudocode “tags” to indicate those things that will eventually be converted into TEI.
  4. Convert the finished MS Word document into TEI using OxGarage.
  5. Use find & replace in OxygenXML to convert the pseudocode tags into TEI tags, resulting in well-formed and complete TEI.

In more detail:

Model your TEI

The very first thing you need to do is to decide what you need your finished TEI to be able to do. If you’re working with an existing system (e.g., if you know you’ll be publishing in EVT at the end) some of your decisions will be made for you, because you will need to have TEI code that the system can use.[1] Are you encoding abbreviations, and if so are you going to tag the entire word or just the abbreviation and expansion? Are you going to normalize spelling, and if so are you going to do it silently or tag it? Are there marginalia you want to include in your TEI code? Do you want to include editorial notes?

Make a list of everything you need in your TEI, which TEI tags you plan to use, and how you plan to use them. You’ll need this to do the next step of the workflow, the creation of pseudocodes.

Create pseudocodes and tag them in MS Word

Pseudocodes are what I call non-TEI formatting elements that are used to set text apart, and are later processed into TEI tags. Pseudocodes can be divided into two main types: native MS Word formatting (italics, underlining, superscript, etc.) and punctuation marks.

Native MS Word Formatting

Native formatting in Word

MS Word formatting is converted by OxGarage into TEI <hi> tags with the relevant values for @rend. For example, Italics converts to <hi rend=“italic”>Italics</hi>, Bold converts to <hi rend=”bold”>Bold</hi rend=”bold”>, Underscore converts to <hi rend=“underline”>Underscore</hi>, Strikethrough converts to <hi rend=“strikethrough”>Strikethrough</hi>, Red text converts to <hi rend=“color(FF0000)”>Red text</hi>, Yellow highlight (not an option in WordPress) converts to <hi rend=“background(yellow)”>Yellow highlight</hi>,Superscript converts to <hi rend=“superscript”>Superscript</hi>, and Subscript converts to<hi rend=“subscript”>Subscript</hi>.

 

This is of course useful if you want these exact tags reflected in your final TEI, but once the TEI comes out of OxGarage, you can use the find and replace function in OxygenXML (or some other text/XML editor) to convert these tags into other tags. More on this below.

Native MS Word formatting works very well and can represent a very large number of TEI tags (using just text color and highlight would give you 75 pseudocodes mapping to 75 TEI tags or tag/abbreviation combinations), but there are definitely cases when you would want to use punctuation marks instead.

Punctuation Marks

You can use punctuation marks to set text apart that might not correspond 1:1 with a TEI tag. These are cases, such as expanded abbreviations or corrected readings, where you need tags nested within tags. Brackets work particularly well for this, especially various combinations of brackets. You do need to be careful about configuring bracket combinations, particularly when you’ll have brackets nested within brackets, and (as will also be mentioned later) the order in which you find & replace brackets later will also be relevant. This isn’t a matter to be taken lightly. You should test your pseudocodes and find & replace expressions on a section of text before encoding a full text.

Here is an example using the first line of Genesis 3, from University of Pennsylvania MS. Codex 236, fol. 31r

Genesis 3:1, UPenn Ms. Codex 236, fol. 31r

 

 

 

The text in this line reads:

sed et serpens erat callidior cūctis aīantib t̄

This includes a number of abbreviations that we could expand silently, or we could encode them in TEI in a few different ways. (For more information see the TEI Guidelines 11.3.1.2, “Abbreviations and Expansion”) Options include:

Noting that a word contains an abbreviation, without expanding it. In this example we put <abbr> tags around the complete word, and <am> tags around the abbreviated letter:

sed et serpens erat callidior <abbr>c<am>ū</am>ctis</abbr> <abbr>a<am>ī</am>anti<am>b</am></abbr> <abbr><am></am></abbr>

In word, you might choose pseudocodes using nested brackets. In this case, [[]] will be converted later into <abbr></abbr>, and [] (nested within [[]]) will be converted to <am></am>:

sed et serpens erat callidior [[c[ū]ctis]] [[a[ī]antib]] [[[t̄]…]]

Alternatively, you might choose to encode both abbreviation and expansion, and enable the system to choose between them. In this example we add <expan> and <ex> tags to the mix alongside <abbr> and <am>, and then include <choice> to make it clear that the abbreviations and expansions come in pairs.:

sed et serpens erat callidior

<choice>

<abbr>c<am>ū</am>ctis</abbr>

<expan>c<ex>un</ex>ctis</expan>

</choice>

<choice>

<abbr>a<am>ī</am>anti<am>b</am></abbr>

<expan>a<ex>nim</ex>antib</expan>

</choice>

<choice>

<abbr><am></am></abbr>

<expan><ex>ter</ex></expan>

</choice>

As above, you can come up with combinations of marks that you can use to indicate the encoding. In this case <abbr> and <am> are encoded as above, || will later be converted to <choice>, {{}} will be converted later into <expan></expan>, and {} (nested within {{}}) will be converted to <ex></ex>:

sed et serpens erat callidior |[[c[ū]ctis]]{{c{un}ctis}}| |[[a[ī]anti[b]]]{{a{nim}anti{bus}}}| |[[[t̄]]]{{{ter}…}}|

I like to group brackets of the same type together (as here, where square brackets are used for abbreviations, curly brackets for expansions, and pipes for choice) but you can also combine them in various ways for more options. For example, here are the bracketing options for a project I’m currently working on:

In all cases you need to be very careful that the punctuation marks you use don’t appear in your text, or only use them in combinations that don’t appear in your text, or else you will accidentally create TEI tags where you don’t want them.

Convert Word to TEI in OxGarage

Once you’ve transcribed and entered pseudocodes in MS Word, it’s time to convert your Word file into TEI. You can do this using OxGarage, a conversion service provided by the TEI. OxGarage has an online interface where you can convert one document at a time, described here, but you can also download the XSLTS from GitHub and run bulk conversion processes (converting multiple files at one time).

The online OxGarage interface is at http://www.tei-c.org/oxgarage/. You need to indicate that you are converting Documents, then select your options from (Microsoft Word doc or docx) and to (TEI P5), then load in your Word document and click “Convert”. Here is my input file (so you can download it and try this yourself), and a screenshot: 
OxGarage will generate a TEI file with a template header (including information gleaned from the Word document) and the textual content of the Word doc converted into very basic TEI (file here; you’ll need to change the file extension to .xml):

You can see here how the Word comment is converted into a <note> nested in <hi>, with a <date> included. I also included some red text (to indicate the tyronian et symbol) which has converted as expected. The punctuation mark pseudocodes are unchanged.

Replace Pseudocodes with TEI Tags

This is where we replace those pseudocodes – the <hi> color tags and the combinations of punctuation marks – into TEI. I like to do this in OxygenXML, because that software has advanced and advanced find & replace that enables you to search using regular expressions, including the ability to save pieces of what is being searched and reusing that in the replace (a bit like setting a variable in the search).[2]

As mentioned above the order in which you replace tags matters. You will always want to replace the outermost pseudotags first, then the interior ones, because the find & replace will always match from the first instance of a character in the regular expression to the last instance. This means that if you have […] (for <am>) nested inside [[…]] (for <abbr>) you need to replace the [[…]] before the […] or else you will end up with <am> around the word, and there will be no match when you then search for [[…]].

For example, to find [[…]] and replace it with <abbr>, using Find/Replace with the “Regular Expression” and “Dot matches all” boxes checked, you would search for:

\[\[([^\s]*)\]\] (this is a regular expression that will find every instance where a string of any character except spaces (\s), enclosed by [[ and ]] . The central part of the expression is enclosed with parentheses because we’re going to reuse that in the replace. The square brackets are preceded by \ to ensure they are considered as characters and not as part of a regular expression)

And replace that with

<abbr>$1</abbr> (This will replace the [[ and ]] with the closing and ending tags, and copy everything else in the middle – $1 refers to the piece of the search that was enclosed in parentheses)

Unfortunately, if you have an abbreviated word that starts on one line and ends on the following line (as we do here – the last word on this line is terre, but the re are on the next line) this regular expression won’t catch it because it ignores all spaces. So I do two sets of finds for each set of pseudocodes: one using the expression above, which ignores spaces, and a second one which includes spaces.

\[\[(.*)\]\] (replace with <abbr>$1</abbr>) as above

You don’t want to include spaces in your first search because if you have multiple sets of the same pseudocode in your document (which you probably do), the regular expression will include all the spaces so will only find the very first and the very last instance of the double brackets and you’ll end up with this:

The regular expression has matched the first [[ (on line 43) and the last]] (on line 46), but there are many in between that are missed because spaces are included.

Starting with the first search followed immediately with the second gives you:

Similarly, tag abbreviations by replacing |…| with <choice></choice> and {{..}} with <expan></expan>  – all the outer nesting has been replaced with TEI tags:

When you have multiple codes that may be nested in a single tag (as the multiple [] and {} now within <abbr> and <expan>) you need to modify the regular expression again, so it catches every matching pair of brackets.

\{([^\s\}]*)\} (Note the \} now within the square brackets. This will keep the expression from moving past the first closing bracket)

The result is a complete set of TEI tags encoding abbreviations and expansions (result file here, change the file extension to .txt).

 

You can also use OxygenXML’s find & replace function to replace the pseudocode TEI tags, or you can be fancy and write an XSLT to do that work. In this example, I want to replace the <hi rend=”color(FF0000)”> with <g ref=”#t_et”> (I’ll add a corresponding <glyph> tag to the <charDecl> section of the header as described in 5.5.2 of the TEI Guidelines). This is fairly straightforward, since I know the content of the tag will always be “et” I can do a find & replace for the whole thing. If the content of the tag varies, I can use a regular expression as I did above to copy content from find to replace.

And that’s the workflow. It’s still a lot of work, you need a strong handle on the TEI and you need to plan everything in advance. But if you are working with a large number of people transcribing and advanced TEI training isn’t possible or desirable. 

[1] EVT for example has specific requirements for tags it can process and how those tags need to be formatted, and if your TEI doesn’t meet those requirements it won’t work out of the box – you’ll need to modify the EVT code to suit your TEI.

[2] For more information and tutorials on Regular Expressions, visit https://regexone.com/.

“What is an edition anyway?” My Keynote for the Digital Scholarly Editions as Interfaces conference, University of Graz

This week I presented this talk as the opening keynote for the Digital Scholarly Editions as Interfaces conference at the University of Graz. The conference is hosted by the Centre for Information Modelling, Graz University, the programme chair is Georg Vogeler, Professor of Digital Humanities and the program is endorsed by Dixit – Scholarly Editions Initial Training Network. Thanks so much to Georg for inviting me! And thanks to the audience for the discussion after. I can’t wait for the rest of the conference.

What is an edition anyway?

Thank you to Georg Voegler for inviting me to present the keynote at the symposium, thank you Dixit for making this conference possible, and danke to welcoming speakers for welcoming us so warmly. I’m excited to be here and looking forward to hear what the speakers have to say about digital scholarly editions as interfaces. Georg invited me here to talk about my work on medievalists use of digital editions. But first, I have a question.

What is an edition? I think we all know what an edition is, but it’s still fun, I find, to investigate the different ways that people define edition, or think about editions, so despite the title of this talk, most of what I’m going to be talking about is various ways that people think about editions and why that matters to those of us in the room who spend our time building editions, and at the end I’m going to share my thoughts on directions I’d like to see medieval editions in particular take in the future.

screen-shot-2016-09-21-at-12-07-15-pm

I’ll admit that when I need a quick answer to a question, often the first place I turn to is Google. Preparing for this talk was no different. So, I asked Google to define edition for me, and this is what I got. No big surprise. Two definitions, the first “a particular form or version of a published text,” and the second “the total number of copies of a book, newspaper, or other published material issued at one time.” The first definition here is one that’s close to the way I would answer this question myself. I think I’d generally say that an edition is a particular version of a text. It might be a version compiled together from other versions, like in a scholarly critical edition, but need not be. I’m a medievalist, so this got me thinking about texts written over time, and what might make a text rise to the level of being an “edition”, or not.

screen-shot-2016-09-19-at-10-48-59-am
Bankes Papyrus, British Museum Papyrus 114.

So here is some text from the Illiad, written on a papyrus scroll I the 2nd century BC. The scroll is owned by the British Museum, Papyrus 114 also known as the Bankes Papyrus. The Illiad, you probably know, is an ancient Greek epic poem set during the Trojan war, which focuses on a series of battles between King Agamemnon and the warrior Achilles. If you are a Classicist, I apologize in advance for simplifying a complex textual situation. If you aren’t a Classicist, if you’ve read the Illiad you probably read it in a translation from Greek into your native language, and this text most likely would have been presented to you as “The Text of The Illiad” – that is, a single text. That text, however, is built from many small material fragments that were written over a thousand years, and which represent written form of text that was composed through oral performance. The Bankes Papyrus is actually one of the most complete examples of the Illiad in papyrus form – most surviving examples are much more fragmentary than this.

Venetus A, aka Marcianus Graecus Z. 454 [=822] (ca. 950), fol. 12r
Venetus A, aka Marcianus Graecus Z. 454 [=822] (ca. 950), fol. 12r
As far as we know the text of the Illiad was only compiled into a single version in the 10th century, in what is known as the Venetus A manuscript, now at the Marciana Library in Venice. I have an image of the first page of the first book of the Illiad here. You can see that this presents much more than the text, which is the largest writing in the center-left of the page. This compiled text is surrounded by layers of glossing, which includes commentary as well as textual variants.

UPenn Ms Codex 1058, fol. 12r.
UPenn Ms Codex 1058, fol. 12r.

The Venetus A is just one example of a medieval glossed manuscript. Another more common genre of glossed manuscripts are Glossed Psalters, that is, texts of the Psalter written with glosses, quotes from the Church Fathers, included to comment on specific lines. Here is an example of a Glossed Psalter from the University of Pennsylvania’s collection. This is a somewhat early example, dated to around 1100, which is before the Glossa Ordinara was compiled (the Glossa Ordinaria was the standard commentary on the scriptures into the 14th century). Although this isn’t as complex as the Venetus A, you can still see at least two levels of glossing, both in the text and around the margins.

UPenn Ms. Codex 1640, fol. 114r.
UPenn Ms. Codex 1640, fol. 114r.

One more example, a manipulus florum text from another University of Pennsylvania manuscript. Thomas of Ireland’s Manipulus florum (“Handful of flowers”), compiled in the early 14th century, belongs to the genre of medieval texts known as florilegia, collections of authoritative quotations that are the forerunners of modern reference works such as Bartlett’s Familiar Quotations and The Oxford Dictionary of Quotations. This particular florilegium contains approximately 6000 Latin proverbs and textual excerpts attributed to a variety of classical, patristic and medieval authors. The flora are organized under alphabetically-ordered topics; here we see magister, or teacher. The red text is citation information, and the brown text is the quotes.

marsden

Now let’s take a look at a modern edition, Richard Marsden’s 2008 edition of The Old English Heptateuch published with the Early English Text Society. A glance at the table of contents reveals an introduction with various sections describing the history of editions of the text, the methodology behind this edition, and a description of the manuscripts and the relationships among them. This is followed by the edited texts themselves, which are presented in the traditional manner: with “the text” at the top of the page, and variant readings and other notes – the apparatus – at the bottom. In this way you can both read the text the editor has decided is “the text”, but also check to see how individual manuscripts differ in their readings. It is, I’ll point out, very similar to the presentation of the Illiad text in the Venetus A.

screen-shot-2016-09-21-at-12-22-22-pm

Electronic and digital editions have traditionally (as far as we can talk about there being a tradition of these types of editions) presented the same type of information as print editions, although the expansiveness of hypertext has allowed us to present this information interactively, selecting only what we want to see at any given moment and enabling us to follow trails of information via links and pop-ups. For example I have here Prue Shaw’s edition of Dante’s Commedia, published by the Scholarly Digital Editions. Here we have a basic table of contents, which informs us of the sections included in the edition.

screen-shot-2016-09-21-at-12-26-08-pm

Here we have the edited text from one manuscript, with the page image displayed alongside (this of course being one of the main differences between digital and print editions), with variant readings and other notes available at the click of the mouse.

screen-shot-2016-09-21-at-12-29-37-pm

A more extensive content list is also available via dropdown, and with another click I can be anywhere in the edition I wish to be.

screen-shot-2016-09-21-at-12-30-41-pm

Here I am at the same point in the text, except the base text is now this early printed edition, and again the page image is here displayed so I can double-check the editor’s choices should I wish to.

With the possible exception of the Bankes Papyrus, all of these examples are editions. They reflect the purpose of the editor, someone who is not writing original text but is compiling existing text to suit some present desire or need. The only difference being the material through which the edition is presented – handwritten parchment or papyrus, usually considered “primary material”, vs. a printed book or digital media, or “secondary material”. And I could even make an argument that the papyrus is an edition as well, if I posit that the individual who wrote the text on the papyrus was compiling it from some other written source or even from the oral tradition.


I want to take a step back now from the question of what is an edition and talk a bit about why, although the answer to this may not matter to me personally, it does matter very much when you start asking people their opinions about editions. (I am not generally a fan of labels and prefer to let things be whatever they are without worrying too much about what I should call them. I’m no fun at parties.)

I’ve been studying the attitudes of medievalists towards digital resources, including editions, since I was a library science graduate student back in 2002. In May 2001 I graduated with an MA from the Medieval Institute at Western Michigan University, with a focus on Anglo-Saxon language, literature, and religious culture. I had taken a traditional course of work, including courses in paleography and codicology, Old English, Middle English, and Latin language and literature, and several courses on the reading of religious texts, primarily hagiographical texts. I was keenly aware of the importance of primary source materials to the study of the middle ages, and I was also aware that there were CD-ROMs available that made primary materials, and scholarly editions of them, available at the fingertips. There were even at this time the first online collections of medieval manuscripts (notably the Early Medieval Manuscript Collection at the Bodleian Library at Oxford). But I was curious about how much these new electronic editions (and electronic journals and databases, too) were actually being used by scholars. I conducted a survey of medievalists, asking them about their attitudes toward, and use of, electronic resources. I wrote my findings in a research paper, “Medievalists’ Use of Electronic Resources: The Results of a National Survey of Faculty Members in Medieval Studies,” which is still available if you want to read it, in the IU Bloomington institutional repository.

I conducted a second survey in 2011, and compared findings from the two surveys in an article published in 2013 in Scholarly Editing, “Medievalists and the Scholarly Digital Edition.” The methodologies for these surveys were quite different (the first was mailed to a preselected group of respondents, while the second was sent to a group but also advertised on listservs and social media), and I’m hesitant to call either of them scientific, but with these caveats they do show a general trend of usage in the 9 years between, and this trend reflects what I have seen anecdotally.

2002

In this chart from 2002, we see that 7% of respondents reported using electronic and print editions the same, 44% print mostly, and 48% print only.

2009

Nine years later, while still no-one reports using only electronic editions, 7% report using electronic mostly, 12% electronic and print the same, 58% print mostly, and 22% print only. The largest shift is from “print only” to “print mostly”, and it’s most clearly seen on this chart.

screen-shot-2016-09-23-at-12-19-31-pm

Now this is all well and good, and you’d be forgiven for looking at this chart and coming to the conclusion that all these folks had finally “seen the light” and were hopping online and on CD Rom to check out the latest high-tech digital editions in their field. But the written comments show that this is clearly not the case, at least not for all respondents, and that any issues with the survey data come from a disconnect between how I conceive of a “digital edition” and how the respondents conceive of the same.

screen-shot-2016-09-23-at-12-20-20-pm

Exhibit A: Comments from four different respondents explaining when they use digital editions and how they find them useful. I won’t read these to you, but I will point out that the phrase Google Books has been bolded in three of them, and while the other one doesn’t mention Google Books by name, the description strongly implies it.

I have thought about this specific disconnect a lot in the past five years, because I think that it does reflect a general disconnect between how we who create digital editions think about editing and editions, and how more traditional scholars and those who consume editions think about them. Out of curiosity, as I was working on this lecture I asked on Facebook for my “friends” to give me their own favorite definition of edition (not digital edition, just edition), and here are two that reflected the general consensus. The first is very material, a bibliographic description that would be favored by early modernists (as a medievalist I was actually a bit shocked by this definition, although I know what an edition is, bibliographically speaking, I wasn’t thinking in that direction at that point, I was really thinking of a “textual edition”), while the second focuses not so much on how the text was edited but on the apparatus that comes along with it. Thus, an edited text by itself isn’t properly an edition, it requires material explaining the text to be a “real” edition. Interestingly, this second definition arguably includes the Venetus A manuscript we looked at earlier.

This spring, in preparation for this lecture, I created a new survey, based on the earlier surveys (which were more or less identical) but taking as a starting place Patrick Sahle’s definition of a Digital Scholarly Edition:

Digital scholarly editions are not just scholarly editions in digital media. I distinguish between digital and digitized. A digitized print edition is not a “digital edition” in the strict sense used here. A digital edition can not be printed without a loss of information and/or functionality. The digital edition is guided by a different paradigm. If the paradigm of an edition is limited to the two-dimensional space of the “page” and to typographic means of information representation, than it’s not a digital edition.

In this definition Sahle differentiates between a digital edition, which essentially isn’t limited by typography and thus can’t be printed, and a digitized edition, which is and which can. In practice most digitized editions will be photographic copies of print editions, although of course they could just be very simple text rendered fully in HTML pages with no links or pop-ups. While the results of these lines of questioning aren’t directly comparable with the 2002 and 2011 results, I think it’s possible to see a general continuing trend towards a use of digitized editions, if not towards digital editions following Sahle’s definition.

First, a word about methodology. This year’s respondents were entirely self-selecting, and the announcement of the survey, which was online, went out through social media and listservs. I didn’t have a separate selected group. There were 337 total respondents although not every respondent answered every question.

digital

This year, I asked respondents about their use of editions – digital, digitized, and print – over the past year, focusing on the general number of times they had used the editions. Over 90% of respondents report using digital editions at all, although only just over 40% report using them “more times than I can count”.

screen-shot-2016-09-21-at-12-02-58-pm

When asked about digitized editions, however, over 75% report using them “more times than I can count”, and only 2 respondents – .6% – report using them not at all.

screen-shot-2016-09-21-at-12-05-01-pm

Print edition usage is similar to digitized edition usage, with about 78% reporting they use them “more times than I can count” and no respondents reporting they use them not at all. A chart comparing the three types of editions side-by-side shows clearly how similar numbers are for digitized and print editions vs. digital editions.

Comparing usage of Digital Editions, Digitized Editions, and Print Editions.
Comparing usage of Digital Editions, Digitized Editions, and Print Editions.

What can we make of this? Questions that come immediately to my mind include: are we building the editions that scholars need? That they will find useful? Are there editions that people want that aren’t getting made? But also: Does it matter? If we are creating our editions as a scholarly exercise, for our own purposes, does it matter if other people use them or not? It might hurt to think that someone is downloading a 19th century edition from Google Books instead of using my new one, but is it okay? And if it’s not, what can we do about that? (I’m not going to try to answer that, but maybe we can think about it this week)


I want to change gears and come back now to this question, what is an edition. I’ve talked a bit about how I conceive of editions, and how others do, and how if I’m going to have a productive conversation about editions with someone (or ask people questions on a survey) it’s important to make sure we’re on the same page – or at least in the same book – regarding what we mean when we say “edition”. But now I want to take a step back – way back – and think about what an edition is at the most basic level. On the Platonic level. If an edition is a shadow on the wall, what is casting that shadow? Some people will say “the urtext” which I think of (not unkindly, I assure you) as the floating text, the text in the sky. The text that never existed until some editor got her hands on it and brought it to life as Viktor Frankenstein brought to life that poor, wretched monster in the pages of Mary Shelley’s classic horror story. I say, we know texts because someone cared enough to write them down, and some of that survives, so what we have now is a written record that is intimately connected to material objects: text doesn’t float, text is ink on skin and ink on paper and notches in stone, paint on stone, and whatever else borne on whatever material was handy. So perhaps we can posit editions that are cast from manuscripts and the other physical objects on which text is borne, not simply being displayed alongside text, or pointed to from text, or described in a section “about the manuscript”, but flipping the model and organizing the edition according to the physical object.

I didn’t come up with this idea, I am sad to say. In 2015, Christoph Flüeler presented a talk at the International Congress on Medieval Studies titled “Digital Manuscripts as Critical Edition,” later posted to the Schoenberg Institute for Manuscript Studies blog. In this essay Flüeler asks: “… how [does] a digital manuscript [stand] in relation to a critical edition of a text. Can the publication of a digital manuscript on the internet be understood as an edition? Further: could such an edition even be regarded as a critical edition?” – His answer being, of course, yes. I won’t go into his arguments, instead I’m going to use them as a jumping-off point, but I encourage you to read his essay.

This concept is very appealing to me. I suppose I should admit now, almost at the end of my keynote, that I am not presently doing any textual editing, and I haven’t in a few years. My current position is “Curator, Digital Research Services” in the Schoenberg Institute for Manuscript Studies at the University of Pennsylvania Libraries in Philadelphia. This position is a great deal of fun and encompasses many different responsibilities. I am involved in the digitization efforts of the unit and I’m currently co-PI of Bibliotheca Philadelphiensis, a grant funded project that will digitize all the medieval manuscripts in Philadelphia, which I can only mention now but I’ll be glad to talk about more later to anyone interested in hearing about it. All our digital images are released in the public domain, and published openly on our website, OPenn, along with human readable HTML descriptions, links to download the images, and robust TEI manuscript descriptions available for download and reuse.

I also do a fair amount of what I think of as experimental work, including new ways to make manuscripts available to scholars and the public. I’ve created electronic facsimiles in the epub format, a project currently being expanded by the Penn Libraries metadata group, which are published in our institutional repository, and I also make short video orientations to our manuscripts which are posted on YouTube and also made available through the repository. In the spring I presented on OPenn for a mixed group of librarians and faculty at Vanderbilt University in Tennessee, after which an art historian said to me, “this open data thing is great and all, but why can’t we just have the manuscripts as PDFs?” So I held my nose and generated PDF files for all our manuscripts, then I did it for the Walters Art Museum as well for good measure. I posted them all to Google Docs, along with spreadsheets as a very basic search facility.

Collation visualization via VisColl
Collation visualization via VisColl

I’ve also been working for the past few years on developing a system for modeling and visualizing the physical collation of medieval manuscripts (this is distinct from textual collation which involves comparing versions of texts). With a bit of funding from the Mellon Foundation and the collaboration of Alexandra Gillespie and her team at the University of Toronto I am very excited for the next version of that system, which we call VisColl (it is on GitHub if like to check it out – you can see the code and there are instructions for creating your own models and visualizations). The next version will include facilities for connecting tags, and perhaps transcriptions, to the deconstructed manuscript. I hadn’t thought of the thing that this system generates as an edition, but perhaps it is. But instead of being an edition of a text, you might think of it as an edition of a manuscript that happens to have text on it (or sometimes, perhaps, won’t).

I am aware that I’m reaching the end of my time, so I just want to take a few minutes to mention something that I see playing an enormous role in the future of digital-manuscripts-as-editions, and that’s the International Image Interoperability Framework, or IIIF. I think Jeffrey Witt may mention IIIF in his presentation tomorrow, and perhaps others will as well although I don’t see any IIIF-specific papers in the schedule. At the risk of simplifying it, IIIF is a set of Application Programming Interfaces (APIs) – sets of routines, protocols, and tools – to enable the interoperability of image repositories. This means you can use images from different repositories in the same browser or other tool. Here, quickly, is an example of how that can work.

screen-shot-2016-09-21-at-5-25-06-pm

e-codices publishes links to IIIF manifests for each of their manuscripts. A manifest is a json file that contains descriptive and structural metadata for a manuscript, including links to images that are served through a IIIF server. You can look at it. It is human readable, kind of, but it’s a mess.

Two e-codices manuscripts (and others) in Mirador.
Two e-codices manuscripts (and others) in Mirador.

However, if you copy that link and paste it into a IIIF-conformant tool such as Mirador (a simple IIIF browser which I have installed on my laptop) you can create your own collection and then view and manipulate the images side-by-side. Here I’ve pulled in two manuscripts from e-codices, both copies of the Roman de la Rose.

screen-shot-2016-09-21-at-5-38-49-pm

And here I can view them side by side, I can compare the images, compare the text, and I can make annotations on them too. Here is tool for creating editions of manuscripts.

(A quick side note: of course there are other tools that offer image tagging ability, including the DM project at SIMS, but what IIIF offers is not a single tool but a system for building and viewing editions, and all sorts of other unnamable things, using manuscripts in different institutions without having the move the images around. I cannot stress how radical this is for medieval manuscript studies.)

However as fond as I am of IIIF, and as promising I think it is for my future vision, my support for it comes with some caveats. If you don’t know, I am a huge proponent of open data, particularly open manuscript data. The Director of the Schoenberg Institute is Will Noel, an open data pioneer in his own right who has been named a White House Champion of Change, and I take him as my example. I believe that in most cases, when institutions digitize their manuscript collections they are obligated to release those images into the public domain, or at the very least under a creative commons: by license (to be clear, a license that would allow commercial use) and that manuscript metadata should be licensed for reuse. My issue with IIIF is that is presents the illusion of openness without actual openness. That is, if images are published under a closed license, if you have the IIIF manifest you can use them to do whatever you want, as long as you’re doing it through IIIF-compliant software. You can’t download them and use them outside of the system (to, say, generate PDF or epub facsimiles, or collation visualizations). I love IIIF for what it makes possible but I also think it’s vital to keep data open so people can use it outside of any given system.

DATA OVER INTERFACE
DATA OVER INTERFACE

We have a saying around the Schoenberg Institute, Data Over Interface. It was introduced to us by Doug Emery, our data programmer who was also responsible for the curation of the data of the Archimedes Palimpsest Project and the Walters Art Museum manuscripts. We like it so much we had it put on teeshirts (You can order your own here!). I like it, not because I necessarily agree that the data is always more important than the interface, but because it makes me think about whether or not the data is always more important than the interface. Excellent, robust data with no interface isn’t easily usable (although a creative person will always find a way), but an excellent interface with terrible data or no data at all is useless as anything other than a show piece. And then inevitably my mind turns to manuscripts, and I begin to wonder, in the case of a manuscript, what is the data and what is the interface? Is a manuscript simply an interface for the text and whatever else it bears, or is the physical object data of its own that begs for an interface to present it, to pull it apart and put it back together in some way to help us make sense of it or the time it was created? Is it both? Is it neither?

I am so excited to be here and to hear about what everyone in this room is thinking about editions, and interfaces, and what editions are, and what interfaces are and are for. Thank you so much for your time, and enjoy the conference.

Presented September 23, 2016.