The stated aim of the session was “to spark conversations about using emerging digital approaches to study cultural heritage collections,” (I’ll copy the full workshop description at the end of this post) but all of our presentations ended up focusing on the labor involved in developing our projects. This was not planned, but it was good, and also interesting that all of us independently came to this conclusion.
Clifford’s presentation was about work being done by the Scholarly Communications team at Vanderbilt University Libraries as they convert data from legacy projects (which have tended to be purpose built, siloed, and bespoke) into more tractable, reusable open data, and Alex told us about the GAM Digital Archive Project, which is digitizing materials related to human rights violations in Guatemala. Both Clifford and Alex stressed the amount of time and effort it takes to do the work behind their projects. The audience was mainly history faculty and maybe a few graduate students, and I expect they, like me, wanted to make sure the audience understood that the issue of where data comes from is arguably more important than the existence of the data itself.
My own talk was about the University of Pennsylvania’s OPenn (Primary Digital Resources for Everyone), which if you know me you probably already know about. OPenn is the website in which the Kislak Center for Special Collections, Rare Books and Manuscripts publishes its digitized collections in the public domain, as well as hosting collections for many other institutions. This includes several libraries and archives around Philadelphia who are partners on the CLIR-funded Bibliotheca Philadelphiensis project (a collaboration with Lehigh University, the Free Library of Philadelphia, Penn, and the Philadelphia Area Consortium of Special Collections Libraries), which I always mention in talks these days (I’m a co-PI and much of the work of the project is being done at Penn). I also focused my talk on the labor of OPenn, mentioning the people involved and including slides on where the data in OPenn comes from, which I haven’t mentioned in a public talk before.
Ironically I ended up spending so much time talking about what OPenn is and how it works that I didn’t have time to show much of the data, or what you can do with it. But that ended up fitting the (unplanned) theme of the workshop, and the attendees seemed to appreciate it, so I consider it a success.
The purpose of this workshop is to spark conversations about using emerging digital approaches to study cultural heritage collections. It will include a few demonstrations of history projects that make use of collection materials from galleries, libraries, archives, or museums (GLAM) in computational ways, or that address those materials as data. The group will also discuss a range of ways that historical collections can be transformed and creatively re-imagined as data. The workshop will include conversations about the ethical aspects of these kinds of transformations, as well as the potential avenues of exploration that are opened by historical materials treated as data. Part of an IMLS-funded National Digital Forum grant, this workshop will ultimately inform the development of recommendations that aim to support cultural heritage community efforts to make collections collections more readily amenable to computational use.
When I was eleven years old, my parents brought my brother (who would have been thirteen at that time) and me to England for two weeks during the summer. They rented a house in the southwest corner of the country, not far from Bath, and borrowed a car. We went all over the place; I remember Salisbury and Stonehenge, Wells Cathedral and Bath Abbey. I also remember riding in the back seat down a particularly narrow road surrounded by trees and fields and pointing out the funny stones the cows were grazing around, at which point my father remarked that we were probably getting close to Avebury. But one memory of that trip stands out above all the rest: The Castle. Over the years, The Castle in my mind has grown to almost mythical proportions as I’ve come to realize (even more so over the past couple of years as I have been preparing for this exhibition) that it marks the point at which I was destined to become a medievalist. My reaction to The Castle was an epiphany, my path set in childhood—and I didn’t realize it until almost thirty years later.
In my memory, we visited The Castle toward the end of the afternoon. I was probably tired and grouchy, although I don’t remember that. (I spent much of this trip tired and grouchy.) I do remember a small town, walking through a residential area with lots of houses, turning the corner, and all of a sudden there it was. It was very different from Warwick Castle, which we’d visited earlier in our trip and which I’d found dull and crowded and ugly. This one was small. I remember a tower, and a demolished wall; it was a ruin. There was no one else around, so my brother and I climbed on the broken walls and ran around and basically acted like kids.
At some point, I noticed that the interior of the tower, which was several meters tall, had regular sets of holes around the perimeter, several feet apart horizontally and several more running vertically all the way to the top. I asked about the holes, and someone told me that wooden beams would have gone through those holes, serving as supports for floors. And I remember being struck very suddenly that people had lived here. I was standing in this ruined tower, we were using it essentially as a playground, and yet hundreds of years ago people had made this place their home.
That experience was the first time I can remember having a visceral reaction to a physical object, a reminder that this object was not just the thing we have today, but a thing that has existed over time and been touched by so many hands and lives before it came to us, and will continue touching people long after we are gone.
As my personal experience attests, reactions are both immediate and ongoing, with potentially long-term effects (on both people and objects). Not all premodern book owners wrote in their books, and not all modern artists look to medieval manuscripts for inspiration, but by looking at the various ways that medieval and modern people have reacted to manuscripts, we may come to appreciate these objects as more than simply bearers of information, or beautiful things for us to enjoy. They bear the marks of their own history, and they still have the potential to make history today and in the future.
My story begins in 2012, when Dr. Benjamin Fleming, Visiting Scholar in Religious Studies and Cataloger of Indic Manuscripts for the Kislak Center for Special Collections at the University of Pennsylvania, proposed and was awarded an Endangered Archives grant from the British Library. The main purpose of the grant was to write a catalog for the Ramamala Library, which is one of the oldest still-active traditional libraries in Bangladesh. A secondary part of this grant was to digitize around 150 of the most fragile manuscripts from the Ramamala Library, and an agreement was made that the University of Pennsylvania Libraries would be responsible for hosting these digital images. At this time, someone from the Penn libraries recommended to Dr. Fleming that they get a Creative Commons license, and the non-commercial license was given as an example. The proposal went forward with a CC-NC license, which both the Penn Libraries and the Ramamala Library agreed to, and everything was fine.
So a bit of historical logistics might be helpful here. 2012, the year this proposal was agreed to, was one year before the Schoenberg Institute for Manuscript Studies (SIMS) was founded at Penn. One of the very first things that SIMS was tasked with doing – one of the things it was designed to do – was to create some kind of open access portal to enable the resue of digital images of our medieval manuscripts. Penn has been digitizing our manuscripts and posting them online since the late 1990s, and in 2013 all of them were online in a system called Penn in Hand. Penn in Hand is a kind of black box – you can see the manuscripts in there, search for them, navigate them, but if you want to publish an image in a book or use them in a project, you have to do some work to figure out what’s allowed in terms of licensing, and then figure out how to get access to high-resolution images that are going to be usable for your needs.
It took us a couple of years, but on May 1, 2015, we launched OPenn: Primary Digital Resources Available for Everyone, as a platform not for viewing images, but explicitly for downloading and reusing images and metadata. OPenn includes high-resolution master TIFF images and smaller JPEG derivatives, as well as robust metadata in TEI/XML using the Manuscript Description element. We started hosting our own data, but today we host manuscript data for 17 institutions in the Philadelphia area with others in the US and Europe (including Hebrew manuscripts from the John Rylands Library at the University of Manchester) to come online in the next year. One the things we wanted was for users of OPenn to always be certain about what they could do with the data, so we decided that anything that goes into OPenn must follow those licenses that Creative Commons has approved for Free Cultural Works:
the CC Public Domain mark
CC0 (“CC-zero”), the Public Domain dedication for copyrighted works
CC-BY, the Creative Commons Attribution license
CC-BY-SA, the Creative Commons Attribution-Share Alike license
Note that licenses with a non-commercial clause are not approved for Free Cultural Works, and thus OPenn, by policy, is not able to host them.
You see where this is going.
So in March 2016, a year after OPenn was launched, and well after the Ramamala Library manuscripts had been photographed, Dr. Fleming asked about adding the Ramamala Library material to OPenn, in addition to having it in Penn in Hand (where it was already going online). It wasn’t until he asked this question that we realized, under current policy, we couldn’t include that material in OPenn because of the license. Over the next few weeks we (we being representatives of OPenn, and Dr. Fleming) had several conversations during which we floated some various ideas:
We could loosen the “Free Cultural Works” requirement and allow inclusion of the Ramamala data with the noncommercial license.
We could build a parallel OPenn to contain data with a noncommercial license.
We could use OPenn as a kind of carrot, to encourage the Ramamala library administration to loosen the noncommercial clause on the license and release the data as Free Cultural Works.
The third option was struck down almost as soon as it was suggested. There were, it turns out, highly sensitive discussions that had been happening at the Ramamala Library during the course of the project that would have made such a request difficult to say the least. As Dr. Fleming said in an email to me as we were talking about this talk, “It would be highly inappropriate and complex to try and revisit the copyright agreement as, even as it was, the act of a Western organization making digital copies of a small set of mss unraveled a dense set of internal issues related to private property and government control over cultural property (digital or otherwise).”
Before I return to our other two options I want to take a quick detour to talk a bit about how this conversation changed my thinking about Open Access in general, and about open access of non-Western material specifically.
I am by all accounts an evangelist for open access to medieval manuscript material. I like to complain: about institutions that keep their images under strict licenses, that make their images difficult or impossible to download, that charge hefty fees for manuscripts that have been digitized for years. OPenn is a reaction against that kind of thinking. We say: Here are our manuscript images! Here is our metadata! Here’s how you can download them. Do whatever you like with them. We own the books, but we acknowledge that they are our shared cultural heritage and in fact they belong to all of us. So the very least we can do is give you digital copies.
Until Ramamala, I would have told you that it was necessary that digital images of every manuscript in every culture that was written before modern times should be in the public domain and available for everyone. The people who wrote them are dead, and they wouldn’t have had the same conception of copyright ownership in any case, so why not? But suddenly I wasn’t so sure. I was forced to move beyond thinking in very black and white terms about “old vs. new” to thinking in a more nuanced way about “old vs. new”, sure, but also about “what is yours vs. what is ours” – and what ownership of the physical means for ownership of the digital. Again, until Ramamala I would have told you that physical owners owe it to the rest of us to allow the digital to sit in the public domain. But what does this mean for countries who have suffered under colonialism, and who have been forced for the past however many hundred years or more to share their cultural heritage with the west? In my somewhat unstructured thoughts, I keep coming back to the Elgin Marbles, which is just one example of cultural vandalism by the west (in this case, with the assistance of the Ottoman Empire, which ruled Greece at the time the marbles were taken) but is a particularly egregious one. I’m sure you’re all familiar with the Elgin Marbles, which used to decorate buildings in the Arcopolis, including the Parthenon, before they were removed to Britain between 1801 and 1812, later purchased by the British Museum where they can be seen today, although the current government of Greece has urged their return for many years.
Now the situation of the manuscripts in the Ramamala Library isn’t the same as that of the Elgin Marbles – we aren’t suggesting to move the physical collection to Penn – but I can’t help but believe there is a parallel here, and particularly a case to be made for respecting the ownership of cultural heritage by the cultures that created the heritage. But what does that mean? Dr. Fleming’s comment above implies the complexities around this question. Does “the culture that created the heritage” mean the current government? Or the cultural institutions? Or the citizens of the countries, or the citizens of the culture in those countries, no matter where they live now? And once we figure out the who, how do we ensure they have power to make decisions about their heritage objects? But I have taken my tangent far enough and I want to get back to talking about our ideas for making the Ramamala Library data available in OPenn.
When we left off, we had two other ideas:
We could loosen the “Free Cultural Works” requirement and allow inclusion of the Ramamala data with the noncommercial license.
We could build a parallel OPenn to contain data with a noncommercial license.
We considered the first suggestion but decided very quickly that we didn’t want to open that can of worms. Our concern was that if we started allowing one collection with a noncommercial license, no matter what the circumstances, other institutions that would want to include such a clause could point to it and say “Well them, why not us”? We have in fact used our “Free Cultural Works” only policy as a carrot for other institutions we host data for, including several museums and libraries in Philadelphia, and it works remarkably well (free hosting in exchange for an open license is apparently an attractive prospect). We don’t want to lose that leverage – particularly when it comes to institutions that own materials from former colonial countries, we have the ability and thus the responsibility to make that data available again, and it’s not a responsibility we take lightly.
The second idea, building a parallel OPenn to allow noncommercially licenced data, was more attractive but, we decided, just too much work at this time, with only one collection. However it did seem like something that would be a good community project: an open access portal, similar to OPenn in design, but with policies designed for the concerns surrounding access and reuse for cultural heritage data from former colonial countries. If something like this is currently being designed I would love to hear about it, and I expect we would be very happy to put our Ramamala data into such a platform.
So what did we do? We decided to take the path of least resistance and not do anything. The Ramamala manuscripts are available on Penn in Hand, and, since the license information was entered into the MARC record notes field, the license is obvious (unlike most other materials in Penn in Hand). The images are still not easily downloadable, although we are working on a new page for the project through which people can request free access to the high-resolution TIFF files. However, they aren’t available on OPenn. And I’m not sorry about that. I’m not sorry that OPenn’s policy is strictly for “Free Cultural Works,” because I think within our community, serving mainly institutions in Philadelphia and other US and Western European cities, the policy helps us leverage collections into Open Access that might otherwise be under more strict licenses. But I do think it’s important for us to keep talking and thinking about how we can serve other communities with respect.
The text of a lightning talk originally presented at The Futures of Medieval Historiography, a conference at the University of Pennsylvania organized by Jackie Burek and Emily Steiner. Keep in mind that this was very lightly researched; please be kind.
Rather than the originally proposed topic, the historiography of medieval manuscript descriptions, I will instead be talking about the historiography of medieval manuscripts specifically in England and the USA, as perceived through the lens of manuscript descriptions.
We’ll start in the late 12th into the 15th century, when monastic houses cataloged the books in their care using little more than a shelf-list. Such a list would be practical in nature: the community needs to be able to know what books they own, so as books are borrowed internally or loaned to other houses (or perhaps sold) they have a way to keep track of them. Entries on the list would be very simple: a brief statement of contents, and perhaps a note on the number of volumes. There is, of course, an entire field of study around reconstructing medieval libraries using these lists, and as the descriptions are quite simple it is not an easy task.
In the 15th and 16th centuries there were two major historical events that I expect played a major role both in a change in the reception of manuscripts, and in the development of manuscript descriptions moving forward: those are the invention of the printing press in the mid-15th century, and the dissolution of the monasteries in the mid-16th century. The first made it possible to relatively easily print multiple copies of the same book, and also began the long process that rendered manuscripts obsolete. The second led to the transfer of monastic books from institutional into private hands, and the development of private collections with singular owners. When it came to describing their books, these collectors seemed to be interested in describing for themselves and other collectors, and not only for the practical purpose of keeping track of them. Here is a 1697 reprint of a catalog published in 1600 of Matthew Parker’s private collection (bequeathed to Corpus Christi College Cambridge in 1574). You can see that the descriptions themselves are not much different from those in the manuscript lists, but the technology for sharing the catalog – and thus the audience for the catalog – is different.
In the later 16th and into the 17th century these private manuscript collections began to be donated back to institutions (educational and governmental), leading to descriptions for yet other audiences and for a new purpose: for institutions to inform scholars of what they have available for their use. The next three examples, from three catalogs of the Cotton Collection (now at the British Library) reflect this movement. The first is from a catalogue published in 1696, the content description is perhaps a bit longer than the earlier examples, and barely visible in the margin is a bit of a physical description: this is a codex with 155 folios. Notably this is the first description we’ve looked at that mentions the size of the book at all, so we are moving beyond a focus only on content. This next example, from 1777, is notable because it completely forefronts the contents. This catalog as a whole is organized by theme, not by manuscript (you can see below the contents listed out for Cotton Nero A. i), so we might describe it as a catalog of the collection, rather than a catalog of the manuscripts comprising the collection.
The third example is from the 1802 catalog, and although it’s still in Latin we can see that there is more physical description as well as more detail about the contents and appearance of the manuscript. There is also a citation to a book in which the preface on the manuscript has been published – the manuscript description is beginning to look a bit scholarly.
We’ll jump ahead 150 years, and we can see in that time that concern with manuscripts has spread out from the institution to include the realm of the scholar. This example is from N.R. Ker’s Catalogue of Manuscripts Containing Anglo-Saxon, rather than focusing on the books in a particular collection it is focused on a class of manuscripts, regardless of where they are physically located. The description is in the vernacular, and has more detail in every regard. The text is divided into sections as well: General description; codicological description; discussion of the hands; and provenance.
And now we arrive at today, and to the next major change to come to manuscript descriptions, again due to new technology. Libraries around the world, including here at Penn, are writing our manuscript descriptions using code instead of on paper, and publishing them online along with digital images of the manuscript pages, so people can not only read about our manuscripts, but also see images of them and use our data to create new things. We use the data ourselves, for example in OPenn (Primary digital resources available to everyone!) we build websites from our manuscript descriptions to make them available to the widest possible audience.
I want to close by giving a shout-out to the Schoenberg Database of Manuscripts, directed by Lynn Ransom, which is pushing the definition of manuscript descriptions in new scholarly directions. In the SDBM, a manuscript is described temporally, through entries that describe where a book was at particular moments in time (either in published catalogs, or through personal observation). As scholarly needs continue to change, and technology makes new things possible, the description of manuscripts will likewise continue to change around these, even as they have already over the last 800 years.
So you’ve just digitized medieval manuscripts from your collection and you’re putting them online. Congratulations! That’s great. Online access to manuscripts is so important, for scholars and students and lots of other people, too (I know a tattoo artist who depends on digital images for design ideas). As the number of collections available online has grown in recent years (DMMAP lists 545 institutions offering at least one digitized manuscript), the use of digital manuscripts by medievalists has grown right along with supply. If you’re a medievalist and you study manuscripts, I’m confident that you regularly use digital images of manuscripts. So every new manuscript online is a celebration. But now, you who are making digitized medieval manuscripts available online, tell us more. How, exactly, are you making your manuscripts available? And please don’t say you’re making them freely available online.
I hate this phrase. It makes my teeth clench and my heart beat faster. It makes me feel this way because it doesn’t actually tell me anything at all. I know you are publishing your images online, because where else would you publish them (the age of CDRom for these things is long gone) and I know they are going to be free, because otherwise you’d be making a very different kind of announcement and I would be making a very different kind of complaint (I’m looking at you, Codices Vossiani Latini Online). What else can you tell me?
Here are the questions I want answered when I read about an online manuscript collection.
How can I find your manuscripts? Is there a search and browse function on your site, or do I have to know what I’m looking for when I come in?
Will your images be served through the International Image Interoperability Framework (IIIF)? IIIF has become very popular recently, and for good reason – it enables users to pull manuscripts from any IIIF-compliant repository into a single interface, for example comparing manuscripts from different institutions in a single browser window. A user will need access to the IIIF manifests to make this work – the manifest is essentially a file containing metadata about the manuscript and a list of links to image files. So, if you are using IIIF, will the manifests be easily accessible so I can use them for my own purposes? (For reference, e-codices links IIIF manifests to each manuscript record, and it couldn’t be easier to find them.)
How can I get your images? I know you’re proud of your interface, but I might want to do something else with your images, either download them to my own machine or point to them from an interface I’ve built myself or borrowed from someone else (maybe using IIIF, but maybe not). If you provide IIIF manifests I have a list of URLs I can use to point to or download your image files (more or less, depending on how your server works), but if you’re not using IIIF, is there some other way I can easily get a list of image URLs for a manuscript? For example, OPenn and The Digital Walters publish TEI documents with facsimile lists. If you can’t provide a list, can you at least share how your urls are constructed? If I know how they’re made I can probably figure out how to build them myself.
Those are the big five questions I like to have answered when I read about a new digital manuscript collection, and they very rarely are. Please, please, please, next time you announce a new collection, try to go beyond freely available online and tell us all more about how your collection will be made available, and what users will be able and allowed to do with it.
 In 2002 33% of survey respondents reported manuscript facsimiles “print mostly, electronic sometimes” and 47% reported using “print only”. In 2011, 44% reported using them “electronic mostly, print sometimes” and 17% reported using “electronic only”. This is an enormous shift. From Dot Porter, “Medievalists and the Scholarly Digital Edition,” Scholarly Editing: The Annual of the Association for Documentary Editing Volume 34, 2013. http://www.scholarlyediting.org/2013/essays/essay.porter.html
Thank you to Georg Voegler for inviting me to present the keynote at the symposium, thank you Dixit for making this conference possible, and danke to welcoming speakers for welcoming us so warmly. I’m excited to be here and looking forward to hear what the speakers have to say about digital scholarly editions as interfaces. Georg invited me here to talk about my work on medievalists use of digital editions. But first, I have a question.
What is an edition? I think we all know what an edition is, but it’s still fun, I find, to investigate the different ways that people define edition, or think about editions, so despite the title of this talk, most of what I’m going to be talking about is various ways that people think about editions and why that matters to those of us in the room who spend our time building editions, and at the end I’m going to share my thoughts on directions I’d like to see medieval editions in particular take in the future.
I’ll admit that when I need a quick answer to a question, often the first place I turn to is Google. Preparing for this talk was no different. So, I asked Google to define edition for me, and this is what I got. No big surprise. Two definitions, the first “a particular form or version of a published text,” and the second “the total number of copies of a book, newspaper, or other published material issued at one time.” The first definition here is one that’s close to the way I would answer this question myself. I think I’d generally say that an edition is a particular version of a text. It might be a version compiled together from other versions, like in a scholarly critical edition, but need not be. I’m a medievalist, so this got me thinking about texts written over time, and what might make a text rise to the level of being an “edition”, or not.
So here is some text from the Illiad, written on a papyrus scroll I the 2nd century BC. The scroll is owned by the British Museum, Papyrus 114 also known as the Bankes Papyrus. The Illiad, you probably know, is an ancient Greek epic poem set during the Trojan war, which focuses on a series of battles between King Agamemnon and the warrior Achilles. If you are a Classicist, I apologize in advance for simplifying a complex textual situation. If you aren’t a Classicist, if you’ve read the Illiad you probably read it in a translation from Greek into your native language, and this text most likely would have been presented to you as “The Text of The Illiad” – that is, a single text. That text, however, is built from many small material fragments that were written over a thousand years, and which represent written form of text that was composed through oral performance. The Bankes Papyrus is actually one of the most complete examples of the Illiad in papyrus form – most surviving examples are much more fragmentary than this.
As far as we know the text of the Illiad was only compiled into a single version in the 10th century, in what is known as the Venetus A manuscript, now at the Marciana Library in Venice. I have an image of the first page of the first book of the Illiad here. You can see that this presents much more than the text, which is the largest writing in the center-left of the page. This compiled text is surrounded by layers of glossing, which includes commentary as well as textual variants.
The Venetus A is just one example of a medieval glossed manuscript. Another more common genre of glossed manuscripts are Glossed Psalters, that is, texts of the Psalter written with glosses, quotes from the Church Fathers, included to comment on specific lines. Here is an example of a Glossed Psalter from the University of Pennsylvania’s collection. This is a somewhat early example, dated to around 1100, which is before the Glossa Ordinara was compiled (the Glossa Ordinaria was the standard commentary on the scriptures into the 14th century). Although this isn’t as complex as the Venetus A, you can still see at least two levels of glossing, both in the text and around the margins.
One more example, a manipulus florum text from another University of Pennsylvania manuscript. Thomas of Ireland’s Manipulus florum (“Handful of flowers”), compiled in the early 14th century, belongs to the genre of medieval texts known as florilegia, collections of authoritative quotations that are the forerunners of modern reference works such as Bartlett’s Familiar Quotations and The Oxford Dictionary of Quotations. This particular florilegium contains approximately 6000 Latin proverbs and textual excerpts attributed to a variety of classical, patristic and medieval authors. The flora are organized under alphabetically-ordered topics; here we see magister, or teacher. The red text is citation information, and the brown text is the quotes.
Now let’s take a look at a modern edition, Richard Marsden’s 2008 edition of The Old English Heptateuch published with the Early English Text Society. A glance at the table of contents reveals an introduction with various sections describing the history of editions of the text, the methodology behind this edition, and a description of the manuscripts and the relationships among them. This is followed by the edited texts themselves, which are presented in the traditional manner: with “the text” at the top of the page, and variant readings and other notes – the apparatus – at the bottom. In this way you can both read the text the editor has decided is “the text”, but also check to see how individual manuscripts differ in their readings. It is, I’ll point out, very similar to the presentation of the Illiad text in the Venetus A.
Electronic and digital editions have traditionally (as far as we can talk about there being a tradition of these types of editions) presented the same type of information as print editions, although the expansiveness of hypertext has allowed us to present this information interactively, selecting only what we want to see at any given moment and enabling us to follow trails of information via links and pop-ups. For example I have here Prue Shaw’s edition of Dante’s Commedia, published by the Scholarly Digital Editions. Here we have a basic table of contents, which informs us of the sections included in the edition.
Here we have the edited text from one manuscript, with the page image displayed alongside (this of course being one of the main differences between digital and print editions), with variant readings and other notes available at the click of the mouse.
A more extensive content list is also available via dropdown, and with another click I can be anywhere in the edition I wish to be.
Here I am at the same point in the text, except the base text is now this early printed edition, and again the page image is here displayed so I can double-check the editor’s choices should I wish to.
With the possible exception of the Bankes Papyrus, all of these examples are editions. They reflect the purpose of the editor, someone who is not writing original text but is compiling existing text to suit some present desire or need. The only difference being the material through which the edition is presented – handwritten parchment or papyrus, usually considered “primary material”, vs. a printed book or digital media, or “secondary material”. And I could even make an argument that the papyrus is an edition as well, if I posit that the individual who wrote the text on the papyrus was compiling it from some other written source or even from the oral tradition.
I want to take a step back now from the question of what is an edition and talk a bit about why, although the answer to this may not matter to me personally, it does matter very much when you start asking people their opinions about editions. (I am not generally a fan of labels and prefer to let things be whatever they are without worrying too much about what I should call them. I’m no fun at parties.)
I’ve been studying the attitudes of medievalists towards digital resources, including editions, since I was a library science graduate student back in 2002. In May 2001 I graduated with an MA from the Medieval Institute at Western Michigan University, with a focus on Anglo-Saxon language, literature, and religious culture. I had taken a traditional course of work, including courses in paleography and codicology, Old English, Middle English, and Latin language and literature, and several courses on the reading of religious texts, primarily hagiographical texts. I was keenly aware of the importance of primary source materials to the study of the middle ages, and I was also aware that there were CD-ROMs available that made primary materials, and scholarly editions of them, available at the fingertips. There were even at this time the first online collections of medieval manuscripts (notably the Early Medieval Manuscript Collection at the Bodleian Library at Oxford). But I was curious about how much these new electronic editions (and electronic journals and databases, too) were actually being used by scholars. I conducted a survey of medievalists, asking them about their attitudes toward, and use of, electronic resources. I wrote my findings in a research paper, “Medievalists’ Use of Electronic Resources: The Results of a National Survey of Faculty Members in Medieval Studies,” which is still available if you want to read it, in the IU Bloomington institutional repository.
I conducted a second survey in 2011, and compared findings from the two surveys in an article published in 2013 in Scholarly Editing, “Medievalists and the Scholarly Digital Edition.” The methodologies for these surveys were quite different (the first was mailed to a preselected group of respondents, while the second was sent to a group but also advertised on listservs and social media), and I’m hesitant to call either of them scientific, but with these caveats they do show a general trend of usage in the 9 years between, and this trend reflects what I have seen anecdotally.
In this chart from 2002, we see that 7% of respondents reported using electronic and print editions the same, 44% print mostly, and 48% print only.
Nine years later, while still no-one reports using only electronic editions, 7% report using electronic mostly, 12% electronic and print the same, 58% print mostly, and 22% print only. The largest shift is from “print only” to “print mostly”, and it’s most clearly seen on this chart.
Now this is all well and good, and you’d be forgiven for looking at this chart and coming to the conclusion that all these folks had finally “seen the light” and were hopping online and on CD Rom to check out the latest high-tech digital editions in their field. But the written comments show that this is clearly not the case, at least not for all respondents, and that any issues with the survey data come from a disconnect between how I conceive of a “digital edition” and how the respondents conceive of the same.
Exhibit A: Comments from four different respondents explaining when they use digital editions and how they find them useful. I won’t read these to you, but I will point out that the phrase Google Books has been bolded in three of them, and while the other one doesn’t mention Google Books by name, the description strongly implies it.
I have thought about this specific disconnect a lot in the past five years, because I think that it does reflect a general disconnect between how we who create digital editions think about editing and editions, and how more traditional scholars and those who consume editions think about them. Out of curiosity, as I was working on this lecture I asked on Facebook for my “friends” to give me their own favorite definition of edition (not digital edition, just edition), and here are two that reflected the general consensus. The first is very material, a bibliographic description that would be favored by early modernists (as a medievalist I was actually a bit shocked by this definition, although I know what an edition is, bibliographically speaking, I wasn’t thinking in that direction at that point, I was really thinking of a “textual edition”), while the second focuses not so much on how the text was edited but on the apparatus that comes along with it. Thus, an edited text by itself isn’t properly an edition, it requires material explaining the text to be a “real” edition. Interestingly, this second definition arguably includes the Venetus A manuscript we looked at earlier.
Digital scholarly editions are not just scholarly editions in digital media. I distinguish between digital and digitized. A digitized print edition is not a “digital edition” in the strict sense used here. A digital edition can not be printed without a loss of information and/or functionality. The digital edition is guided by a different paradigm. If the paradigm of an edition is limited to the two-dimensional space of the “page” and to typographic means of information representation, than it’s not a digital edition.
In this definition Sahle differentiates between a digital edition, which essentially isn’t limited by typography and thus can’t be printed, and a digitized edition, which is and which can. In practice most digitized editions will be photographic copies of print editions, although of course they could just be very simple text rendered fully in HTML pages with no links or pop-ups. While the results of these lines of questioning aren’t directly comparable with the 2002 and 2011 results, I think it’s possible to see a general continuing trend towards a use of digitized editions, if not towards digital editions following Sahle’s definition.
First, a word about methodology. This year’s respondents were entirely self-selecting, and the announcement of the survey, which was online, went out through social media and listservs. I didn’t have a separate selected group. There were 337 total respondents although not every respondent answered every question.
This year, I asked respondents about their use of editions – digital, digitized, and print – over the past year, focusing on the general number of times they had used the editions. Over 90% of respondents report using digital editions at all, although only just over 40% report using them “more times than I can count”.
When asked about digitized editions, however, over 75% report using them “more times than I can count”, and only 2 respondents – .6% – report using them not at all.
Print edition usage is similar to digitized edition usage, with about 78% reporting they use them “more times than I can count” and no respondents reporting they use them not at all. A chart comparing the three types of editions side-by-side shows clearly how similar numbers are for digitized and print editions vs. digital editions.
What can we make of this? Questions that come immediately to my mind include: are we building the editions that scholars need? That they will find useful? Are there editions that people want that aren’t getting made? But also: Does it matter? If we are creating our editions as a scholarly exercise, for our own purposes, does it matter if other people use them or not? It might hurt to think that someone is downloading a 19th century edition from Google Books instead of using my new one, but is it okay? And if it’s not, what can we do about that? (I’m not going to try to answer that, but maybe we can think about it this week)
I want to change gears and come back now to this question, what is an edition. I’ve talked a bit about how I conceive of editions, and how others do, and how if I’m going to have a productive conversation about editions with someone (or ask people questions on a survey) it’s important to make sure we’re on the same page – or at least in the same book – regarding what we mean when we say “edition”. But now I want to take a step back – way back – and think about what an edition is at the most basic level. On the Platonic level. If an edition is a shadow on the wall, what is casting that shadow? Some people will say “the urtext” which I think of (not unkindly, I assure you) as the floating text, the text in the sky. The text that never existed until some editor got her hands on it and brought it to life as Viktor Frankenstein brought to life that poor, wretched monster in the pages of Mary Shelley’s classic horror story. I say, we know texts because someone cared enough to write them down, and some of that survives, so what we have now is a written record that is intimately connected to material objects: text doesn’t float, text is ink on skin and ink on paper and notches in stone, paint on stone, and whatever else borne on whatever material was handy. So perhaps we can posit editions that are cast from manuscripts and the other physical objects on which text is borne, not simply being displayed alongside text, or pointed to from text, or described in a section “about the manuscript”, but flipping the model and organizing the edition according to the physical object.
I didn’t come up with this idea, I am sad to say. In 2015, Christoph Flüeler presented a talk at the International Congress on Medieval Studies titled “Digital Manuscripts as Critical Edition,” later posted to the Schoenberg Institute for Manuscript Studies blog. In this essay Flüeler asks: “… how [does] a digital manuscript [stand] in relation to a critical edition of a text. Can the publication of a digital manuscript on the internet be understood as an edition? Further: could such an edition even be regarded as a critical edition?” – His answer being, of course, yes. I won’t go into his arguments, instead I’m going to use them as a jumping-off point, but I encourage you to read his essay.
This concept is very appealing to me. I suppose I should admit now, almost at the end of my keynote, that I am not presently doing any textual editing, and I haven’t in a few years. My current position is “Curator, Digital Research Services” in the Schoenberg Institute for Manuscript Studies at the University of Pennsylvania Libraries in Philadelphia. This position is a great deal of fun and encompasses many different responsibilities. I am involved in the digitization efforts of the unit and I’m currently co-PI of Bibliotheca Philadelphiensis, a grant funded project that will digitize all the medieval manuscripts in Philadelphia, which I can only mention now but I’ll be glad to talk about more later to anyone interested in hearing about it. All our digital images are released in the public domain, and published openly on our website, OPenn, along with human readable HTML descriptions, links to download the images, and robust TEI manuscript descriptions available for download and reuse.
I also do a fair amount of what I think of as experimental work, including new ways to make manuscripts available to scholars and the public. I’ve created electronic facsimiles in the epub format, a project currently being expanded by the Penn Libraries metadata group, which are published in our institutional repository, and I also make short video orientations to our manuscripts which are posted on YouTube and also made available through the repository. In the spring I presented on OPenn for a mixed group of librarians and faculty at Vanderbilt University in Tennessee, after which an art historian said to me, “this open data thing is great and all, but why can’t we just have the manuscripts as PDFs?” So I held my nose and generated PDF files for all our manuscripts, then I did it for the Walters Art Museum as well for good measure. I posted them all to Google Docs, along with spreadsheets as a very basic search facility.
I’ve also been working for the past few years on developing a system for modeling and visualizing the physical collation of medieval manuscripts (this is distinct from textual collation which involves comparing versions of texts). With a bit of funding from the Mellon Foundation and the collaboration of Alexandra Gillespie and her team at the University of Toronto I am very excited for the next version of that system, which we call VisColl (it is on GitHub if like to check it out – you can see the code and there are instructions for creating your own models and visualizations). The next version will include facilities for connecting tags, and perhaps transcriptions, to the deconstructed manuscript. I hadn’t thought of the thing that this system generates as an edition, but perhaps it is. But instead of being an edition of a text, you might think of it as an edition of a manuscript that happens to have text on it (or sometimes, perhaps, won’t).
I am aware that I’m reaching the end of my time, so I just want to take a few minutes to mention something that I see playing an enormous role in the future of digital-manuscripts-as-editions, and that’s the International Image Interoperability Framework, or IIIF. I think Jeffrey Witt may mention IIIF in his presentation tomorrow, and perhaps others will as well although I don’t see any IIIF-specific papers in the schedule. At the risk of simplifying it, IIIF is a set of Application Programming Interfaces (APIs) – sets of routines, protocols, and tools – to enable the interoperability of image repositories. This means you can use images from different repositories in the same browser or other tool. Here, quickly, is an example of how that can work.
e-codices publishes links to IIIF manifests for each of their manuscripts. A manifest is a json file that contains descriptive and structural metadata for a manuscript, including links to images that are served through a IIIF server. You can look at it. It is human readable, kind of, but it’s a mess.
However, if you copy that link and paste it into a IIIF-conformant tool such as Mirador (a simple IIIF browser which I have installed on my laptop) you can create your own collection and then view and manipulate the images side-by-side. Here I’ve pulled in two manuscripts from e-codices, both copies of the Roman de la Rose.
And here I can view them side by side, I can compare the images, compare the text, and I can make annotations on them too. Here is tool for creating editions of manuscripts.
(A quick side note: of course there are other tools that offer image tagging ability, including the DM project at SIMS, but what IIIF offers is not a single tool but a system for building and viewing editions, and all sorts of other unnamable things, using manuscripts in different institutions without having the move the images around. I cannot stress how radical this is for medieval manuscript studies.)
However as fond as I am of IIIF, and as promising I think it is for my future vision, my support for it comes with some caveats. If you don’t know, I am a huge proponent of open data, particularly open manuscript data. The Director of the Schoenberg Institute is Will Noel, an open data pioneer in his own right who has been named a White House Champion of Change, and I take him as my example. I believe that in most cases, when institutions digitize their manuscript collections they are obligated to release those images into the public domain, or at the very least under a creative commons: by license (to be clear, a license that would allow commercial use) and that manuscript metadata should be licensed for reuse. My issue with IIIF is that is presents the illusion of openness without actual openness. That is, if images are published under a closed license, if you have the IIIF manifest you can use them to do whatever you want, as long as you’re doing it through IIIF-compliant software. You can’t download them and use them outside of the system (to, say, generate PDF or epub facsimiles, or collation visualizations). I love IIIF for what it makes possible but I also think it’s vital to keep data open so people can use it outside of any given system.
We have a saying around the Schoenberg Institute, Data Over Interface. It was introduced to us by Doug Emery, our data programmer who was also responsible for the curation of the data of the Archimedes Palimpsest Project and the Walters Art Museum manuscripts. We like it so much we had it put on teeshirts (You can order your own here!). I like it, not because I necessarily agree that the data is always more important than the interface, but because it makes me think about whether or not the data is always more important than the interface. Excellent, robust data with no interface isn’t easily usable (although a creative person will always find a way), but an excellent interface with terrible data or no data at all is useless as anything other than a show piece. And then inevitably my mind turns to manuscripts, and I begin to wonder, in the case of a manuscript, what is the data and what is the interface? Is a manuscript simply an interface for the text and whatever else it bears, or is the physical object data of its own that begs for an interface to present it, to pull it apart and put it back together in some way to help us make sense of it or the time it was created? Is it both? Is it neither?
I am so excited to be here and to hear about what everyone in this room is thinking about editions, and interfaces, and what editions are, and what interfaces are and are for. Thank you so much for your time, and enjoy the conference.
Last week I posted on how to use a Firefox plugin called Down them All to download all the files from an e-codices IIIF manifest (there’s also a tutorial video on YouTube, one of a small but growing collection that will soon include a video outlining the process described here), but not all manifests include direct links to images. The manifests published by the Vatican Digital Library are a good example of this. The URLs in manifests don’t link directly to images; you need to add criteria at the end of the URLs to hit the images. What can you do in that case? In that case, what you need to do it build a list of urls pointing to images, then you can use Down Them All (or other tools) to download them.
In addition to Down Them All I like to use a combination of TextWrangler and a website called Multilinkr, which takes text URLs and turns them into hot links. Why this is important will become clear momentarily.
Next, we need to pull all the base URLs out of the Vatican manifest.
Search the Vatican Digital Library for the manuscript you want. Once you’ve found one, download the IIIF manifest (click the “Bibliographic Information” button on the far left, which opens a menu, then click on the IIIF manifest link)
Open the manifest you just downloaded in TextWrangler. When it opens, it will appear as a single long string:
You need to get all the URLs on separate lines. The easiest way to do this is to find and replace all commas with a comma followed by a hard return. Do this using the “grep” option, using “\r” to add the return. Your find and replace box will look like this (don’t forget to check the “grep” box at the bottom!):
Your manifest will now look something like this:
Now we’re going to search this file to find everything that starts with “http” and ends with “jp2” (what I’m calling the base URLs). We’ll use the “grep” function again, and a little regular expression that will match everything between the beginning of the URL and the end( .* ). Your Find window should look like this (again, don’t forget to check “grep”). Click “Find All”:
Your results will appear in a new window, and will look something like this:
Now we want to export these results as text, and then remove anything in the file that isn’t a URL. First, go to TextWrangler’s File menu and select “Export as Text”:
Save that text file wherever you’d like. Then open it in TextWrangler. You now need to do some finding and replacing, using “grep” (again!) and the .* regular expression to remove anything that is not http…jp2. I had to do two runs to get everything, first the stuff before the URLs, then the stuff after:
You will notice (I hope!) that there are forward slashes (\) before every backslash/regular slash (/) in the URLs. So we need to remove them too. Just to a regular find and replace, DO NOT check the “grep” box:
Hooray! We have our list of base URLs. Now we need to add the criteria necessary to turn these base URLs into direct links to images.
I keep mentioning the criteria required to turn these links from error-throwers to image files. If you go to the Vatican Digital Library website and mouse over the “Download” button for any image file, you’ll see what I mean. As you mouse that button over a bar will appear at the very bottom of your window, and if you look carefully you’ll see that the URL there is the base URL (ending in “jp2”) followed by four things separated by slashes:
So in this case, we have the full region (the entire image, not a piece of it), size 1047 pixels across by however tall (since there is nothing after the comma), rotation of 0 degrees, and a quality native (aka default, I think – one could also use bitonal or gray to get those quality of images). I like to get the “full” image size, so what I’m going to add to the end of the URLS is:
We’ll just do this using another find and replace in TextWrangler.
We’re just adding the additional criteria after the file extension, so all I do is find the file extension – jp2 – and replace all with “jp2/full/full/0/native.jpg”.
Test one, just to make sure it works. Copy and paste the URL into a browser.
Now – finally! promise! – you can use Down them All to download all those lovely image files. In order to do that you need to turn the text links into hot links. When I was testing this I first tried opening the text file in Firefox and pointing Down Them All to it, but it broke Down Them All – and I mean BROKE it. I had to uninstall Down Them All and delete everything out of my Firefox profile before I could get it to work again. Happily I found a tool that made it easy to turn those text links into hot links: Multilinkr. So now open a new tab in Firefox and open Multilinkr. Copy all the URLs from TextWrangler and paste them into the Multilinkr box. Click the “Links” button and gasp as the text links turn into hot links:
Now go up to the Firefox “Tools” menu and select “Down Them All Tools > Down Them All” from the dropdown: Down Them All should automatically recognize all the files and highlight them. Two things to be careful about here. One is that you need to specify a download location. It will default to your Downloads folder, but I like to indicate a new folder using the shelfmark of the manuscript I’m downloading. You can also browse to download the files wherever you’d like. The second one is that Down Them All will keep file names the same unless you tell it to do something different. In the case of the Vatican that’s not ideal, since all the files are named “native.jpg”, so if you don’t do something with the “Renaming Mask” you’ll end up with native.jpg native.jpg(1) native.jpg(2) etc. I like to change the Renaming Mask from the default *name*.*ext* to *flatsubdirs*.*ext* – “flatsubdirs” stands for “flat subdirectories”, and it means the downloaded files will be named according to the path of subdirectories wherever they are being downloaded from. In the case of the Vatican files, a file that lives here:
This is still a mouthful, but both the shelfmark (Vat.lat.3773) and the page number or folio number are there (here it’s pa_0002.jp2 = page 2, in other manuscripts you’ll see for example fr_0003r.jp2), so it’s simple enough to use Automator or another tool to batch rename the files by removing all the other bits and just leaving the shelfmark and folio or page number.
There are other ways you could do this, too, using Excel to construct the URLs and wget to download, but I think the method outlined here is relatively simple for people who don’t have strong coding skills. Don’t hesitate to ask if you have trouble or questions about this! And please remember that the Vatican manuscript images are not licensed for reuse, so only download them for your own scholarly work.
IIIF manifests are great, but what if you want to work with digital images outside of a IIIF interface? There are a few different ways I’ve figured out that I can use IIIF manifests to download all the images from a manuscript. The exact approach will vary since different institutions construct their image URLs in different ways. Here’s the first approach, which is fairly straightforward and uses e-codices as an example. Tomorrow I’ll post a second post using on the Vatican Digital Library. Please remember that most institutions license their images, so don’t repost or publish images unless the institution specifically allows this in their license.
Method 1: The manifest has urls that resolve directly to image files
This is the easiest method, but it only works if the manifest contains urls that resolve directly to image files. If you can copy a url and paste it into a browser and an image displays, you can use this method. The manifests provided by e-codices follow this approach.
Install DownThemAll, a Firefox browser plugin that allows you to download all the files linked to from a webpage. (There is a similar browser plugin for Chrome, called Get Them All, but it did not recognize the image files linked from the manifest)
Go to e-codices, search for a manuscript, and click the “IIIF manifest” link on the Overview page.
The manifest will open in the browser. It will look like a mess, but it doesn’t need to look good.
Open DownThemAll. It will recognize all the files linked from the manifest (including .json files, .jpg, .j2, and anything else) and list them. Click the box next to “JPEG Images” at the bottom of the page (under “Filters”). It will highlight all the JPEG images in the list, including the various “default.jpg” images and files ending with “.jp2”
Now, we only want the images that are named “default.jpg”. These are the “regular” jpeg files; the .jp2 files are the masters and, although you could download them, your browser wouldn’t know what to do with them. So we need to create a new filter so we get only the default.jpg files. To do this, first click “Preferences” in the lower right-hand corner, then click the “Filters” button in the resulting window.
There they are. To create a new filter, click the “Add New Filter” button, and call the new filter “Default Jpg” (or whatever you like). In the Filtered Extensions field, type “/\/default.jpg” – the filter will select only those files that end with “default.jpg” (yes you do need three slashes there!). Note that you do not need to press save or anything, the filter list updates and saves automatically.
Return to the main Down Them All view and check the box next to your newly-created filter. Be amazed as all the “default.jpg” files are highlighted.
Don’t hit download just yet. If you do, it will download all the files with their given names, and since they are all named “default.jpg” it won’t end well. It will also download them all directly to whatever is specified under “Save Files in” (in my case, my Downloads folder) which also may not be ideal. So you need to change the Renaming Mask to at least give you unique names for each one, and specify where to download all those files. In the case of e-codices the manifest urls include both the manuscript shelfmark and the folio number for each image, so let’s use the Renaming Mask to name the files according to the file page. Simply change *name* to *flatsubdirs* (flat subdirectories). Under “Save Files in”, browse to wherever you want to download all these files.
Press “Start” and wait for everything to download.
Congratulations, you have downloaded all the images from this manuscript! You’ll probably want to rename them (if you’re on Mac you can use Automator to do this fairly easily), and you should also save the manifest alongside the images.
As with the Schoenberg Manuscripts, these two other collections are in their own folders, along with a spreadsheet you can search and brows to aid in discovery. You are free to download the PDF files and redistribute them as you wish. They are in the public domain.
Hi everyone! It’s been almost a year since my last blog post (in which I promised to post more frequently, haha) so I guess it’s time for another one. I actually have something pretty interesting to report!
Last week I gave an invited talk at the Cultural Heritage at Scale symposium at Vanderbilt University. It was amazing. I spoke on OPenn: Primary Digital Resources Available to Everyone, which is the platform we use in the Schoenberg Institute for Manuscript Studies at the University of Pennsylvania Libraries to publish high-resolution digital images and accompanying metadata for all our medieval manuscripts (I also talked for a few minutes about the Schoenberg Database of Manuscripts, which is a provenance database of pre-1600 manuscripts). The philosophy of OPenn is centered on openness: all our manuscript images are in the public domain and our metadata is licensed with Creative Commons licenses, and none of those licenses prohibit commercial use. Next to openness, we embrace simplicity. There is no search facility or fancy interface to the data. The images and metadata files are on a file system (similar to the file systems on your own computer) and browse pages for each manuscript are presented in HTML that is processed directly from the metadata file. (Metadata files are in TEI/XML using the manuscript description element)
This approach is actually pretty novel. Librarians and faculty scholars alike love their interfaces! And, indeed, after my talk someone came up to me and said, “I’m a humanities faculty member, and I don’t want to have to download files. I just want to see the manuscripts. So why don’t you make them available as PDF so I can use them like that?”
This gave me the opportunity to talk about what OPenn is, and what it isn’t (something I didn’t have time to do in my talk). The humanities scholar who just wants to look at manuscripts is really not the audience for OPenn. If you want to search for and page through manuscripts, you can do that on Penn in Hand, our longstanding page-turning interface. OPenn is about data, and it’s about access. It isn’t for people who want to look at manuscripts, it’s for people who want to build things with manuscript data. So it wouldn’t make sense for us to have PDFs on OPenn – that’s just not what it’s for.
HOWEVER. However. I’m sympathetic. Many, many people want to look at manuscripts, and PDFs are convenient, and I want to encourage them to see our manuscripts as available to them! So, even if Penn isn’t going to make PDFs available institutionally (at least, not yet – we may in the future), maybe this is something I could do myself. And since all our manuscript data is available on OPenn and licensed for reuse, there is no reason for me not to do it.
If you click that link, you’ll find yourself in a Google Drive folder titled “OPenn manuscript PDFs”. In there is currently one folder, “LJS Manuscripts.” In that folder you’ll fine a link to a Google spreadsheet and over 400 PDF files. The spreadsheet lists all the LJS manuscripts (LJS = Laurence J. Schoenberg, who gifted his manuscripts to Penn in 2012) including catalog descriptions, origin dates, origin locations, and shelfmarks. Let’s say you’re interested in manuscripts from France. You can highlight the Origin column and do a “Find” for “France.” It’s not a fancy search so you’ll have to write down the shelfmarks of the manuscripts as you find them, but it works. Once you know the shelfmarks, go back into the “LJS Manuscripts” folder and find and download the PDF files you want. Note that some manuscripts may have two PDF files, one with “_extra” in the file name. These are images that are included on OPenn but not part of the front-to-back digitization of a manuscript. They might include things like extra shots of the binding, or reference shots.
If you are interested in knowing how I did this, please read on. If not, enjoy the PDFs!
How I did it
I’ll be honest, this is my favorite part of the exercise so thank you for sticking with me for it! There won’t be a pop quiz at the end although if you want to try this out yourself you are most welcome to.
First I downloaded all the web jpeg files from the LJS collection on OPenn. I used wget to do this, because with wget I am able to get only the web jpeg files from all the collection folders at once. My wget command looked like this:
wget -r -np -A “_web.jpg” http://openn.library.upenn.edu/Data/0001/
wget = use the wget program -r = “recursive”, basically means go into all the child folders, not just the folder I’m pointing to -np = “no parent”, basically means don’t go into the parent folders, no matter what -A “_web.jpg” = “accept list”, in this case I specified that I only want those files that contain _web.jpg (which all the web jpeg files on OPenn do) http://openn.library.upenn.edu/Data/0001/ = where all the LJS manuscript data lives
I didn’t use the -nd command, which I usually do (-nd = “no directory”, if you don’t use this command you get the entire file structure for the file server starting from root, which in this case is openn.library.upenn.edu. What this means, practically, is that if you use wget to download one file from a directory five levels up, you get empty folders four levels deep then the top director with one file in it. Not fun. But in this case it’s helpful, and you’ll see why later.
At my house, with a pretty good wireless connection, it took about 5 hours to download everything.
I used Automator to batch create the PDF files. After a bit of googling I found this post on batch creating multipage PDF files from jpeg files. There are some different suggestions, but I opted to use Mac’s Automator. There is a workflow linked from that post. I downloaded that and (because all of the folders of jpeg images I was going to process are in different parent folders) I replaced the first step in the workflow, which was Get Selected Finder Items, with Get Specified Finder Items. This allowed me to search in Automator for exactly what I wanted. So I added all the folders called “web” that were located in the ancestor folder “openn.library.upenn.edu” (which was created when I downloaded all the images from OPenn in the previous step). In this step Automator creates one PDF file named “output.pdf” for each manuscript in the same location as that manuscript’s web jpeg images (in a folder called web – which is important to know later).
Once I created the PDFs, I no longer needed the web jpeg files. So I took some time to delete all the web jpegs. I did this by searching in Finder for “_web.jpg” in openn.library.upenn.edu and then sending them all to Trash. This took ages, but when it was done the only thing in those folders were output.pdf files.
I still had more work to do. I needed to change the names of the PDF files so I would know which manuscripts they represented. Again, after a bit of Googling, I chanced upon this post which includes an AppleScript that did exactly what I needed: it renames files according to the path of their location on the file system. For example, the file “output.pdf” located in Macintosh HD/Users/dorp/Downloads/openn/openn.library.upenn.edu/Data/0001/ljs101/data/web would be renamed “Macintosh HD_Users_dorp_Downloads_openn_openn.library.upenn.edu_Data_0001_ljs101_data_web_001.pdf”. I’d never used AppleScript before so I had to figure that out, but once I did it was smooth sailing – just took a while. (To run the script I copied it into Apple’s Script Editor, hit the play button, and selected openn.library.upenn.edu/Data/0001when it asked me where I wanted to point the script)
Finally, I had to remove all the extraneous pieces of the file names to leave just the shelfmark (or shelfmark + “extra” for those files that represent the extra images). Automator to the rescue again!
Get Specified Finder Items (adding all PDF files located in the ancestor folder “http://openn.library.upenn.edu”)
Rename Finder Items to replace text (replacing “Macintosh HD_Users_dorp_Downloads_openn_openn.library.upenn.edu_Data_0001_” with nothing) –
Rename Finder Items to replace text (replacing “_data_web_001” with nothing)
Rename Finder Items to replace text (replacing “_data_extra_web_001” with “_extra” – this identifies PDFs that are for “extra” images)
The last thing I had to do was to move them into Google Docs. Again, I just searched for “.pdf” in Finder (just taking those that are in openn.libraries.upenn.edu/Data/0001) and dragged them into Google Drive.
The spreadsheet I generated by running an XSLT script over the TEI manuscript descriptions (it’s a spreadsheet I created a couple of years ago when I first uploaded data about the Penn manuscripts on Viewshare. Leave a comment or send me a note if that sounds interesting and I’ll make a post on it.