How to get MODS using the NYPL Digital Collections API

Last week I figured out how to batch-download MODS records from the NYPL Digital Collections API (http://api.repo.nypl.org/) using my limited set of technical skills, so I thought I would share my process here.

I had a few tools at my disposal. First, I’m on a Macbook. I’m not sure how I would have done this had I been on a Windows machine. Second, I’m pretty good with XSLT. Although I have some experience with some other languages (javascript, python, perl) I’m not really good at them. It’s possible one could do something like this using other languages and it would be more effective – but I use what I know. I also had a browser, which came in handy in the first step.

The first thing I had to do is find all the objects that I wanted to get the MODS for. I wanted all the medieval objects (surprise!), so to get as broad a search as possible I opted for the “Search across all MODS fields” option (Method 4 in the API Documentation), which involves constructing a URL to stick in a browser. Because the most results the API will return on a single search is 500, I included that limit in my search. I ended up constructing four URLs, since it turned out there were between 1500 and 2000 objects:

I plugged these into my browser, then just saved these result pages as XML files in a directory on my Mac. Each of these results pages had a brief set of fields for each object: UUID (the unique identifier for the objects, and the thing I needed to use to get the MODS), title, typeOfResource, imageID, and itemLink (the URL for the object in the NYPL Digital Collections website).

Next, I had to figure out how to feed the UUIDs back into the API. I thought about this for most of a day, and an evening, and then a morning. I tapped my network for some suggestions, and it wasn’t until Conal Tuohy suggested using document() in XSLT that I thought XSLT might actually work.

To get the MODS record for any UUID, you need to simply construct a URL that points to the MODS record on the NYPL file directory. They look like this:

http://api.repo.nypl.org/api/v1/items/mods/[UUID].xml

For my first attempt, I wrote an XSLT document that used document(), constructing pointers to each MODS record when processed over the result documents I saved from my browser. Had this worked, it would have pulled all the MODS records into a new document during processing. I use Oxygen for most all of my XML work, including processing, but when I tried to process my first result document I got an I/O error. Of course I did – the API doesn’t allow just any old person in. You need to authenticate, and when you sign up with the API they send you an authentication token. There may be some way to authenticate through Oxygen, but if so I couldn’t figure it out. So, back to the drawing board.

Over lunch on the second day, I picked the brain of my colleague Doug Emery. Doug and I worked together on the Walters BookReaders (which are elsewhere on this site), and I trust him to give good advice. We didn’t have much time, but he suggested using a curl request through the terminal on my Mac – maybe I could do something like that? I had seen curl mentioned on the API documentation as well, but I hadn’t heard of it and certainly hadn’t used it before. But I went back to my office and did some research.

Basically, curl is a command-line tool for grabbing the content of whatever is at the other end of a URL. You give it a URL, and it sends back whatever is on the other end. So, if you send out the URL for an NYPL MODS record, it will send the MODS record back. There’s an example on the NYPL API documentation page which incorporates the authentication token. Score!

curl “http://api.repo.nypl.org/api/v1/items?identifier_type=local_bnumber&identifier_val=b11722689” -H ‘Authorization: Token token=”abcdefghijklmn”‘ where ‘abcdefghijklmn’ is the authentication token you receive when you sign up (link coming soon).

Next, I needed to figure out how to send between 1500 and 2000 URLs through my terminal, without having to do them one by one. Through a bit of Google searching I discovered that it’s possible to replace the URL in the command with a pointer to a text file containing a list of URLs, in the format url = [url]. So I wrote a short XSLT that I used to process over all four of the result documents, pulling out the UUIDs, constructing URLs that pointed to the corresponding MODS records, and putting them in the correct format. Then I put pointers to those documents in my curl command:

curl -K “nypl_medieval_4_forCurl.txt” -H ‘Authorization: Token token=”[my_token]”‘> test.xml

Voila – four documents chock full of MODS goodness. And I was able to do it with a Mac terminal and just a little bit of XSLT.

Q: How do you teach TEI in an hour?

A: You don’t! But you can provide a substantial introduction to the concept of the TEI, and explain how it functions.

On June 4 I participated in PhillyDH@Penn, a day of workshops and unconference sessions sponsored by PhillyDH and held in my own beloved Van Pelt-Dietrich Library on the University of Pennsylvania campus. I was sick, so I wasn’t able to participate fully, but I was able to lead a one-hour Introduction to TEI. I aimed it at absolute beginners, with the intention to a) Give the audience an idea of what TEI is and what it’s for (to help them answer the question, Is TEI really what I need?) and b) explain enough about the TEI so they will know a bit of something walking into their first “real” (multi-hour, hands-on) TEI workshop. I got a lot of good feedback, so hopefully it did its job. And I do hope to have the opportunity to follow this up with more substantial workshops.

Slides (in PDF format) are posted here.

EDIT: Need to add that these slides owe a ton to James Cummings, with whom I have taught TEI and to whom I owe much of what I know about it!

New Job

I have a new job! As of April 1, I am the Curator, Digital Research Services, Schoenberg Institute for Manuscript Studies, Special Collections Center, University of Pennsylvania.

Now say that five times fast.

In this role, I’m responsible for the digital initiatives coming out of SIMS. I’m also a curator for the Schoenberg Manuscript collection (as are all SIMS staffers in the SCC), and I manage the Vitale Special Collections Digital Media Lab (Vitale 2). It’s also clear that I will have some broad role in the digital humanities at UPenn, although that part of the job isn’t so clear to me yet. I’m the inaugural DRS Curator, so there is still a lot to be worked out although after three weeks on the job, I’m feeling really confident that things are under control.

I report to Will Noel, Director of the SCC and Director of SIMS. Will may be best known as the Project Director for the Archimedes Palimpsest Project at the Walters Art Museum. Will is awesome, and I’m thrilled to be working for him. I’m also thrilled to be working at the University of Pennsylvania, in a beautiful brand new space, in the great city of Philadelphia.

I plan to blog more in this new position, so keep your eyes here and on the MESA blog.

Medieval Electronic Scholarly Alliance

It’s not that I haven’t been doing anything; I just haven’t been doing anything here.

In June, Tim Stinson at NCSU and I were awarded a grant from the Andrew W. Mellon Foundation to implement the Medieval Electronic Scholarly Alliance, MESA, and since we started preparing metadata for indexing in, oh, about mid-September, all of my technical energy has been going into that work. MESA is a node of the Advanced Research Consortium (ARC), and follows the model of scholarly federation spearheaded by NINES and followed by 18thconnect, using Collex as its underlying software support.

What does it involve? It involves taking metadata already created by digital collections or projects (like, for example, the Walters Art Museum, Parker on the Web, or e-Codices), mapping their fields out to the Collex and MESA fields, and then generating RDF that follows to the Collex guidelines. It’s these three collections (and, just in the last week and with a ton of help from Lisa McAulay at UCLA, the St. Gall Project) that I’ve spent the most time on, and here are some key things I’ve learned so far that might be helpful for people considering adding their collections to MESA (or to any of the other ARC nodes, for that matter)… or for that matter, for anyone who is interested in sharing their data, any time ever.

  1. If your metadata is in an XML format, put unique identifiers on all the tags. Just do it.
  2. Follow best practices for file and folder naming. In this case, best practice probably just means BE CONSISTENT. My favorite example for illustrating consistency is probably the Digital Walters at the Walters Art Museum. One of the reasons that it’s a great example is that their data (image files and metadata) has been released under Open Access licensing, with the intention that people (like me) will come along and grab them, and work with them. What this means is that everything was designed with this usage in mind, so it’s simple, and that all their practices are well-documented, so it’s easy to take them and run without having to figure a lot out. Their file and folder naming practices are documented here.
  3. Be thoughtful about how you format your dates. Yeah, dates are a big deal. The metadata of one of the projects (I won’t say which one) had dates formatted as, e.g., “xii early-xiii late (?)”, and with no ISO (or an other even slightly computer-readable) version in, e.g., an attribute value. I ended up mapping every individual date value to a computer-readable form. As always, I figure there was a better way to do it, but at the time… well, it took a while, but in the end it did what I needed it to do. If you are in the process of setting up a new project, consider including computer-readable dates in your metadata. Because Collex/MESA includes values for both date label and date value, you can include both a computer-readable date to be used for searching, and a human-readable version, which may have more nuance than can be included in a computer readable value. For example, LABEL: xii early-xiii late (?); VALUE: 800,900.
  4. Is whatever you are describing in English? Great! If it’s not, find someplace in your metadata to indicate what language it is in. Use a controlled vocabulary, such as the ISO 639-2 Language Code List, and as with dates it helps to include a computer-readable code in addition to a human-readable version. (This is less important for MESA, as we can map between the columns in the ISO 639-2 Language Code List, but it’s a good general rule to follow)
  5. You can actually do a lot with a little. In the past few weeks the metadata I’ve worked with range from really amazing full manuscript Descriptions (in TEI P5) from the Walters Art Museum to very simple and basic descriptions in Dublin Core from e-Codices. Although the full descriptions provide more information (helpful for full-text searching), the simple DC records, assuming they include the basic information you’d want for searching (title, date, language, maybe incipits) are just fine. So don’t let a lack of detail put you off from contributing to MESA!

I would also encourage you to release your data under an Open Access license (something from the Creative Commons), and to make your data easy to grab (either by posting everything online, as the Walters Art Museum has done, or by releasing it through an OAI provider, as e-Codices has done).

Manuscript-level records in Omeka

I had some time tonight to figure out how to get the manuscript-level records into Omeka.

I’d already worked out the (very simple) XPath needed to pull out the few bits of information I wanted from the TEI msDesc: siglum, title, URL to the Walters data, BookReader URL, and URL to the first thumbnail image in the ms (which I’m just using as an illustration, a placeholder for the full ms). This evening I did the final checks to make sure the output was right, then just ran the XSLT against all the msDesc files and created .csv files for each one (there’s an Omeka plug-in that will accept CSV input, in order to bulk ingest metadata and files). I used the “insert file contents” function in TextWrangler (bless that program) to pull all of the individual CSV rows into a single document, then ingested that into Omeka. There were a few bugs, of course, but generally it was smooth. I’ve made a few of the records live, just those that already have illumination records tagged in Omeka too. What this means is that you can now go to the Omeka site, go to “Browse Items by Tag” (http://www.dotporterdigital.org/omeka/items/tags), and click on one of the larger tags (each ms and illumination is tagged with the ms siglum; the more illustrations there are, the larger the tag will appear in the browse list). At the moment the first entry in the list will be the record for the manuscript, followed by the record for the illuminations… although I don’t know if that’s just because the manuscript records are newer.

I would like to include each illumination in the ms records (HasPart) and the ms in the illumination records too (IsPartOf), but I am not certain that’s something I’ll be able to do programmatically. Anyway, I think that is the next thing on my list. That and tagging all of the other illumination records with the siglum (so they will be browseable with the manuscript).

Walters in Omeka

This evening I spent some time thinking about the best way to organize the Digital Walters data into Omeka.

I’ve already experimented with bulk ingesting all of the illuminations (pulling all the decoDesc tags from the TEI manuscript descriptions, and creating a record in Omeka for each one). You can see these in the Omeka instance (although it’s not very pretty). I realized that, as fun as that experiment was, in order for it to be useful I need to take a step back and reevaluate how best to move forward.

I created a record for one of the manuscripts: http://www.dotporterdigital.org/omeka/items/show/2618. It’s basic, including the Title (and the siglum under Alternative Title), links to the manuscript’s home on the Digital Walters site and its Bookreader version on this site (both under Description), and under Has Part, links to an illumination record in Omeka, an illumination that appears in that manuscript.

I want to do a few things to start out:

1) Create one record for each manuscript. I will do this using Omeka’s CSV plug-in… I’ve figured out how to pull all of the information I need from each of the TEI MS Description files, now I need to figure out how to pull all of it into one file and make that file a CSV file. Ithink I can use Xinclude to do that but I need to try more than I had time to tonight.

2) I would like to have a way to automatically attach the illumination records that are already in Omeka to the new manuscript records. The link that’s in the test record is one I added by hand, but Omeka has a collection hierarchy and I need to play with that to see if there might be something in there that can be used for this purpose. What I fear is that the hierarchy is only at the full level – that is, I can say that all of the illuminations are under all of the manuscripts, but I can’t say that some subset of the illuminations are under some particular manuscript. I will need to find out more!

It’s good to be back in the site. I still have an article to finish (and another already started) but I would like to make some progress on the Omeka catalog in the next month or so.

What I’m up to

I know there was a while, when we were in the throes of the Digital Walters Bookreader project when I was updating this blog every night! Then I had to slow down to go to the Digital Humanities 2012 conference, and the ESU Culture and Technologies Summer School, and I had to finish an article (submitted! yay!) and I haven’t really gotten back to this. I did post a long-promised update this evening describing the process of the BookReaders. And I’m planning to continue working on the Digital Walters Omeka – which if it works will be a full catalog of the works and illustrations in all of the Digital Walters manuscripts (including links both to the DW site and to the BookReaders). Hopefully Doug Emery will be willing and able to help with that as he did with the BookReaders project.

I’m currently on research leave, working on an article (maybe two, if I can swing it) on medievalists’ use of digital resources (the topic of my paper at Kalamazoo, my poster at DH2012, and related to my lecture at the ESU School and the article I submitted last month). There’s just too much interesting stuff to say about that, it’s a bit overwhelming. And then of course there is the Medieval Electronic Scholarly Alliance (MESA), which I am co-directing with Tim Stinson at NSCU… just gearing up… and my day job as well (which I’ll be back to on September 4) to keep me busy. But I will keep working here, because… well, because it’s fun.

How to set up your own Bookreader

Now that the Digital Walters BookReaders are all updated and online, I wanted to make a post documenting how you too can create BookReaders along the same lines.

The original BookReader source is available from the Internet Archive. That was the staring point. Doug Emery and I worked together to modify that code to pull most of the information needed by the BookReader out of a TEI Manuscript Description – although the same process could potentially be followed to read from some other XML file containing the appropriate information. I did have to write an XSLT to create the BookReader files themselves (one for each manuscript). Of course it would also be possible to create one BookReader file for all manuscripts to share! But I wanted the BookReaders to be able to be grabbed and used separately, rather than being dependent on server-side scripting in order to use.

You will need:

The .zip file from Internet Archive contains a License, a readme text file, and three folders: BookReader, BookReaderDemo, and BookReaderIA. The modified BookReader.js file available here replaces the BookReaderJSSimple.js file contained in the BookReaderDemo folder. The rest of the files in the folder should be unchanged. The BookReader folder is required for the system to work. We don’t use the BookReaderIA folder.

To run the sample modified BookReader.js file:

  • Replace the BookReaderJSSimple.js file in the BookReaderDemo folder with the .js file from the link above
  • Place the TEI document in the BookReaderDemo folder
  • Open the index.html document in the BookReaderDemo folder

It really is that easy. If you want to take the files I have on this site and host them yourself, please do! The relevant URLs are all formatted as above.

Now, you may want to use these files as a basis to build BookReaders for your own collection. If you have TEI Manuscript Description files you should be able to do it. The file will need to have (or you will need to be able to generate somehow):

  • Title of the book or manuscript
    • TEI/teiHeader/fileDesc/titleStmt/title[@type=’common’]
  • a URL to the official webpage of the manuscript (if there is one)
    • We generated this by supplying the base URL (http://www.thedigitalwalters.org/Data/WaltersManuscripts/) and then filling in the rest of the URL from the TEI (‘.concat(siglum,’/data/’,idno,’/’) – where siglum and idno were pulled from different areas of the file
  • The number of leaves / pages
    • We generated this by counting <surface> tags – but because some of the images were duplicates (one with flap closed, one with flap open) and also included fore-edge, tail, spine, and head images, we had to do a bit of work to keep those from being counted.
    • var surfaces = $(file).find(“surface:[n!=’Fore-edge’][n!=’Tail’][n!=’Spine’][n!=’Head’]”).
      not(“[n*=’flap closed’]”);;
      var leafCount = $(surfaces).size();
  • An indication of whether the manuscript / book is to be read left to right or right to left (we generated this by searching for the language code and specifying which languages are l-r)
    • var rtlLangs = [ ‘ara’, ‘heb’, ‘jpr’, ‘jrb’, ‘per’, ‘tuk’, ‘syc’, ‘syr’, ‘sam’, ‘arc’, ‘ota’ ]
      // get the lang from the TEI
      var lang = $(file).find(‘textLang’).attr(‘mainLang’);
      // set pageProgression if lang is in rtlLangs
      if (jQuery.inArray(lang, rtlLangs) > -1) {
      br.pageProgression = ‘rl’;
      }
  • URLs of the location of the page / leaf files
    • These were generated using the file names that were provided in the TEI document (@url on the third <graphic> tag, which was the resolution file we wanted for the page turning)
    • var path = $(file).find(‘surface’).eq(index).find(‘graphic’).eq(2).attr(‘url’);
      var graphicurl = url + path;
      return graphicurl;
      }
  • The height and width of page/leaf files
    • I tried many different ways to get these. In the first version of the Digital Walters BookReaders I hard-coded the height and width into the .js file (this is what is done in the demo version available from Internet Archive). Unfortunately the image files in Digital Walters are different sizes – although always 1800px on the long edge, the short edge will vary page by page, and the long edge is not always the vertical side. Eventually, the Digital Walters team very kindly generated new TEI files for me to use, with the height and width hard-coded. Ideally there would be some way to automatically generate height and width from the files themselves but if there is some way to do that, I don’t know it!
    • br.getPageWidth = function(index) {
      var widthpx =
      $(file).find(‘surface’).eq(index).find(‘graphic’).eq(2).attr(‘width’);
      if (widthpx) {
      var width = parseInt(widthpx.replace(“px”,””));
      return width;
      } else {
      return 1200;
      }
      }
    • And again for height

One last thing: because I wanted to generate many files at the same time, one per manuscript, I set up an XSLT that I could use to create those files based on information from the TEI documents. That XSLT is available here: http://dotporterdigital.org/walters/TEImsdesc2js.xsl. Aside from the body of the .js there are just a few transformations, and they are (I think) sufficiently documented.

I hope this is useful. I certainly learned a lot working on this project. Thanks to Doug Emery for all his technical help, to Will Noel for his moral support and interest in the project (and for putting me in touch with Doug!). And finally, thanks to the Trustees of the Walters Art Museum for making all of this great data available under Open Access licenses so people like me can do fun and cool things with it!

Walters Bookreaders updated!

Just a quick post to say that, thanks to Doug Emery, the Walters Digital Bookreaders have all been updated to remove the bug that was causing all of the 1-up images to appear as thumbnail size. Doug noticed that in our code, image sizes were being parsed as text and not as numbers, so the Bookreader code couldn’t figure out how to process them. It was a simple change, I was able to make it globally, and everything has been updated as of Wednesday night.

In the process of doing the global replace, however, I discovered that Oxygen (which I use for all of my XML and javascript encoding) doesn’t recognize .js files when doing a replace across files (at least, it was not recognizing my .js files). So I downloaded a tool called TextWrangler (http://www.barebones.com/products/TextWrangler/) and it got the job done in no time. I’ve actually heard of TextWrangler, about 18 months ago I did a consulting gig with some folks down at Southern Louisiana University and they were using TextWrangler for all of them find-and-replace in XML needs. I’m happy to report that it does work very well.

Over the weekend I’m planning to write up a post documenting how the Walters Bookreaders work, along with the code, so others can try setting up their own page-turning versions of open access page images.