How to download images using IIIF manifests, Part II: Hacking the Vatican

Tags

, , ,

Last week I posted on how to use a Firefox plugin called Down them All to download all the files from an e-codices IIIF manifest (there’s also a tutorial video on YouTube, one of a small but growing collection that will soon include a video outlining the process described here), but not all manifests include direct links to images. The manifests published by the Vatican Digital Library are a good example of this. The URLs in manifests don’t link directly to images; you need to add criteria at the end of the URLs to hit the images. What can you do in that case? In that case, what you need to do it build a list of urls pointing to images, then you can use Down Them All (or other tools) to download them.

In addition to Down Them All I like to use a combination of TextWrangler and a website called Multilinkr, which takes text URLs and turns them into hot links. Why this is important will become clear momentarily.

Let’s go!

First, make sure you have all the software you’ll need: Firefox, Down Them All, and TextWrangler.

Next, we need to pull all the base URLs out of the Vatican manifest.

  1. Search the Vatican Digital Library for the manuscript you want. Once you’ve found one, download the IIIF manifest (click the “Bibliographic Information” button on the far left, which opens a menu, then click on the IIIF manifest link)
    Vatican Digital Library.

    Vatican Digital Library.

    Screen Shot 2016-07-19 at 9.40.49 AM

    Viewing Bibliographic Information. IIIF manifest is on the bottom of the list

  2. Open the manifest you just downloaded in TextWrangler. When it opens, it will appear as a single long string:
    Manifest open in TextWrangler.

    Manifest open in TextWrangler.

    You need to get all the URLs on separate lines. The easiest way to do this is to find and replace all commas with a comma followed by a hard return. Do this using the “grep” option, using “\r” to add the return. Your find and replace box will look like this (don’t forget to check the “grep” box at the bottom!):

    Find and replace. Don't forget grep!

    Find and replace. Don’t forget grep!

    Your manifest will now look something like this:

    Manifest, now with returns!

    Manifest, now with returns!

  3. Now we’re going to search this file to find everything that starts with “http” and ends with “jp2” (what I’m calling the base URLs). We’ll use the “grep” function again, and a little regular expression that will match everything between the beginning of the URL and the end( .* ). Your Find window should look like this (again, don’t forget to check “grep”). Click “Find All”:
    Find the URLs that end with "jp2"

    Find the URLs that end with “jp2”

    Your results will appear in a new window, and will look something like this:

    Search results.

    Search results.

  4. Now we want to export these results as text, and then remove anything in the file that isn’t a URL. First, go to TextWrangler’s File menu and select “Export as Text”:
    Export as Text.

    Export as Text.

    Save that text file wherever you’d like. Then open it in TextWrangler. You now need to do some finding and replacing, using “grep” (again!) and the .* regular expression to remove anything that is not http…jp2. I had to do two runs to get everything, first the stuff before the URLs, then the stuff after:

    Before first search.

    Before first find and replace.

    After first search.

    After first find and replace.

    Before second find and replace.

    Before second find and replace.

    After second find and replace.

    After second find and replace.

  5. You will notice (I hope!) that there are forward slashes (\) before every backslash/regular slash (/) in the URLs. So we need to remove them too. Just to a regular find and replace, DO NOT check the “grep” box:
    Before the slash find and replace.

    Before the slash find and replace.

    After the slash find and replace.

    After the slash find and replace.

    Hooray! We have our list of base URLs. Now we need to add the criteria necessary to turn these base URLs into direct links to images.

    I keep mentioning the criteria required to turn these links from error-throwers to image files. If you go to the Vatican Digital Library website and mouse over the “Download” button for any image file, you’ll see what I mean. As you mouse that button over a bar will appear at the very bottom of your window, and if you look carefully you’ll see that the URL there is the base URL (ending in “jp2”) followed by four things separated by slashes:

    Check out the bits after "jp2" in the URL.

    Check out the bits after “jp2” in the URL.

    There is a detailed description of what exactly these mean in the IIIF Image API Documentation on the IIIF website, but basically:

    [baseurl]/region/size/rotation/quality

    So in this case, we have the full region (the entire image, not a piece of it), size 1047 pixels across by however tall (since there is nothing after the comma), rotation of 0 degrees, and a quality native (aka default, I think – one could also use bitonal or gray to get those quality of images). I like to get the “full” image size, so what I’m going to add to the end of the URLS is:

    [baseurl]/full/full/0/native.jpg

    We’ll just do this using another find and replace in TextWrangler.

  6. We’re just adding the additional criteria after the file extension, so all I do is find the file extension – jp2 – and replace all with “jp2/full/full/0/native.jpg”.
    Adding criteria: before find and replace.

    Adding criteria: before find and replace.

    Adding criteria: After find and replace.

    Adding criteria: After find and replace.

    Test one, just to make sure it works. Copy and paste the URL into a browser.

    Works for me.

    Works for me.

  7. Now – finally! promise! – you can use Down them All to download all those lovely image files. In order to do that you need to turn the text links into hot links. When I was testing this I first tried opening the text file in Firefox and pointing Down Them All to it, but it broke Down Them All – and I mean BROKE it. I had to uninstall Down Them All and delete everything out of my Firefox profile before I could get it to work again. Happily I found a tool that made it easy to turn those text links into hot links: Multilinkr. So now open a new tab in Firefox and open Multilinkr. Copy all the URLs from TextWrangler and paste them into the Multilinkr box. Click the “Links” button and gasp as the text links turn into hot links:
    Text links.

    Text links.

    Hot links *gasp*

    Hot links *gasp*

    Now go up to the Firefox “Tools” menu and select “Down Them All Tools > Down Them All” from the dropdown: Screen Shot 2016-07-21 at 1.51.15 PMDown Them All should automatically recognize all the files and highlight them. Two things to be careful about here. One is that you need to specify a download location. It will default to your Downloads folder, but I like to indicate a new folder using the shelfmark of the manuscript I’m downloading. You can also browse to download the files wherever you’d like. The second one is that Down Them All will keep file names the same unless you tell it to do something different. In the case of the Vatican that’s not ideal, since all the files are named “native.jpg”, so if you don’t do something with the “Renaming Mask” you’ll end up with native.jpg native.jpg(1) native.jpg(2) etc. I like to change the Renaming Mask from the default *name*.*ext* to *flatsubdirs*.*ext* – “flatsubdirs” stands for “flat subdirectories”, and it means the downloaded files will be named according to the path of subdirectories wherever they are being downloaded from. In the case of the Vatican files, a file that lives here:

    http://digi.vatlib.it/iiifimage/MSS_Vat.lat.3773/3396_0-AD0_f11fd975a99e2b099ee569f7667f8b8d0fd922dbc0cf5cd6730cda1a00626794_1469208118412_Vat.lat.3773_0003_pa_0002.jp2/full/full/0/native.jpg

    will be renamed

    iiifimage-MSS_Vat.lat.3773-3396_0-AD0_f11fd975a99e2b099ee569f7667f8b8d0fd922dbc0cf5cd6730cda1a00626794_1469208118412_Vat.lat.3773_0003_pa_0002.jp2-full-full-0.jpg

    This is still a mouthful, but both the shelfmark (Vat.lat.3773) and the page number or folio number are there (here it’s pa_0002.jp2 = page 2, in other manuscripts you’ll see for example fr_0003r.jp2), so it’s simple enough to use Automator or another tool to batch rename the files by removing all the other bits and just leaving the shelfmark and folio or page number.

There are other ways you could do this, too, using Excel to construct the URLs and wget to download, but I think the method outlined here is relatively simple for people who don’t have strong coding skills. Don’t hesitate to ask if you have trouble or questions about this! And please remember that the Vatican manuscript images are not licensed for reuse, so only download them for your own scholarly work.

How to download images using IIIF manifests, Part I: DownThemAll

Tags

,

Edit: Down Them All has stopped working, as of at least July 19, 2016. I will be updating this post to use a different tool. Until then please be aware that these instructions will not work. Sorry!

IIIF manifests are great, but what if you want to work with digital images outside of a IIIF interface? There are a few different ways I’ve figured out that I can use IIIF manifests to download all the images from a manuscript. The exact approach will vary since different institutions construct their image URLs in different ways. Here’s the first approach, which is fairly straightforward and uses e-codices as an example. Tomorrow I’ll post a second post using on the Vatican Digital Library. Please remember that most institutions license their images, so don’t repost or publish images unless the institution specifically allows this in their license.

Method 1: The manifest has urls that resolve directly to image files

This is the easiest method, but it only works if the manifest contains urls that resolve directly to image files. If you can copy a url and paste it into a browser and an image displays, you can use this method. The manifests provided by e-codices follow this approach.

  1. Install DownThemAll, a Firefox browser plugin that allows you to download all the files linked to from a webpage. (There is a similar browser plugin for Chrome, called Get Them All, but it did not recognize the image files linked from the manifest)
  2. Go to e-codices, search for a manuscript, and click the “IIIF manifest” link on the Overview page.
    IIIF manifest link (look for the colorful IIIF logo)

    IIIF manifest link (look for the colorful IIIF logo)

    The manifest will open in the browser. It will look like a mess, but it doesn’t need to look good.

    Messy manifest.

    Messy manifest.

  3. Open DownThemAll. It will recognize all the files linked from the manifest (including .json files, .jpg, .j2, and anything else) and list them. Click the box next to “JPEG Images” at the bottom of the page (under “Filters”). It will highlight all the JPEG images in the list, including the various “default.jpg” images and files ending with “.jp2”

    Screen Shot 2016-07-14 at 4.46.47 PM

    JPEG images highlighted in Down Them All

  4. Now, we only want the images that are named “default.jpg”. These are the “regular” jpeg files; the .jp2 files are the masters and, although you could download them, your browser wouldn’t know what to do with them. So we need to create a new filter so we get only the default.jpg files. To do this, first click “Preferences” in the lower right-hand corner, then click the “Filters” button in the resulting window.
    Filters.

    Filters.

    There they are. To create a new filter, click the “Add New Filter” button, and call the new filter “Default Jpg” (or whatever you like). In the Filtered Extensions field, type “/\/default.jpg” – the filter will select only those files that end with “default.jpg” (yes you do need three slashes there!). Note that you do not need to press save or anything, the filter list updates and saves automatically.

    New filter

    New filter

  5. Return to the main Down Them All view and check the box next to your newly-created filter. Be amazed as all the “default.jpg” files are highlighted.

    AMAZE

    AMAZE

  6. Don’t hit download just yet. If you do, it will download all the files with their given names, and since they are all named “default.jpg” it won’t end well. It will also download them all directly to whatever is specified under “Save Files in” (in my case, my Downloads folder) which also may not be ideal. So you need to change the Renaming Mask to at least give you unique names for each one, and specify where to download all those files. In the case of e-codices the manifest urls include both the manuscript shelfmark and the folio number for each image, so let’s use the Renaming Mask to name the files according to the file page. Simply change *name* to *flatsubdirs* (flat subdirectories). Under “Save Files in”, browse to wherever you want to download all these files.

    Renaming Mask and Save Files in, read to go

    Renaming Mask and Save Files in, read to go

  7. Press “Start” and wait for everything to download.
    Downloading...

    Downloading…

    Congratulations, you have downloaded all the images from this manuscript! You’ll probably want to rename them (if you’re on Mac you can use Automator to do this fairly easily), and you should also save the manifest alongside the images.

    TOMORROW: THE VATICAN!

Manuscript PDFs: Update

My last post was an announcement that I’d posted the University of Pennsylvania’s Schoenberg Collection manuscripts on Google Drive as PDF files, along with details on how I did it. This is a follow-up to announce that I’ve since added PDF files for UPenn’s Medieval and Renaissance Manuscript collection, AND for the Walters Art Museum manuscripts (which are available for download through The Digital Walters).

As with the Schoenberg Manuscripts, these two other collections are in their own folders, along with a spreadsheet you can search and brows to aid in discovery. You are free to download the PDF files and redistribute them as you wish. They are in the public domain.

The main directory for the manuscripts is here.

Enjoy!

Title: Initial "C" with St. Paul trampling Agrippa Form: Historiated initial "C," 12 lines Text: Psalm 97 Comment: The inscriptions on the scrolls read "Paulus/Agr[ipp]a." The second inscription is partially obliterated.

Source: Walters Ms. W.36, Touke Psalter, fol. 89r
Title: Initial “C” with St. Paul trampling Agrippa
Form: Historiated initial “C,” 12 lines
Text: Psalm 97
Comment: The inscriptions on the scrolls read “Paulus/Agr[ipp]a.”
The second inscription is partially obliterated.

 

UPenn’s Schoenberg Manuscripts, now in PDF

Tags

Hi everyone! It’s been almost a year since my last blog post (in which I promised to post more frequently, haha) so I guess it’s time for another one. I actually have something pretty interesting to report!

Last week I gave an invited talk at the Cultural Heritage at Scale symposium at Vanderbilt University. It was amazing. I spoke on OPenn: Primary Digital Resources Available to Everyone, which is the platform we use in the Schoenberg Institute for Manuscript Studies at the University of Pennsylvania Libraries to publish high-resolution digital images and accompanying metadata for all our medieval manuscripts (I also talked for a few minutes about the Schoenberg Database of Manuscripts, which is a provenance database of pre-1600 manuscripts). The philosophy of OPenn is centered on openness: all our manuscript images are in the public domain and our metadata is licensed with Creative Commons licenses, and none of those licenses prohibit commercial use. Next to openness, we embrace simplicity. There is no search facility or fancy interface to the data. The images and metadata files are on a file system (similar to the file systems on your own computer) and browse pages for each manuscript are presented in HTML that is processed directly from the metadata file. (Metadata files are in TEI/XML using the manuscript description element)

Screen Shot 2016-06-10 at 2.20.26 PM

This approach is actually pretty novel. Librarians and faculty scholars alike love their interfaces! And, indeed, after my talk someone came up to me and said, “I’m a humanities faculty member, and I don’t want to have to download files. I just want to see the manuscripts. So why don’t you make them available as PDF so I can use them like that?”

This gave me the opportunity to talk about what OPenn is, and what it isn’t (something I didn’t have time to do in my talk). The humanities scholar who just wants to look at manuscripts is really not the audience for OPenn. If you want to search for and page through manuscripts, you can do that on Penn in Hand, our longstanding page-turning interface. OPenn is about data, and it’s about access. It isn’t for people who want to look at manuscripts, it’s for people who want to build things with manuscript data. So it wouldn’t make sense for us to have PDFs on OPenn – that’s just not what it’s for.

Landing page for Penn in Hand.

Landing page for Penn in Hand.

HOWEVER. However. I’m sympathetic. Many, many people want to look at manuscripts, and PDFs are convenient, and I want to encourage them to see our manuscripts as available to them! So, even if Penn isn’t going to make PDFs available institutionally (at least, not yet – we may in the future), maybe this is something I could do myself. And since all our manuscript data is available on OPenn and licensed for reuse, there is no reason for me not to do it.

So here they are.

If you click that link, you’ll find yourself in a Google Drive folder titled “OPenn manuscript PDFs”. In there is currently one folder, “LJS Manuscripts.” In that folder you’ll fine a link to a Google spreadsheet and over 400 PDF files. The spreadsheet lists all the LJS manuscripts (LJS = Laurence J. Schoenberg, who gifted his manuscripts to Penn in 2012) including catalog descriptions, origin dates, origin locations, and shelfmarks. Let’s say you’re interested in manuscripts from France. You can highlight the Origin column and do a “Find” for “France.” It’s not a fancy search so you’ll have to write down the shelfmarks of the manuscripts as you find them, but it works. Once you know the shelfmarks, go back into the “LJS Manuscripts” folder and find and download the PDF files you want. Note that some manuscripts may have two PDF files, one with “_extra” in the file name. These are images that are included on OPenn but not part of the front-to-back digitization of a manuscript. They might include things like extra shots of the binding, or reference shots.

If you are interested in knowing how I did this, please read on. If not, enjoy the PDFs!

Screen Shot 2016-06-10 at 2.24.32 PM

How I did it

I’ll be honest, this is my favorite part of the exercise so thank you for sticking with me for it! There won’t be a pop quiz at the end although if you want to try this out yourself you are most welcome to.

First I downloaded all the web jpeg files from the LJS collection on OPenn. I used wget to do this, because with wget I am able to get only the web jpeg files from all the collection folders at once. My wget command looked like this:

wget -r -np -A “_web.jpg” http://openn.library.upenn.edu/Data/0001/

Brief translation:

wget = use the wget program
-r = “recursive”, basically means go into all the child folders, not just the folder I’m pointing to
-np = “no parent”, basically means don’t go into the parent folders, no matter what
-A “_web.jpg” = “accept list”, in this case I specified that I only want those files that contain _web.jpg (which all the web jpeg files on OPenn do)
http://openn.library.upenn.edu/Data/0001/ = where all the LJS manuscript data lives

I didn’t use the -nd command, which I usually do (-nd = “no directory”, if you don’t use this command you get the entire file structure for the file server starting from root, which in this case is openn.library.upenn.edu. What this means, practically, is that if you use wget to download one file from a directory five levels up, you get empty folders four levels deep then the top director with one file in it. Not fun. But in this case it’s helpful, and you’ll see why later.

At my house, with a pretty good wireless connection, it took about 5 hours to download everything.

I used Automator to batch create the PDF files. After a bit of googling I found this post on batch creating multipage PDF files from jpeg files. There are some different suggestions, but I opted to use Mac’s Automator. There is a workflow linked from that post. I downloaded that and (because all of the folders of jpeg images I was going to process are in different parent folders) I replaced the first step in the workflow, which was Get Selected Finder Items, with Get Specified Finder Items. This allowed me to search in Automator for exactly what I wanted. So I added all the folders called “web” that were located in the ancestor folder “openn.library.upenn.edu” (which was created when I downloaded all the images from OPenn in the previous step). In this step Automator creates one PDF file named “output.pdf” for each manuscript in the same location as that manuscript’s web jpeg images (in a folder called web – which is important to know later).

Once I created the PDFs, I no longer needed the web jpeg files. So I took some time to delete all the web jpegs. I did this by searching in Finder for “_web.jpg” in openn.library.upenn.edu and then sending them all to Trash. This took ages, but when it was done the only thing in those folders were output.pdf files.

I still had more work to do. I needed to change the names of the PDF files so I would know which manuscripts they represented. Again, after a bit of Googling, I chanced upon this post which includes an AppleScript that did exactly what I needed: it renames files according to the path of their location on the file system. For example, the file “output.pdf” located in Macintosh HD/Users/dorp/Downloads/openn/openn.library.upenn.edu/Data/0001/ljs101/data/web would be renamed “Macintosh HD_Users_dorp_Downloads_openn_openn.library.upenn.edu_Data_0001_ljs101_data_web_001.pdf”. I’d never used AppleScript before so I had to figure that out, but once I did it was smooth sailing – just took a while. (To run the script I copied it into Apple’s Script Editor, hit the play button, and selected openn.library.upenn.edu/Data/0001 when it asked me where I wanted to point the script)

Finally, I had to remove all the extraneous pieces of the file names to leave just the shelfmark (or shelfmark + “extra” for those files that represent the extra images). Automator to the rescue again!

  1. Get Specified Finder Items (adding all PDF files located in the ancestor folder “http://openn.library.upenn.edu”)
  2. Rename Finder Items to replace text (replacing “Macintosh HD_Users_dorp_Downloads_openn_openn.library.upenn.edu_Data_0001_” with nothing) –
  3. Rename Finder Items to replace text (replacing “_data_web_001” with nothing)
  4. Rename Finder Items to replace text (replacing “_data_extra_web_001” with “_extra” – this identifies PDFs that are for “extra” images)

The last thing I had to do was to move them into Google Docs. Again, I just searched for “.pdf” in Finder (just taking those that are in openn.libraries.upenn.edu/Data/0001) and dragged them into Google Drive.

All done!

The spreadsheet I generated by running an XSLT script over the TEI manuscript descriptions (it’s a spreadsheet I created a couple of years ago when I first uploaded data about the Penn manuscripts on Viewshare. Leave a comment or send me a note if that sounds interesting and I’ll make a post on it.

 

It’s been a while since I rapped at ya

I’m not dead! I’m just really bad when it comes to blogging. I’m better at Facebook, and somewhat better at Twitter (and Twitter), and I do my best to update Tumblr.

The stated purpose of this blog is to give technical details of my work. This mostly involves finding data, and moving it around from one format to another. I use XSLT, because it’s what I know, although I’ve seen the promise of Python and I may eventually take the time to learn it well. I don’t know when that will happen, though.

I’ve taken to posting files and documentation on Github, so if you’re curious you can look there. If you’re familiar with my interests, and you share them, the most interesting things will be VisColl, a developing system for generating visualizations of manuscript codices showing elements of physical construction, DistributionVis which is, as described on Github, “a wee script to visualize the distribution of illustration in manuscripts from the Walters Art Museum,” and ebooks, files I use to start the process of building ebooks from our digitized collection. (Finished ebooks are archived in UPenn’s institutional repository, you can download them there)

VisColl – quire with an added leaf

DistributionVis – Different color lines refer to different types of illustrations or texts.

VisColl has legs, a lot of interest in the community, and is part of a major grant from the Mellon Foundation to the University of Toronto. Woohoo! DistributionVis is something I threw together in an afternoon because I wanted to see if I could. I thought ebooks were a nice way to provide a different way for people to access our collection. I’ve no idea if either of these two are any use to anyone, but I put them out there, because why not?

I do a lot of putting-things-out-there-because-why-not – it seems to be what I’m best at. So I’m going to continue doing that. And when I do, I shall try my very best to put it here too!

Until next time…

Disbinding Some Manuscripts, and Rebinding Some Others (presented at ICMS, Kalamazoo, MI, May 2014)

I presented my collaborative project on visualizing collation at the International Congress on Medieval Studies in Kalamazoo, Michigan, last week, and it was really well received. Also last week I discovered the Screen Recording function in QuickTime on my Mac. So, I thought it might be interesting to re-present the Kalamazoo talk in my office and record it so people who weren’t able to make the talk could still see what we are up to. I think this is longer than the original presentation – 23 minutes! – so feel free to skip around if it gets boring. Also there is no editing, so um ah um sorry about that. (Watch out for a noise at 18:16, I think my hand brushed the microphone, it’s unnerving if you’re not expecting it)

We’ll also be presenting this work as a poster/demo at the Digital Humanities 2014 Conference in Lausanne this July.

How to get MODS using the NYPL Digital Collections API

Last week I figured out how to batch-download MODS records from the NYPL Digital Collections API (http://api.repo.nypl.org/) using my limited set of technical skills, so I thought I would share my process here.

I had a few tools at my disposal. First, I’m on a Macbook. I’m not sure how I would have done this had I been on a Windows machine. Second, I’m pretty good with XSLT. Although I have some experience with some other languages (javascript, python, perl) I’m not really good at them. It’s possible one could do something like this using other languages and it would be more effective – but I use what I know. I also had a browser, which came in handy in the first step.

The first thing I had to do is find all the objects that I wanted to get the MODS for. I wanted all the medieval objects (surprise!), so to get as broad a search as possible I opted for the “Search across all MODS fields” option (Method 4 in the API Documentation), which involves constructing a URL to stick in a browser. Because the most results the API will return on a single search is 500, I included that limit in my search. I ended up constructing four URLs, since it turned out there were between 1500 and 2000 objects:

I plugged these into my browser, then just saved these result pages as XML files in a directory on my Mac. Each of these results pages had a brief set of fields for each object: UUID (the unique identifier for the objects, and the thing I needed to use to get the MODS), title, typeOfResource, imageID, and itemLink (the URL for the object in the NYPL Digital Collections website).

Next, I had to figure out how to feed the UUIDs back into the API. I thought about this for most of a day, and an evening, and then a morning. I tapped my network for some suggestions, and it wasn’t until Conal Tuohy suggested using document() in XSLT that I thought XSLT might actually work.

To get the MODS record for any UUID, you need to simply construct a URL that points to the MODS record on the NYPL file directory. They look like this:

http://api.repo.nypl.org/api/v1/items/mods/[UUID].xml

For my first attempt, I wrote an XSLT document that used document(), constructing pointers to each MODS record when processed over the result documents I saved from my browser. Had this worked, it would have pulled all the MODS records into a new document during processing. I use Oxygen for most all of my XML work, including processing, but when I tried to process my first result document I got an I/O error. Of course I did – the API doesn’t allow just any old person in. You need to authenticate, and when you sign up with the API they send you an authentication token. There may be some way to authenticate through Oxygen, but if so I couldn’t figure it out. So, back to the drawing board.

Over lunch on the second day, I picked the brain of my colleague Doug Emery. Doug and I worked together on the Walters BookReaders (which are elsewhere on this site), and I trust him to give good advice. We didn’t have much time, but he suggested using a curl request through the terminal on my Mac – maybe I could do something like that? I had seen curl mentioned on the API documentation as well, but I hadn’t heard of it and certainly hadn’t used it before. But I went back to my office and did some research.

Basically, curl is a command-line tool for grabbing the content of whatever is at the other end of a URL. You give it a URL, and it sends back whatever is on the other end. So, if you send out the URL for an NYPL MODS record, it will send the MODS record back. There’s an example on the NYPL API documentation page which incorporates the authentication token. Score!

curl “http://api.repo.nypl.org/api/v1/items?identifier_type=local_bnumber&identifier_val=b11722689” -H ‘Authorization: Token token=”abcdefghijklmn”‘ where ‘abcdefghijklmn’ is the authentication token you receive when you sign up (link coming soon).

Next, I needed to figure out how to send between 1500 and 2000 URLs through my terminal, without having to do them one by one. Through a bit of Google searching I discovered that it’s possible to replace the URL in the command with a pointer to a text file containing a list of URLs, in the format url = [url]. So I wrote a short XSLT that I used to process over all four of the result documents, pulling out the UUIDs, constructing URLs that pointed to the corresponding MODS records, and putting them in the correct format. Then I put pointers to those documents in my curl command:

curl -K “nypl_medieval_4_forCurl.txt” -H ‘Authorization: Token token=”[my_token]”‘> test.xml

Voila – four documents chock full of MODS goodness. And I was able to do it with a Mac terminal and just a little bit of XSLT.

Q: How do you teach TEI in an hour?

A: You don’t! But you can provide a substantial introduction to the concept of the TEI, and explain how it functions.

On June 4 I participated in PhillyDH@Penn, a day of workshops and unconference sessions sponsored by PhillyDH and held in my own beloved Van Pelt-Dietrich Library on the University of Pennsylvania campus. I was sick, so I wasn’t able to participate fully, but I was able to lead a one-hour Introduction to TEI. I aimed it at absolute beginners, with the intention to a) Give the audience an idea of what TEI is and what it’s for (to help them answer the question, Is TEI really what I need?) and b) explain enough about the TEI so they will know a bit of something walking into their first “real” (multi-hour, hands-on) TEI workshop. I got a lot of good feedback, so hopefully it did its job. And I do hope to have the opportunity to follow this up with more substantial workshops.

Slides (in PDF format) are posted here.

EDIT: Need to add that these slides owe a ton to James Cummings, with whom I have taught TEI and to whom I owe much of what I know about it!

New Job

I have a new job! As of April 1, I am the Curator, Digital Research Services, Schoenberg Institute for Manuscript Studies, Special Collections Center, University of Pennsylvania.

Now say that five times fast.

In this role, I’m responsible for the digital initiatives coming out of SIMS. I’m also a curator for the Schoenberg Manuscript collection (as are all SIMS staffers in the SCC), and I manage the Vitale Special Collections Digital Media Lab (Vitale 2). It’s also clear that I will have some broad role in the digital humanities at UPenn, although that part of the job isn’t so clear to me yet. I’m the inaugural DRS Curator, so there is still a lot to be worked out although after three weeks on the job, I’m feeling really confident that things are under control.

I report to Will Noel, Director of the SCC and Director of SIMS. Will may be best known as the Project Director for the Archimedes Palimpsest Project at the Walters Art Museum. Will is awesome, and I’m thrilled to be working for him. I’m also thrilled to be working at the University of Pennsylvania, in a beautiful brand new space, in the great city of Philadelphia.

I plan to blog more in this new position, so keep your eyes here and on the MESA blog.