Last week I figured out how to batch-download MODS records from the NYPL Digital Collections API (http://api.repo.nypl.org/) using my limited set of technical skills, so I thought I would share my process here.
I had a few tools at my disposal. First, I’m on a Macbook. I’m not sure how I would have done this had I been on a Windows machine. Second, I’m pretty good with XSLT. Although I have some experience with some other languages (javascript, python, perl) I’m not really good at them. It’s possible one could do something like this using other languages and it would be more effective – but I use what I know. I also had a browser, which came in handy in the first step.
The first thing I had to do is find all the objects that I wanted to get the MODS for. I wanted all the medieval objects (surprise!), so to get as broad a search as possible I opted for the “Search across all MODS fields” option (Method 4 in the API Documentation), which involves constructing a URL to stick in a browser. Because the most results the API will return on a single search is 500, I included that limit in my search. I ended up constructing four URLs, since it turned out there were between 1500 and 2000 objects:
- http://api.repo.nypl.org/api/v1/items/search.xml?q=medieval&per_page=500&page=1
- http://api.repo.nypl.org/api/v1/items/search.xml?q=medieval&per_page=500&page=2
- http://api.repo.nypl.org/api/v1/items/search.xml?q=medieval&per_page=500&page=3
- http://api.repo.nypl.org/api/v1/items/search.xml?q=medieval&per_page=500&page=4
I plugged these into my browser, then just saved these result pages as XML files in a directory on my Mac. Each of these results pages had a brief set of fields for each object: UUID (the unique identifier for the objects, and the thing I needed to use to get the MODS), title, typeOfResource, imageID, and itemLink (the URL for the object in the NYPL Digital Collections website).
Next, I had to figure out how to feed the UUIDs back into the API. I thought about this for most of a day, and an evening, and then a morning. I tapped my network for some suggestions, and it wasn’t until Conal Tuohy suggested using document() in XSLT that I thought XSLT might actually work.
To get the MODS record for any UUID, you need to simply construct a URL that points to the MODS record on the NYPL file directory. They look like this:
http://api.repo.nypl.org/api/v1/items/mods/[UUID].xml
For my first attempt, I wrote an XSLT document that used document(), constructing pointers to each MODS record when processed over the result documents I saved from my browser. Had this worked, it would have pulled all the MODS records into a new document during processing. I use Oxygen for most all of my XML work, including processing, but when I tried to process my first result document I got an I/O error. Of course I did – the API doesn’t allow just any old person in. You need to authenticate, and when you sign up with the API they send you an authentication token. There may be some way to authenticate through Oxygen, but if so I couldn’t figure it out. So, back to the drawing board.
Over lunch on the second day, I picked the brain of my colleague Doug Emery. Doug and I worked together on the Walters BookReaders (which are elsewhere on this site), and I trust him to give good advice. We didn’t have much time, but he suggested using a curl request through the terminal on my Mac – maybe I could do something like that? I had seen curl mentioned on the API documentation as well, but I hadn’t heard of it and certainly hadn’t used it before. But I went back to my office and did some research.
Basically, curl is a command-line tool for grabbing the content of whatever is at the other end of a URL. You give it a URL, and it sends back whatever is on the other end. So, if you send out the URL for an NYPL MODS record, it will send the MODS record back. There’s an example on the NYPL API documentation page which incorporates the authentication token. Score!
curl “http://api.repo.nypl.org/api/v1/items?identifier_type=local_bnumber&identifier_val=b11722689” -H ‘Authorization: Token token=”abcdefghijklmn”‘ where ‘abcdefghijklmn’ is the authentication token you receive when you sign up (link coming soon).
Next, I needed to figure out how to send between 1500 and 2000 URLs through my terminal, without having to do them one by one. Through a bit of Google searching I discovered that it’s possible to replace the URL in the command with a pointer to a text file containing a list of URLs, in the format url = [url]. So I wrote a short XSLT that I used to process over all four of the result documents, pulling out the UUIDs, constructing URLs that pointed to the corresponding MODS records, and putting them in the correct format. Then I put pointers to those documents in my curl command:
curl -K “nypl_medieval_4_forCurl.txt” -H ‘Authorization: Token token=”[my_token]”‘> test.xml
Voila – four documents chock full of MODS goodness. And I was able to do it with a Mac terminal and just a little bit of XSLT.