Yesterday I set up my hosting and installed some tools that I’m going to use to start my project. I also explained a bit about the project, just my initial thoughts.
Today I did a bit of looking at code, and thinking in more detail about what I want to get out of the first stage of the project: creating Internet Archive BookReader objects for every manuscript digitized by the Walters Art Museum (http://www.thedigitalwalters.org/).
I’ve selected the IA BookReader (http://openlibrary.org/dev/docs/bookreader) for this project because it’s a very simple bit of code and it presents an attractive and functional finished product. I’m using the Walters Art Museum data because it’s been released under a Creative Commons license and because the Curator of Manuscripts, Will Noel, has been vocal about inviting people to take and reuse the data – so that is what I’m going to do.
Yesterday I selected one of the Walters manuscripts and created a BookReader for it – you can access it by linking the image on this page. It’s certainly not perfect, notably there is no way to navigate using page numbers, and I think foliation would be more problematic than pagination to figure out. Anyway, that’s something that I may tackle later on. But for now I’m pretty happy with it.
What do I need to do to get from here to there?
- width of the page image in pixels
- height of the page image in pixels
- ms siglum – in a couple of different places
- number of page images
- manuscript name
HERE. What do I have here? I have a TEI manuscript description! It is a beautiful thing. The TEI ms desc gives me much of the information I need over there.
- the siglum, bless its heart, isn’t noted on its own anywhere in the XML, although it is part of all the graphic@url that are in the facsimile section. Happily each of the ms desc files is named according to the siglum, so I should be able to extract it from the file name.
- the name of the manuscript can be extracted from TEI/teiHeader/fileDesc/titleStmt/title@type=’common’. Many of the manuscripts have additional titles, but for this exercise I’ll just be taking the common one.
- the number of pages is a bit more difficult. I think the easiest way to do this will be to extract the number from the last TEI/facsimile/surface/graphic/@url in the document, where the number is part of that URL value.
- width and height of the page image in pixels – I have no idea. The BookReader really depends on these numbers being there, if you leave them out the reader will load the image in actual size, which looks awful (and breaks the tile view), and if you don’t get the width to height ratio correct it will skew the images. So they really need to be exact. Of course the same measurement is applied to every page in the manuscript, and that only works as long as you don’t have fold-outs or anything like that (more common in printed books, but not completely unheard of in medieval manuscripts) – so for some documents the BookReader just won’t be usable. ANYWAY, I think these numbers will need to be extracted from files. I haven’t done that before, but I know it can be done. I just need to figure out how to do it.
So, this is my plan. Tomorrow and over the weekend I will start playing around with code to see what I can get easily and what will need more investigation. For now, I sleep.