Quantcast
Channel: inlustre monumentum est » data
Viewing all articles
Browse latest Browse all 2

How to retrieve ancient text data from Perseus

$
0
0

In my last post I was describing problems with the URL schema not being entirely predictable, and therefore computable from body of text to body of text (e.g. from Livy to Caesar). That is the way the URLs are formed, what constitutes a ‘body of text’, and what you might expect to see returned in a request and how that varies with each textual work.

Update: Schema will now include a ‘urn’ attribute

Warning: this is a long and somewhat technical post about using the Perseus CTS API to fetch classical texts as XML data

This stuff is important for software developers and “digital classicists” (that is, classicists who work with computer-information systems for analysing information about the classical world).

On the Digital Classics mailing list, some helpful hints managed to emerge to my queries. The first is, the Perseus XML interface I was using (it’s the one that’s behind the helpful “XML” button at the bottom of each passage in the HTML version that you typically use with your web browser) is probably on its last legs.

CTS Overview

The more up-to-date (but still in beta) version is Perseus CTS; where “CTS” stands for Canonical Text Services. CTS is built on work done by the Homer Multitext Project.

CTS appears to have three main functional components:

  • A catalogue service (actually called “getCapabilities”)
  • A reference validation and exploration service
  • A service that retrieves text

Some commentary on its limitations

What it is missing, is a search service. The catalogue is huge. It has listed in it every available Greek and Roman text in the Perseus database and includes details of all editions and translation of each text. It’s available here http://www.perseus.tufts.edu/hopper/CTS?request=GetCapabilities and I’m not actually linking that URL because don’t click on it just yet. It’s 2.1 MB of XML. Your browser may not like especially like it. Mine only manages to load it properly half the time.

When you do manage to download it and save it on your local disk (highly recommended), you’ll see it’s a pretty comprehensive catalogue of the data. Unordered. With no links to the texts in either the reference validation or text retrieval services, and nothing obvious as a field that gives you the unique identifier needed.

What the references are constructed from

The reference validation service assumes you know the reference you want to validate (and discover the sub-components of). But you need that first-level peek into the initial reference. Perseus uses Thesaurus Linguae Graecae referencing system for Greek texts, and the Packard Humanities Institute PHI Latin Texts system for Latin texts. These both principally organise their respective corpora around authors, assigning each their own index number. Thus, Homer is ‘tlg0012′ and Livy ‘phi0914′.

The references are formatted into a type of reference called a URN.

How to create the references

Now I’m going to tell you how to construct a functional reference ID for the CTS system.

First thing, load the catalogue URL into your browser. I’m not going to link it but cut and paste this one into your browser: http://www.perseus.tufts.edu/hopper/CTS?request=GetCapabilities – if you know how to use Wget or Curl use that instead.

Save the file to a convenient location on your disk. I called mine “CTS.xml”.

Open the file in a text editor. Notepad won’t cut it. Word most certainly will not (it’s not even a text editor!). One the Mac, I recommend BBEdit. [Update: it's been pointed out on the mailing list that Oxygen XML editor is an ideal tool. I use this tool at work and have it on my Mac at home. An Academic licence is $99, a full one nearly $500. Unless you do extensive work in XML I would not recommend to buy it. Probably on Windows by default Internet Explorer is the default program for an XML file. It, or Safari on the Mac, will suffice to read the document. Google's Chrome also works pretty well. Browsers will also "pretty print" the XML to make it easier to view.]

Use your editor’s search capability to find the author you want.

The ‘textgroup’ (normally the author) identifies the first level

You’ll find that the author’s work is contained in an XML element called “textgroup”. Here’s the text group for Livy, along with the groupname element identifying it:

<textgroup projid="latinLit:phi0914">
  <groupname xml:lang="en">Titus Livius (Livy)</groupname>
  ... (thousands of lines omitted)
</textgroup>

Pay careful attention to the ‘projid’ attribute of the textgroup. This helps form the root of the URN used to identify the text in Perseus. The URN always starts with ‘urn:cts:’. Add the projid to that, like this:

urn:cts:latinLit:phi0914

Check it in the reference validation service

That’s all texts/editions/translations by/of Livy in the Perseus database. Here’s a link to the reference validation service: http://www.perseus.tufts.edu/hopper/CTS?request=GetValidReff&urn=urn:cts:latinLit:phi0914. If you open that link, you’ll see, in XML, a list of all the available URNs for every version and edition and translation of Livy in the database. But unfortunately, no descriptive information what each version edition or translation is!

We still need the catalogue file. Go back to the catalogue file.

The ‘work’ identifies the next level of reference

Search for a book. In my case, let’s look for “Book 1″ of Livy. You’ll see the catalogue file is unordered. The version I looked at, Livy books started at Book 11 (what? The one of the missing books is miraculously in the Perseus database I hear you say? Unfortunately, it’s just the periocha of book 11). The unordered nature of the database make it especially annoying: you have to search, and you can’t browse.

Anyway the entry for Book 1 looks something like this:

<textgroup projid="latinLit:phi0914">
  <groupname xml:lang="en">Titus Livius (Livy)</groupname>
  <!-- ... (thousands of lines omitted) -->
  <work projid="latinLit:phi0011" xml:lang="lat">
    <title xmlns="http://purl.org/dc/elements/1.1/" xml:lang="en">
    The History of Rome, Book 1</title>
 <!-- ... (thousands of lines omitted) -->
</work></textgroup>

See how the Book is contained in an XML element called “work”? Note the “projid” element of the work. In this case, we don’t need the “latinLit:” part, the interesting part of the id is the “phi0011″: that’s the ID for Book 1 of Livy. We add it to the URN we’ve been constructing as follows:

urn:cts:latinLit:phi0914.phi0011

The ‘edition’ and/or ‘translation’ identifies a specific version of the work

While that’s supposed to be valid reference to Livy’s book 1, Perseus contains at least two Latin editions of the text and three English translations. These are listed inside the “work” element in either “edition” or “translation” elements, like so (for brevity I have omitted some lines that give data about the citation system of the edition):

<work projid="latinLit:phi0011" xml:lang="lat">
  <title xmlns="http://purl.org/dc/elements/1.1/" xml:lang="en">
   The History of Rome, Book 1</title>
  <edition projid="latinLit:perseus-lat1">
    <label xml:lang="en">The History of Rome, Book 1</label>
    <description xmlns="" xml:lang="en">Titi Livi ab urbe condita libri 
     editionem priman curavit Guilelmus Weissenborn editio altera auam
     curavit Mauritius Mueller Pars I. Libri I-X. Editio Stereotypica.
     Titus Livius. W. Weissenborn. H. J. M&amp;#252;ller. Leipzig. 
     Teubner. 1898. 1.
    </description>
    <!-- some lines omitted -->         
  </edition>
  <translation projid="latinLit:perseus-eng1">
    <label xml:lang="en">The History of Rome, Book 1</label>
    <description xmlns="" xml:lang="en">Livy. Books I and II With An
     English Translation. Cambridge. Cambridge, Mass., Harvard 
     University Press; London, William Heinemann, Ltd. 1919.
    </description>
    <!-- some lines omitted -->         
  </translation>
  <edition projid="latinLit:perseus-lat2">
    <label xml:lang="en">The History of Rome, Book 1</label>
    <description xmlns="" xml:lang="en">Livy. Books I and II With An
     English Translation. Cambridge. Cambridge, Mass., Harvard 
     University Press; London, William Heinemann, Ltd. 1919.
    </description>
    <!-- some lines omitted -->         
  </edition>
  <edition projid="latinLit:perseus-lat3">
    <label xml:lang="en">The History of Rome, Book 1</label>
    <description xmlns="" xml:lang="en">Livy. Ab urbe condita. Robert
     Seymour Conway. Charles Flamstead Walters. Oxford. Oxford 
     University Press. 1914. 1.</description>
    <!-- some lines omitted -->         
    <memberof collection="Perseus:collection:Greco-Roman"></memberof>
  </edition>
  <translation projid="latinLit:perseus-eng2">
    <label xml:lang="en">The History of Rome, Book 1</label>
    <description xmlns="" xml:lang="en">Livy. History of Rome by Titus
     Livius, the first eight Books. literally translated, with notes 
     and illustrations, by. D. Spillan. York Street, Covent Garden,
     London. Henry G. Bohn. John Child and son, printers. 1857. 1.
    </description>
    <!-- some lines omitted -->         
  </translation>
  <translation projid="latinLit:perseus-eng3">
    <label xml:lang="en">The History of Rome, Book 1</label>
    <description xmlns="" xml:lang="en">Perseus:bib:oclc,2311635, Livy.
     History of Rome. English. Translation by. Rev. Canon Roberts. New
     York, New York. E. P. Dutton and Co. 1912. 1. Livy. History of 
     Rome. English Translation. Rev. Canon Roberts. New York, New York.
     E.P. Dutton and Co. 1912. 2.</description>
    <!-- some lines omitted -->         
  </translation>
</work>

Now, assuming we’re after the Teubner edition of the text (the first one), we can use that edition’s ‘projId’ attribute as before, and stripping the ‘latinLit’ from it and adding it to the URN we’ve been building up, we get:

urn:cts:latinLit:phi0914.phi0011.perseus-lat1

This is the complete reference to the Weissenborn & Mueller edition of Livy’s Book 1 published by Teubner.

Check it in the reference service

We can hit up the reference validation service with that URN as follows: http://www.perseus.tufts.edu/hopper/CTS?request=GetValidReff&urn=urn:cts:latinLit:phi0914.phi0011.perseus-lat1 – you will see a complete collection of URNs for the distinct parts of Book 1 in the Teubner edition of the text.

URNs for specific passages

This URN is all of the preface that’s found at the start of Book 1:

urn:cts:latinLit:phi0914.phi0011.perseus-lat1:pr

This URN is all of Chapter 1 of Book 1 (not including the preface):

urn:cts:latinLit:phi0914.phi0011.perseus-lat1:1

You can also get parts of chapters, here is 1.4.2:

urn:cts:latinLit:phi0914.phi0011.perseus-lat1:4.2

Fetch the text chunk you want

These arguments are passed to the ‘urn’ parameter of text retrieval service of Perseus like this: http://www.perseus.tufts.edu/hopper/CTS?request=GetPassage&urn=urn:cts:latinLit:phi0914.phi0011.perseus-lat1:pr (that’s the preface).

Anatomy of the URN format used by Perseus

    urn:cts:latinLit:phi0914.phi0011.perseus-lat1:4.2
    {1}:{2}:   {3}  :   {4} . {5}   .  {6}       :{7}
  • {1} It’s a urn. This part is fixed.
  • {2} The urn is part of the ‘cts’ namespace. This part is fixed.
  • {3} The Latin Literature namespace. Would be ‘greekLit’ for Greek texts, and possibly other values.
  • {4} The textgroup’s identifier. It’s normally either the TLG or PHI author index value. In the catalogue it’s contained in the ‘projid’ attribute of the ‘textgroup’ element, stripped of the namespace.
  • {5} The work’s identifier. This may map to an author’s title or to an individual book in a larger collection of texts. This also apparently comes from either TLG or PHI indices (I’ve not verified this fact for sure). In the catalogue it’s contained in the ‘projid’ attribute of the ‘work’ element, stripped of the namespace.
  • {6} The edition of the work. This may also be a translation. This is a Perseus-specific value. In the catalogue it’s contained in the ‘projid’ attribute of the ‘edition’ or ‘translation’ element, stripped of the namespace.
  • {7} The text reference. This will be specific to the work and edition you are referencing. You can find out a simple unadorned list of what’s available by querying the reference validation service with the URN up to this point at the argument.

Note how the textgroup, work and edition use dots for separators but otherwise the data element delimiter is a colon.

Commentary

There are still problems:

  • You cannot get all of book 1 in a single hit (at least for Livy).
  • If you want book 2, you have to repeat this process (it’s phi0012)
    • So, Chapter 1 of book 2 of the Teubner text looks like this URN: urn:cts:latinLit:phi0914.phi0012.perseus-lat1:1
    • Repeat and rinse for other books/editions
  • Entirely different authors and works may have different results or slightly different algorithms for building URNs.
  • The catalogue elements ‘textgroup’, ‘work’, ‘edition’ and ‘translation’ should each have a child element, ‘urn’, that builds this URN for you, so that such explanations as I’ve attempted are unnecessary.
  • The reference checking service needs to include a modicum of descriptive information about the URNs that are returned.
  • There needs to be a search service that stitches all this together.

I hope someone can find this of use.


Viewing all articles
Browse latest Browse all 2

Latest Images

Trending Articles





Latest Images