Creating a Digital Concordance for Early Chinese Historiography: 
a dhsi success story, 
with a note on tools and skills

(This entry was created from a poster I presented at this year’s DHSI. Unless you are interested in the digital concordance, go ahead and jump down to the note on tools and skills.)

My dissertation is a study of early Chinese historiography, focusing on the Shiji 史記 [Historian’s Records] by Sima Qian 司馬遷 (c. 145—c. 86 BCE) et al. and the Han shu 漢書 [History of the (Western) Han Dynasty] by Ban Gu 班固 (32—92 CE) et al. Together, these two massive historical texts have more than a million words in 230 ‘chapters’ [juan 卷] (in Classical Chinese, a very terse language). Thus these great histories would be the ancient equivalent of ‘big data’ if anything were, and they are fine candidates for DH methods.

While there are pre-packaged tools available for text analysis, they do not always play nicely with Classical Chinese. The Chinese script demands that tools comply with the Unicode standard, but not all do. At the same time, the digital versions of my sources that are available online seem simply to have been machine-read (OCR) and pasted into a website, without any attention proper structural markup (e.g. as described in the standards of the Text Encoding Initiative). There is a standardization shortfall on both ends, and useful, pre-packaged tools for Chinese texts are as rare as a unicorn’s horn, to borrow from a Chinese proverb.

But why choose XSLT? … These problems were on my mind when I attended the XSLT workshop at the Digital Humanities Summer Institute (DHSI) last June, but my motivations for enrolling in that course were different. Having some experience with XML and some of its applications (TEI, HTML, and KML), and having already begun to offer XML workshops for historians, I thought I should develop my skills in that same area: to level-up in the same skill tree, so to speak. XSLT, a particularly powerful XML application well-suited for working with encoded literature, seemed a good choice. So I went to DHSI last year without a specific project, just the goal of expanding my knowledge of XML. One week later, I had a prototype of a keyword in context (KWIC) visualization tool for early Chinese texts. Over the past year, I was able to work out the bugs, and now I have a working version.

The Application

Overall, the process is fairly simple. It takes a lightly marked-up base text and produces a report that shows the concordance list.

The XML Input (A)

Assuming an electronic version of the text exists, the input file requires very little preparation. At a minimum, it should be well-formed XML, and the text should be divided into chapters, each with attributes to label the chapter’s number and title, both of which will be included in the output file.centerImageOnly That’s it. Because the input is very simple, it is a trivial matter to run the transformation on any early Chinese text.

The XSLT Transformation (B)

The XSLT file transforms the input text into the output concordance. The transformation can be run from Oxygen XML Editor or the command line (with Saxon); currently, I’m using Oxygen. It is here that the keywords are entered (two at a time).

The Output (C)

Once the transformation is run, Oxygen produces an html file that can be opened for perusal in a web browser. With some adjusting of settings in the browser, useful add-ons such as Chinese pop-up dictionaries (e.g. the Zhongwen Chinese Pop-up Dictionary for Chrome and Perapera Chinese for Firefox) can be used to aid the reader.

Next Steps

My ultimate goal is to combine this project with a mapping tool and a word-frequency tool, the prototypes of which I created two summers ago, into a multifaceted research tool for early Chinese historical texts. The combination of these tools will mean I have to step beyond the capabilities of XSLT alone, and so I’ve been learning the query language xQuery (designed to work with XML) and the eXist-db open source XML-based database platform. XML, XSLT, and xQuery together will give me everything I need to make a research tool that suits my purposes. Eventually, I hope to make something that is useful to others as well, and to put in up on the web for free use under a CC license.


A Note on Tools and Skills

Now, I must admit that this digital concordance is not particularly innovative or ground-breaking as a tool. It won’t usher in a new age of scholarship, it won’t revolutionize research, and it won’t make any money. But none of this matters, because it does what I need it to do.

This year we saw the publication of a special issue of the journal differences discussing the meaning and value of the digital humanities, including essays by DH scholars and critics alike. Responses, many and various, quickly proliferated. Among these, Matthew Lincoln, a PhD student in art history at the University of Maryland, penned an insightful blog post, “Tool Trouble,” which noted that an overemphasis on ‘tools’ tends to obstruct any discussion of methodology and theory. This is something that I have been pondering as well, both in the context of this project and more broadly, and I think there is something to be gleaned from my own experience as a workshop leader and event organizer, as an attendee of workshops and unconference sessions, and as a researcher using digital methods, on the topic of tools and skills.

My point is less a theoretical one than a pedagogical one—for those academics who wish to use DH in their own idiosyncratic research, it is better to learn skills than tools. In other words, teaching pre-packaged tools, no matter how good they might be, is less helpful than teaching skills, literacies, or competencies, such as languages (from XML to Python) and programming concepts (e.g. data structures). Why? Because these are extensible. That’s the X in XML, and it makes all the difference. This is not to say that there aren’t reasons to learn specific applications. When a researcher is brought into a project that uses a certain tool extensively, then of course it makes sense to learn it. Pre-fabricated tools can be extremely powerful for certain projects, but they are rarely extensible. Speaking purely from my own experience, for idiosyncratic research of the type that PhD students conduct for their dissertations, I think extensible skills are the better investment of time and effort, and these should be emphasized in DH pedagogy.

* Trout fly image adapted from one found on the website of Grays of Kilsyth.