forum: Interview: Paul Rayson WMatrix, text mining
Submitted by Torsten Reimer on Mon, 01/10/2007 - 15:14.
Dr. Paul Rayson is director of the UCREL research centre and a teaching fellow in the Computing Department at Lancaster University in Lancaster, UK. He has research interests in the area of corpus linguistics (and related subjects within natural language processing and language engineering) and the application of corpus-based methods in word frequency dictionaries, semantic analysis, information extraction, systems engineering and decision management. Paul is the developer of the Wmatrix corpus analysis and comparison tool which this forum was set up to support.






disciplines, text mining, tools
Hi Paul,
You took an active part in the Methods Network workshop Text mining for historians in July 2007 and organized an earlier one on Historical text mining.
Is this focus on History just a coincidence, or are historians especially interested in text mining? And, to make the question more general, are there many differences regarding the application of text mining techniques in different humanities disciplines? If so, I would be interested to hear how this influences the development of tools for text mining.
--
Torsten Reimer
http://www.methodsnetwork.ac.uk
text mining workshops
Hi Torsten,
The first workshop at Lancaster (Historical text mining) grew out of an overlap of interests that Dawn Archer and I had: historical linguistics, computational and corpus linguistics. Part of the reason that we organised the first workshop was to form a network of scholars working at the intersection of the same areas. We wanted to extend the group in both directions e.g. text mining researchers and historians. We were more successful in the former direction than the latter with the exception of Stephen Pumfrey at Lancaster who presented an account of his early explorations into corpus-based methods.
For the second workshop (Text mining for historians), the focus was squarely on the historians and seeing what research questions they had that could be answered by exploiting the text mining and corpus (linguistic) methods.
More generally, I think we need further networking events like those to explore the level of adoption or possibilities for use of text mining techniques in other disciplines.
Paul.
re: text mining workshops
This week I had two meetings with members of the e-Uptake project, which exists to understand potential barriers that hinder wider adoption of e-infrastructures in research (and make recommendations about how to address these issues). Returning to our conversation about text mining and corpus methods, this made me wonder what the 'barriers' would be in this field of research. As text is still the most important source for historical research, you would expect historians to focus on these methods. Now, there is definitely interest, but not as much as you could expect.
Unless you disagree with the last statement, would you say this is due to historians lacking the skills or the discipline specific tools? Or is it a more basic problem, i.e. that historians are maybe not really aware of what text mining could do for them? If raising the awareness is a major issue in this (not only in History), are there any groups or projects that could take this agenda forward? Where would you point researchers with an interest in this field?
barriers
I don't really agree with your statement there, because historians would use a different definition of "text" to that used in corpus linguistics. This came up at the text mining workshops and again in a recent meeting that I was having at Lancaster with colleagues who are using Wmatrix to support stylistics research. The view of a text provided through a corpus tool tends to obscure the structure or flow of a text. It permits the user to focus in on specific sentences or parts of sentences without necessarily providing the wider context of the full text, or even the context outside the text.
Raising awareness of these tools and techniques is one issue and it looks like the e-uptake project will address this through discipline based case studies. However, two points that need to be addressed from my perspective on the computer science side is the awareness of the requirements of the historian and the ease of use of the tools that are provided. We talked about Google-style ease of use in the e-science scoping study of Linguistics last year in London: http://www.ahds.ac.uk/e-science/e-science-scoping-study.htm - these are the key barriers in my opinion.
history rant
You are right that corpus tools make you look at texts in a different way. Being a historian myself I do think that this can actually be very useful for the (usually) more qualitative approach my discipline takes.
For my Ph.D. research, the access to corpus tools and a wide sample of early modern English prints would have allowed me to see how insights gained through a qualitative analysis of one genre of texts could be weighted against a more representative sample of texts from a specific period (I was, among other things, concerned with how topoi such as the Royal Navy as the "Wooden Walls" were used in political discourse - it would have been interesting to see how popular this topos was beyond politicians, naval officials etc. and their pamphleteers).
A large corpus of, for instance, early modern texts, could help historians to test some theories develop through qualitative analysis against a wider range of source and help the discipline to avoid the "tunnel vision" that a qualitative approach can sometimes have. Here I would like to see more activities among my colleagues, which is what I meant by my comment.
Corpus WorkBench
One of the things that we are planning to do internally at Lancaster is put our full text copy of the EEBO dataset inside a corpus tool such as Corpus Workbench.
CWB was originally developed at [[http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/|Stuttgart ]] but has recently been made [[http://cwb.sourceforge.net/|open source ]].
This would allow local historians to see the full benefit of corpus methods on texts that they already use online through the standard EEBO interface.
re: Corpus WorkBench
That would be an extremely useful resource and incredibly helpful for researchers!
EEBO
Unfortunately due to the licence restrictions on EEBO, we can only make it available internally. However, all UK academic institutions can obtain a copy of the full text EEBO. This was part of the JISC licence agreement, see the [[http://www.jisc-collections.ac.uk/eebo|JISC website for more information]].
re: EEBO
Having worked intensively with EEBO, I am (unfortunately) aware of the licensing problems. Germany has recently acquired a national license for EEBO, but even that requires registration and access control. Still, it is good to have the resource available and I am sure that researchers at Lancaster will be able to make good use of your full text copy!
tools development
You have mentioned ease of use as something to focus on. Interestingly, a historian I discussed a similar issue with earlier this year, said more or less exactly the same:
I remember the great success of a simple Visual-Basic-Tool connecting MS-Word with MS-Access to build a glossary of phrases in the critical edition of the charters of Frederic II. (1194-1250): The principal editor Walter Koch – he stands somewhere between "computers are dangerous" and the "every day historian" - loved the tool (with it's football-button) because it made index entries and linked them to phrases he formed out of the texts of his charters. There wasn't any linguistic process, maybe some simple string functions, just a relational database as the most complex programming part in it. My conclusion from this experience was: A good computer tool for a historian has to be easy to use, reliable, looking atomic ("It's just one operation I think I can control, not a collection of operations that make the result blurry") and have a fancy user interface. The main task to disseminate computing methods among historians is to build tools for their needs.
The question is, how does one build such a tool? If it is specifically designed for the needs of a certain discipline then you may have a relatively clear set of requirements. Text mining and corpus linguistic tools, however, can be used in many different disciplines. How did you approach this task when you started work on Wmatrix? Was it mainly built for your own needs, did you focus on a specific discipline or did you aim at developing a tool with good "general" ergonomics?