In ePol we build and maintain an infrastructure for supporting qualitative and quantitative Content Analysis (CA), the Leipzig Corpus Miner (LCM) [1].  This integrated application of differnt technologies was built by the NLP Group at the University of Leipzig. The infrastructure aims at the integration of “close reading” procedures on individual documents with procedures of “distant reading”, e.g. lexical characteristics of large document collections. Therefore information retrieval systems, lexicometric statistics and machine learning procedures are combined in a coherent framework which enables qualitative data analysts to make use of state-of-the-art Natural Language Processing (NLP) techniques on very large document collections. Applicability of the framework ranges from social sciences to media studies and market research. The LCM is more of an infrastructure in contrast to complete software packages. The LCM is a combination of di fferent technologies which provide a qualitative data analysis environment accessible by an interface targeted towards domain experts unfamiliar with NLP. Analysts are put in a position to work on their data with more methodical rather than technical understanding of the algorithms. Applied technologies behind the user interface need to support analysts in tasks such as data storage, retrieval, processing and presentation. We integrate technologies such as UIMA, SOLR, MongoDB and Glassfish to create a distributed multi-tier environment capable to process and store the 3.5 million text documents of our research corpus.

We also address the application of such methods by using R and its Text Mining capabilities [2]. This open source platform allows for rapid prototyping of ideas and for immediate discussions about results and methodology. We also used R to train and educate scholars in Text Mining methods for the social sciences [3].

