One central objective of the LCM is to enable analysts to perform Text Mining tasks without explicit guidance by NLP experts. For that reason we implemented a middle ware (UIMAWS), a web service to start, stop and manage UIMA processes for certain tasks.The LCM integrates several procedures for retrieving, annotating and mining textual data. Flexibility in combining these tools lends support to various analysis interests ranging from quantitative corpus linguistics to qualitative reconstructivist methodologies.
Information retrieval: Assuming the availability of a large document collection, e.g. complete volumes of a daily newspaper over several decades, a common need is to identify documents of interest for certain research questions.
- Search results as Heatmap and time series
- Screenshot of the document search in LCM. The search capabilities are realized by a SOLR cloud.
Lexicometrics: The LCM has implemented computation and visualization of basic corpus linguistic measures on stored collections. It allows for frequency analysis, co-occurrence analysis and automatic extraction of key terms.
- Graph view of a co-occurrence analysis. Word semantics can be visualized. this allows to access complex concepts and pattern in a distant reading process.
- Screenshot of a diachronic frequency analysis. Multiplw word can be compared in a time series plot.
Topic models: For analysis of topical structures in large text collections Topic Models have been shown to be useful in recent studies. Topic Models are statistical models which infer probability distributions over latent variables, assumed to represent topics, in text collections as well as in single documents.
- Screenshot of a Topic Model result. Differnt latent semantic topics can be identified within a collection of documents. It is possible to compare topics in a diachronic analysis of topic related documents.
Classifcation: Supervised learning from annotated text to assist coding of documents or parts of documents promises to be one major innovation to Content Analysis applications. The LCM allows for manual annotation of complete documents or snippets of documents with category labels. The analyst may initially develop a hierarchical category system and / or refine it during the process of annotation. Annotated text parts are used as training examples for automatic classification processes which output category labels for unseen analysis units (e.g. sentences, paragraphs or documents).
- Detailed view of documents. Analysts can create and annotate with custom category systems.
- Classification results and manual verification of results in an active learning process. Analysts are able to refine the training set until the classification returns sufficient results.
Pingback: DH 2015 Sydney notes – Thursday | The Bibliobrary