Leipzig Corpus Miner (LCM)

One central objective of the LCM is to enable analysts to perform Text Mining tasks without explicit guidance by NLP experts. For that reason we implemented a middle ware (UIMAWS), a web service to start, stop and manage UIMA processes for certain tasks.The LCM integrates several procedures for retrieving, annotating and mining textual data. Flexibility in combining these tools lends support to various analysis interests ranging from quantitative corpus linguistics to qualitative reconstructivist methodologies.

Information retrieval: Assuming the availability of a large document collection, e.g. complete volumes of a daily newspaper over several decades, a common need is to identify documents of interest for certain research questions.

Lexicometrics: The LCM has implemented computation and visualization of basic corpus linguistic measures on stored collections. It allows for frequency analysis, co-occurrence analysis and automatic extraction of key terms.

Topic models: For analysis of topical structures in large text collections Topic Models have been shown to be useful in recent studies. Topic Models are statistical models which infer probability distributions over latent variables, assumed to represent topics, in text collections as well as in single documents.

Classifcation: Supervised learning from annotated text to assist coding of documents or parts of documents promises to be one major innovation to Content Analysis applications. The LCM allows for manual annotation of complete documents or snippets of documents with category labels. The analyst may initially develop a hierarchical category system and / or refine it during the process of annotation. Annotated text parts are used as training examples for automatic classification processes which output category labels for unseen analysis units (e.g. sentences, paragraphs or documents).

1 Kommentar » Schreibe einen Kommentar

  1. Pingback: DH 2015 Sydney notes – Thursday | The Bibliobrary

Hinterlasse eine Antwort

Pflichtfelder sind mit * markiert.


Du kannst folgende HTML-Tags benutzen: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>