Tools for Corpus Linguistics

Tool Description
@nnotate Semi-automatic annotation of corpus data
aConCorde Multilingual concordance tool (English and Arabic)
almaneser / SALTA Semantic Parser/POS Tagger for English
AMALGAM Tool for grammatical annotation (POS and phrase structure). Tagging a text that was entered via email.
ANNIS Search and visualization tool for multi-layer linguistic corpora with diverse types of annotation
AntCLAWSGUI Front-end interface for CLAWS tagger
AntConc Corpus analysis toolkit
AntCorGen A freeware discipline-specific corpus creation tool.
AntFileConverter Freeware tool to convert PDF and Word (DOCX) files into plain text
AntFileSplitter A freeware text file splitting tool.
AntGram A freeware n-gram and p-frame (open-slot n-gram) generation tool.
AntMover Tool for text structure (moves) analysis
AntPConc Corpus analysis toolkit for files encoded with UTF-8
AntWordProfiler Tool for profiling vocabulary level and text complexity
Atomic Multi-layer corpus annotation platform.
BFSU Collocator A collocation analysis toolkit
BFSU English Sentence Segmenter A simple sentence segmenter
BFSU Qualitative Coder A tool for manual coding of corpora
BFSU Sentence Collector A pedagogic concordancer
BFSU Stanford Parser A simple parser
BNCWeb BNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the British National Corpus (BNC).
BootCat Tool for crawling and compiling data from the web with a list of seed words.
Bow Statistical Language Modeling, Text Retrieval, Classification and Clustering
BSFU ParaConc A parallel concordancer
BSFU PowerConc A fairly powerful concordancer
BSFU Stanford POS Tagger A PoS tagger
CasualConc CasualConc is a concordance program that runs natively on Mac 10.9 or late
Chared Tool for detecting the character encoding of a text
Chi-Square and Log Likelihood Calculator A simple tool for calculating Chi-squared and LL
CLaRK XML Based System For Corpora Development
CLiC A corpus tool to support the analysis of literary texts.
Colligator 2.0 A colligation query/analysis toolkit
Collocate Tool for the extraction of concordances and collocations
Concordance Randomizer A concordance randomizer
Concordancer Online tool for frequency counts and text clouds
CorpKit An advanced modern corpus toolkit with an emphasis on visualization and annotated corpora.
CorporaCoCo A set of R functions used to compare co-occurrence between corpora
Corpus Presenter Tree tagger and corpus analysis software
Corpus-Tools Text annotation and analysis tool
CorpusExplorer A complex corpus analysis toolkit combining 45 interactive tools.
CorpusSearchLite Searches parsed corpora in the Penn Treebank format
CPQWeb Overview of and access to a wide range of corpora
DART An annotation tool and research environment for annotating dialogues.
DeTagging Tool A tool that strips annotation/tags from files
Dexter Tool for text annotation
DISCO Corpus pre-processing tool for a variety of languages that Dallows to retrieve the semantic similarity between arbitrary words and phrases
ELAN Transcription and annotation of sound or video files
EncodeAnt Tool for the detection and conversion of character encodings
EXMARaLDA Tool for transcription, annotation, corpus analysis of spoken data
FireAnt Social media analysis toolkit
Flesh PC Calculating Flesh-scores
FrameNet Dictionary of more than 10,000 word senses, tagged for semantic roles (according to Fillmorean Frame Semantics)
gensim Deep learning via word2vec
Google Ngrams An ngram-viewer for the whole of Google Books
GraphColl Tool for building and exploring networks of linguistic collocations
Gsearch Tool for syntactic pattern matching
HeidelGram Web-Based Tools Basic corpus analysis toolkit for the HeidelGram Corpus
HeidelTime A multilingual, domain-sensitive temporal tagger
HGSimpleCorpusNetwork Batch frequency analysis on corrupted (e.g. OCR) corpus data and generation of network analysis data.
HTST Samuels Historical Thesaurus Semantic Tagger via web-interface
ICARUS Search and visualization tool for dependency trees
ICEweb A tool for compiling, downloading, and analyzing web corpora in accordance with the ICE
IMS Corpus Workbench Tool for sorting frequencies in corpora
jTokenizer Tokenizing natural language
JusText Tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages
Kaleidographic A dynamic and interactive visualization tool for multivariate data.
KAT Tool Grouping patterns based on search terms
kdiff3 KDiff3 is a diff and merge program.
Keyword Plus A keyword generation/analysis tool
Khepri A view-based toolfor exploring (historical sociolinguistic) data
KoGra-R An R-based online tool that provides statistical measures for corpus-based frequencies
LancsBox The Lancaster Desktop Corpus Toolbox; Software package for the analysis of language data and corpora
LEXA A complex lemmatizer.
LexisNexis A database containing (new and old) news articles. They also have other (business) data.
LightSide A machine learning workbench.
Linguistica Word segmentation and morphological analysis?
MALLET Package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text
MAT – Multidemensional Analysis Tagger A tagger for MDA (Biber et al.)
MLCT Tool for building and processing corpora
MonoConc Esy Concordancing and text search tool that allows primary and secondary concordancing
MorphAdorner Tool for performing morphological tagging of texts
Natural Language Toolkit Platform for building Python programs to work with human language data
NooJ Tags texts and corpora (i.e. sets of text files) at the Orthographical, Lexical, Morphological, Syntactic and Semantic levels
NoSketch Engine Word sketches, thesaurus, keyword computation, corpus creation
Onion Tool for removing duplicate parts from large collections of texts
Online Graded Text Editor Tool for profiling a text’s vocabulary level and complexity
OpenConc Tool for concordancing
PALinkA Annotation tool
ParaConc A bilingual/multilingual concordancer
Pareidoscope Pareidoscope is a collection of tools for determining the association between arbitrary linguistic structures, such as collocations, collostructions or between structures.
PatCount A pattern counting tool with powerful statistic capabilities and regex support
Pattern Builder A tool helping with regular expressions and PoS tags
Pepper Conversion between linguistic formats, e.g. from TEI to ANNIS to Tiger XML to EXMARaLDA.
Phonological CorpusTools (PCT) Phonological analysis on transcribed corpora
PhraseContext Tool for wordlists, concordancing, collocation, TTR,
Praaline Praaline is a system for metadata management, annotation, visualisation and analysis of spoken language corpora.
PRAAT A tool for doing phonetics by computer
ProtAnt Tool for prototypical text analysis
pysupersensetagger Analyses texts for MWE and supersenses.
PyXMLConc Concordancer for XML files with automatic tag and attribute detection.
Query Tool for the Edenburgh Associative Thesaurus A query tool for the EAT
Readability Analyzer A tool for generating various readability statistics
RSTTool Tool that can annotate texts for constituency and rhetorical structure
Salt Meta models for linguistic data.
SarAnt Tool for batch search and replacing
SegmentAnt Tool for the segmentation of Japanese and Chinese
Shinyconc ShinyConc is a framework for generating custom web-based concordancers and is written in R and R Shiny.
Simple Concordance Program Tool for concordance and word listing that works with many languages
SketchEngine Word sketches, thesaurus, keyword computation, corpus creation
SpiderLing Software for obtaining text from the web useful for building text corpora
SPre Tool for segmenting and annotating texts
Stanford Log-linear POS Tagger POS Tagger (with Penn Treebank Tagset) for English, Arabic, Chinese, German
Stanford Topic Modeling Toolbox The Stanford Topic Modeling Toolbox (TMT) allows users to perform topic modeling on texts imported from spreadsheets. It supports both LDA and labelled LDA.
Stylo for R Tool for computational stylistic analysis (authorship attribution, genre analysis)
Sub-Corpus Creator A tool for creating sub-corpora based on search searchs and metadata
Synpathy Tool for manual syntactic annotation
TAACO TAACO is a tool that calculates 150 indices of textual/lexical cohesion.
TAALES TAALES measures over 400 indices of lexical sophistication.
TagAnt Part-of-speech tagging tool built on Tree Tagger
Tagxedo A tool for generating word clouds.
TASX-Annotator Tool for multilevel annotation and transcription of (multi-channel) video and audio data.
Text Analysis Computing Tools (TACT) A simple, fairly old concordancer.
Text Variation Explorer The Text Variation Explorer TVE is a tool for exploring the effect of window size on various common linguistic measures. It visualizes these measures and allows for PCA/Cluster analysis.
Text Visualization Browser A survey/gallery of text visualizations
Textanz Language analysis program that produces frequency lists, word lists, parts of speech tags.
TextArc A tool for visualizing the structure of texts.
TextDirectory TextDirectory is a tool for aggregating text files based on various filters and transformation functions.
Textplot A tool for mapping a document into a network of terms in order to visualize the topic structure.
TextSmith Tools A tool for genre-informed phraseological profiles
TextSTAT Tool for creation and manipulation of linguistic data from different languages
The (Phonetic) Transcription Editor An editor for creating phonetic transcriptions
The Simple Corpus Tool A corpus analysis toolkit that supports XML annotations.
The Simple PoS Tagger A simply PoS-tagger utilizing Perl Lingua::EN:Tagger
The SPAADIA concordancer A concordancer for the SPAADIA corpus
The Text Feature Analyser A tool for investigating textual features and various meassures English language thesaurus with links to English dictionary and translation sites.
TigerSearch Tool for searching syntactically and POS-tagged corpora
TnT – Thorsten Brants’s PoS Tagger A simple PoS-Tagger
Tree Editor TrEd 2.0 Graphical editor and viewer for tree-like structures.
TreeTagger Tool for annotating text with part-of-speech and lemma information
TurboParser Multilingual dependency parser with linear programming
Tweet NLP Tweet tokenizer, POS Tagger, hierarchical word clusters, and a dependency parser for tweets, along with annotated corpora and web-based annotation tools. Clusters:
TXM XML & TEI compatible text analysis software based on TreeTagger, the CQP search engine and the R statistical environment.
UAM CorpusTool Text annotation tool and statistics for various types of linguistic analysis and multilayer annotation
UAM ImageTool Image annotation tool for visual data corpora
Unitok Tool that splits texts into tokens
VARD Spelling variant detection and deletion in historical corpora (particularly EModE)
VariAnt Tool for the detection of spelling variants
Voyant A web-based reading/analysis toolkit for digital texts.
VU Amsterdam Metaphor Identification Corpus Corpus tool for metaphor identification
WConcord 3.0 A full featured concordancer
WebLicht WebLicht is an execution environment for automatic annotation of text corpora embedded with the CLARIN-D project.
Wmatrix Tool for corpus analysis and comparison
WordCruncher A tool for analyzing ebooks.
WordFish Extract political positions from text documents.
Wordscores A tool (approach) to extract dimensional information from political texts
Wordsmith One of the most established corpus toolkits providing a variety of functionality
Wordstatix Corpus analysis tool
Worldbuilder Tool for annotation and visualisation in analysis applying text-world-theory
Wordle A tool for generating word clouds.
Xaira Indexing and analysis of XML resources,
YACSI Chinese Tokeniser / PoS Tagger A Chinese tokenizer and PoS tagger
Gephi A toolkit for network analysis
DocuScope A tool for computer-aided rhetorical anyalysis
juxta Comparing and collating multiple witnesses to single textual works
WordHoard Close reading and scholarly analysis of deeply tagged texts
Intelligent Archive Managing corpora for stylometry
Twarc A command line tool (and Python library) for archiving Twitter JSON
WebAnno A web-based annotation tool
Coh-Metrix A web-based system to compute cohesion and coherence metrics.
LIWC A tool that tries to compute scores for different emotions, thinkings styles, and social concerns.
ANVIL A tool for video annoation.
LDA-Toolkit A toolkit for linguistic discourse and image analysis.
FLAIR (2.0) An online tool for language teachers and learners that analyzes grammatical constructions and readability on the fly.
DisMo An automatic multi-level annotator for spoken language corpora.
TagCrowd A simple tool for generating tag/word clouds online
MMAX2 A multi-level annotation tool
KorAP A complex platform for corpus analysis developed at the IDS in Mannheim
kfNgram A simple tool for generating n-grams
MAXQDA Sophisticated QDA software that works with multimodal data and supports mixed methods approaches
ATLAS.ti A sophistaticated QDA software for mixed methods approaches
Pipoca (formerly openQDA) A web-based QDA software
f4analyse QDA software specifically geared towards interview (spoken) data
f4transkript Software for transcribing audio data
CATMA (Computer Assisted Text Markup and Analysis) A complex annotation and analysis package
ANC2go A web service that allows users to create custom sub-corpora of the ANC
CoMOn A tooil for corpus matching analysis

