Software related to Text/Corpus Linguistics

Subcategory: Software: Natural Language Processing 

  • AGTK: An annotation graph toolkit. Also available for Mac OS X.
  • CLaRK – an XML-based System for Corpora Development: CLaRK is an XML-based software system for corpora development. The main aim behind the design of the system is the minimization of human intervention during the creation of language resources.
  • DKPro WSD: DKPro WSD is a modular, extensible Java framework for word sense disambiguation. It is based on Apache UIMA, an industry standard for text processing.
  • GATE: General Architecture for Text Engineering. A domain-specific software architecure and development environment that supports researchers in Natural Language Processing and Computational Linguistics and developers who are producing and delivering Language Engineering systems.
  • HFST – Helsinki Finite-State Transducer Technology: The Helsinki Finite-State Transducer software is intended for the implementation of morphological analysers and other tools which are based on weighted and unweigted finite-state transducer technology. This work is licensed under a GNU Lesser General Public License v3.0. The feasibility of the HFST toolkit is demonstrated by a full-fledged open source implementation of a Finnish morphological analyzer as well as analyzers and generators for a number of other languages of varying morphological complexity, e.g. English, French, German, Italian, Nothern Sámi, Swedish, Turkish, etc. Many more languages are also available as spellers and hyphenators.
  • openNLP: The Open Natural Language Processing website with many software packages that also run on Mac OS X.
  • Tesla (Text Engineering Software Laboratory): Tesla is a client-server-based, virtual research environment for text engineering – a framework to create experiments in corpus linguistics, and to develop new algorithms for natural language processing. It is being developed at the Department of Computational Linguistics, University of Cologne, Germany, and licenced under the Eclipse Public Licence (EPL). Tesla was implemented in Java (with an Eclipse-based Client and IDE) and is available for Windows, Linux, and Mac OS X.
  • UAM CorpusTool: The UAM CorpusTool is a text annotation tool, allowing annotation of a plain text corpus (collections of text files) at multiple linguistic levels. The annotation scheme at each level is provided by the user in terms of a hierarchical tree of features (allowing cross-classification). The tool allows complex search of the corpus, including concordancing. Another interface allows you to produce statistical analyses of the corpus (descriptive, comparative). Windows and Macintosh are supported. On Windows, full unicode support.

Subcategory: Software: Concordances 

  • aConCorde: aConcorde is a multi-lingual concordance tool. Originally developed for native Arabic concordance, it posses basic concordance functionality, as well as English and Arabic interfaces. Written in Java, so will run on any platform that has the Java Runtime Environment installed.
  • Apple Pie Parser: MonoConc Pro 2.0 and MonoConc 1.5: Two concordance programs for linguists and other language researchers.
  • A Simple Concordance Program: Windows based program for creation of wordlists and concordances.
  • CLaRK – an XML-based System for Corpora Development: CLaRK is an XML-based software system for corpora development. The main aim behind the design of the system is the minimization of human intervention during the creation of language resources.
  • Conc: Concordance software for the Macintosh, developed by the Summer Institute of Linguistics.
  • Concordance – the program: Flexible text analysis software. Lets you gain better insight into e-texts. Make concordances, word lists, indexes. Count word frequencies, find phrases, and more. Publish results to the Web with one click. For Windows XP/2000/NT/ME/98/95
  • KH Coder: Quantitative Text Analysis: KH Coder is a free software for quantitative analysis of Japanese, English, French, German, Italian, Portuguese and Spanish language text. KH Coder provides these functions using back-end tools such as Stanford POS Tagger, Snowball stemmer, MySQL and R. Just input raw texts and you can utilize these functionalities. Originally, KH Coder was developed for content analysis in sociological field and only supports analysis of Japanese language data. But significant number of linguistic researches has been conducted with this software and now it supports other languages.
  • UAM CorpusTool: The UAM CorpusTool is a text annotation tool, allowing annotation of a plain text corpus (collections of text files) at multiple linguistic levels. The annotation scheme at each level is provided by the user in terms of a hierarchical tree of features (allowing cross-classification). The tool allows complex search of the corpus, including concordancing. Another interface allows you to produce statistical analyses of the corpus (descriptive, comparative). Windows and Macintosh are supported. On Windows, full unicode support.
  • WordSmith Tools: A suite of pc software for lexical analysis of corpora in a very wide variety of languages. Offers oncordancing, wordlisting, key words analysis and a number of other utilities. WordSmith 3.0 (OUP, 1999) handles Windows 3.1 and better and is restricted to Ascii/Ansi text; WS 4.0 (2002) requires Windows 98B or better and handles Unicode as well as Ascii/Ansi text. Version 4.0 was issued in 2004. This is a complete new edition with many limitations removed and numerous additional features, such as sound concordancing, use of Unicode, tools for obtaining text from the Internet, etc.

Subcategory: Software: Parsers 

Subcategory: Software: Taggers 

  • Adsotrans Chinese-English Annotation Engine: Adsotrans is a collaborative open source Chinese-English annotation project designed to assist learners of Chinese as a second language. It comes with a large database of semantically-tagged Chinese word information.
  • CLaRK – an XML-based System for Corpora Development: CLaRK is an XML-based software system for corpora development. The main aim behind the design of the system is the minimization of human intervention during the creation of language resources.
  • SALTO Semantic Annotation Tool: SALTO is a graphical tool that supports manual annotation of text corpora with (frame) semantic argument structures. The tool was developed within the SALSA project (http://www.coli.uni-saarland.de/projects/salsa/) at Saarland University. SALTO can be used to add a second (typically semantic) layer of annotation to corpora that are already syntactically analyzed (through manual annotation or automatically). Main features are: Query-based creation of subcorpora for annotation, Distribution of corpora to different annotators, Definition of Items and Classes/Tags to be annotated, Comfortable annotation with visual editor and mouse-menus, and Semi-automatic merging and adjudication of parallel annotations in same editor.
  • UAM CorpusTool: The UAM CorpusTool is a text annotation tool, allowing annotation of a plain text corpus (collections of text files) at multiple linguistic levels. The annotation scheme at each level is provided by the user in terms of a hierarchical tree of features (allowing cross-classification). The tool allows complex search of the corpus, including concordancing. Another interface allows you to produce statistical analyses of the corpus (descriptive, comparative). Windows and Macintosh are supported. On Windows, full unicode support.
  • WordStat: WordStat is a text analysis module specifically designed to study textual information such as responses to open-ended questions, interviews, titles, journal articles, public speeches, electronic communications, etc. WordStat may be used for automatic categorization of text using a dictionary approach or text mining. WordStat can apply existing categorization dictionaries to a new text corpus. It also may be used in the development and validation of taxonomies. When used in conjunction with manual coding, this module can provide assistance for a more systematic application of coding rules, help uncover differences in word usage between subgroups of individuals, assist in the revision of existing coding using KWIC (Keyword-In-Context) tables, and assess the reliability of coding by the computation of inter-raters agreement statistics. WordStat includes numerous exploratory data analysis and graphical tools that may be used to explore the relationship between the content of documents and information stored in categorical or numeric variables such as the gender or the age of the respondent, year of publication, etc. Relationships among words or categories as well as document similarity may be identified using hierarchical clustering and multidimensional scaling analysis. Correspondence analysis and heatmap plots may be used to explore relationship between keywords and different groups of individuals.

Subcategory: Software: Other Software Tools 

  • CINTIL Concordancer and Corpus: CINTIL Online Concordancer is now available at: http://cintil.ul.pt This is an online concordancing service that supports the research usage of the CINTIL Corpus. CINTIL-Corpus Internacional do Português is a linguistically interpreted corpus of Portuguese, developed at the University of Lisbon. At present it is composed of 1 Million annotated tokens, manually verified by linguistic experts. The annotation comprises information on part-of-speech, on lemma and inflection of tokens from open classes, on multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and on multi-word proper names (for named entity recognition). Feedback is very welcome, to cintil@ di.fc.ul.pt
  • Helpful add-in to MS Word : repetition counter and approximate matching search tools: Fore Words is plugin (Add-in) to Microsoft Word , providing some helpful tools for text analysis. Currently add-in contains two items : Repetyler and K-Diff Search. Repetyler calculates numbers of all repetitions in text (words or phrases). This program can help to improve (or just examine) the writing style in business documentation, literature text, correspondence, etc. Excessively frequent constructions and so called words-parasites can be invisible at a first glance, but drastically affect reader’s impression in a wrong way. On the other hand, repetition analysis can help you to build the true portrait of person or find implicit messages in formal language. Web-masters can find Repetyler useful when analyzing words density and choosing keywords for search engines. Professional version , Fore Words Pro , provides the additional possibility to count repetitions of word parts . This way you can find repeatedly used words in all their forms (particularly, with different suffixes). The length of word part being searched is configurable. Yet another configuration parameter is position of word part : this can be beginning of the word (prefix) or middle part. K-Diff Search is search by approximate matching.
  • ISIS: Indian Scripts Input System (ISIS) is a set of easy-to-use, mnemonic software keyboards for Indian scripts. ISIS is Unicode-compliant and covers almost all major Indian scripts with a single keyboard layout.
  • LingPipe Java API: LingPipe is a Java API for linguistic processing tasks that include: tokenization, sentence detection, part-of-speech tagging, phrase chunking, entity detection, within document coreference. It also has efficient language model based classifiers, noisy-channel spell correction. Source included.
  • LIWC – Linguistic Inquiry and Word Count: LIWC calculates the percentage of words within each file along 72+ dimensions. Categories include negative emotions (including anger, anxiety, sadness), positive emotions, cognitive processing, standard linguistic dimensions (pronouns, prepositions, articles), and common content categories (death, sex, occupation, etc). This is a sound program from a psychometric perspective — both in the creation of categories and the validation of the dictionaries. Dictionaries in English, Spanish, German, Dutch, Italian, and Norwegian are available; partial dictionaries in Korean, Hungarian, and French,
  • Online text summarization for french texts: As a software of automatic text summarization, Pertinence gives the possibility to users to reach easily and quickly to the extraction of the important textual information . Pertinence acts as on- and off-ramps to the information superhighway, allowing friendly access to the relevent information. The convenience provided by Pertinence is essential to several tasks, such as for effectively accessing very large and unstructured databases such as the World Wide Web, or an Intranet or own text databases stocked in a computer. Pertinence demo is free ( french texts) for Document types : ASCII, HTML, PDF TRY FREE ONLINE Automatic text summarization : http://www.pertinence.net/register_en.html
  • Redet: Redet is a tool for performing regular expression matching and substitution. It is useful for performing complex searches of corpora and lexica as well as for transforming data. It permits the user to define named character classes and to take their intersection, with the result that it is possible to run searches on feature matrices. It provides considerable assistance for the user, including a palette of regular expression constructions, a history list that persists across sessions, extensive help, and a set of character entry tools including IPA charts and a simple facility for defining custom character charts. Numerous aspects of the program are configurable. Unicode is fully supported.
  • SEMANA software for interactive semantic data mining: Knowledge acquisition using the Knowledge Discovery in Databases (KDD) technology (with data mining in its core) is situated halfway between Database Management and Automated Discovery. It is computationally possible today to reveal usually “invisible” or “hidden” remarkably compound (lattice) structures analyzing very simple tabular representations of gathered atomic data, and much more… See also the Mailing list CASK (Computer-aided Acquisition of Semantic Knowledge)
  • SenseClusters: SenseClusters is a suite of Perl programs that supports unsupervised clustering of similar contexts. It relies on it’s own native methodology, and also provides support for Latent Semantic Analysis. SenseClusters is a complete system that takes users from preprocessing of text to clustered output. It supports the selection of features, the creation of various kinds of context representations, dimensionality reduction via Singular Value Decomposition, clustering, and analysis of results.
  • System Quirk: The System Quirk family of applications are designed to aid in the production and maintenance of texts and terminologies. These applications are of specific relevance to computational linguists and language engineers.
  • Textanz: Textanz builds a list of word and phrase frequencies from text. This information allows you to detect excessive use of words and expressions. Such a stylistic control is not less important, than become already the standard spell checking function. Especially advisable is to check business documentation. The first impression that reader gets from your commercial offer, project, resume, contract, report , etc. in many respects depends on writing style. It is also useful to analyze frequencies in your informal writing, generally in any text which you assume to give someone for reading. When you are in a role of reader, Textanz will help again. Most often used phrases will prompt, what idea was main for the author at the moment of writing, and probably reveal implicit psychological aspects. Word frequency list is a part of so-called stylistic portrait of the writer. In linguistics research, this is often used for identification of authorship (something similar to handwriting). Developers and a web-master also can find advantage in Textanz , when choosing keywords for web-page or search for repeatable fragments of program source code.
  • Topicalizer: Topicalizer is a text analysis, topic extraction and keyword analysis tool. Based on methods of computational linguistics it provides various analyses for a given URL or plain text. These comprise, amongst others, language recognition, lexical density, keywords, collocations, word and phrase frequencies, readability and a short abstract. Topicalizer also is able to find similar pages according to the keywords it has extracted from a document. Moreover, Topicalizer provides an API for use by external applications.

Subcategory: Software: Phonetic Analysis 

  • AGTK: An annotation graph toolkit. Also available for Mac OS X.
  • Audiamus: Audimaus builds a corpus of linked text and media. It is a cross-platform tool that allows presentation of textual material linked to unsegmented media files, using quicktime to instantiate links. It was developed as a means of working interactively with field recordings and of presenting texts and example sentences as playable media with a dissertation.
  • Child Phonology Analyzer: This user-friendly and easy-to-use tool provides phonological and lexical analyses of child speech (but can be adjusted for other types of corpora). It is intended to be used with corpora stored in Microsoft Excel files. The tool offers a detailed phonological analysis, allowing you to count instances of different segments, articulatory features, syllablic structures and strings (words and parts of words) within a given age range. It also offers an account of lexical development, portraying stages of development (by cumulated attempted target words).
  • Metricalizer²: The release-version of a program for automated metrical analysis of German poetry. Copy any poem inside and get back information about prosody, meter, rhyme and metrical complexity.

Subcategory: Software: Transcription 

  • AGTK: An annotation graph toolkit. Also available for Mac OS X.
  • Audiamus: Audimaus builds a corpus of linked text and media. It is a cross-platform tool that allows presentation of textual material linked to unsegmented media files, using quicktime to instantiate links. It was developed as a means of working interactively with field recordings and of presenting texts and example sentences as playable media with a dissertation.
  • IPAKLICK: a freely accessible tool that makes it easy to insert strings of IPA-symbols (Unicode) into a text.
  • ONZE Miner: ONZE Miner is a browser-based linguistics research tool that stores audio recordings and text transcripts of interviews. The transcripts can be searched for particular text or regular expressions. The search results, or entire transcripts, can be viewed or saved in a variety of formats, and the related parts of the audio recordings can be played or opened in acoustic analysis software, all directly through the web-browser.
  • WordStat: WordStat is a text analysis module specifically designed to study textual information such as responses to open-ended questions, interviews, titles, journal articles, public speeches, electronic communications, etc. WordStat may be used for automatic categorization of text using a dictionary approach or text mining. WordStat can apply existing categorization dictionaries to a new text corpus. It also may be used in the development and validation of taxonomies. When used in conjunction with manual coding, this module can provide assistance for a more systematic application of coding rules, help uncover differences in word usage between subgroups of individuals, assist in the revision of existing coding using KWIC (Keyword-In-Context) tables, and assess the reliability of coding by the computation of inter-raters agreement statistics. WordStat includes numerous exploratory data analysis and graphical tools that may be used to explore the relationship between the content of documents and information stored in categorical or numeric variables such as the gender or the age of the respondent, year of publication, etc. Relationships among words or categories as well as document similarity may be identified using hierarchical clustering and multidimensional scaling analysis. Correspondence analysis and heatmap plots may be used to explore relationship between keywords and different groups of individuals.

Subcategory: Software: Lexicons 

  • CLaRK – an XML-based System for Corpora Development: CLaRK is an XML-based software system for corpora development. The main aim behind the design of the system is the minimization of human intervention during the creation of language resources.
  • LIWC – Linguistic Inquiry and Word Count: LIWC calculates the percentage of words within each file along 72+ dimensions. Categories include negative emotions (including anger, anxiety, sadness), positive emotions, cognitive processing, standard linguistic dimensions (pronouns, prepositions, articles), and common content categories (death, sex, occupation, etc). This is a sound program from a psychometric perspective — both in the creation of categories and the validation of the dictionaries. Dictionaries in English, Spanish, German, Dutch, Italian, and Norwegian are available; partial dictionaries in Korean, Hungarian, and French,
  • TAMS: Text Analysis Markup System; for Linux and Mac OS X
  • TshwaneLex Lexicography Software: TshwaneLex is a professional software application for the compilation of monolingual, bilingual or semi-bilingual dictionaries. TshwaneLex contains various innovative features designed to optimise the process of producing dictionaries, and to improve consistency and quality of the final dictionary product. TshwaneLex supports Unicode throughout, allowing it to handle virtually all of the world’s languages, and includes features such as immediate article preview, customisable fields, automatic cross-reference tracking, automated lemma reversal, online and electronic dictionary modules, export to MS Word format, and teamwork (network) support.
  • WordSmith Tools: A suite of pc software for lexical analysis of corpora in a very wide variety of languages. Offers oncordancing, wordlisting, key words analysis and a number of other utilities. WordSmith 3.0 (OUP, 1999) handles Windows 3.1 and better and is restricted to Ascii/Ansi text; WS 4.0 (2002) requires Windows 98B or better and handles Unicode as well as Ascii/Ansi text. Version 4.0 was issued in 2004. This is a complete new edition with many limitations removed and numerous additional features, such as sound concordancing, use of Unicode, tools for obtaining text from the Internet, etc.

Subcategory: Software: Morphological Analysis 

  • AGTK: An annotation graph toolkit. Also available for Mac OS X.
  • Child Phonology Analyzer: This user-friendly and easy-to-use tool provides phonological and lexical analyses of child speech (but can be adjusted for other types of corpora). It is intended to be used with corpora stored in Microsoft Excel files. The tool offers a detailed phonological analysis, allowing you to count instances of different segments, articulatory features, syllablic structures and strings (words and parts of words) within a given age range. It also offers an account of lexical development, portraying stages of development (by cumulated attempted target words).
  • Emdros text database engine for analyzed or annotated text: Emdros is an Open Source text database engine specializing in linguistic analyses of text. Emdros comes with a powerful query language for asking linguistically relevant questions of the data.
  • Helpful add-in to MS Word : repetition counter and approximate matching search tools: Fore Words is plugin (Add-in) to Microsoft Word , providing some helpful tools for text analysis. Currently add-in contains two items : Repetyler and K-Diff Search. Repetyler calculates numbers of all repetitions in text (words or phrases). This program can help to improve (or just examine) the writing style in business documentation, literature text, correspondence, etc. Excessively frequent constructions and so called words-parasites can be invisible at a first glance, but drastically affect reader’s impression in a wrong way. On the other hand, repetition analysis can help you to build the true portrait of person or find implicit messages in formal language. Web-masters can find Repetyler useful when analyzing words density and choosing keywords for search engines. Professional version , Fore Words Pro , provides the additional possibility to count repetitions of word parts . This way you can find repeatedly used words in all their forms (particularly, with different suffixes). The length of word part being searched is configurable. Yet another configuration parameter is position of word part : this can be beginning of the word (prefix) or middle part. K-Diff Search is search by approximate matching.
  • HFST – Helsinki Finite-State Transducer Technology: The Helsinki Finite-State Transducer software is intended for the implementation of morphological analysers and other tools which are based on weighted and unweigted finite-state transducer technology. This work is licensed under a GNU Lesser General Public License v3.0. The feasibility of the HFST toolkit is demonstrated by a full-fledged open source implementation of a Finnish morphological analyzer as well as analyzers and generators for a number of other languages of varying morphological complexity, e.g. English, French, German, Italian, Nothern Sámi, Swedish, Turkish, etc. Many more languages are also available as spellers and hyphenators.
  • Linguistica: Linguistica is an ongoing research project developing software for the unsupervised learning of natural language morphology. It takes an untagged text corpus as its input, and attempts to determine the stems, affixes, and morphological structure of the words with no prior knowledge of the language.
  • TAMS: Text Analysis Markup System; for Linux and Mac OS X

Subcategory: Software: Fieldwork 

  • Audiamus: Audimaus builds a corpus of linked text and media. It is a cross-platform tool that allows presentation of textual material linked to unsegmented media files, using quicktime to instantiate links. It was developed as a means of working interactively with field recordings and of presenting texts and example sentences as playable media with a dissertation.
  • Pacx: Platform for Annotated Corpora in XML . Integrated tool for corpus linguistics built on Eclipse, Vex, Subversive, etc. for creating and editing transcriptions and annotations, for querying, for managing version controlled data, and for building a shippable corpus.

Subcategory: Software: Historical Reconstruction 

  • LingPy: LingPy is a suite of open-source Python modules for sequence comparison, distance analyses, data operations and visualization methods in quantitative historical linguistics. The main idea of LingPy is to provide a software package which, on the one hand, integrates different methods for data analysis in quantitative historical linguistics within a single framework, and, on the other hand, serves as an interface for the preparation and analysis of linguistic data using biological software packages.

Subcategory: Software: Computer Aided Translation 

  • Enhanced Ottoman Turkish Keyboard: Enhanced Ottoman Turkish Keyboard can translate between Latin and Arabic via Internet Explorer. It can process the transcription and adjoin. You can use the program to transfer the text to word processors such as Word for further editing. Additionally, you can copy and paste Latin text into the text field of the program to get instant translation.
  • Keyboard of Modern Turkic Languages: Keyboard of Modern Turkic Languages can translate between Latin and Cyrillic or vice versa, any text (via a browser such as the Internet Explorer. You can use the program to transfer the text to word processors such as Word for further editing. Additionally, you can copy and paste Latin or Cyrillic text into the text field of the program to get instant translation.

 

ardian.id by Ardian Wahyu Setiawan .

Up ↑