Corpora Resources

Online Corpus


PolyU Language bank	Over 36 mil words of multilingual, multi-genre corpora	free
RCPCE Profession-specific Corpora	A large collection of texts used in different professions in Hong Kong	free
A Query to Internet Corpora (Leeds U)	Updated general-purpose online corpora with different languages
British National Corpus (1980-1993)	A standard English corpus often used as a reference corpus.
British Academic Written corpus (BAWE)	A 6- mil- word collection of student essays in different disciplines
Business Letter Corpus	A corpus with different English letters
BYU Corpora	A collection of mega-corpora, including such as BNC and NOW (New words from 2010 to yesterday)
The Corpus of Contemporary American English (COCA ,1990-present)	Representative of modern American English
Time Magazine (1923-2006)	A corpus for diachronic language study	free
GloWbE (Global Web-Based English)	1.9 billion words of English used in 20 countries	free
MICASE	Transcripts of a wide range of spoken academic texts from Michigan University.	free
The Oxford Text Archive	The Archive develops, collects, catalogues and preserves a variety of electronic literary and linguistic resources	free
WebCorp	Allows corpus-type searches of documents in English on the Internet.	free
CQP Web for Language Corpora	A collection of corpora created by the Language and Mutilmodal Analysis Lab(LAMAL), Department of English, The Hong Kong Polytechnic University	free
Fashion Communication Corpus (FCC)	A 1 million-word texts obtained from fashion magazines, literature, journals, websites etc.	free
Enron email corpus	Enron email data sets compiled at UK Berkeley	free
Corpora maintained by Geoffrey Sampson	A collection of different texts

Parallel Corpus


Bilingual Parallel Corpora of Chinese Classics	Parallel texts of Chinese classic novels and government documents
English-Chinese parallel concordancer	A collection of novels, fables and essays	free

Text Archive


The Gutenberg Project	The pioneering project designed to make non-copyright text available electronically	free
Internet Archive	The Internet Archive Text Archive contains a wide range of fiction, popular books, children’s books, historical texts and academic books.	free
Internet Archive: Wayback Machine	The Wayback Machine is a digital archive of the World Wide Web and other information on the Internet. You can check the Wayback Machine for archives of a website.	free

Word Cloud


Voyant Tools	To create word cloud based on frequency	free
Wordle	Wordle is a tool for generating “word clouds” from text that you provide.	free

Corpus Tools


AntConc	A freeware concordance program for Windows. Please visit Laurence Anthony’s Website for the complete list of software.	free
AntCorGen	A freeware discipline-specific corpus creation tool.	free
ConcGram 1.0	ConcGram 1.0 is a corpus linguistics software package which is specifically designed to find all the co-occurrences of words in a text or corpus irrespective of variation.
ConcGramCore	ConcGramCore is an open source corpus linguistics software package for corpus linguists to find all the co-occurrences of words in a text or corpus irrespective of variation. The software is in continous development.	free
ParaConc	A bilingual or multilingual concordancer that can be used in contrastive analyses and translation studies	free trial
WordSmith Tools	Concordancing, word lists, key words
Leximancer	Lexical analysis	free trial
WMatrix	In addition to frequency lists and concordances, WMatrix extends the keywords method to key grammatical categories and key semantic domains.	free trial
Sketch Engine	Sketch Engine can provide a one-page summary of the word’s grammatical and collocational behavior, showing the word’s collocates categorised by grammatical relations.
ATLAS.ti (7)	For qualitative data analysis and discourse analysis	free trial
NVivo (10)	For qualitative data analysis and discourse analysis
kfNgram	kfNgram makes n-gram indices of any text(s) you give it, similar to WordSmithTools’ Cluster function.	free
The IMS Open Corpus Workbench		free

Lexical Analysers


The Ultimate Research Assistant	Lexical semantic thematic analysis of web documents	free

Taggers


CLAWS	Word class (part-of-speech) tagger	free
Stanford Log-linear Part Of Speech tagger	Different software for POS tagging	free
Stanford CoreNLP online engine	Online interface of the Stanford CoreNLP software. Click here for more information of the package.	free
GUM	The Georgetown University Multilayer Corpus	free

Phonetic Analysis


Praat	Praat (the Dutch word for “talk”) is a free scientific software program for the analysis of speech in phonetics.	free
EMU (The Emu Speech Database System)	EMU is a collection of software tools for the creation, manipulation and analysis of speech databases.	free
WaveSurfer	WaveSurfer is an Open Source tool for sound visualization and manipulation.	free
SpeechAnalyzer	Speech Analyzer is a computer program for acoustic analysis of speech sounds.	free

Development Workbench


KPML	Workbench for developing grammatical descriptions and defining computational grammars	free
TermBase	Database for developing and storing terminologies	free

Descriptive Resources


WordNet	A lexical database organizing nouns, verbs, adjectives and adverbs into synonym sets, each representing one underlying lexical concept.	free
FrameNet	A lexical database containing around 1,200 semanticframes, 13,000lexical units and over 190,000 example sentences.	free

Statistical Tools


SPSS	A famous advanced statistical and analytic tools.
R Project	A free package for statistical computing and graphics	free
GNU PSPP	A free program for statisical analysis. It is a free as in freedom replacement for the proprietary program SPSS, and appears very similar to it with a few exceptions.	free
Sample Size Calculator	An online calculator to find out the sample size based on the set confidence level and confidence interval. Useful for quantitative research sampling.	free

Comments are closed.

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.