Corpora Resources

Online Corpus

PolyU Language bank Over 36 mil words of multilingual, multi-genre corpora free
RCPCE Profession-specific Corpora A large collection of texts used in different professions in Hong Kong free
A Query to Internet Corpora (Leeds U) Updated general-purpose online corpora with different languages
British National Corpus (1980-1993) A standard English corpus often used as a reference corpus.
British Academic Written corpus (BAWE) A 6- mil- word collection of student essays in different disciplines
Business Letter Corpus A corpus with different English letters
BYU Corpora A collection of mega-corpora, including such as BNC and NOW (New words from 2010 to yesterday)
The Corpus of Contemporary American English
(COCA ,1990-present)
Representative of modern American English
Time Magazine (1923-2006) A corpus for diachronic language study free
GloWbE (Global Web-Based English) 1.9 billion words of English used in 20 countries free
MICASE Transcripts of a wide range of spoken academic texts from Michigan University. free
The Oxford Text Archive The  Archive develops, collects, catalogues and preserves a variety of electronic literary and linguistic resources free
WebCorp Allows corpus-type searches of documents in English on the Internet. free
CQP Web for Language Corpora A collection of corpora created by the Language and Mutilmodal Analysis Lab(LAMAL), Department of English, The Hong Kong Polytechnic University free
Fashion Communication Corpus (FCC) A 1 million-word texts obtained from fashion magazines, literature, journals, websites etc. free
Enron email corpus Enron email data sets compiled at UK Berkeley free
Corpora maintained by Geoffrey Sampson A collection of different texts


Parallel Corpus

Bilingual Parallel Corpora of Chinese Classics Parallel texts of Chinese classic novels and government documents
English-Chinese parallel concordancer A collection of novels, fables and essays free


Text Archive

The Gutenberg Project The pioneering project designed to make non-copyright text available electronically free
Internet Archive The Internet Archive Text Archive contains a wide range of fiction, popular books, children’s books, historical texts and academic books. free
Internet Archive: Wayback Machine The Wayback Machine is a digital archive of the World Wide Web and other information on the Internet. You can check the Wayback Machine for archives of a website. free


Word Cloud

Voyant Tools To create word cloud based on frequency free
Wordle Wordle is a tool for generating “word clouds” from text that you provide. free


Corpus Tools

AntConc A freeware concordance program for Windows. Please visit Laurence Anthony’s Website for the complete list of software. free
AntCorGen A freeware discipline-specific corpus creation tool. free
ConcGram 1.0 ConcGram 1.0 is a corpus linguistics software package which is specifically designed to find all the co-occurrences of words in a text or corpus irrespective of variation.
ConcGramCore ConcGramCore is an open source corpus linguistics software package for corpus linguists to find all the co-occurrences of words in a text or corpus irrespective of variation. The software is in continous development. free
ParaConc A bilingual or multilingual concordancer that can be used in contrastive analyses and translation studies free trial
WordSmith Tools Concordancing, word lists, key words
Leximancer Lexical analysis free trial
WMatrix In addition to frequency lists and concordances, WMatrix extends the keywords method to key grammatical categories and key semantic domains. free trial
Sketch Engine Sketch Engine can provide a one-page summary of the word’s grammatical and collocational behavior, showing the word’s collocates categorised by grammatical relations.
ATLAS.ti (7) For qualitative data analysis and discourse analysis free trial
NVivo (10) For qualitative data analysis and discourse analysis
kfNgram kfNgram makes n-gram indices of any text(s) you give it, similar to WordSmithTools’ Cluster function. free
The IMS Open Corpus Workbench free


Lexical Analysers

The Ultimate Research Assistant Lexical semantic thematic analysis of web documents free



CLAWS Word class (part-of-speech) tagger free
Stanford Log-linear Part Of Speech tagger Different software for POS tagging free
Stanford CoreNLP online engine Online interface of the Stanford CoreNLP software.
Click here for more information of the package.
GUM The Georgetown University Multilayer Corpus free


Phonetic Analysis

Praat Praat (the Dutch word for “talk”) is a free scientific software program for the analysis of speech in phonetics. free
EMU (The Emu Speech Database System) EMU is a collection of software tools for the creation, manipulation and analysis of speech databases. free
WaveSurfer WaveSurfer is an Open Source tool for sound visualization and manipulation. free
SpeechAnalyzer Speech Analyzer is a computer program for acoustic analysis of speech sounds. free


Development Workbench

KPML Workbench for developing grammatical descriptions and defining computational grammars free
TermBase Database for developing and storing terminologies free


Descriptive Resources

WordNet A lexical database organizing nouns, verbs, adjectives and adverbs into synonym sets, each representing one underlying lexical concept. free
FrameNet A lexical database containing around 1,200 semanticframes, 13,000lexical units and over 190,000 example sentences. free


Statistical Tools

SPSS A famous advanced statistical and analytic tools.
R Project A free package for statistical computing and graphics free
GNU PSPP A free program for statisical analysis. It is a free as in freedom replacement for the proprietary program SPSS, and appears very similar to it with a few exceptions. free
Sample Size Calculator An online calculator to find out the sample size based on the set confidence level and confidence interval. Useful for quantitative research sampling. free


Define success on your own terms, achieve it by your own rules, and build a life you’re proud to live. (Anne Sweeney)

The Effective Teacher

Impressing others by complicating simple things is easy, but simplifying complicated things to make others understand is more challenging. It requires a higher level of comprehension, empathy, and the ability to deliver meanings/messages clearly.



Writing a Scientific Manuscript

Scientific manuscript writing requires writers (researchers) to use various genres (cognitive genres – Bruce, 2008). In this post, I arrange the sample practices provided by UEfAP based on the basic structure of a scientific article. The arrangement will help you understand how particular genres are used in article sections, and allow you to compare genres employed in different sections (by doing the practices). You can start with the introduction [1], and then finish the other sections.

[1] Writing an Introduction

[2] Writing the Method Section

[3] Writing the Results Section

[4] Writing the Discussion 

[5] Writing the Conclusions

[6] Writing an Abstract

Why scientific manuscripts are rejected?

Summarizing various reasons for rejection of scientific manuscripts, Lucey (2015) proposed 12 dominant factors:


  1. Clarity: the paper needs to tell us what it is doing. If there are a host of good ideas all crowding each other out then in the confines of the space available in a modern journal article this is going to present a problem. Without getting into salami slicing, where a host of papers are created from one base, each differing only minutely form each other, the rubric of “One for One”, one major idea per paper, is one to live by. That way you can present a tightly argued, clear, organised paper. The other ideas go in other papers. Ask yourself – is this tightly coherent.
  1. Fit: It still astonishes me how often I see papers that are simply not within the aims and scope of the journal. How hard can it be to check have similar papers been published in the last few years, to read the journal homepage, to perhaps even email the editor or an associate editor? Again, this is not to say that journals shouldn’t, and perhaps even have a responsibility to, go outside the box a little, but sending a theory paper to an empirical journal, or a paper on international trade to one focusing on corporate finance suggests sloppy preparation and a lack of clarity. Check if it fits.
  1. Contribution: I mentioned salami slicing. In empirical papers this most often appears where one or two variables or approaches are changed and a new paper produced. Thus one paper uses one methodology and another a similar, with essentially the same explanatory variable set. Really this is down to referees and editors, where the dreaded “robustness checks” are required to be shown. Let the reader feel they learned something
  1. Triviality: Some things, if not known (can anything be known, really, in social science?) are well accepted. A paper that demonstrates already well attributed findings but in another setting, that is hard to publish. In my area this usually manifests itself as a paper that takes a concept or finding from developed or increasingly emerging markets, applies it to a frontier market and finds the same findings. Salami slicing works this way also, or rather, not. Give the reader a solid reason for reading the paper
  1. Coherence: Some papers are a mess. There is a good reason for, and again this is in my area, a conventional layout: introduction, previous literature, data and methodology, findings, robustness checks, conclusion and recommendations. It works to aid the writer and more important the reader in understanding the flow of the paper. Too many or too few sections, lack of integration across same, a sense where there are multiple authors of multiple voices rather than one, all these make it hard to read and hard to understand. Remember, this is a discourse, a communication. Make the paper clear
  1. Completeness: some papers are simply not complete. For most publishers now there is a technical screening before it hits the editor; are the manuscripts, tables, figures, data etc. in the submission? Has it passed the plagiarism screening? Is it legible? Sometimes people simply forget to include material. It is uncommon but not unknown to have papers that have <to be added – Jim> or something similar. If it’s not complete, it’s not going anywhere. Complete the paper.
  1. Legibility: At times I feel like channelling Samuel L. Jackson, discussing linguistics with Brett, in Pulp Fiction. English is overwhelmingly the language of academic publishing. If the language is poorly structured, riddled with errors syntactical and lexical, then it’s going to be rejected. Get it proofread, even if you are a native English speaker.
  1. Correctness. A paper needs, especially if it is going to challenge established wisdom, to be very well constructed and to leave the reader feeling that yes there is a solid challenge. If the paper misses a whole pile of literature, has bad statistics, overambitious conclusions drawn from fuzzy data, is in general riddled with poor science, then it’s going to go down. Alternative perspectives are great, but being wrong is an alternative to being right. Check your science.
  1. Strength: This is often the case when papers are from junior researchers or are driving forward a new area. At the end we want to know – so now what do we do or where do we go? If the paper cant tell us that, perhaps because of some of the other issues noted here or because the paper spent too long or in too rambling a way to get to the point, then it is not going to prosper. Make it strong but grounded.
  1. Replicability: Data integrity and replicability are becoming key concerns of journal editors. Some have adopted a policy of having data and commands deposited with the paper. In general however the paper should be complete in its descriptions so that someone with the same or similar data can reproduce the flow. Explain what data, where sourced, what cleaning etc.; outline the nature of the theoretical steps; explain the experiments. Many of these explanations, which can be quite long, can be placed now as supplemental appendices, and should be so done. That way the paper as such can be short and pointed and the interested replicator can go to the appendices for detail. If there is a sense that this can’t be replicated, then it’s incomplete and poorly written and will crash. Make it reproducible
  1. Courtesy: The academy is quite small once you get into paper writing and reviewing. I have had occasion to reject a paper from a journal knowing, as I had been the reviewer just two weeks before, that the paper authors had not made any effort to address my previous concerns. That doesn’t mean agreeing with them, it does mean addressing them. Sending a literally identical paper sequentially rejected to a multiple of journals will get you a bad reputation and you WILL meet as editors or other gatekeepers people whose views you have blown off. Address the concerns.
  1. Bad Luck: ideas and topics go into and out of vogue. It is not uncommon to see two or more similar papers addressing similar areas being submitted. In that case there is an element of luck. Generally I will try to track back, via working papers dates, and see who has some claim on priority. This, by the way, is another reason why working papers and conference presentations are useful; they show intellectual priority. At any rate, Solomonic judgements are sometimes required. Be swift, but sure, I suggest.



Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑