The Brown Corpus

In the early 1960s two linguists created the first computer-readable text collection (or ‘corpus’) of American English – the Brown Corpus of Standard American English. Compiled by Nelson Francis and Henry Kučera, the corpus consisted of one million words from works published in 1961, sampled from 15 different text categories. By creating an electronic word reference resource the linguists had unlocked a vast potential for comparing, grouping and analysing languages.

Francis and Kučera used the Brown Corpus to run a wide array of computer-based analyses, and in 1967 they published Computational Analysis of Present-Day American English – widely seen as a landmark publication in linguistic research. Although the Brown Corpus is small by today’s standard, it’s still being used because its structure has been copied by later corpora such as the LOB corpus (the Lancaster-Oslo/Bergen Corpus, for British English) – which means that it can still be cross-referenced.

Much of corpus linguistics research has been built on the Brown and LOB corpora, as they for a long time remained the only computer-readable corpora that were easily available for researchers. By studying the same data from different angles, using different types of methodology, researchers can compare their findings directly without having to consider variation from the use of different data.

The Brown Corpus ushered in the age of computer linguistics, pioneering a field where we today have electronic corpora for text and speech encompassing almost all known languages across the world – launching a true revolution in linguistics.