Given a corpus of texts C, then a concordance for a certain word form W (taken as a type) is the set of all occurrences (tokens) of W in C, arranged in a table such that
- each table row houses one occurrence of W,
- the first column refers to the place in C where the token occurs,
- the following column contains the context preceding the token,
- the central column contains the token of W itself,
- the last column contains the context following the token.
Derivatively, a concordance of corpus C is the set of the concordances of all word types Wi in C.
The following is a concordance of of in the first paragraph of the section ‘Reference entries’ of the page ‘Lemmatization’.
| reference | preceding context | target | following context |
|---|---|---|---|
| lemm 2.4, 01 | Lemmatization is a decision in favor | of | one form of an expression which |
| lemm 2.4, 01 | decision in favor of one form | of | an expression which is considered its |
| lemm 2.4, 04 | destined for users with imperfect knowledge | of | the language in question |
The reference to the place of the token identifies
- the particular text in the corpus,
- the chapter or section number,
- the line or sentence number.
The context of the token in question must be limited in some sensible way. Even if, for many applications, it would seem desirable to reproduce the entire containing sentence, this would be too expensive and also unnecessary in the case of long sentences. In principle, one could reproduce some suitable construction (phrase) containing the token in question. However, that would require a human analyst, and that is undesirable not only for economical reasons, but also because the concordance is supposed to serve as a theory-free analytical tool in the first place: it does not presuppose the analysis of syntactic constructions, but instead helps do them on an empirical basis. Therefore the context in a concordance is mostly clipped mechanically, e.g. by limiting it to a certain number of text words at either side of the target. The user can always find the full context by following up the reference.
A concordance is based on a word list of a corpus, for which see the section on lemmatization.
Concordances may be produced for many different purposes. For the lexicographer, they show the range of contextual variation of each word form in his corpus. He needs that for the following analytic steps:
- The tokens show semantic variation. Thus, the lexicographer can see the senses that the item has in his corpus, and can base decisions on homonymy vs. polysemy on the concordance.
- The tokens exhibit syntactic variation in the sense of occurring in different constructions. A concordance of a given word form may be sorted by the following context or – preferably in retrograde order – by the preceding context. Thus, the lexicographer can categorize the contexts, make a distributional analysis of the item in question and base decisions on its syntactic categorization on the concordance.
- If the concordances of all the word forms of a given lexeme are considered together, they display the range of inflectional variants documented in the corpus. Here, lemmatization and concordancing work hand in hand: On the one hand, the survey concordance of all the word forms of a given lexeme presupposes lemmatization. On the other hand, lemmatization as regards morphological variation presupposes the concordance of the whole corpus, as one can assign a morphological form to a lexeme only if one sees its syntactic context.