Photo by Samuel Zeller on Unsplash

Collostructional analysis

Theoretical framework

Under the framework of Construction Grammar (Goldberg 1995):

grammatical constructions are pairings of forms and meanings

grammatical constructions have various degrees of complexity and schematicity: morphemes->words->partially lexically filled idioms->clause-level argument structure constructions

Methodological framework

Quantitative corpus linguistics

  1. Linguistic data from corpus
  2. Exhaustive retrieval of target linguistic phenomenon
  3. Strict quantification + statistical evaluation

Three methods of CA

  • collexeme analysis (Stefanowitsch & Gries 2003): identify the interactive relation between a particular construction (C) and lexemes occurring in a given slot of C; investigate which lexemes are strongly attracted or repelled by a particular slot in the construction.

Contingency table for a collexme analysis*

+target lexeme -target lexeme
+target construction frequency of L in C frequency of C with lexemes other than L row total (= freq of C in the corpus)
-target construction frequency of L in all other constructions in the corpus frequency of all other constructions with lexemes other than L row total
column total (freq of L in the corpus) column total grand total

*bold freq counts can be obtained directly form the corpus

  • distinctive-collexeme analysis (Gries & Stefanowitsch 2004): identify the words that best distinguish between semantically or functionally near-equivalent constructions (or syntactic alternations)

  • covarying collexeme analysis (Stefanowitsch & Gries 2005): investigate the correlation between lexemes occurring in two different slots of a particular grammatical construction; lexical items in one slot covary with those in another slot.

Contingency table for covarying collexeme analysis*

+Wslot2 -Wslot2
+Wslot1 freq (+Wslot1, +Wslot2) freq (+Wslot1, -Wslot2) row total (freq of the collexeme in slot 1)
-Wslot1 freq (-Wslot1, +Wslot2) freq (-Wslot1, -Wslot2) row total
column total (frequency of the collexeme in slot 2) column total grand total

*bold freq counts can be obtained directly form the corpus

Statistical tests:

Fisher-Yates Exact test (p-values) + logarithmic transformation --> `collostruction strength`

Issues in collostructional analysis and its potential solutions

(Schmid & Küchenhoff 2013)


  1. Null hypo testing is based on the randomness assumption, while linguistic data in corpus is “never, ever, ever, random” (Kilgarriff 2005)
  2. The linguistic phenomenon collected in a corpus can never be ‘independent observations’ (sampling issue)
  3. P-values do NOT measure the strength of association between lexemes and constructions (i.e., collostruction strength), but rather the likelihood with which the assumption that there is no attraction (i.e., the null hypo) can be rejected.
  4. Larger samples reduce p-values as compared to smaller samples with the same internal distribution (Kilgarriff 2005)
  5. The challenge of filling cell no. 4 (as well as the other three cells): definitional challenges of target lexeme (cell no. 1 + cell no. 2), target construction (cell no. 3), and all other constructions with lexemes other than the target lexeme (cell no. 4); the freq score in cell no. 4 has a strong effect on the Fisher Exact p-values

    Two criteria for the freq count of cell no. 4:

    • it must render the number of constructions in the corpus which feature the value intersection (-target lexeme, -target construction)
    • all other constructions should be somehow comparable to the target construction
    • The directionality of association: construction -> lexeme vs. construction <- lexeme
    • The Fisher Exact test relies only on cell no. 1 and marginals in the contingency table, while the relations (cell 1 × cell 2), (cell 1 × cell 3) are neglected
    • The effect of the marginal conditioning on p-values is particularly strong when the score in cell 2 is very high because this leads to a high marginal in the first column

NOTE: #1 and #2 challenge a wide range of well-established corpus-linguistic statistics, besides the collostructional analysis

Alternative approaches

  1. Attraction and Reliance (Schmid 2000)

    Attraction: construction –> lexeme

    Reliance: construction <– lexeme

    • Advantages: cell 4 is not required for calculating; straightforward descriptive measures; no assumptions (like random distribution of corpus data) have to be made
    • Downside: negative effect of cell 4 (no frequency-adjusted) –in small corpora, if rare nouns happens to occur relatively frequently in the target construction, it will produce very high reliance scores; two measures instead of one simple and unifying rank ordering.
  2. Delta P (∆P) (Ellis & Ferreira-Junior 2009)

    ∆P Attraction: construction –> lexeme

    ∆P Reliance: construction <– lexeme

    • Advantages: take the observation concerning of cell 4; yield effect size
    • Downside: long-standing cell 4 issue; Attraction and Reliance approach and ∆P approach yield very similar results, thus the less demanding measures (i.e., the former ones) are preferred
  3. Odds Ratio

    (Odds Ratio_simpler version)

    • Advantages: one unifying measure; frequency-adjusted + bi-directional; yield effect sizes; no randomness assumption has to be made
    • Downside: cell 4 problem remains unsolved

External experimental evidence

Interesting experiments attempt to test the psychological plausibility of CA

(further reading needed)

  1. (Gries, Hampe & Schönefeld 2005)
    • sentence-completion task (active + passive sentences)
    • four classes of verb items
    • statistics: ANOVA
    • results: collostruction strength is a much more predictor of the experimental data than Attraction and Reliance
    • issues: verb classification (may have confounding variables); sentence-completion task (verbs are stimuli while construction needs to be filled) contradicts CA, which proceeds from an Attraction perspective (i.e., construction is given while the lexemes are variables)
  2. (Gries, Hampe & Schönefeld 2010)
    • extension of Gries et al. (2005)
    • improve data retrieval method (reading-time)
    • results: collostruction strength had only marginally significant effect; its effect size is high; no significant interactions; reliance was not taken into consideration
  3. (Wiechmann 2008)
    • test a wide range of association measures in Evert (2004)
    • did not carry out his own experiment
    • compare the predictions of the statistical measures to the results of an eye-tracking and reading-time study by Kennison (2001)
  4. (Ellis & Ferreira-Junior 2009)

    • compared the predictions of the several statistical tests vis-à-vis the use of certain verbs in given constructions by L2 learners

    the first-learned verbs in each construction will be those which are more distinctively associated with that construction in the input (pp. 202)

    when a construction cues a particular word, that word occurs very often in that construction and it tends to be very generic. When a word cues a particular construction, it may be a lower frequency word, quite specific in its […] semantics and thus very selective of that construction (pp. 203)

Cognitive underpinnings

(do not understand the most of the section, further reading needed)

The most frequently a given linguistic stimulus has been processed by a speaker, the more routinized the corresponding association becomes in his or her mind. Different strengths of associations can be understood as representing different degrees of entrenchment in the network. More deeply entrenched associations are reflected behaviorally in higher degrees of routinization and automatization and lower levels of cognitive effort required for processing.