Photo by Christine Donaldson on Unsplash

Bridging Collocational and Syntactic Analysis

Seretan (2018)


  • Collocational analysis: the analysis of words through techniques like association measures and concordancing
  • Collocational expressions: represent ‘the way words combine in a language to produce natural-sounding speech and writing
  • Collocational analysis work in history: manual compilation – Gross (1984), Hornby et al. (1948), Palmer.

    looking for needles in a haystack (Choueka 1988)

  • Collocations are important: computerized lexicography, CALL, NLP

    L’importance des collocations réside dans leur omniprésence (Mel’čuk, 2003, 26)

  • Collocation extraction: COBUILD –> other languages; Sketch Engine, Antidote

  • Appropriate statistical methods –> good collocation candidates

  • However, ‘the usefulness of pure statistical approaches in practical NLP applications is limited’

  • Syntax-based approaches to collocation extraction: accurate selection of the candidate dataset – build input dataset carefully; take into account the syntactic relationship between the candidate words

    focusing on optimizing the haystack and transforming it into a much smaller pile, containing less hay and more needles

Using syntactic info for collocation identification

Procedure of collocation extraction

  1. Linguistic preprocessing (optional): split into sentences and words, linguistic filters (discard uninteresting items, such as conjunctions), lemmatization
  2. Candidate selection: specific filters –> collocation candidate list

    Traditional     VS.    New
        ↓                   ↓
    'window method' VS. 'syntactic patterns'
  3. Candidate ranking (statistical procedure): raw frequency (+ freq threshold), MI, t-test, chi-squared test, log-likelihood ratio, etc.

    NB: No one-size-fits-all solution

Statistical Processing

  • Collocation candidate ranking –> lexical association measure (Daille 1994; Evert 2004; Pecina 2008)
  • Contingency table

Linguistic Preprocessing and Candidate Selection

  • The quality of the candidate dataset –> the quality of a collocation extraction system
  • How-to: syntactic parsing
  • Early work: Lafon (1984) – Breidt (1993) – Smadja (1993) – Daille (1994) (hybrid approached: lemmatization + POS tagging + shallow parsing + AMs: MI and LLR)– Krenn (2000) – Pearce (2001) (‘with recent significant increases in parsing efficiency and accuracy, there is no reason why explicit parse information should not be used’) – Shimohata et al. (1997) – Villada-Moirón (2005)

Syntax-based extractors

Recent work: syntactic parser for improving the performance of collocation extraction

  • Lin (1998, 1999): English, dependency parser, exclude long sentences (>25 words)
  • Wu & Zhou (2003), Lü & Zhou (2004): English and Chinese, syntactic pairs (V-O, N-Adj, V-Adv), LLR score
  • Orliac & Dilinger (2003): English, ✅(syntactic constructions: active, passive, infinitive and gerundive), ❎relative constructions (–> miss many candidate pairs)
  • Villada-Moirón (2005): Dutch, P-N-P collocation, exclude long sentences (>20 words), partial parser
  • Seretan & Wehrli (2006), Seretan (2008, 2011), Wehrli et al. (to appear): multilingual, priority to the candidate selection, complex syntactic environments, anaphora identification

Using Collocations (and Other MWEs) for Parsing

“反哺效应”(Effect of feeding back)1

  • Phraseological knowledge (such as collocation) improves the performance of NLP tasks and application: POS tagging and parsing, word sense disambiguation, information extraction, information retrieval, paraphrase recognition, question answering, sentiment analysis
  • Relevant work: Brun (1998), Nivre & Nilsson (2004), Zhang & Kordoni (2006), Villavicencio et al. (2007), Korkontzelos & Manandhar (2010), PARSEME
  • 'words-with-spaces' pre-recognition approaches (complex lexical items are treated as single tokens) have been proven useful in guiding parsing attachments

    Two shortcomings:

    • ❎ syntactically flexible items
    • question asked: 1. verb-object; 2. subject-verb
  • Deadlock: collocational knowledge ⇋ parsing?

    Solution: synergetic approach (Wehrli et al. 2010): collocation identification + parsing attachment decision – ‘hand in hand’


  • Despite the development of fast and robust parsers for an increasing number of languages, collocation extraction work remains mostly focused on improving candidate ranking methods, instead of candidate selection methods – ‘garbage in, garbage out’ principle
  • Limitations of ‘words-with-spaces’ pre-recognition approaches on parsing decisions

These [recent work] are bricks laid at the end of the bridge that aims to fill the gap between the two sides. Even though the research community has made particular efforts to unite the two ends, the bridge is not yet complete.

  1. The term is used based on my own understanding. ^