Morphological Paradigms: Computational Structure and Unsupervised Learning
My doctoral research is broadly concerned with morphological paradigms (e.g., jump-jumps-jumping-jumped as a morphological paradigm) in both theoretical and computational terms. On the one hand, my work addresses linguistic issues such as the notion of morphemes, allomorphy, and inflection classes. On the other, all topics under study are pieced together as a fully unsupervised and incremental approach to the problem of morphological paradigm induction. The major components of my dissertation are as follows.
Part 1: Stem identification
The first part studies structure within a morphological paradigm. It focuses on a central question in morphological analysis: given a paradigm, how exactly do we know what the stem is? The conventional view sees English-type languages (concatenative; e.g., English stem jump-) and Arabic-type languages (nonconcatenative; e.g., k-t-b for words about writing) as distinct. In the computational literature, a similarly disjunctive treatment of concatenative and nonconcatenative morphology is observed. This chapter examines in detail the computational properties of words and strings, particularly in terms of linearity and contiguity, and proposes general and language-independent algorithms of stem identification handling both concatenative and nonconcatenative morphology.
Part 2: Paradigm similarity
The second part studies structure across morphological paradigms. Cross-linguistically, it is common to observe what are known as inflection classes, groupings (e.g., conjugation classes for verbs) of distinct inflectional exponence which particular lexical items follow. It is also common that inflection classes are far from completely distinct from one another (e.g., Spanish present indicative verbs in first person singular all have the suffix -o regardless of the conjugation class). This part of my dissertation develops the notion of paradigm similarity, and proposes an unsupervised clustering algorithm that computationally formalizes the similarities and differences across morphological paradigms with allomorphic and inflection class patterns.
Part 3: Morphological paradigm induction
While the previous two parts of my dissertation study morphological paradigms when the paradigms are given, this part asks where morphological paradigms come from in the first place. Here I focus on two problems, taking Goldsmith's (2001) Linguistica morphological learner as the point of departure. First, I propose an algorithm to learn morphophonological patterns using syntactic knowledge induced from the data. Second, I pursue the idea that morphological patterns can be induced incrementally from child-directed speech data. In computational linguistics and natural language processing, a fair amount of research has been on the unsupervised learning, but most systems work as a batch learner instead of performing incremental learning. Since humans are not born with huge linguistic datasets loaded immediately after birth, a more realistic and cognitively plausible model of learning must be incremental in nature.
From raw text to syntax
Given a raw text only, how do we learn which words are similar in terms of syntactic distribution? Machine learning methods can be used to find the closest word neighbors for each word in a raw text, e.g., the nine closest neighbors of he in the Brown corpus are she, I, they, we, you, who, it, there, I've. Word neighbors from various words are interconnected, which calls for network visualization:
This graph is for the first 1,000 most frequent words in the Brown words. Do you see clusters of nodes for what we would call bare verb forms, singular nouns, determiners, prepositions, etc? Knowledge analogous to word categories as induced here is employed in part of my dissertation work, e.g., solving problems in learning morphophonology.
Given a morphological paradigm, how exactly do we know what the stem is? I have been working on this problem of stem identification, exploring different mathematical concepts of what it means to be a "subpart" of a string: substrings, multisets, and subsequences. The goal is to come up with language-independent algorithms for stem identification. Results for English and Arabic: