Jackson L. Lee | Computational Linguist

I'm a computational linguist. I'm broadly interested in understanding how natural language is or can be learned, both by humans and machines. While my day job is to lead a team of software and data engineers, after work I morph into my underlying form of a linguist. Topics I've worked on include inflection classes, morphological segmentation, word segmentation, grapheme-to-phoneme conversion, truncation, tone, and reduplication. I hold a PhD in Linguistics from the University of Chicago, where I worked with John Goldsmith on computational morphology and phonology.

Publications

2024: Multi-Tiered Cantonese Word Segmentation. Charles Lam, Chaak-ming Lau, and Jackson L. Lee. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).
[ link | bib | abstract | data ]
@inproceedings{lam-etal-2024-multi-tiered, title = "Multi-Tiered {C}antonese Word Segmentation", author = "Lam, Charles and Lau, Chaak-ming and Lee, Jackson L.", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italy", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.1047", pages = "11993--12002", abstract = "Word segmentation for Chinese text data is essential for compiling corpora and any other tasks where the notion of {``}word{''} is assumed, since Chinese orthography does not have conventional word boundaries as languages such as English do. A perennial issue, however, is that there is no consensus about the definition of {``}word{''} in Chinese, which makes word segmentation challenging. Recent work in Chinese word segmentation has begun to embrace the idea of multiple word segmentation possibilities. In a similar spirit, this paper focuses on Cantonese, another major Chinese variety. We propose a linguistically motivated, multi-tiered word segmentation system for Cantonese, and release a Cantonese corpus of 150,000 characters word-segmented by this proposal. Our work will be of interest to researchers whose work involves Cantonese corpus data.", }

Word segmentation for Chinese text data is essential for compiling corpora and any other tasks where the notion of "word" is assumed, since Chinese orthography does not have conventional word boundaries as languages such as English do. A perennial issue, however, is that there is no consensus about the definition of "word" in Chinese, which makes word segmentation challenging. Recent work in Chinese word segmentation has begun to embrace the idea of multiple word segmentation possibilities. In a similar spirit, this paper focuses on Cantonese, another major Chinese variety. We propose a linguistically motivated, multi-tiered word segmentation system for Cantonese, and release a Cantonese corpus of 150,000 characters word-segmented by this proposal. Our work will be of interest to researchers whose work involves Cantonese corpus data.
2023: The SIGMORPHON 2022 Shared Task on Cross-lingual and Low-Resource Grapheme-to-Phoneme Conversion. Arya D. McCarthy, Jackson L. Lee, Alexandra DeLucia, Travis Bartley, Milind Agarwal, Lucas F.E. Ashby, Luca Del Signore, Cameron Gibson, Reuben Raff, Winston Wu. Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology.
[ link | bib | abstract ]
@inproceedings{mccarthy-etal-2023-sigmorphon, title = "The {SIGMORPHON} 2022 Shared Task on Cross-lingual and Low-Resource Grapheme-to-Phoneme Conversion", author = "McCarthy, Arya D. and Lee, Jackson L. and DeLucia, Alexandra and Bartley, Travis and Agarwal, Milind and Ashby, Lucas F.E. and Del Signore, Luca and Gibson, Cameron and Raff, Reuben and Wu, Winston", editor = {Nicolai, Garrett and Chodroff, Eleanor and Mailhot, Frederic and {\c{C}}{\"o}ltekin, {\c{C}}a{\u{g}}r{\i}}, booktitle = "Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.sigmorphon-1.27", doi = "10.18653/v1/2023.sigmorphon-1.27", pages = "230--238", abstract = "Grapheme-to-phoneme conversion is an important component in many speech technologies, but until recently there were no multilingual benchmarks for this task. The third iteration of the SIGMORPHON shared task on multilingual grapheme-to-phoneme conversion features many improvements from the previous year{'}s task (Ashby et al., 2021), including additional languages, three subtasks varying the amount of available resources, extensive quality assurance procedures, and automated error analyses. Three teams submitted a total of fifteen systems, at best achieving relative reductions of word error rate of 14{\%} in the crosslingual subtask and 14{\%} in the very-low resource subtask. The generally consistent result is that cross-lingual transfer substantially helps grapheme-to-phoneme modeling, but not to the same degree as in-language examples.", }

Grapheme-to-phoneme conversion is an important component in many speech technologies, but until recently there were no multilingual benchmarks for this task. The third iteration of the SIGMORPHON shared task on multilingual grapheme-to-phoneme conversion features many improvements from the previous year’s task (Ashby et al., 2021), including additional languages, three subtasks varying the amount of available resources, extensive quality assurance procedures, and automated error analyses. Three teams submitted a total of fifteen systems, at best achieving relative reductions of word error rate of 14% in the crosslingual subtask and 14% in the very-low resource subtask. The generally consistent result is that cross-lingual transfer substantially helps grapheme-to-phoneme modeling, but not to the same degree as in-language examples.
2022: PyCantonese: Cantonese Linguistics and NLP in Python. Jackson L. Lee, Litong Chen, Charles Lam, Chaak Ming Lau, Tsz-Him Tsui. Proceedings of the 13th Language Resources and Evaluation Conference.
[ pdf | bib | abstract | software documentation ]
@inproceedings{lee-etal-2022-pycantonese, title = "PyCantonese: Cantonese Linguistics and NLP in Python", author = "Lee, Jackson L. and Chen, Litong and Lam, Charles and Lau, Chaak Ming and Tsui, Tsz-Him", booktitle = "Proceedings of The 13th Language Resources and Evaluation Conference", month = june, year = "2022", publisher = "European Language Resources Association", abstract = "This paper introduces PyCantonese, an open-source Python library for Cantonese linguistics and natural language processing. After the library design, implementation, corpus data format, and key datasets included are introduced, the paper provides an overview of the currently implemented functionality: stop words, handling Jyutping romanization, word segmentation, part-of-speech tagging, and parsing Cantonese text.", language = "English", }

This paper introduces PyCantonese, an open-source Python library for Cantonese linguistics and natural language processing. After the library design, implementation, corpus data format, and key datasets included are introduced, the paper provides an overview of the currently implemented functionality: stop words, handling Jyutping romanization, word segmentation, part-of-speech tagging, and parsing Cantonese text.
2020: Massively Multilingual Pronunciation Modeling with WikiPron. Jackson L. Lee, Lucas F. E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, Kyle Gorman. Proceedings of the 12th Language Resources and Evaluation Conference.
[ link | bib | abstract | data and code ]
@inproceedings{lee-etal-2020-massively, title = "Massively Multilingual Pronunciation Modeling with {W}iki{P}ron", author = "Lee, Jackson L. and Ashby, Lucas F.E. and Garza, M. Elizabeth and Lee-Sikka, Yeonju and Miller, Sean and Wong, Alan and McCarthy, Arya D. and Gorman, Kyle", booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.521", pages = "4223--4228", abstract = "We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.", language = "English", ISBN = "979-10-95546-34-4", }

We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.
2018: Shaping Phonology. Edited by Diane Brentari and Jackson L. Lee. University of Chicago Press. (This volume is in honor of John Goldsmith.)
[ website | bib | abstract ]
@book{brentari.lee.shaping.phonology, title = {Shaping {P}honology}, author = {Diane Brentari and Jackson L. Lee}, year = {{2018}}, publisher = {The University of Chicago Press}, isbn = {9780226562452}, }

Within the past forty years, the field of phonology -- a branch of linguistics that explores both the sound structures of spoken language and the analogous phonemes of sign language, as well as how these features of language are used to convey meaning -- has undergone several important shifts in theory that are now part of standard practice. Drawing together contributors from a diverse array of subfields within the discipline, and honoring the pioneering work of linguist John Goldsmith, this book reflects on these shifting dynamics and their implications for future phonological work. Divided into two parts, Shaping Phonology first explores the elaboration of abstract domains (or units of analysis) that fall under the purview of phonology. These chapters reveal the increasing multidimensionality of phonological representation through such analytical approaches as autosegmental phonology and feature geometry. The second part looks at how the advent of machine learning and computational technologies has allowed for the analysis of larger and larger phonological data sets, prompting a shift from using key examples to demonstrate that a particular generalization is universal to striving for statistical generalizations across large corpora of relevant data. Now fundamental components of the phonologist’s tool kit, these two shifts have inspired a rethinking of just what it means to do linguistics.
: On the discovery procedure. Jackson L. Lee. Shaping Phonology. Edited by Diane Brentari and Jackson L. Lee.
: Mincing words: balancing recovery and deletion in word truncation. Mike Pham and Jackson L. Lee. Glossa.
[ link | bib | abstract | data and code ]
@article{pham.lee.truncation.glossa, title = {Mincing words: balancing recovery and deletion in word truncation}, author = {Mike Pham and Jackson L. Lee}, journal = {Glossa}, volume = {3}, issue = {1}, pages = {36}, year = {{2018}}, }

Brazilian Portuguese exhibits word truncation: e.g., the word cruzeiro 'cruise' results in the truncated form cruza, where the vowel -a is added to the truncated stem cruz. Gonçalves (2011) claims that truncated words preserve the onset of the rightmost syllable of the first binary foot. We argue from a corpus-based perspective instead that the truncated stem is better predicted by optimizing two opposing forces: original word recovery and phonological deletion. These are formalized and implemented as right-complete counts (RC) and left-complete counts (LC), based primarily on the analysis of blends and subtractive word formation in Gries (2006) and taking into consideration the informativity of the deleted material as well as the preserved material. Specifically, a model incorporating both RC and LC outperforms one that uses only one or the other, as well as prosodic models based on binary feet, in predicting truncated stems in Brazilian Portuguese. Beyond truncation, our model has implications for morpheme segmentation as well as the mechanics of morphological reanalysis.
2017: Computational learning of morphology. John A. Goldsmith, Jackson L. Lee, and Aris Xanthos. Annual Review of Linguistics 3, 85-106.
[ pdf | bib | abstract ]
@article{goldsmith.lee.xanthos.overview, title = {Computational learning of morphology}, author = {John A. Goldsmith and Jackson L. Lee and Aris Xanthos}, journal = {Annual Review of Linguistics}, volume = {3}, pages = {85-106}, year = {{2017}}, }

This paper reviews work on the unsupervised learning of morphology, that is, the induction of morphological knowledge with no prior knowledge of the language beyond the training texts. This is an area of considerable activity over the period from the mid-1990s, continuing to the present moment. It is of particular interest to linguists because it provides a good example of a domain in which complex structures must be induced by the language learner, and successes in this area have all relied on quantitative models that in various ways focus on model complexity and on goodness of fit to the data.
2016: Linguistica 5: Unsupervised Learning of Linguistic Structure. Jackson L. Lee and John A. Goldsmith. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics, San Diego, California, June 2016. Association for Computational Linguistics.
[ pdf | bib | abstract ]
@inproceedings{lee-goldsmith:2016:naacl, title = {{L}inguistica 5: {U}nsupervised {L}earning of {L}inguistic {S}tructure}, author = {Lee, Jackson L. and Goldsmith, John A.}, booktitle = {Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics}, year = {2016}, address = {San Diego, California}, month = {June}, publisher = {Association for Computational Linguistics}, }

This paper introduces Linguistica 5, a software for unsupervised learning of linguistic structure. It is a descendant of Goldsmith's (2001, 2006) Linguistica. Open-source and written in Python, the new Linguistica 5 is both a graphical user interface software and a Python library. While Linguistica 5 inherits its predecessors' strength in unsupervised learning of natural language morphology, it incorporates significant improvements in multiple ways. Notable new features include tools for data visualization as well as straightforward extensions for both its components and embedding in other programs.
: Working with CHAT transcripts in Python. Jackson L. Lee, Ross Burkholder, Gallagher B. Flinn, and Emily R. Coppess. Technical Report TR-2016-02, Department of Computer Science, University of Chicago, January 2016.
[ pdf | bib | abstract ]
@techreport{lee.ea:2016:pylangacq, title = {{W}orking with {CHAT} transcripts in {P}ython}, author = {Lee, Jackson L. and Burkholder, Ross and Flinn, Gallagher B. and Coppess, Emily R.}, institution = {Department of Computer Science, University of Chicago}, year = {2016}, month = {January}, number = {TR-2016-02}, }

This report introduces the Python library PyLangAcq for working with CHAT transcription data in Python. The library interfaces with speech data transcribed in the CHAT format, which is adopted by the CHILDES database for child language development research. Built in a Python infrastructure, PyLangAcq has direct access to a multitude of computational and statistical tools for language acquisition research. As the CHAT format is also used for other speech transcription databases, PyLangAcq will be useful for researchers in other linguistically related fields such as conversational analysis, corpus linguistics, and clinical linguistics.
2015: Morphological Paradigms: Computational Structure and Unsupervised Learning. Jackson L. Lee. In Proceedings of NAACL-HLT 2015 Student Research Workshop (SRW), pages 161--167, Denver, Colorado, June 2015. Association for Computational Linguistics.
[ pdf | bib | abstract ]
@inproceedings{lee:2015:naacl, title = {{M}orphological {P}aradigms: {C}omputational {S}tructure and {U}nsupervised {L}earning}, author = {Lee, Jackson L.}, booktitle = {Proceedings of NAACL-HLT 2015 Student Research Workshop (SRW)}, year = {2015}, month = {June}, address = {Denver, Colorado}, pages = {161-167}, publisher = {Association for Computational Linguistics}, }

This thesis explores the computational structure of morphological paradigms from the perspective of unsupervised learning. Three topics are studied: (i) stem identification, (ii) paradigmatic similarity, and (iii) paradigm induction. All the three topics progress in terms of the scope of data in question. The first and second topics explore structure when morphological paradigms are given, first within a paradigm and then across paradigms. The third topic asks where morphological paradigms come from in the first place, and explores strategies of paradigm induction from child-directed speech. This research is of interest to linguists and natural language processing researchers, for both theoretical questions and applied areas.
: Great pizzas, ghost negations: The emergence and persistence of mixed expressives. Andrea Beltrama and Jackson L. Lee. In Proceedings of Sinn und Bedeutung 19. 2015.
[ pdf | bib | abstract ]
@incollection{beltrama-lee:2015:sub, title = {{G}reat pizzas, ghost negations: {T}he emergence and persistence of mixed expressives}, author = {Beltrama, Andrea and Lee, Jackson L.}, booktitle = {Proceedings of {S}inn und {B}edeutung 19}, year = {2015}, }

Two novel cases of mixed expressives are reported and analyzed: Italian gran "big" and Cantonese gwai2 "ghost". We argue that (i) contrary to models of language change predicting the diachronic volatility of mixed expressivity, mixed expressives can be diachronic stable, and that (ii) expressive meaning and at-issue meaning diachronically proceed in a parallel fashion, interacting very little in the process. The case studies provide empirical support to current synchronic models of mixed expressivity, which assign separate semantic representations to expressive and descriptive meaning. The data also provide important insights to the hitherto poorly understood questions with regard to the diachrony and interaction of truth-conditional and expressive meaning.
: When French becomes tonal: Prosodic transfer from L1 Cantonese and L2 English. Jackson L. Lee and Stephen Matthews. In The 6th Annual Proceedings of the Pronunciation in Second Language Learning and Teaching Conference, 2015.
[ pdf | bib | abstract ]
@inproceedings{lee-matthews:2015:psllt, title = {{W}hen {F}rench becomes tonal: {P}rosodic transfer from {L1} {C}antonese and {L2} {E}nglish}, author = {Lee, Jackson L. and Matthews, Stephen}, booktitle = {The 6th {A}nnual {P}roceedings of the {P}ronunciation in {S}econd {L}anguage {L}earning and {T}eaching {C}onference}, year = {2015}, }

This paper describes and analyzes the prosodic properties of L3 French of Hong Kong speakers. The L3 French variety in question is like a tone language, where syllables of content words bear the Cantonese high-level tone and those of functions words have the Cantonese low-level tone instead. L2 English mediates the interlanguage transfer.
2014: Combining successor and predecessor frequencies to model truncation in Brazilian Portuguese. Mike Pham and Jackson L. Lee. Technical Report TR-2014-15, Department of Computer Science, University of Chicago, October 2014.
[ pdf | bib | abstract ]
@techreport{phamlee:2014, title = {{C}ombining successor and predecessor frequencies to model truncation in {B}razilian {P}ortuguese}, author = {Pham, Mike and Lee, Jackson L.}, institution = {Department of Computer Science, University of Chicago}, month = {October}, year = {2014}, number = {TR-2014-15}, }

This paper describes and evaluates an algorithm to predict Brazilian Portuguese truncation stems using Zellig Harris's successor frequencies and predecessor frequencies. The algorithm can be more generally regarded as a simple morpheme segmentation algorithm, one which finds the best morpheme boundary within a word against a given lexicon and does not a priori assume morpheme consistency.
: Automatic morphological alignment and clustering. Jackson L. Lee. Technical Report TR-2014-07, Department of Computer Science, University of Chicago, May 2014.
[ pdf | bib | abstract ]
@techreport{lee:2014:techreport, title = {{A}utomatic morphological alignment and clustering}, author = {Lee, Jackson L.}, institution = {Department of Computer Science, University of Chicago}, month = {May}, year = {2014}, number = {TR-2014-07}, }

This paper proposes an algorithm which takes a list of morphological paradigms and explores cross-paradigmatic structure.
: Variability in perceived duration: pitch dynamics and vowel quality. Alan C. L. Yu, Hyunjung Lee, and Jackson L. Lee. In Proceedings of the 4th International Symposium on Tonal Aspects of Languages, May 2014.
[ pdf | bib | abstract ]
@inproceedings{yu.ea:2014:tal4, title = {{V}ariability in perceived duration: pitch dynamics and vowel quality}, author = {Yu, Alan C. L. and Lee, Hyunjung and Lee, Jackson L.}, booktitle = {Proceedings of the 4th International Symposium on Tonal Aspects of Languages}, month = {May}, year = {2014}, }

This paper explores the hypothesis that differential phonologization might arise as a result of variability in how the perceptual system copes with variation in the speech signal.
: The representation of contour tones in Cantonese. Jackson L. Lee. In Proceedings of the 38th Annual Meeting of the Berkeley Linguistics Society. 2014.
[ pdf | bib | abstract ]
@incollection{lee:2014:bls, title = {{T}he representation of contour tones in {C}antonese}, author = {Lee, Jackson L.}, booktitle = {Proceedings of the 38th {A}nnual {M}eeting of the {B}erkeley {L}inguistics {S}ociety}, year = {2014}, }

This paper argues that Cantonese contour tones are tone clusters but not contour tone units. In light of the mainstream assumption that Asian tone languages are contour tone unit languages, Cantonese is a typologically exceptional.
: Proceedings of the Forty-eighth Annual Meeting of the Chicago Linguistic Society. Andrea Beltrama, Tasos Chatzikonstantinou, Jackson L. Lee, Mike Pham, and Diane Rak, editors. Chicago Linguistic Society, 2014.
[ pdf | bib | abstract ]
@book{cls48, editor = {Andrea Beltrama and Tasos Chatzikonstantinou and Jackson L. Lee and Mike Pham and Diane Rak}, title = {{Proceedings of the Forty-eighth Annual Meeting of the Chicago Linguistic Society}}, publisher = {Chicago Linguistic Society}, year = {2014}, }

Table of contents, BibTeX entries for all papers
2012: Fixed-tone reduplication in Cantonese. Jackson L. Lee. In McGill Working Papers in Linguistics 22(1). Proceedings from the Montreal-Ottawa-Toronto (MOT) Phonology Workshop 2011: Phonology in the 21st Century: In Honour of Glyne Piggott. 2012.
[ pdf | bib | abstract ]
@incollection{lee:2012;mcgill, title = {{F}ixed-tone reduplication in {C}antonese}, author = {Lee, Jackson L.}, booktitle = {{McGill} {W}orking {P}apers in {L}inguistics 22(1). {P}roceedings from the {M}ontreal-{O}ttawa-{T}oronto ({MOT}) {P}honology {W}orkshop 2011: {P}honology in the 21st {C}entury: {I}n {H}onour of {G}lyne {P}iggott}, year = {2012}, }

This paper provides a phonological subcategorization analysis of the attenuative constructions A-A-dei2 (from A) and A-A-dei2-B (from A-B) involving reduplication, infixation, and tonal alternation. On reduplication, the analysis is in terms of Morphological Doubling Theory, argued to be superior to Base-Reduplicant Correspondence Theory or Phonological Copying for the range of data considered.