Publications

[Find up-to-date info on my Google Scholar page]

Jump to… [Papers] [Books] [Preprints] [Refereed Abstracts] [Master’s Thesis]

Papers

LawInstruct: A Resource for Studying Language Model Adaptation to the Legal Domain
Joel Niklaus, Lucia Zheng, Arya D. McCarthy, Christopher Hahn, Brian M. Rosen, Peter Henderson, Daniel E. Ho, Garrett Honke, Percy Liang, Christopher Manning. Findings of NAACL 2025. To appear.
Long-Form Speech Translation through Segmentation with Finite-State Decoding Constraints on Large Language Models
Arya D. McCarthy, Hao Zhang, Shankar Kumar, Felix Stahlberg, and Ke Wu. Findings of EMNLP 2023.
Theory-Grounded Computational Text Analysis
Arya D. McCarthy and Giovanna Maria Dora Dore. ACL 2023.
TLDR: A framework for integrative research in text analysis by connecting findings to social science theories. Honrable mention for best paper.
The SIGMORPHON 2022 Shared Task on Cross-Lingual and Low-Resource Grapheme-to-Phoneme Conversion
Arya D. McCarthy, Jackson L. Lee, Alexandra DeLucia, Travis Bartley, Milind Agarwal, Lucas F.E. Ashby, Luca Del Signore, Cameron Gibson, Reuben Raff, and Winston Wu. SIGMORPHON 2023.
TLDR: Transfer learning of grapheme-to-phoneme conversion in truly low-resource settings.
Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models
Abteen Ebrahimi, Arya D. McCarthy, Arturo Oncevay, John Ortega, Luis Chiruzzo, Rolando Coto-Solano, Gustavo A. Giménez-Lugo and Katharina Kann. EACL 2023.
TLDR: Large language models extract useful representations for alignment, annotation projection, and downstream tasks like POS tagging and NER.
A Major Obstacle for NLP Research: Let’s Talk about Time Allocation!
Katharina Kann, Shiran Dudy, Arya D. McCarthy. EMNLP 2022.
TLDR: In recent years, subpar time allocation by the field has been a major obstacle for NLP research. We propose remedies to improve the status quo.
Morphological Processing of Low-Resource Languages: Where We Are and What’s Next
Adam Wiemerslage, Miikka Silfverberg, Changbing Yang, Arya D. McCarthy, Garrett Nicolai, Eliana Colunga, Katharina Kann. Findings of ACL 2022.
TLDR: We learn a language's morphology from raw text alone, without even external word lists.
Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?
E. Lee, Sarubi Thillainathan, Shravan Nayak, Surangika Ranathunga, David Ifeoluwa Adelani, Ruisi Su, Arya D. McCarthy. Findings of ACL 2022.
TLDR: Considering five dimensions of training data and domain, we show that mBART and mT5 cannot learn to translate languages phylogenically unrelated to their pre-training languages.
On the Uncomputability of Partition Functions in Energy-Based Sequence Models
Chu-Cheng Lin, Arya D. McCarthy. ICLR 2022.
TLDR: Energy-based sequence models backed by expressive parametric families can result in uncomputable and inapproximable partition functions. Statistical procedures with asymptotic guarantees and sheer amounts of compute are not the only things that make sequence modeling work. Spotlight.
Deciphering and Characterizing Out-of-Vocabulary Words for Morphologically Rich Languages
G. Botev, Arya D. McCarthy, Winston Wu, David Yarowsky. COLING 2022.
TLDR: A multi-faceted distributional analysis of the underlying word-formation processes that can aid in OOV words' compositional translation, tag- ging, parsing, language modeling, and other NLP tasks.
Hong Kong: Longitudinal and Synchronic Characterisations of Protest News between 1998 and 2020
Arya D. McCarthy, G. Dore. LREC 2022.
TLDR: Construction of a useful and timely corpus: the Hong Kong Protest News Dataset, along with investigation of patterns of journalistic practice.
UniMorph 4.0: Universal Morphology
Khuyagbaatar Batsuren, Omer Goldman, and 92 others including Arya D. McCarthy. LREC 2022.
TLDR: A push towards inclusion of derivational morphology in UniMorph, expanding on UniMorph 3.0 (McCarthy et al., 2020).
Learning to play with the machines in social science research: Bringing the theory back in.
Giovanna Maria Dora Dore*, Arya D. McCarthy*. ICML Workshop on Human–Machine Collaboration and Teaming (HMCaT) 2022.
TLDR: Computational and social science can benefit from greater interaction, considering the questions of big data versus theory, prediction versus explanation, and praxis versus substance.
Findings of the SIGMORPHON 2021 Shared Task on Unsupervised Morphological Paradigm Clustering
Adam Wiemerslage, Arya D. McCarthy, Alexander Erdmann, Garrett Nicolai, Manex Agirrezabal, Miikka Silfverberg, Mans Hulden, Katharina Kann. SIGMORPHON 2021.
TLDR: The goal of the shared task is to cluster word types from a raw text corpus into paradigms, and the best performing system is based on adaptor grammars.
Jump-Starting Item Parameters for Adaptive Language Tests
Arya D. McCarthy, K. Yancey, Geoffrey T. LaFlair, Jesse Egbert, Manqian Liao, Burr Settles. EMNLP 2021.
TLDR: A multi-task generalized linear model with BERT features enables new item difficulty estimates without piloting them first, which in turn limits item exposure and thus enhances test item security.
A Mixed-Methods Analysis of Western and Hong Kong–based Reporting on the 2019–2020 Protests
Arya D. McCarthy, James Scharf, G. Dore. LaTeCH-CLfL @EMNLP 2021.
TLDR: NLP discovers different media portrayals of the Hong Kong protests between the Hong Kong and Western presses, and analysis interprets it.
Characterizing News Portrayal of Civil Unrest in Hong Kong, 1998–2020
James Scharf, Arya D. McCarthy, G. Dore. CASE @ACL 2021.
TLDR: We apply statistical techniques from natural language processing to a collection of Western and Hong Kong–based English-language newspaper articles spanning the years 1998–2020.
Neural Transduction for Multilingual Lexical Translation
D. Lewis, Winston Wu, Arya D. McCarthy, David Yarowsky. COLING 2020.
TLDR: We design a probabilistic approach to synthesize new word forms, allowing it to operate in settings where correct translations have not been observed in text.
Measuring the Similarity of Grammatical Gender Systems by Comparing Partitions
Arya D. McCarthy, Adina Williams, Shijia Liu, David Yarowsky, Ryan Cotterell. EMNLP 2020.
TLDR: We craft novel information theoretic metrics for measuring similarity between gender systems, then use these to test hypotheses about Indo-European phylogeny.
The JHU Submission to the 2020 Duolingo Shared Task on Simultaneous Translation and Paraphrase for Language Education
Huda Khayrallah, Jacob Bremerman, Arya D. McCarthy, Kenton Murray, Winston Wu, Matt Post. Workshop on Neural Generation and Translation @ACL 2020.
TLDR: Fine-tuning strong machine translation models with in-domain data and filtering n-best lists yields top performance in the shared task. First place system
The SIGMORPHON 2020 Shared Task on Multilingual Grapheme-to-Phoneme Conversion
Kyle Gorman, Lucas F. E. Ashby, Aaron Goyzueta, Arya D. McCarthy, Shijie Wu, Daniel You. SIGMORPHON 2020.
TLDR: Design and findings of the SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion.
Addressing Posterior Collapse with Mutual Information for Improved Variational Neural Machine Translation
Arya D. McCarthy, Xian Li, Jiatao Gu, Ning Dong. EMNLP 2020.
TLDR: A simple and effective approach to address posterior collapse in conditional variational autoencoders (CVAEs). Yields improved translation with noisy or monolingual data.
The SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm Completion
Katharina Kann*, Arya D. McCarthy*, Garrett Nicolai, Mans Hulden. SIGMORPHON 2020.
TLDR: A novel task in the field of inflectional morphology, where none of the submitted systems was able to improve over the baseline on average over all 9 test languages.
Unsupervised Morphological Paradigm Completion
Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia, Arya D. McCarthy, Katharina Kann. ACL 2020.
TLDR: Discovery of morphological paradigms from raw text, with the potential to improve tools for low-resource languages or to assist linguistic annotators.
Predicting Declension Class from Form and Meaning
Adina Williams, Tiago Pimentel, Arya D. McCarthy, Hagen Blix, Eleanor Chodroff, Ryan Cotterell. ACL 2020.
TLDR: An information-theoretic method supporting the finding that form and meaning are relevant to grouping nouns into declension classes, with variation across languages.
SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation
Arya D. McCarthy, Liezl Puzon, J. Pino. ICASSP 2020.
TLDR: Autoencoding speaker conversion is combined with augmentation by machine-translated transcripts to obtain a competitive end-to-end AST model that outperforms a very strong cascade model.
Fine-grained Morphosyntactic Analysis and Generation Tools for More Than One Thousand Languages
Garrett Nicolai, D. Lewis, Arya D. McCarthy, Aaron Mueller, Winston Wu, David Yarowsky. LREC 2020.
TLDR: Kilolanguage generation, lemmatization, and fine-grained tagging, supplemented and improved through ensembling and dictionary-based reranking, for 1000+ languages.
The Johns Hopkins University Bible Corpus: 1600+ Tongues for Typological Exploration
Arya D. McCarthy, R. Wicks, D. Lewis, Aaron Mueller, Winston Wu, Oliver Adams, Garrett Nicolai, Matt Post, David Yarowsky. LREC 2020.
TLDR: Findings from the creation of a massively parallel corpus in over 1600 languages, the Johns Hopkins University Bible Corpus (JHUBC).
An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages
Aaron Mueller, Garrett Nicolai, Arya D. McCarthy, D. Lewis, Winston Wu, David Yarowsky. LREC 2020.
TLDR: Adding more languages to a training set is often better, but too many harms performance—the best number depends on the source language.
Massively Multilingual Pronunciation Modeling with WikiPron
Jackson L. Lee, Lucas F. E. Ashby, M. E. Garza, Yeonju Lee-Sikka, Sean Miller, A. Wong, Arya D. McCarthy, Kyle Gorman. LREC 2020.
TLDR: We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary.
UniMorph 3.0: Universal Morphology
Arya D. McCarthy, Christo Kirov, Matteo Grella, Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekaterina Vylomova, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, 11 others, and David Yarowsky. LREC 2020.
TLDR: More than 2x the number of languages, as well a snew parts of speech and validation tools, in the latest UniMorph morphological inventory release.
An Exact No Free Lunch Theorem for Community Detection
Arya D. McCarthy, Tongfei Chen, Seth Ebner. Complex Networks 2019.
TLDR: By using the correct random model for the problem domain, this work provides a stronger, exact No Free Lunch theorem for community detection and generalizes to other set-partitioning tasks including core/periphery separation, k-clustering, and graph partitioning.
Metrics Matter in Community Detection
Arya D. McCarthy, Tongfei Chen, Rachel Rudinger, D. Matula. Complex Networks 2019.
TLDR: A critical evaluation of normalized mutual information (NMI) as an evaluation metric for community detection is presented, and one-sided AMI under the 𝕄ₐₗₗ model (all partitions of 𝑛 nodes) is advised.
Modeling Color Terminology Across Thousands of Languages
Arya D. McCarthy, Winston Wu, Aaron Mueller, Bill Watson, David Yarowsky. EMNLP 2019.
TLDR: This paper employs a set of diverse measures on massively cross-linguistic data to operationalize and critique the Berlin and Kay color term hypotheses, showing the value of learning from weak supervision.
Weird Inflects but OK: Making Sense of Morphological Generation Errors
Kyle Gorman, Arya D. McCarthy, Ryan Cotterell, Ekaterina Vylomova, Miikka Silfverberg, M. Markowska. CoNLL 2019.
TLDR: A manual error analysis of the CoNLL-SIGMORPHON Shared Task on Morphological Reinflection and proposes an error taxonomy, which is used to annotate errors made by the top two systems across twelve languages.
Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade
J. Pino, Liezl Puzon, Jiatao Gu, Xutai Ma, Arya D. McCarthy, D. Gopinath. IWSLT 2019.
TLDR: A variety of approaches like data augmentation by translating ASR transcripts proves most effective on the English–French augmented LibriSpeech dataset, closing the performance gap from 8.2 to 1.4 BLEU, compared to a very strong cascade that could directly utilize copious ASR and MT data.
Meaning to Form: Measuring Systematicity as Information
Tiago Pimentel, Arya D. McCarthy, Damián E. Blasi, Brian Roark, Ryan Cotterell. ACL 2019.
TLDR: A holistic quantification of the systematicity of the sign using mutual information and recurrent neural networks. Best paper nominee.
The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection
Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence Wolf-Sonkin, Garrett Nicolai, Christo Kirov, Miikka Silfverberg, Sabrina J. Mielke, Jeffrey Heinz, Ryan Cotterell, Mans Hulden. SIGMORPHON 2019.
TLDR: Cross-lingual transfer and contextual analysis in morphology examined transfer learning of inflection between 100 language pairs, as well as contextual lemmatization and morphosyntactic description in 66 languages.
Marrying Universal Dependencies and Universal Morphology
Arya D. McCarthy, Miikka Silfverberg, Ryan Cotterell, Mans Hulden, David Yarowsky. UDW @EMNLP 2018.
TLDR: A deterministic mapping from Universal Dependencies v2 features into the UniMorph schema is presented to ease interoperability and a critical evaluation of the foundations, strengths, and weaknesses of the two annotation projects is presented.
Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation
Brian Thompson, Huda Khayrallah, Antonios Anastasopoulos, Arya D. McCarthy, Kevin Duh, Rebecca Marvin, Paul McNamee, Jeremy Gwinnup, Tim Anderson, Philipp Koehn. WMT 2018.
TLDR: A precursor to modern parameter-efficient finetuning approaches: performance is surprisingly good when a single component is adapted while holding the rest of the model fixed.
The CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Arya D. McCarthy, Katharina Kann, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, David Yarowsky, Jason Eisner, Mans Hulden. CoNLL 2018.
TLDR: Morphological generation in 103 languages and inflecting words in sentential context, similar to a cloze task.
UniMorph 2.0: Universal Morphology
Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sabrina J. Mielke, Arya D. McCarthy, Sandra Kübler, David Yarowsky, Jason Eisner, Mans Hulden. LREC .
TLDR: A projcet hosted at the Johns Hopkins University. Advances made to the schema, tooling, and dissemination of project resources since the UniMorph 1.0 release described at LREC 2016 are detailed.

Books

Giovanna Maria Dora Dore, Arya D. McCarthy, and James A. Scharf. A Free Press, If You Can Keep It: What Natural Language Processing Reveals About Freedom of the Press in Hong Kong. Springer 2022.
TLDR: Interdisciplinary analyses of protest coverate in print media that will “appeal to a wide range of readers with interests in computational social science, public policy, political sciences as well as policy-makers, think tanks, and practitioners who focus on the China-Hong Kong nexus”.

Preprints

Joel Niklaus, Lucia Zheng, Arya D. McCarthy, Christopher Hahn, Brian M. Rosen, Peter Henderson, Daniel E. Ho, Garrett Honke, Percy Liang, Christopher D. Manning. FLawN-T5: An Empirical Examination of Effective Instruction-Tuning Data Mixtures for Legal Reasoning. arXiv.
Arya D. McCarthy, Hao Zhang, Shankar Kumar, Felix Stahlberg, Axel H. Ng. Improved Long-Form Spoken Language Translation with Large Language Models. arXiv.
Nibhrat Lohia, Raunak Mundada, Arya D. McCarthy, Eric C. Larson. AirWare: Utilizing Embedded Audio and Infrared Signals for In-Air Hand-Gesture Recognition. arXiv.
Arya D. McCarthy, Xian Li, Jiatao Gu, and Ning Dong. Improved variational neural machine translation by promoting mutual information. arXiv.

Refereed Abstracts

David Yarowsky, Arya D. McCarthy, Garrett Nicolai, Winston Wu, Aaron Mueller, Dylan Lewis, Yingqi Ding, Abhinav Nigam, Emre Ozgu, Debanik Purkayastha, James Scharf and Kenneth Zheng. A 1000-language Collaborative Universal Dictionary and Universal Translator. UNESCO Language Technologies for All (LT4All) 2019.
Xian Li, Jiatao Gu, Ning Dong, and Arya D. McCarthy. Improved variational neural machine translation via promoting mutual information. EMNLP Workshop on Neural Generation and Translation 2019.
Xian Li, Jiatao Gu, Arya D. McCarthy, and Ning Dong. Improving variational NMT by promoting mutual information. West Coast NLP Summit (WeCNLP) 2019.
Arya D. McCarthy. Community Detection, The No Free Lunch Theorem, and Attack Games. SIAM Workshop on Network Science 2019.
Arya D. McCarthy. An exact No Free Lunch theorem for community detection. Complex Networks (ICCNA) 2018.
Arya D. McCarthy and David W. Matula. Evaluating the leximin method for community detection. Complex Networks (ICCNA) 2018.
Arya D. McCarthy and David W. Matula. Normalized mutual information exaggerates community detection performance. SIAM Workshop on Network Science 2018.

Master’s thesis

Arya D. McCarthy. Gridlock in networks: The leximin method for hierarchical community detection. Master’s thesis, SMU. 2017.

Arya McCarthy

Publications

Papers

Books

Preprints

Refereed Abstracts

Master’s thesis