Alexandra Schofield


2022

pdf bib
More Than Words: Collocation Retokenization for Latent Dirichlet Allocation Models
Jin Cheevaprawatdomrong | Alexandra Schofield | Attapol Rutherford
Findings of the Association for Computational Linguistics: ACL 2022

Traditionally, Latent Dirichlet Allocation (LDA) ingests words in a collection of documents to discover their latent topics using word-document co-occurrences. Previous studies show that representing bigrams collocations in the input can improve topic coherence in English. However, it is unclear how to achieve the best results for languages without marked word boundaries such as Chinese and Thai. Here, we explore the use of retokenization based on chi-squared measures, t-statistics, and raw frequency to merge frequent token ngrams into collocations when preparing input to the LDA model. Based on the goodness of fit and the coherence metric, we show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those of unmerged models.

2021

pdf bib
Learning How To Learn NLP: Developing Introductory Concepts Through Scaffolded Discovery
Alexandra Schofield | Richard Wicentowski | Julie Medero
Proceedings of the Fifth Workshop on Teaching NLP

We present a scaffolded discovery learning approach to introducing concepts in a Natural Language Processing course aimed at computer science students at liberal arts institutions. We describe some of the objectives of this approach, as well as presenting specific ways that four of our discovery-based assignments combine specific natural language processing concepts with broader analytic skills. We argue this approach helps prepare students for many possible future paths involving both application and innovation of NLP technology by emphasizing experimental data navigation, experiment design, and awareness of the complexities and challenges of analysis.

pdf bib
How effective is BERT without word ordering? Implications for language understanding and data privacy
Jack Hessel | Alexandra Schofield
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Ordered word sequences contain the rich structures that define language. However, it’s often not clear if or how modern pretrained language models utilize these structures. We show that the token representations and self-attention activations within BERT are surprisingly resilient to shuffling the order of input tokens, and that for several GLUE language understanding tasks, shuffling only minimally degrades performance, e.g., by 4% for QNLI. While bleak from the perspective of language understanding, our results have positive implications for cases where copyright or ethics necessitates the consideration of bag-of-words data (vs. full documents). We simulate such a scenario for three sensitive classification tasks, demonstrating minimal performance degradation vs. releasing full language sequences.

2020

pdf bib
Integrating Ethics into the NLP Curriculum
Emily M. Bender | Dirk Hovy | Alexandra Schofield
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

To raise awareness among future NLP practitioners and prevent inertia in the field, we need to place ethics in the curriculum for all NLP students—not as an elective, but as a core part of their education. Our goal in this tutorial is to empower NLP researchers and practitioners with tools and resources to teach others about how to ethically apply NLP techniques. We will present both high-level strategies for developing an ethics-oriented curriculum, based on experience and best practices, as well as specific sample exercises that can be brought to a classroom. This highly interactive work session will culminate in a shared online resource page that pools lesson plans, assignments, exercise ideas, reading suggestions, and ideas from the attendees. Though the tutorial will focus particularly on examples for university classrooms, we believe these ideas can extend to company-internal workshops or tutorials in a variety of organizations. In this setting, a key lesson is that there is no single approach to ethical NLP: each project requires thoughtful consideration about what steps can be taken to best support people affected by that project. However, we can learn (and teach) what issues to be aware of, what questions to ask, and what strategies are available to mitigate harm.

2017

pdf bib
Quantifying the Effects of Text Duplication on Semantic Models
Alexandra Schofield | Laure Thompson | David Mimno
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Duplicate documents are a pervasive problem in text datasets and can have a strong effect on unsupervised models. Methods to remove duplicate texts are typically heuristic or very expensive, so it is vital to know when and why they are needed. We measure the sensitivity of two latent semantic methods to the presence of different levels of document repetition. By artificially creating different forms of duplicate text we confirm several hypotheses about how repeated text impacts models. While a small amount of duplication is tolerable, substantial over-representation of subsets of the text may overwhelm meaningful topical patterns.

pdf bib
Pulling Out the Stops: Rethinking Stopword Removal for Topic Models
Alexandra Schofield | Måns Magnusson | David Mimno
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

It is often assumed that topic models benefit from the use of a manually curated stopword list. Constructing this list is time-consuming and often subject to user judgments about what kinds of words are important to the model and the application. Although stopword removal clearly affects which word types appear as most probable terms in topics, we argue that this improvement is superficial, and that topic inference benefits little from the practice of removing stopwords beyond very frequent terms. Removing corpus-specific stopwords after model inference is more transparent and produces similar results to removing those words prior to inference.

2016

pdf bib
Comparing Apples to Apple: The Effects of Stemmers on Topic Models
Alexandra Schofield | David Mimno
Transactions of the Association for Computational Linguistics, Volume 4

Rule-based stemmers such as the Porter stemmer are frequently used to preprocess English corpora for topic modeling. In this work, we train and evaluate topic models on a variety of corpora using several different stemming algorithms. We examine several different quantitative measures of the resulting models, including likelihood, coherence, model stability, and entropy. Despite their frequent use in topic modeling, we find that stemmers produce no meaningful improvement in likelihood and coherence and in fact can degrade topic stability.

pdf bib
Gender-Distinguishing Features in Film Dialogue
Alexandra Schofield | Leo Mehr
Proceedings of the Fifth Workshop on Computational Linguistics for Literature