Tarek Sakakini


2020

pdf bib
Context-Aware Automatic Text Simplification of Health Materials in Low-Resource Domains
Tarek Sakakini | Jong Yoon Lee | Aditya Duri | Renato F.L. Azevedo | Victor Sadauskas | Kuangxiao Gu | Suma Bhat | Dan Morrow | James Graumlich | Saqib Walayat | Mark Hasegawa-Johnson | Thomas Huang | Ann Willemsen-Dunlap | Donald Halpin
Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis

Healthcare systems have increased patients’ exposure to their own health materials to enhance patients’ health levels, but this has been impeded by patients’ lack of understanding of their health material. We address potential barriers to their comprehension by developing a context-aware text simplification system for health material. Given the scarcity of annotated parallel corpora in healthcare domains, we design our system to be independent of a parallel corpus, complementing the availability of data-driven neural methods when such corpora are available. Our system compensates for the lack of direct supervision using a biomedical lexical database: Unified Medical Language System (UMLS). Compared to a competitive prior approach that uses a tool for identifying biomedical concepts and a consumer-directed vocabulary list, we empirically show the enhanced accuracy of our system due to improved handling of ambiguous terms. We also show the enhanced accuracy of our system over directly-supervised neural methods in this low-resource setting. Finally, we show the direct impact of our system on laypeople’s comprehension of health material via a human subjects’ study (n=160).

2019

pdf bib
Equipping Educational Applications with Domain Knowledge
Tarek Sakakini | Hongyu Gong | Jong Yoon Lee | Robert Schloss | JinJun Xiong | Suma Bhat
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

One of the challenges of building natural language processing (NLP) applications for education is finding a large domain-specific corpus for the subject of interest (e.g., history or science). To address this challenge, we propose a tool, Dexter, that extracts a subject-specific corpus from a heterogeneous corpus, such as Wikipedia, by relying on a small seed corpus and distributed document representations. We empirically show the impact of the generated corpus on language modeling, estimating word embeddings, and consequently, distractor generation, resulting in better performances than while using a general domain corpus, a heuristically constructed domain-specific corpus, and a corpus generated by a popular system: BootCaT.

2018

pdf bib
Document Similarity for Texts of Varying Lengths via Hidden Topics
Hongyu Gong | Tarek Sakakini | Suma Bhat | JinJun Xiong
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Measuring similarity between texts is an important task for several applications. Available approaches to measure document similarity are inadequate for document pairs that have non-comparable lengths, such as a long document and its summary. This is because of the lexical, contextual and the abstraction gaps between a long document of rich details and its concise summary of abstract information. In this paper, we present a document matching approach to bridge this gap, by comparing the texts in a common space of hidden topics. We evaluate the matching algorithm on two matching tasks and find that it consistently and widely outperforms strong baselines. We also highlight the benefits of the incorporation of domain knowledge to text matching.

2017

pdf bib
MORSE: Semantic-ally Drive-n MORpheme SEgment-er
Tarek Sakakini | Suma Bhat | Pramod Viswanath
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present in this paper a novel framework for morpheme segmentation which uses the morpho-syntactic regularities preserved by word representations, in addition to orthographic features, to segment words into morphemes. This framework is the first to consider vocabulary-wide syntactico-semantic information for this task. We also analyze the deficiencies of available benchmarking datasets and introduce our own dataset that was created on the basis of compositionality. We validate our algorithm across datasets and present state-of-the-art results.