Yugo Murawaki


2023

pdf bib
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop
Dongfang Li | Rahmad Mahendra | Zilu Peter Tang | Hyeju Jang | Yugo Murawaki | Derek Fai Wong
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop

pdf bib
KWJA: A Unified Japanese Analyzer Based on Foundation Models
Nobuhiro Ueda | Kazumasa Omura | Takashi Kodama | Hirokazu Kiyomaru | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

We present KWJA, a high-performance unified Japanese text analyzer based on foundation models.KWJA supports a wide range of tasks, including typo correction, word segmentation, word normalization, morphological analysis, named entity recognition, linguistic feature tagging, dependency parsing, PAS analysis, bridging reference resolution, coreference resolution, and discourse relation analysis, making it the most versatile among existing Japanese text analyzers.KWJA solves these tasks in a multi-task manner but still achieves competitive or better performance compared to existing analyzers specialized for each task.KWJA is publicly available under the MIT license at https://github.com/ku-nlp/kwja.

2022

pdf bib
Addressing Segmentation Ambiguity in Neural Linguistic Steganography
Jumon Nozaki | Yugo Murawaki
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Previous studies on neural linguistic steganography, except Ueoka et al. (2021), overlook the fact that the sender must detokenize cover texts to avoid arousing the eavesdropper’s suspicion. In this paper, we demonstrate that segmentation ambiguity indeed causes occasional decoding failures at the receiver’s side. With the near-ubiquity of subwords, this problem now affects any language. We propose simple tricks to overcome this problem, which are even applicable to languages without explicit word boundaries.

2021

pdf bib
Japanese Zero Anaphora Resolution Can Benefit from Parallel Texts Through Neural Transfer Learning
Masato Umakoshi | Yugo Murawaki | Sadao Kurohashi
Findings of the Association for Computational Linguistics: EMNLP 2021

Parallel texts of Japanese and a non-pro-drop language have the potential of improving the performance of Japanese zero anaphora resolution (ZAR) because pronouns dropped in the former are usually mentioned explicitly in the latter. However, rule-based cross-lingual transfer is hampered by error propagation in an NLP pipeline and the frequent lack of transparency in translation correspondences. In this paper, we propose implicit transfer by injecting machine translation (MT) as an intermediate task between pretraining and ZAR. We employ a pretrained BERT model to initialize the encoder part of the encoder-decoder model for MT, and eject the encoder part for fine-tuning on ZAR. The proposed framework empirically demonstrates that ZAR performance can be improved by transfer learning from MT. In addition, we find that the incorporation of the masked language model training into MT leads to further gains.

pdf bib
Frustratingly Easy Edit-based Linguistic Steganography with a Masked Language Model
Honai Ueoka | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

With advances in neural language models, the focus of linguistic steganography has shifted from edit-based approaches to generation-based ones. While the latter’s payload capacity is impressive, generating genuine-looking texts remains challenging. In this paper, we revisit edit-based linguistic steganography, with the idea that a masked language model offers an off-the-shelf solution. The proposed method eliminates painstaking rule construction and has a high payload capacity for an edit-based model. It is also shown to be more secure against automatic detection than a generation-based method while offering better control of the security/payload capacity trade-off.

2020

pdf bib
Latent Geographical Factors for Analyzing the Evolution of Dialects in Contact
Yugo Murawaki
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Analyzing the evolution of dialects remains a challenging problem because contact phenomena hinder the application of the standard tree model. Previous statistical approaches to this problem resort to admixture analysis, where each dialect is seen as a mixture of latent ancestral populations. However, such ancestral populations are hardly interpretable in the context of the tree model. In this paper, we propose a probabilistic generative model that represents latent factors as geographical distributions. We argue that the proposed model has higher affinity with the tree model because a tree can alternatively be represented as a set of geographical distributions. Experiments involving synthetic and real data suggest that the proposed method is both quantitatively and qualitatively superior to the admixture model.

pdf bib
A System for Worldwide COVID-19 Information Aggregation
Akiko Aizawa | Frederic Bergeron | Junjie Chen | Fei Cheng | Katsuhiko Hayashi | Kentaro Inui | Hiroyoshi Ito | Daisuke Kawahara | Masaru Kitsuregawa | Hirokazu Kiyomaru | Masaki Kobayashi | Takashi Kodama | Sadao Kurohashi | Qianying Liu | Masaki Matsubara | Yusuke Miyao | Atsuyuki Morishima | Yugo Murawaki | Kazumasa Omura | Haiyue Song | Eiichiro Sumita | Shinji Suzuki | Ribeka Tanaka | Yu Tanaka | Masashi Toyoda | Nobuhiro Ueda | Honai Ueoka | Masao Utiyama | Ying Zhong
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign countries. We build a system for worldwide COVID-19 information aggregation containing reliable articles from 10 regions in 7 languages sorted by topics. Our reliable COVID-19 related website dataset collected through crowdsourcing ensures the quality of the articles. A neural machine translation module translates articles in other languages into Japanese and English. A BERT-based topic-classifier trained on our article-topic pair dataset helps users find their interested information efficiently by putting articles into different categories.

pdf bib
Adapting BERT to Implicit Discourse Relation Classification with a Focus on Discourse Connectives
Yudai Kishimoto | Yugo Murawaki | Sadao Kurohashi
Proceedings of the Twelfth Language Resources and Evaluation Conference

BERT, a neural network-based language model pre-trained on large corpora, is a breakthrough in natural language processing, significantly outperforming previous state-of-the-art models in numerous tasks. However, there have been few reports on its application to implicit discourse relation classification, and it is not clear how BERT is best adapted to the task. In this paper, we test three methods of adaptation. (1) We perform additional pre-training on text tailored to discourse classification. (2) In expectation of knowledge transfer from explicit discourse relations to implicit discourse relations, we add a task named explicit connective prediction at the additional pre-training step. (3) To exploit implicit connectives given by treebank annotators, we add a task named implicit connective prediction at the fine-tuning step. We demonstrate that these three techniques can be combined straightforwardly in a single training pipeline. Through comprehensive experiments, we found that the first and second techniques provide additional gain while the last one did not.

pdf bib
Building a Japanese Typo Dataset from Wikipedia’s Revision History
Yu Tanaka | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

User generated texts contain many typos for which correction is necessary for NLP systems to work. Although a large number of typo–correction pairs are needed to develop a data-driven typo correction system, no such dataset is available for Japanese. In this paper, we extract over half a million Japanese typo–correction pairs from Wikipedia’s revision history. Unlike other languages, Japanese poses unique challenges: (1) Japanese texts are unsegmented so that we cannot simply apply a spelling checker, and (2) the way people inputting kanji logographs results in typos with drastically different surface forms from correct ones. We address them by combining character-based extraction rules, morphological analyzers to guess readings, and various filtering methods. We evaluate the dataset using crowdsourcing and run a baseline seq2seq model for typo correction.

pdf bib
Native-like Expression Identification by Contrasting Native and Proficient Second Language Speakers
Oleksandr Harust | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 28th International Conference on Computational Linguistics

We propose a novel task of native-like expression identification by contrasting texts written by native speakers and those by proficient second language speakers. This task is highly challenging mainly because 1) the combinatorial nature of expressions prevents us from choosing candidate expressions a priori and 2) the distributions of the two types of texts overlap considerably. Our solution to the first problem is to combine a powerful neural network-based classifier of sentence-level nativeness with an explainability method that measures an approximate contribution of a given expression to the classifier’s prediction. To address the second problem, we introduce a special label neutral and reformulate the classification task as complementary-label learning. Our crowdsourcing-based evaluation and in-depth analysis suggest that our method successfully uncovers linguistically interesting usages distinctive of native speech.

2019

pdf bib
Bayesian Learning of Latent Representations of Language Structures
Yugo Murawaki
Computational Linguistics, Volume 45, Issue 2 - June 2019

We borrow the concept of representation learning from deep learning research, and we argue that the quest for Greenbergian implicational universals can be reformulated as the learning of good latent representations of languages, or sequences of surface typological features. By projecting languages into latent representations and performing inference in the latent space, we can handle complex dependencies among features in an implicit manner. The most challenging problem in turning the idea into a concrete computational model is the alarmingly large number of missing values in existing typological databases. To address this problem, we keep the number of model parameters relatively small to avoid overfitting, adopt the Bayesian learning framework for its robustness, and exploit phylogenetically and/or spatially related languages as additional clues. Experiments show that the proposed model recovers missing values more accurately than others and that some latent variables exhibit phylogenetic and spatial signals comparable to those of surface features.

pdf bib
Minimally Supervised Learning of Affective Events Using Discourse Relations
Jun Saito | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Recognizing affective events that trigger positive or negative sentiment has a wide range of natural language processing applications but remains a challenging problem mainly because the polarity of an event is not necessarily predictable from its constituent words. In this paper, we propose to propagate affective polarity using discourse relations. Our method is simple and only requires a very small seed lexicon and a large raw corpus. Our experiments using Japanese data show that our method learns affective events effectively without manually labeled data. It also improves supervised learning results when labeled data are small.

pdf bib
Diversity-aware Event Prediction based on a Conditional Variational Autoencoder with Reconstruction
Hirokazu Kiyomaru | Kazumasa Omura | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

Typical event sequences are an important class of commonsense knowledge. Formalizing the task as the generation of a next event conditioned on a current event, previous work in event prediction employs sequence-to-sequence (seq2seq) models. However, what can happen after a given event is usually diverse, a fact that can hardly be captured by deterministic models. In this paper, we propose to incorporate a conditional variational autoencoder (CVAE) into seq2seq for its ability to represent diverse next events as a probabilistic distribution. We further extend the CVAE-based seq2seq with a reconstruction mechanism to prevent the model from concentrating on highly typical events. To facilitate fair and systematic evaluation of the diversity-aware models, we also extend existing evaluation datasets by tying each current event to multiple next events. Experiments show that the CVAE-based models drastically outperform deterministic models in terms of precision and that the reconstruction mechanism improves the recall of CVAE-based models without sacrificing precision.

2018

pdf bib
A Knowledge-Augmented Neural Network Model for Implicit Discourse Relation Classification
Yudai Kishimoto | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 27th International Conference on Computational Linguistics

Identifying discourse relations that are not overtly marked with discourse connectives remains a challenging problem. The absence of explicit clues indicates a need for the combination of world knowledge and weak contextual clues, which can hardly be learned from a small amount of manually annotated data. In this paper, we address this problem by augmenting the input text with external knowledge and context and by adopting a neural network model that can effectively handle the augmented text. Experiments show that external knowledge did improve the classification accuracy. Contextual information provided no significant gain for implicit discourse relations, but it did for explicit ones.

pdf bib
Universal Dependencies Version 2 for Japanese
Masayuki Asahara | Hiroshi Kanayama | Takaaki Tanaka | Yusuke Miyao | Sumire Uematsu | Shinsuke Mori | Yuji Matsumoto | Mai Omura | Yugo Murawaki
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Annotating Modality Expressions and Event Factuality for a Japanese Chess Commentary Corpus
Suguru Matsuyoshi | Hirotaka Kameko | Yugo Murawaki | Shinsuke Mori
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Improving Crowdsourcing-Based Annotation of Japanese Discourse Relations
Yudai Kishimoto | Shinnosuke Sawada | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Analyzing Correlated Evolution of Multiple Features Using Latent Representations
Yugo Murawaki
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Statistical phylogenetic models have allowed the quantitative analysis of the evolution of a single categorical feature and a pair of binary features, but correlated evolution involving multiple discrete features is yet to be explored. Here we propose latent representation-based analysis in which (1) a sequence of discrete surface features is projected to a sequence of independent binary variables and (2) phylogenetic inference is performed on the latent space. In the experiments, we analyze the features of linguistic typology, with a special focus on the order of subject, object and verb. Our analysis suggests that languages sharing the same word order are not necessarily a coherent group but exhibit varying degrees of diachronic stability depending on other features.

2017

pdf bib
Diachrony-aware Induction of Binary Latent Representations from Typological Features
Yugo Murawaki
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Although features of linguistic typology are a promising alternative to lexical evidence for tracing evolutionary history of languages, a large number of missing values in the dataset pose serious difficulties for statistical modeling. In this paper, we combine two existing approaches to the problem: (1) the synchronic approach that focuses on interdependencies between features and (2) the diachronic approach that exploits phylogenetically- and/or spatially-related languages. Specifically, we propose a Bayesian model that (1) represents each language as a sequence of binary latent parameters encoding inter-feature dependencies and (2) relates a language’s parameters to those of its phylogenetic and spatial neighbors. Experiments show that the proposed model recovers missing values more accurately than others and that induced representations retain phylogenetic and spatial signals observed for surface features.

2016

pdf bib
Wikification for Scriptio Continua
Yugo Murawaki | Shinsuke Mori
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The fact that Japanese employs scriptio continua, or a writing system without spaces, complicates the first step of an NLP pipeline. Word segmentation is widely used in Japanese language processing, and lexical knowledge is crucial for reliable identification of words in text. Although external lexical resources like Wikipedia are potentially useful, segmentation mismatch prevents them from being straightforwardly incorporated into the word segmentation task. If we intentionally violate segmentation standards with the direct incorporation, quantitative evaluation will be no longer feasible. To address this problem, we propose to define a separate task that directly links given texts to an external resource, that is, wikification in the case of Wikipedia. By doing so, we can circumvent segmentation mismatch that may not necessarily be important for downstream applications. As the first step to realize the idea, we design the task of Japanese wikification and construct wikification corpora. We annotated subsets of the Balanced Corpus of Contemporary Written Japanese plus Twitter short messages. We also implement a simple wikifier and investigate its performance on these corpora.

pdf bib
Statistical Modeling of Creole Genesis
Yugo Murawaki
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Contrasting Vertical and Horizontal Transmission of Typological Features
Kenji Yamauchi | Yugo Murawaki
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Linguistic typology provides features that have a potential of uncovering deep phylogenetic relations among the world’s languages. One of the key challenges in using typological features for phylogenetic inference is that horizontal (spatial) transmission obscures vertical (phylogenetic) signals. In this paper, we characterize typological features with respect to the relative strength of vertical and horizontal transmission. To do this, we first construct (1) a spatial neighbor graph of languages and (2) a phylogenetic neighbor graph by collapsing known language families. We then develop an autologistic model that predicts a feature’s distribution from these two graphs. In the experiments, we managed to separate vertically and/or horizontally stable features from unstable ones, and the results are largely consistent with previous findings.

2015

pdf bib
Continuous Space Representations of Linguistic Typology and their Application to Phylogenetic Inference
Yugo Murawaki
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2013

pdf bib
Global Model for Hierarchical Multi-Label Text Classification
Yugo Murawaki
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf bib
Semi-Supervised Noun Compound Analysis with Edge and Span Features
Yugo Murawaki | Sadao Kurohashi
Proceedings of COLING 2012

2011

pdf bib
Non-parametric Bayesian Segmentation of Japanese Noun Phrases
Yugo Murawaki | Sadao Kurohashi
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

2010

pdf bib
Online Japanese Unknown Morpheme Detection using Orthographic Variation
Yugo Murawaki | Sadao Kurohashi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

To solve the unknown morpheme problem in Japanese morphological analysis, we previously proposed a novel framework of online unknown morpheme acquisition and its implementation. This framework poses a previously unexplored problem, online unknown morpheme detection. Online unknown morpheme detection is a task of finding morphemes in each sentence that are not listed in a given lexicon. Unlike in English, it is a non-trivial task because Japanese does not delimit words by white space. We first present a baseline method that simply uses the output of the morphological analyzer. We then show that it fails to detect some unknown morphemes because they are over-segmented into shorter registered morphemes. To cope with this problem, we present a simple solution, the use of orthographic variation of Japanese. Under the assumption that orthographic variants behave similarly, each over-segmentation candidate is checked against its counterparts. Experiments show that the proposed method improves the recall of detection and contributes to improving unknown morpheme acquisition.

pdf bib
Semantic Classification of Automatically Acquired Nouns using Lexico-Syntactic Clues
Yugo Murawaki | Sadao Kurohashi
Coling 2010: Posters

2008

pdf bib
Online Acquisition of Japanese Unknown Morphemes using Morphological Constraints
Yugo Murawaki | Sadao Kurohashi
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing