Buzhou Tang


2019

pdf bib
HITSZ-ICRC: A Report for SMM4H Shared Task 2019-Automatic Classification and Extraction of Adverse Effect Mentions in Tweets
Shuai Chen | Yuanhang Huang | Xiaowei Huang | Haoming Qin | Jun Yan | Buzhou Tang
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task

This is the system description of the Harbin Institute of Technology Shenzhen (HITSZ) team for the first and second subtasks of the fourth Social Media Mining for Health Applications (SMM4H) shared task in 2019. The two subtasks are automatic classification and extraction of adverse effect mentions in tweets. The systems for the two subtasks are based on bidirectional encoder representations from transformers (BERT), and achieves promising results. Among the systems we developed for subtask1, the best F1-score was 0.6457, for subtask2, the best relaxed F1-score and the best strict F1-score were 0.614 and 0.407 respectively. Our system ranks first among all systems on subtask1.

2018

pdf bib
LCQMC:A Large-scale Chinese Question Matching Corpus
Xin Liu | Qingcai Chen | Chong Deng | Huajun Zeng | Jing Chen | Dongfang Li | Buzhou Tang
Proceedings of the 27th International Conference on Computational Linguistics

The lack of large-scale question matching corpora greatly limits the development of matching methods in question answering (QA) system, especially for non-English languages. To ameliorate this situation, in this paper, we introduce a large-scale Chinese question matching corpus (named LCQMC), which is released to the public1. LCQMC is more general than paraphrase corpus as it focuses on intent matching rather than paraphrase. How to collect a large number of question pairs in variant linguistic forms, which may present the same intent, is the key point for such corpus construction. In this paper, we first use a search engine to collect large-scale question pairs related to high-frequency words from various domains, then filter irrelevant pairs by the Wasserstein distance, and finally recruit three annotators to manually check the left pairs. After this process, a question matching corpus that contains 260,068 question pairs is constructed. In order to verify the LCQMC corpus, we split it into three parts, i.e., a training set containing 238,766 question pairs, a development set with 8,802 question pairs, and a test set with 12,500 question pairs, and test several well-known sentence matching methods on it. The experimental results not only demonstrate the good quality of LCQMC but also provide solid baseline performance for further researches on this corpus.

pdf bib
The BQ Corpus: A Large-scale Domain-specific Chinese Corpus For Sentence Semantic Equivalence Identification
Jing Chen | Qingcai Chen | Xin Liu | Haijun Yang | Daohe Lu | Buzhou Tang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This paper introduces the Bank Question (BQ) corpus, a Chinese corpus for sentence semantic equivalence identification (SSEI). The BQ corpus contains 120,000 question pairs from 1-year online bank custom service logs. To efficiently process and annotate questions from such a large scale of logs, this paper proposes a clustering based annotation method to achieve questions with the same intent. First, the deduplicated questions with the same answer are clustered into stacks by the Word Mover’s Distance (WMD) based Affinity Propagation (AP) algorithm. Then, the annotators are asked to assign the clustered questions into different intent categories. Finally, the positive and negative question pairs for SSEI are selected in the same intent category and between different intent categories respectively. We also present six SSEI benchmark performance on our corpus, including state-of-the-art algorithms. As the largest manually annotated public Chinese SSEI corpus in the bank domain, the BQ corpus is not only useful for Chinese question semantic matching research, but also a significant resource for cross-lingual and cross-domain SSEI research. The corpus is available in public.

2017

pdf bib
Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings
Bofang Li | Tao Liu | Zhe Zhao | Buzhou Tang | Aleksandr Drozd | Anna Rogers | Xiaoyong Du
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The number of word embedding models is growing every year. Most of them are based on the co-occurrence information of words and their contexts. However, it is still an open question what is the best definition of context. We provide a systematical investigation of 4 different syntactic context types and context representations for learning word embeddings. Comprehensive experiments are conducted to evaluate their effectiveness on 6 extrinsic and intrinsic tasks. We hope that this paper, along with the published code, would be helpful for choosing the best context type and representation for a given task.

2016

pdf bib
Incorporating Label Dependency for Answer Quality Tagging in Community Question Answering via CNN-LSTM-CRF
Yang Xiang | Xiaoqiang Zhou | Qingcai Chen | Zhihui Zheng | Buzhou Tang | Xiaolong Wang | Yang Qin
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In community question answering (cQA), the quality of answers are determined by the matching degree between question-answer pairs and the correlation among the answers. In this paper, we show that the dependency between the answer quality labels also plays a pivotal role. To validate the effectiveness of label dependency, we propose two neural network-based models, with different combination modes of Convolutional Neural Net-works, Long Short Term Memory and Conditional Random Fields. Extensive experi-ments are taken on the dataset released by the SemEval-2015 cQA shared task. The first model is a stacked ensemble of the networks. It achieves 58.96% on macro averaged F1, which improves the state-of-the-art neural network-based method by 2.82% and outper-forms the Top-1 system in the shared task by 1.77%. The second is a simple attention-based model whose input is the connection of the question and its corresponding answers. It produces promising results with 58.29% on overall F1 and gains the best performance on the Good and Bad categories.

2015

pdf bib
Answer Sequence Learning with Neural Networks for Answer Selection in Community Question Answering
Xiaoqiang Zhou | Baotian Hu | Qingcai Chen | Buzhou Tang | Xiaolong Wang
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib
UTH_CCB: A report for SemEval 2014 – Task 7 Analysis of Clinical Text
Yaoyun Zhang | Jingqi Wang | Buzhou Tang | Yonghui Wu | Min Jiang | Yukun Chen | Hua Xu
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

2010

pdf bib
A Cascade Method for Detecting Hedges and their Scope in Natural Language Text
Buzhou Tang | Xiaolong Wang | Xuan Wang | Bo Yuan | Shixi Fan
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

2009

pdf bib
A Joint Syntactic and Semantic Dependency Parsing System based on Maximum Entropy Models
Buzhou Tang | Lu Li | Xinxin Li | Xuan Wang | Xiaolong Wang
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task

2008

pdf bib
Chunking with Max-Margin Markov Networks
Buzhou Tang | Xuan Wang | Xiaolong Wang
Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation