Xiaonan Li


2023

pdf bib
MoT: Memory-of-Thought Enables ChatGPT to Self-Improve
Xiaonan Li | Xipeng Qiu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have shown impressive abilities on various tasks. However, fundamentally improving them depends on high-quality datasets or computationally expensive fine-tuning. On the contrary, humans can easily improve themselves by self-thinking and memory, without external resources. In this paper, we propose a framework, **MoT**, to let the LLM self-improve through **M**emory **o**f **T**houghts, without annotated datasets and parameter updates. Specifically, MoT is divided into two stages: 1. before the test stage, the LLM pre-thinks on the unlabeled dataset and saves the high-confidence thoughts as external memory; 2. During the test stage, given a test question, the LLM recalls relevant memory to help itself reason and answer it. Experimental results show that MoT can help ChatGPT significantly improve its abilities in arithmetic reasoning, commonsense reasoning, factual reasoning, and natural language inference. Further analyses show that each component contributes critically to the improvements and MoT can lead to consistent improvements across various CoT methods and LLMs.

pdf bib
UTC-IE: A Unified Token-pair Classification Architecture for Information Extraction
Hang Yan | Yu Sun | Xiaonan Li | Yunhua Zhou | Xuanjing Huang | Xipeng Qiu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Information Extraction (IE) spans several tasks with different output structures, such as named entity recognition, relation extraction and event extraction. Previously, those tasks were solved with different models because of diverse task output structures. Through re-examining IE tasks, we find that all of them can be interpreted as extracting spans and span relations. They can further be decomposed into token-pair classification tasks by using the start and end token of a span to pinpoint the span, and using the start-to-start and end-to-end token pairs of two spans to determine the relation. Based on the reformulation, we propose a Unified Token-pair Classification architecture for Information Extraction (UTC-IE), where we introduce Plusformer on top of the token-pair feature matrix. Specifically, it models axis-aware interaction with plus-shaped self-attention and local interaction with Convolutional Neural Network over token pairs. Experiments show that our approach outperforms task-specific and unified models on all tasks in 10 datasets, and achieves better or comparable results on 2 joint IE datasets. Moreover, UTC-IE speeds up over state-of-the-art models on IE tasks significantly in most datasets, which verifies the effectiveness of our architecture.

pdf bib
Unified Demonstration Retriever for In-Context Learning
Xiaonan Li | Kai Lv | Hang Yan | Tianyang Lin | Wei Zhu | Yuan Ni | Guotong Xie | Xiaoling Wang | Xipeng Qiu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In-context learning is a new learning paradigm where a language model conditions on a few input-output pairs (demonstrations) and a test input, and directly outputs the prediction. It has been shown sensitive to the provided demonstrations and thus promotes the research of demonstration retrieval: given a test input, relevant examples are retrieved from the training set to serve as informative demonstrations for in-context learning. While previous works train task-specific retrievers for several tasks separately, these methods are hard to transfer and scale on various tasks, and separately trained retrievers will cause a lot of parameter storage and deployment cost. In this paper, we propose Unified Demonstration Retriever (UDR), a single model to retrieve demonstrations for a wide range of tasks. To train UDR, we cast various tasks’ training signals into a unified list-wise ranking formulation by language model’s feedback. Then we propose a multi-task list-wise ranking training framework with an iterative mining strategy to find high-quality candidates, which can help UDR fully incorporate various tasks’ signals. Experiments on 30+ tasks across 13 task families and multiple data domains show that UDR significantly outperforms baselines. Further analyses show the effectiveness of each proposed component and UDR’s strong ability in various scenarios including different LMs (1.3B 175B), unseen datasets, varying demonstration quantities, etc. We will release the code and model checkpoint after review.

pdf bib
An Embarrassingly Easy but Strong Baseline for Nested Named Entity Recognition
Hang Yan | Yu Sun | Xiaonan Li | Xipeng Qiu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Named entity recognition (NER) is the task to detect and classify entity spans in the text. When entity spans overlap between each other, the task is named as nested NER. Span-based methods have been widely used to tackle nested NER. Most of these methods get a score matrix, where each entry corresponds to a span. However, previous work ignores spatial relations in the score matrix. In this paper, we propose using Convolutional Neural Network (CNN) to model these spatial relations. Despite being simple, experiments in three commonly used nested NER datasets show that our model surpasses several recently proposed methods with the same pre-trained encoders. Further analysis shows that using CNN can help the model find more nested entities. Besides, we find that different papers use different sentence tokenizations for the three nested NER datasets, which will influence the comparison. Thus, we release a pre-processing script to facilitate future comparison.

pdf bib
Finding Support Examples for In-Context Learning
Xiaonan Li | Xipeng Qiu
Findings of the Association for Computational Linguistics: EMNLP 2023

In-context learning is a new learning paradigm where a language model observes a few examples and directly outputs the test input’s prediction. Previous works have shown that it is sensitive to the provided examples and randomly sampled examples probably cause inferior performance. In this paper, we propose finding “support examples” for in-context learning: Given a training dataset, it aims to select one permutation of a few examples, which can well characterize the task for in-context learning and thus lead to superior performance. Although for traditional gradient-based training, there are extensive methods to find a coreset from the entire dataset, they struggle to find important in-context examples, because in-context learning occurs in the language model’s forward process without gradients or parameter updates and thus has a significant gap with traditional training. Additionally, the strong dependence among in-context examples makes it an NP-hard combinatorial optimization problem and enumerating all permutations is infeasible. Hence we propose **LENS**, a fi**L**ter-th**EN**-**S**earch method to tackle this challenge in two stages: irst we filter the dataset to obtain individually informative in-context examples. Specifically, we propose a novel metric, InfoScore, to evaluate the example’s in-context informativeness based on the language model’s feedback, and further propose a progressive filtering process to filter out uninformative examples. Then we propose diversity-guided example search which iteratively refines and evaluates the selected example permutations, to find examples that fully depict the task. The experimental results show that LENS significantly outperforms a wide range of baselines and further analyses show that each component contribute critically to the improvements and shed light on the principles of supporting examples and in-context learning.

2022

pdf bib
CodeRetriever: A Large Scale Contrastive Pre-Training Method for Code Search
Xiaonan Li | Yeyun Gong | Yelong Shen | Xipeng Qiu | Hang Zhang | Bolun Yao | Weizhen Qi | Daxin Jiang | Weizhu Chen | Nan Duan
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

In this paper, we propose the CodeRetriever model, which learns the function-level code semantic representations through large-scale code-text contrastive pre-training. We adopt two contrastive learning schemes in CodeRetriever: unimodal contrastive learning and bimodal contrastive learning. For unimodal contrastive learning, we design an unsupervised learning approach to build semantic-related code pairs based on the documentation and function name. For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build code-text pairs. Both contrastive objectives can fully leverage large-scale code corpus for pre-training. Extensive experimental results show that CodeRetriever achieves new state-of-the-art with significant improvement over existing code pre-trained models, on eleven domain/language-specific code search tasks with six programming languages in different code granularity (function-level, snippet-level and statement-level).These results demonstrate the effectiveness and robustness of CodeRetriever.The codes and resources are available at https://github.com/microsoft/AR2/tree/main/CodeRetriever.

pdf bib
Soft-Labeled Contrastive Pre-Training for Function-Level Code Representation
Xiaonan Li | Daya Guo | Yeyun Gong | Yun Lin | Yelong Shen | Xipeng Qiu | Daxin Jiang | Weizhu Chen | Nan Duan
Findings of the Association for Computational Linguistics: EMNLP 2022

Code contrastive pre-training has recently achieved significant progress on code-related tasks. In this paper, we present SCodeR, a Soft-labeled contrastive pre-training framework with two positive sample construction methods to learn functional-level Code Representation. Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels through an iterative adversarial manner and use them to learn better code representation. The positive sample construction is another key for contrastive pre-training. Previous works use transformation-based methods like variable renaming to generate semantically equal positive codes. However, they usually result in the generated code with a highly similar surface form, and thus mislead the model to focus on superficial code structure instead of code semantics. To encourage SCodeR to capture semantic information from the code, we utilize code comments and abstract syntax sub-trees of the code to build positive samples. We conduct experiments on four code-related tasks over seven datasets. Extensive experimental results show that SCodeR achieves new state-of-the-art performance on all of them, which illustrates the effectiveness of the proposed pre-training method.

2021

pdf bib
Accelerating BERT Inference for Sequence Labeling via Early-Exit
Xiaonan Li | Yunfan Shao | Tianxiang Sun | Hang Yan | Xipeng Qiu | Xuanjing Huang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Both performance and efficiency are crucial factors for sequence labeling tasks in many real-world scenarios. Although the pre-trained models (PTMs) have significantly improved the performance of various sequence labeling tasks, their computational cost is expensive. To alleviate this problem, we extend the recent successful early-exit mechanism to accelerate the inference of PTMs for sequence labeling tasks. However, existing early-exit mechanisms are specifically designed for sequence-level tasks, rather than sequence labeling. In this paper, we first propose a simple extension of sentence-level early-exit for sequence labeling tasks. To further reduce the computational cost, we also propose a token-level early-exit mechanism that allows partial tokens to exit early at different layers. Considering the local dependency inherent in sequence labeling, we employed a window-based criterion to decide for a token whether or not to exit. The token-level early-exit brings the gap between training and inference, so we introduce an extra self-sampling fine-tuning stage to alleviate it. The extensive experiments on three popular sequence labeling tasks show that our approach can save up to 66%∼75% inference cost with minimal performance degradation. Compared with competitive compressed models such as DistilBERT, our approach can achieve better performance under the same speed-up ratios of 2×, 3×, and 4×.

pdf bib
Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning
Linyang Li | Demin Song | Xiaonan Li | Jiehang Zeng | Ruotian Ma | Xipeng Qiu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Pre-Trained Models have been widely applied and recently proved vulnerable under backdoor attacks: the released pre-trained weights can be maliciously poisoned with certain triggers. When the triggers are activated, even the fine-tuned model will predict pre-defined labels, causing a security threat. These backdoors generated by the poisoning methods can be erased by changing hyper-parameters during fine-tuning or detected by finding the triggers. In this paper, we propose a stronger weight-poisoning attack method that introduces a layerwise weight poisoning strategy to plant deeper backdoors; we also introduce a combinatorial trigger that cannot be easily detected. The experiments on text classification tasks show that previous defense methods cannot resist our weight-poisoning method, which indicates that our method can be widely applied and may provide hints for future model robustness studies.

2020

pdf bib
FLAT: Chinese NER Using Flat-Lattice Transformer
Xiaonan Li | Hang Yan | Xipeng Qiu | Xuanjing Huang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Recently, the character-word lattice structure has been proved to be effective for Chinese named entity recognition (NER) by incorporating the word information. However, since the lattice structure is complex and dynamic, the lattice-based models are hard to fully utilize the parallel computation of GPUs and usually have a low inference speed. In this paper, we propose FLAT: Flat-LAttice Transformer for Chinese NER, which converts the lattice structure into a flat structure consisting of spans. Each span corresponds to a character or latent word and its position in the original lattice. With the power of Transformer and well-designed position encoding, FLAT can fully leverage the lattice information and has an excellent parallel ability. Experiments on four datasets show FLAT outperforms other lexicon-based models in performance and efficiency.

pdf bib
BERT for Monolingual and Cross-Lingual Reverse Dictionary
Hang Yan | Xiaonan Li | Xipeng Qiu | Bocao Deng
Findings of the Association for Computational Linguistics: EMNLP 2020

Reverse dictionary is the task to find the proper target word given the word description. In this paper, we tried to incorporate BERT into this task. However, since BERT is based on the byte-pair-encoding (BPE) subword encoding, it is nontrivial to make BERT generate a word given the description. We propose a simple but effective method to make BERT generate the target word for this specific task. Besides, the cross-lingual reverse dictionary is the task to find the proper target word described in another language. Previous models have to keep two different word embeddings and learn to align these embeddings. Nevertheless, by using the Multilingual BERT (mBERT), we can efficiently conduct the cross-lingual reverse dictionary with one subword embedding, and the alignment between languages is not necessary. More importantly, mBERT can achieve remarkable cross-lingual reverse dictionary performance even without the parallel corpus, which means it can conduct the cross-lingual reverse dictionary with only corresponding monolingual data. Code is publicly available at https://github.com/yhcc/BertForRD.git.

2019

pdf bib
Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation
Cunxiang Wang | Shuailong Liang | Yue Zhang | Xiaonan Li | Tian Gao
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Introducing common sense to natural language understanding systems has received increasing research attention. It remains a fundamental question on how to evaluate whether a system has the sense-making capability. Existing benchmarks measure common sense knowledge indirectly or without reasoning. In this paper, we release a benchmark to directly test whether a system can differentiate natural language statements that make sense from those that do not make sense. In addition, a system is asked to identify the most crucial reason why a statement does not make sense. We evaluate models trained over large-scale language modeling tasks as well as human performance, showing that there are different challenges for system sense-making.