Tu Nguyen


2024

pdf bib
Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition
Tu Nguyen | Nedim Šrndić | Alexander Neth
Findings of the Association for Computational Linguistics: EACL 2024

Techniques, Tactics and Procedures (TTP) mapping is an important and difficult task in the application of cyber threat intelligence (CTI) extraction for threat reports. TTPs are typically expressed in semantic forms within security knowledge bases like MITRE ATT&CK, serving as textual high-level descriptions for sophisticated attack patterns. Conversely, attacks in CTI threat reports are detailed in a combination of natural and technical language forms, presenting a significant challenge even for security experts to establish correlations or mappings with the corresponding TTPs.Conventional learning approaches often target the TTP mapping problem in the classical multiclass/label classification setting. This setting hinders the learning capabilities of the model, due to the large number of classes (i.e., TTPs), the inevitable skewness of the label distribution and the complex hierarchical structure of the label space. In this work, we approach the problem in a different learning paradigm, such that the assignment of a text to a TTP label is essentially decided by the direct semantic similarity between the two, thus, reducing the complexity of competing solely over the large labeling space. In order that, we propose a neural matching architecture that incorporates a sampling based learn-to-compare mechanism, facilitating the learning process of the matching model despite constrained resources.

2023

pdf bib
Generative Spoken Language Model based on continuous word-sized audio tokens
Robin Algayres | Yossi Adi | Tu Nguyen | Jade Copet | Gabriel Synnaeve | Benoît Sagot | Emmanuel Dupoux
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio tokens that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous tokens. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is five times more memory efficient thanks to its large 200ms units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable.

2020

pdf bib
A Relational Memory-based Embedding Model for Triple Classification and Search Personalization
Dai Quoc Nguyen | Tu Nguyen | Dinh Phung
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Knowledge graph embedding methods often suffer from a limitation of memorizing valid triples to predict new ones for triple classification and search personalization problems. To this end, we introduce a novel embedding model, named R-MeN, that explores a relational memory network to encode potential dependencies in relationship triples. R-MeN considers each triple as a sequence of 3 input vectors that recurrently interact with a memory using a transformer self-attention mechanism. Thus R-MeN encodes new information from interactions between the memory and each input vector to return a corresponding vector. Consequently, R-MeN feeds these 3 returned vectors to a convolutional neural network-based decoder to produce a scalar score for the triple. Experimental results show that our proposed R-MeN obtains state-of-the-art results on SEARCH17 for the search personalization task, and on WN11 and FB13 for the triple classification task.

2018

pdf bib
A Trio Neural Model for Dynamic Entity Relatedness Ranking
Tu Nguyen | Tuan Tran | Wolfgang Nejdl
Proceedings of the 22nd Conference on Computational Natural Language Learning

Measuring entity relatedness is a fundamental task for many natural language processing and information retrieval applications. Prior work often studies entity relatedness in a static setting and unsupervised manner. However, entities in real-world are often involved in many different relationships, consequently entity relations are very dynamic over time. In this work, we propose a neural network-based approach that leverages public attention as supervision. Our model is capable of learning rich and different entity representations in a joint framework. Through extensive experiments on large-scale datasets, we demonstrate that our method achieves better results than competitive baselines.