|
TITLE |
Topic Models, Latent Space Models, Sparse Coding, and All That: A
systematic understanding of probabilistic semantic extraction in large
corpus
|
PRESENTERS |
Eric Xing |
ABSTRACT |
Probabilistic topic models have recently gained much popularity in
informational retrieval and related areas. Via such models, one can
project high-dimensional objects such as text documents into a low
dimensional space where their latent semantics are captured and
modeled; can integrate multiple sources of information---to "share
statistical strength" among components of a hierarchical probabilistic
model; and can structurally display and classify the otherwise
unstructured object collections. However, to many practitioners, how
topic models work, what to and not to expect from a topic model, how
is it different from and related to classical matrix algebraic
techniques such as LSI, NMF in NLP, how to empower topic models to
deal with complex scenarios such as multimodal data, contractual text
in social media, evolving corpus, or presence of supervision such as
labeling and rating, how to make topic modeling computationally
tractable even on web-scale data, etc., in a principled way, remain
unclear. In this tutorial, I will demystify the conceptual,
mathematical, and computational issues behind all such problems
surrounding the topic models and their applications by presenting a
systematic overview of the mathematical foundation of topic modeling,
and its connections to a number of related methods popular in other
fields such as the LDA, admixture model, mixed membership model,
latent space models, and sparse coding. I will offer a simple and
unifying view of all these techniques under the framework multi-view
latent space embedding, and online the roadmap of model extension and
algorithmic design toward different applications in IR and NLP. A main
theme of this tutorial that tie together a wide range of issues and
problems will build on the "probabilistic graphical model" formalism,
a formalism that exploits the conjoined talents of graph theory and
probability theory to build complex models out of simpler pieces. I
will use this formalism as an main aid to discuss both the
mathematical underpinnings for the models and the related
computational issues in a unified, simplistic, transparent, and
actionable fashion. |
OUTLINE |
1. Overview of topic models
1.1: From LSI to pLSI, to LDA, a historical account.
1.2: Related models in other domains
1.3: On latent space embedding
2. Computational Challenges and two classical algorithmic paths
2.1: Variational inference and variants
2.2: Basic MCMC
3. Scenario I: Multimodal data
3.1: text and photos
3.2: text and social networks
3.3: on general multi-view modeling and data integration
4. Scenario II: when supervision is available
4.1: discrete labels: text categories
4.2: continuous scores: user rating
4.3: on general supervised topic models: it is a matter of choice of
loss function and model constraints
5. Scenario III: what if I don't know the total number of topics, and the number changes?
5.1: Dirichlet Process and Hierarchical DP for infinite topic models
5.2: Nested CRP for infinite topic hierarchy
6. Scenario IV: Topic evolution in Streaming Corpus.
6.1: Dynamic topic model for steaming text
6.2: Story-line detection with dynamic infinite hierarchical topic
models: a unified approach
g
7: Advanced subjects: Sparsity in topic modeling (Optional)
7.1: needs and challenges
7.2: Probabilistic approaches: sparse additive generative models
7.3: Optimization and sparse coding approach: sparse topic coding
8: Scalability, Complexity, and Fast algorithms
8.1: complexity and learnability of topic models
8.2: online inference and learning
8.3: distributed inference and learning
9: Other apps (Optional)
9.1: Machine translation
9.2: Image segmentation
9.3: Genetic ancestry
|
PRESENTER BIOS |
Dr. Eric Xing is an associate professor in the School of Computer
Science at Carnegie Mellon University. His principal research
interests lie in the development of machine learning and statistical
methodology; especially for solving problems involving automated
learning, reasoning, and decision-making in high-dimensional,
multimodal, and dynamic possible worlds in social and biological
systems. Professor Xing received a Ph.D. in Molecular Biology from
Rutgers University, and another Ph.D. in Computer Science from UC
Berkeley. His current work involves, 1) foundations of statistical
learning, including theory and algorithms for estimating time/space
varying-coefficient models, sparse structured input/output models, and
nonparametric Bayesian models; 2) computational and statistical
analysis of gene regulation, genetic variation, and disease
associations; and 3) large-scale information & intelligent system in
social networks, computer vision, and natural language
processing. Professor Xing has published over 150 peer-reviewed
papers, and is an associate editor of the Annals of Applied
Statistics, the IEEE Transaction of Pattern Analysis and Machine
Intelligence (PAMI), the PLoS Journal of Computational Biology, an
Action Editor of the Machine Learning journal, and a member of the
DARPA Information Science and Technology (ISAT) Advisory Group. He is
a recipient of the NSF Career Award, the Alfred P. Sloan Research
Fellowship in Computer Science, the IBM collaborative research award,
the United States Air Force Young Investigator Award, and best paper
awards in a number of premier conferences including UAI, ACL, SDM, and
ISMB. |
|
|