University of Pittsburgh

Leveraging Human Knowledge for Better Statistical Generalization

Assistant Professor
Date: 
Friday, February 16, 2018 - 12:30pm - 1:30pm

Although we are living in the era of big data when from the web it is easy to obtain billions or trillions of words, there are many scenarios in which we cannot be too data hungry. For example, even in a billion-word corpus, there is a long tail of rare and out-of vocabulary words. Next, language is not always paired with correlated events: corpora contain what people said, but not what they meant, or how they understood things, or what they did in response to the language. Finally, the vast majority of the world’s languages barely exist on the web at all. I'll present model-based approaches that incorporate prior knowledge in novel ways, to alleviate the problem of missing or skewed data. I'll show (1) how neural language models can benefit from cross-linguistic knowledge; (2) how insight into linguistic coherence, prototypicality, simplicity, and diversity of data helps improve learning in non-convex NLP models; (3) how knowledge about a speaker can be used for domain and style transfer.  I’ll conclude with an overview of ongoing research projects. 

Bio: Yulia Tsvetkov is an assistant professor in the Language Technologies Institute at Carnegie Mellon University. Her research interests lie at or near the intersection of machine learning, natural language processing, social science, and linguistics. Her current research projects focus on language technologies for social good, including advancing NLP technologies for resource-poor languages spoken by millions of people, developing approaches to promote civility in communication (e.g., modeling gender bias in texts and debiasing), identifying strategies that undermine the democratic process (e.g., political framing and agenda-setting in digital media). Prior to joining CMU, Yulia was a postdoc in the Stanford NLP Group; she received her PhD from Carnegie Mellon University.

Copyright 2009 | Web site by UMC Web Team