This book focuses on methods that are unsupervised or require minimal
supervision--vital in the low-resource domain. Over the past few years,
rapid growth in Internet access across the globe has resulted in an
explosion in user-generated text content in social media platforms. This
effect is significantly pronounced in linguistically diverse areas of
the world like South Asia, where over 400 million people regularly
access social media platforms. YouTube, Facebook, and Twitter report a
monthly active user base in excess of 200 million from this region.
Natural language processing (NLP) research and publicly available
resources such as models and corpora prioritize Web content authored
primarily by a Western user base. Such content is authored in English by
a user base fluent in the language and can be processed by a broad range
of off-the-shelf NLP tools. In contrast, text from linguistically
diverse regions features high levels of multilinguality, code-switching,
and varied language skill levels. Resources like corpora and models are
also scarce. Due to these factors, newer methods are needed to process
such text.
This book is designed for NLP practitioners well versed in recent
advances in the field but unfamiliar with the landscape of low-resource
multilingual NLP. The contents of this book introduce the various
challenges associated with social media content, quantify these issues,
and provide solutions and intuition. When possible, the methods
discussed are evaluated on real-world social media data sets to
emphasize their robustness to the noisy nature of the social media
environment.
On completion of the book, the reader will be well-versed with the
complexity of text-mining in multilingual, low-resource environments;
will be aware of a broad set of off-the-shelf tools that can be applied
to various problems; and will be able to conduct sophisticated analyses
of such text.