This textbook explains SQL within the context of data science and
introduces the different parts of SQL as they are needed for the tasks
usually carried out during data analysis. Using the framework of the
data life cycle, it focuses on the steps that are very often given the
short shift in traditional textbooks, like data loading, cleaning and
pre-processing.
The book is organized as follows. Chapter 1 describes the data life
cycle, i.e. the sequence of stages from data acquisition to archiving,
that data goes through as it is prepared and then actually analyzed,
together with the different activities that take place at each stage.
Chapter 2 gets into databases proper, explaining how relational
databases organize data. Non-traditional data, like XML and text, are
also covered. Chapter 3 introduces SQL queries, but unlike traditional
textbooks, queries and their parts are described around typical data
analysis tasks like data exploration, cleaning and transformation.
Chapter 4 introduces some basic techniques for data analysis and shows
how SQL can be used for some simple analyses without too much
complication. Chapter 5 introduces additional SQL constructs that are
important in a variety of situations and thus completes the coverage of
SQL queries. Lastly, chapter 6 briefly explains how to use SQL from
within R and from within Python programs. It focuses on how these
languages can interact with a database, and how what has been learned
about SQL can be leveraged to make life easier when using R or Python.
All chapters contain a lot of examples and exercises on the way, and
readers are encouraged to install the two open-source database systems
(MySQL and Postgres) that are used throughout the book in order to
practice and work on the exercises, because simply reading the book is
much less useful than actually using it.
This book is for anyone interested in data science and/or databases. It
just demands a bit of computer fluency, but no specific background on
databases or data analysis. All concepts are introduced intuitively and
with a minimum of specialized jargon. After going through this book,
readers should be able to profitably learn more about data mining,
machine learning, and database management from more advanced textbooks
and courses.