Foundations of Statistics for Data Scientists: With R and Python is
designed as a textbook for a one- or two-term introduction to
mathematical statistics for students training to become data scientists.
It is an in-depth presentation of the topics in statistical science with
which any data scientist should be familiar, including probability
distributions, descriptive and inferential statistical methods, and
linear modeling. The book assumes knowledge of basic calculus, so the
presentation can focus on "why it works" as well as "how to do it."
Compared to traditional "mathematical statistics" textbooks, however,
the book has less emphasis on probability theory and more emphasis on
using software to implement statistical methods and to conduct
simulations to illustrate key concepts. All statistical analyses in the
book use R software, with an appendix showing the same analyses with
Python.
Key Features:
- Shows the elements of statistical science that are important for
students who plan to become data scientists.
- Includes Bayesian and regularized fitting of models (e.g., showing an
example using the lasso), classification and clustering, and
implementing methods with modern software (R and Python).
- Contains nearly 500 exercises.
The book also introduces modern topics that do not normally appear in
mathematical statistics texts but are highly relevant for data
scientists, such as Bayesian inference, generalized linear models for
non-normal responses (e.g., logistic regression and Poisson loglinear
models), and regularized model fitting. The nearly 500 exercises are
grouped into "Data Analysis and Applications" and "Methods and
Concepts." Appendices introduce R and Python and contain solutions for
odd-numbered exercises. The book's website (http:
//stat4ds.rwth-aachen.de/) has expanded R, Python, and Matlab appendices
and all data sets from the examples and exercises.