This book presents a fresh, new approach in that it provides a
comprehensive recent review of challenging problems caused by imbalanced
data in prediction and classification, and also in that it introduces
several of the latest statistical methods of dealing with these
problems. The book discusses the property of the imbalance of data from
two points of view. The first is quantitative imbalance, meaning that
the sample size in one population highly outnumbers that in another
population. It includes presence-only data as an extreme case, where the
presence of a species is confirmed, whereas the information on its
absence is uncertain, which is especially common in ecology in
predicting habitat distribution. The second is qualitative imbalance,
meaning that the data distribution of one population can be well
specified whereas that of the other one shows a highly heterogeneous
property. A typical case is the existence of outliers commonly observed
in gene expression data, and another is heterogeneous characteristics
often observed in a case group in case-control studies. The extension of
the logistic regression model, maxent, and AdaBoost for imbalanced data
is discussed, providing a new framework for improvement of prediction,
classification, and performance of variable selection. Weights functions
introduced in the methods play an important role in alleviating the
imbalance of data. This book also furnishes a new perspective on these
problem and shows some applications of the recently developed
statistical methods to real data sets.