Holger Frohlich
(Author)This thesis is devoted to the finding of possible solutions for some machine learning related problems in modern chemo- and bioinformatics by means of so-called kernel methods. They are a special family of learning algorithms that have attracted a growing interest during the last years due to their good theoretical foundation and many successful practical applications in various disciplines. At the core of all kernel methods is the usage of a kernel function, which can be thought of as a special similarity measure between arbitrary objects. At the beginning of this thesis fundamentals and principles of kernel machines are reviewed. Afterwards a novel algorithm for model selection for Support Vector Machines (SVMs) in classification and regression is proposed, which is based on ideas from global optimization theory. It does not make any assumptions about special properties of the kernel function, like differentiability, and is highly efficient. Experimental comparisons to existing algorithms yield good results. After this we turn our point of interest to applications of kernel methods in chemo- and bioinformatics: For the ADME in silico prediction problem in modern drug discovery descriptor and graph-based representations of molecules are investigated. A descriptor selection algorithm is proposed, which can improve the statistical stability of an existing method. Furthermore, a novel class of specialized kernel functions is introduced that allows the comparison of a pair of molecules on a graph-based level. Various combinations of graph and descriptor-based representations are investigated, which on one hand allow the incorporation of expert domain knowledge and on the other hand the integration of different notions of molecular similarity in one SVM model. Furthermore, a reduced graph representation for molecular structures is proposed, in which certain structural elements are condensed in one node of the graph. Our experiments indicate that with our method improvements of the prediction performance compared to state-of-the-art modelling approaches can be achieved. At the same time our method is computationally rather cheap, unified and highly flexible. Another question, that is examined in the content of this thesis, is, which features of the membrane potentiel (MP) determine the generation of action potentials (APs) in cortical neurons in vivo. SVMs are trained to predict the occurrence of an AP before its onset based on several extracted features of the MP. A specialized feature selection algorithm is then used to select the most important features simultaneously in several in vivo recordings. In conclusion we find that the occurrence of an AP not only depends on the value of the MP shortly before AP onset, but also on the MP rate of change, the increase of the membrane potential several ms before AP onset, and the long range mean MP. Our findings systematically extend investigations by other researchers and are partially also confirmed by their results. As a last application of kernel methods in this thesis, we deal with the problem of clustering genes with regard to their function based on their Gene Ontology (GO) annotation. For this purpose specialized kernel functions are developed, which measure the similarity between gene products with respect to the structure of the GO graph. Using several clustering algorithms, like kernel k-means, spectral clustering and average linkage, we can detect meaningful clusters with our method. Applications to other ontologies or taxonomies in principle are possible.