In this book, the authors first address the research issues by providing
a motivating scenario, followed by the exploration of the principles and
techniques of the challenging topics. Then they solve the raised
research issues by developing a series of methodologies. More
specifically, the authors study the query optimization and tackle the
query performance prediction for knowledge retrieval. They also handle
unstructured data processing, data clustering for knowledge extraction.
To optimize the queries issued through interfaces against knowledge
bases, the authors propose a cache-based optimization layer between
consumers and the querying interface to facilitate the querying and
solve the latency issue. The cache depends on a novel learning method
that considers the querying patterns from individual's historical
queries without having knowledge of the backing systems of the knowledge
base. To predict the query performance for appropriate query scheduling,
the authors examine the queries' structural and syntactical features and
apply multiple widely adopted prediction models. Their feature modelling
approach eschews the knowledge requirement on both the querying
languages and system.
To extract knowledge from unstructured Web sources, the authors examine
two kinds of Web sources containing unstructured data: the source code
from Web repositories and the posts in programming question-answering
communities. They use natural language processing techniques to
pre-process the source codes and obtain the natural language elements.
Then they apply traditional knowledge extraction techniques to extract
knowledge. For the data from programming question-answering communities,
the authors make the attempt towards building programming knowledge base
by starting with paraphrase identification problems and develop novel
features to accurately identify duplicate posts. For domain specific
knowledge extraction, the authors propose to use a clustering technique
to separate knowledge into different groups. They focus on developing a
new clustering algorithm that uses manifold constraints in the
optimization task and achieves fast and accurate performance.
For each model and approach presented in this dissertation, the authors
have conducted extensive experiments to evaluate it using either public
dataset or synthetic data they generated.