"The rigorous training provided by the ISP stimulates students' ideas for building powerful AI systems for biomedical informatics."
Ye Ye, student
PhD DISSERTATION DEFENSE
The notion of similarity between observations plays a very fundamental role in many Machine Learning and Data Mining algorithms. In many of these methods, the fundamental problem of prediction, which is making assessments and/or inferences about the future observations from the past ones, boils down to how "similar" the future cases are to the already observed ones. However, similarity is not always obtained through the traditional distance metrics. Data-driven similarity metrics, in particular, come into play where the traditional absolute metrics are not sufficient for the task in hand due to special structure of the observed data. A common approach for computing data-driven similarity is to somehow aggregate the local absolute similarities (which are not data-driven and can be computed in a closed-from) to infer a global data-driven similarity value between any pair of observations. The graph based methods offer a natural framework to do so. Incorporating these methods, many of the Machine Learning algorithms, that are designed to work with absolute distances, can be applied on those problems with data-driven distances. This makes graph-based methods very effective tools for many real-world problems.
In this thesis, the major problem that I want to address is the scalability of the graph-based methods. With the rise of large-scale, high-dimensional datasets in many real-world applications, many Machine Learning algorithms do not scale up well applying to these problems. The graph-based methods are no exception either. Both the large number of observations and the high dimensionality hurt graph-based methods, computationally and statistically. While the large number of observations imposes more of a computational problem, the high dimensionality problem has more of a statistical nature. In this thesis, I address both of these issues in depth and review the common solutions for them proposed in the literature. Moreover, for each of these problems, I propose novel solutions with experimental results depicting the merits of the proposed algorithms. Finally, I discuss the contribution of the proposed work from a broader viewpoint and draw some future directions of the current work.
Copyright 2009 | Web site by UMC Web Team