Learning and Generalisation for High-dimensional Data
Abstract
Modern data-driven Artificial Intelligence models are based on large datasets which have
been recently made available to practitioners. Significant efforts have been put into gathering
data and information. The volumes of our data assets grow with time and bring us to the
new era of Big Data. In many relevant problems however, we are faced with one particular
class of Big Data types: high-dimensional low-sample data or data with limited annotation.
These data sets are characterized by many attributes in a single record. At the same time, the
number of separate records in these datastets are often small or lack annotation. We refer to
these datasets as high-dimensional low-sample size data. They are found in many significant
fields such as medical image analysis such as asthma detection and treatment, financial
data analysis, and bioinformatics. These are just examples of where the data has got more
attributes compared to the observations made. Note that the volumes of unlabeled data in
these areas may in fact be large. However, for reasons beyond control of AI practitioners (e.g.
privacy, data protection laws, costs of human assessment, intellectual property) annotated data
may not be fully available to them. This kind of data presents many challenges in machine
learning algorithms. Over-fitting and high variance have been some of the major problems.
They are just one of many facets of the grand challenge of learning and generalisation in
high dimensions. Altogether they constitute the challenge of learning and generalisation for
high-dimensional systems.