H. Wang, D. Bell, F. Murtagh, "Axiomatic approach to feature subset selection based on relevance", IEEE Trans. Pattern An. Mach. Intell. 21 (1999) 271-277.
The feature selection problem is crucial to pattern recognition tasks. Roughly, a classification of a population Z in classes Y aims to provide a method to tell the class y to which a sample z belongs, provided a set of pairs zi, yi (called the training set) is known.
Feature selection methods are used when the X belong to a high dimensional space, in order to reduce the complexity of the classification task, being able to perform it effieciently, and obtain good generalization capability (ie prediction on samples outside the training set).
Relevance
Relevance is a common sense notion: X is relevant to Y if the happaning of X changes the likelihood of Y. There is an agreed formal definition. Given a prior evidence z and an hypothesis y, if the likelihood of y changes when the additional evidence x is considered, then x is relevant to y on the evidence z.
To formalize this notion we use information theoretic concepts. A variable (feature) X is relevant to Y, based on Z if the conditional probability of Y given X and Z is larger than that of Y given Z. More precisely we define the relevance of X to Y given Z,
where H(Y|Z) is the entropy of Y conditional to Z,
and I(X;Y|Z) is the mutual information of X and Y given Z,
The relevance is defined as 0 if H(Y|Z)=0.
Properties of the relevance.
Y<X means that Y implies X, ie, given a value y the value x is uniquely determined. In this case H(Y|X)=H(Y)-H(X), therefore H(X)<=Y(Y).
Feature Selection
To formalize the learning task we assume to have a training set of pairs (xi, yi) i=1..k, where xi are N-dimensional vectors, ie, are made of N features. A measure of the information encompassed in the training set is the mutual information I(X;Y) (called learning information). A subset of the features is said to be sufficient if it preserves the mutual information, I(X';Y)=I(X;Y), where X' is a N' dimensional vector consisting on a subset of the components of X.
Since the mutual information is additive, letting X"=X-X', we have I(X";Y) = I(X;Y) - I(X';Y) = 0, thus Y is conditionally independent of the complementary features. Furthermore preservation of mutual information implies r(X';Y)=r(X;Y).
Prediction
Given a set of features there may be a number of different sufficient feature subsets, and they may be not equalkly good for prediction. There are empirical methods to select a sufficient feature subset,
If X' and X" are two sufficient subsets such that H(X',Y)<H(X",Y), then r(Y;X')>r(Y;X") (N.B. r(X';Y)=r(X";Y) ). Therefore the selection of the minimal sufficient feature subset is equivalent to maximing the relevance of Y to a sufficient feature subset.
The exhaustive examination of the feature subset space is a NP-hard problem. Therefore only a suboptimal solution can be found in polynomial time. An example is the incremental construction of the sufficient feature subset, by adding one feature at a time, ususlly the most relevant (highest entropy) one.
More about Entropy
More about Information Theory
Marco Corvi - Page hosted by
geocities.com.