FEATURE SELECTION

H. Wang, D. Bell, F. Murtagh, "Axiomatic approach to feature subset selection based on relevance", IEEE Trans. Pattern An. Mach. Intell. 21 (1999) 271-277.

The feature selection problem is crucial to pattern recognition tasks. Roughly, a classification of a population Z in classes Y aims to provide a method to tell the class y to which a sample z belongs, provided a set of pairs zi, yi (called the training set) is known.

Feature selection methods are used when the X belong to a high dimensional space, in order to reduce the complexity of the classification task, being able to perform it effieciently, and obtain good generalization capability (ie prediction on samples outside the training set).


Relevance

Relevance is a common sense notion: X is relevant to Y if the happaning of X changes the likelihood of Y. There is an agreed formal definition. Given a prior evidence z and an hypothesis y, if the likelihood of y changes when the additional evidence x is considered, then x is relevant to y on the evidence z.

To formalize this notion we use information theoretic concepts. A variable (feature) X is relevant to Y, based on Z if the conditional probability of Y given X and Z is larger than that of Y given Z. More precisely we define the relevance of X to Y given Z,

r(X;Y|Z) = I(X;Y|Z) / H(Y|Z) = 1 - H(Y|X,Z) / H(Y|Z)

where H(Y|Z) is the entropy of Y conditional to Z,

H(Y|Z) = - ∑z,y P(z) P(y|z) log P(y|z)

and I(X;Y|Z) is the mutual information of X and Y given Z,

I(X;Y|Z) = H(Y|Z) + H(X|Z) - H(X,Y|Z)

The relevance is defined as 0 if H(Y|Z)=0.


Properties of the relevance.

Y<X means that Y implies X, ie, given a value y the value x is uniquely determined. In this case H(Y|X)=H(Y)-H(X), therefore H(X)<=Y(Y).


Feature Selection

To formalize the learning task we assume to have a training set of pairs (xi, yi) i=1..k, where xi are N-dimensional vectors, ie, are made of N features. A measure of the information encompassed in the training set is the mutual information I(X;Y) (called learning information). A subset of the features is said to be sufficient if it preserves the mutual information, I(X';Y)=I(X;Y), where X' is a N' dimensional vector consisting on a subset of the components of X.

Since the mutual information is additive, letting X"=X-X', we have I(X";Y) = I(X;Y) - I(X';Y) = 0, thus Y is conditionally independent of the complementary features. Furthermore preservation of mutual information implies r(X';Y)=r(X;Y).


Prediction

Given a set of features there may be a number of different sufficient feature subsets, and they may be not equalkly good for prediction. There are empirical methods to select a sufficient feature subset,

  1. Occam razor. Choose the "simper" subset, where simplicity can be simplicity of implementation, such as encoding length (Shannon entropy: H(X',Y));
  2. Maximum entropy principle;
  3. minimum description length;
  4. least general generalization principle.

If X' and X" are two sufficient subsets such that H(X',Y)<H(X",Y), then r(Y;X')>r(Y;X") (N.B. r(X';Y)=r(X";Y) ). Therefore the selection of the minimal sufficient feature subset is equivalent to maximing the relevance of Y to a sufficient feature subset.

The exhaustive examination of the feature subset space is a NP-hard problem. Therefore only a suboptimal solution can be found in polynomial time. An example is the incremental construction of the sufficient feature subset, by adding one feature at a time, ususlly the most relevant (highest entropy) one.

More about Entropy
More about Information Theory



Marco Corvi - Page hosted by geocities.com.