Main Page | Report this Page
Science Forum Index  »  Statistics - Math Forum  »  Q: Classification of data with uneven multimodel...
Page 1 of 1    

Q: Classification of data with uneven multimodel...

Author Message
Gradient...
Posted: Tue Nov 03, 2009 8:22 pm
Guest
Hello everyone,

* Problem Statement:

I am trying to solve this classification problem. I am given a set of
x_1,...x_n observation vectors, each x_i being in R^d. The goal is to
assign each x_i to either class A or class B.

I am given a training set y_1,...y_t where for any y_i, I know its
corresponding label (either A or B).

Although there are only two labels, the underlying distribution of
each class, e.g. P(A|z), is multi modal. In addition, the height of
the modes are different, making the distirbuition uneven (some regions
are sparse in training samples and some are dense). We can assume P(A|
z) has smooth density function, i.e. if z_1 and z_2 are close, then it
is likely that the belong to the same label.

The goal is to use these assumptions and training data to label the
unlabeled data.

* My Thought:

The naive idea is use to compute K nearest neighbor of any x_i among
training data, and then perform a majority vote for labeling it.
However, I think it will be troubled by uneven mode assumption, denser
regions require bigger K. Although there might exist some hacks to
choose K adaptively, I was wondering if there is a principled and
clean procedure (maybe beyond KNN framework) to handle this problem?
Do you think bootstrapping or any other tehcnique provides a better
performance guarantee for this problem, say if we want to minimize the
number of mislcassifications?

Your guidance would be "highly" apprecaited!

Golabi
 
Paul...
Posted: Wed Nov 04, 2009 12:43 pm
Guest
Gradient wrote:
[quote]Hello everyone,

* Problem Statement:

I am trying to solve this classification problem. I am given a set of
x_1,...x_n observation vectors, each x_i being in R^d. The goal is to
assign each x_i to either class A or class B.

I am given a training set y_1,...y_t where for any y_i, I know its
corresponding label (either A or B).

Although there are only two labels, the underlying distribution of
each class, e.g. P(A|z), is multi modal. In addition, the height of
the modes are different, making the distirbuition uneven (some regions
are sparse in training samples and some are dense). We can assume P(A|
z) has smooth density function, i.e. if z_1 and z_2 are close, then it
is likely that the belong to the same label.

The goal is to use these assumptions and training data to label the
unlabeled data.

* My Thought:

The naive idea is use to compute K nearest neighbor of any x_i among
training data, and then perform a majority vote for labeling it.
However, I think it will be troubled by uneven mode assumption, denser
regions require bigger K. Although there might exist some hacks to
choose K adaptively, I was wondering if there is a principled and
clean procedure (maybe beyond KNN framework) to handle this problem?
Do you think bootstrapping or any other tehcnique provides a better
performance guarantee for this problem, say if we want to minimize the
number of mislcassifications?

Your guidance would be "highly" apprecaited!

Golabi
[/quote]
I wonder if a support vector machine would be effective? (It seems
"clean" and I'd be inclined to consider it "principled".)

Given a class of discriminant functions that is linear in its
coefficients (linear, polynomial, sum of basis functions, ...), it's
possible to directly minimize the number of misclassifications over
the training sample using a mixed integer programming model. IIRC,
there's an argument based on V-C dimension that this asymptotically
produces the optimal classifier. The problem is that the MIP models
get intractable long before the sample size reaches "asymptotic" (plus
I think there's some question among statisticians as to whether this
line is "principled").

As a sidebar, I would not necessarily minimize the number of
misclassifications (whether by SVM, MIP model, or other approach).
Unless the prior probabilities of the two groups and the
misclassification costs are equal, it would make sense to weight the
criterion function using both of those.

/Paul
 
 
Page 1 of 1    
All times are GMT - 5 Hours
The time now is Sat Dec 05, 2009 7:20 am