Main Page | Report this Page
 
   
Science Forum Index  »  Statistics - Math Forum  »  Outlier detection
Page 1 of 1    
Author Message
Guest
Posted: Thu Jan 04, 2007 7:55 pm
Hi all,
I'm writting a program and I need to find the best regression line in a
scatter plot. In the exemple below I need to identify the path made by
the * and identify outliers, here represented by +.
It's easy to find out by eye but I need to it programmatically.
In my case, each axis represents a DNA sequene and similarities between
them are plotted.
Here, 4 linear regression lines can be fitted to the plot made by the
stars * and they have the same slope. The background noise can be worse
making the identification of outliers harder.
My question is how can i identify the most probable path along the
scatter plot and discard background noise? Is there some kind of
density based algorithm which i could apply to this?
Can anyone help me?

Cheers

Mathieu

| *
| *
|
| *
| + *
| *
| + *
|
| *
| *
| *
| * +
| * +
| *
|__________________________________
David Jones
Posted: Fri Jan 05, 2007 6:10 am
Guest
m.fourment@gmail.com wrote:
Quote:
Hi all,
I'm writting a program and I need to find the best regression line
in
a scatter plot. In the exemple below I need to identify the path
made
by the * and identify outliers, here represented by +.

See the following book for some possible approaches....
A.C. Atkinson and M. Riani, Robust Diagnostic and Regression
Analysis, Springer, 2000.

David Jones
Graham Jones
Posted: Fri Jan 05, 2007 7:12 am
Guest
<m.fourment@gmail.com> wrote in message
news:1167954952.866154.105900@51g2000cwl.googlegroups.com...
Quote:
Hi all,
I'm writting a program and I need to find the best regression line in a
scatter plot. In the exemple below I need to identify the path made by
the * and identify outliers, here represented by +.
It's easy to find out by eye but I need to it programmatically.
In my case, each axis represents a DNA sequene and similarities between
them are plotted.
Here, 4 linear regression lines can be fitted to the plot made by the
stars * and they have the same slope. The background noise can be worse
making the identification of outliers harder.
My question is how can i identify the most probable path along the
scatter plot and discard background noise? Is there some kind of
density based algorithm which i could apply to this?
Can anyone help me?

Cheers

Mathieu

| *
| *
|
| *
| + *
| *
| + *
|
| *
| *
| *
| * +
| * +
| *
|__________________________________


It sounds (and looks) like you are trying to align two sequences so that an
`edit cost' between them is minimised. This has very little to do with
linear regression or outlier detection. It is well-known problem with a
well-known solution. See for example Introduction to Bioinformatics, Lesk,
2005. Or try searching with terms like
pairwise sequence alignment
dynamic programming

Graham
David Jones
Posted: Fri Jan 05, 2007 8:21 am
Guest
David Jones wrote:
Quote:
m.fourment@gmail.com wrote:
Hi all,
I'm writting a program and I need to find the best regression line
in
a scatter plot. In the exemple below I need to identify the path
made
by the * and identify outliers, here represented by +.

See the following book for some possible approaches....
A.C. Atkinson and M. Riani, Robust Diagnostic and Regression
Analysis, Springer, 2000.

David Jones

And for an extended approach not restricted to regression (ie treating
variables more symmetrically) see:

Atkinson AC, Riani M, Cerioli A. Exploring Multivariate Data with the
Forward Search
Springer 2004

David Jones
 
Page 1 of 1       All times are GMT - 5 Hours
The time now is Fri Aug 22, 2008 12:02 am