 |
|
| Computers Forum Index » Computer Artificial Intelligence - Genetic » Surrogate Variables... |
|
Page 1 of 1 |
|
| Author |
Message |
| Phil Sherrod... |
Posted: Wed Sep 16, 2009 2:26 pm |
|
|
|
Guest
|
On 16-Sep-2009, Ray Koopman <koopman at (no spam) sfu.ca> wrote:
Quote: Sounds a lot like a scheme I used back in the mainframe days. It was
designed for replacing missing interval-level data, and used linear
regression; for each missing observation, it chose from among the
mean, the best single surrogate for which data was available, and the
best pair of surrogates for which data were available, where 'best'
meant minimum expected squared error for that particular prediction.
Thus it was sensitive to the number of cases on which each regression
equations was based, the general goodness of each regression, and
the 1- or 2-dimensional Mahalanobis distance from their centroids of
the surrogate values on which the predictions were based. It worked
reasonably well, but I was never totally totally at peace with the
discontinuities that sometimes occurred when small changes in the
data led to different surrogates being used, with corresponding
large changes in the estimates of the missing values.
Yes, that's the idea of surrogate variables.
A lot of my clients have "uncontrolled" data collected from surveys where
missing values are scattered throughout it. Others have medical data where
some tests or observations are performed on some of the patients but not
all. So I was trying to come up with something more accurate than just
using the median for the missing value. After I fit a surrogate function, I
compare its predicted values with just using the median/mode value and only
use the surrogate if it is an improvement over the median/mode.
In principle, you could use more shophisticated surrogate functions than
linear, quadratic, and cubic: You could have neural nets, SVM, etc. But
since it has to generate and test n*(n-1) functions for n variables, I
decided to stick with simple surrogate functions feeding into a more complex
main function.
--
Phil Sherrod
http://www.dtreg.com -- Neural networks, SVM, Decision trees |
|
|
| Back to top |
|
|
|
| ddd... |
Posted: Wed Sep 16, 2009 4:21 pm |
|
|
|
Guest
|
["Followup-To:" header set to comp.ai.genetic.] On Wed, 16 Sep 2009 10:26:06
GMT, Phil Sherrod <PhilSherrod at (no spam) NOSPAMcomcast.net> wrote:
Quote:
On 16-Sep-2009, Ray Koopman <koopman at (no spam) sfu.ca> wrote:
Sounds a lot like a scheme I used back in the mainframe days. It was
designed for replacing missing interval-level data, and used linear
regression; for each missing observation, it chose from among the
mean, the best single surrogate for which data was available, and the
best pair of surrogates for which data were available, where 'best'
meant minimum expected squared error for that particular prediction.
Thus it was sensitive to the number of cases on which each regression
equations was based, the general goodness of each regression, and
the 1- or 2-dimensional Mahalanobis distance from their centroids of
the surrogate values on which the predictions were based. It worked
reasonably well, but I was never totally totally at peace with the
discontinuities that sometimes occurred when small changes in the
data led to different surrogates being used, with corresponding
large changes in the estimates of the missing values.
Yes, that's the idea of surrogate variables.
A lot of my clients have "uncontrolled" data collected from surveys where
missing values are scattered throughout it. Others have medical data where
some tests or observations are performed on some of the patients but not
all. So I was trying to come up with something more accurate than just
using the median for the missing value. After I fit a surrogate function, I
compare its predicted values with just using the median/mode value and only
use the surrogate if it is an improvement over the median/mode.
In principle, you could use more shophisticated surrogate functions than
linear, quadratic, and cubic: You could have neural nets, SVM, etc. But
since it has to generate and test n*(n-1) functions for n variables, I
decided to stick with simple surrogate functions feeding into a more complex
main function.
Maybe something other is to look at the work of Marco Ramoni and Paola
Sebastiani: "The use of exogeneous Knowledge to Learn Bayesian Networks
from Incomplete Databases."
It introduces a method called Bound and Collapse to estimate missing values. |
|
|
| Back to top |
|
|
|
| Clif Davis... |
Posted: Thu Sep 17, 2009 5:05 pm |
|
|
|
Guest
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Tue Dec 01, 2009 5:57 pm
|
|