 |
|
| Computers Forum Index » Computer Artificial Intelligence - Neural Nets » Duplicate Columns, Initial Weights, and Combinations... |
|
Page 1 of 1 |
|
| Author |
Message |
| TomH488... |
Posted: Fri Sep 11, 2009 5:39 pm |
|
|
|
Guest
|
First, apologies for my long-winded reply.
On Sep 11, 11:07 am, Greg <he... at (no spam) alumni.brown.edu> wrote:
Quote:
Don't put too much faith in the value of the training set error.
See the FAQ and archives re validation sets and overfitting
(overtraining).
I wouldn't put too much faith in the value of training set error if my
inequality is true:
(variance of predicted error) > (variance of training error)
Regarding training set error in general:
What I believe I have discovered is that this error is a function of
the output magnitude.
For example, if you train until you reach an error of .01, there is no
way to know what the error associated with a particular output
magnitude interval is. In other words, this error is a kind of
aggregate error. It gives you no indication as to how much better
high magnitude outputs are fit and no indication of how poorly low
output magnitudes are fit.
About the only thing you can prove is that there exists a "middle"
interval which has an error of .01
This is fairly useless and even more useless when combined with my
inequality hypothesis.
________________________________
But here is the interesting part. Can it be that training error is so
useless because it is a non-uniform property and does not represent a
great majority of cases?
Keep in mind the contention that Memorization results when you "learn
the noise" of the input cases.
If that is true, then "learning the noise" would mean that the input
cases would have to be nailed with little error by the network. (A)
And the inverse, NOT "learning the noise" would mean that the
inputcases would NOT have to be nailed with little error by the
network. (B)
My inequality holds for (A) and I believe it should hold for (B).
NOTE: When simplistic plots of generalization & training error -v-
training are rendered, the training error curve is usually always
below the generalized error curve. This is exactly the inequality I
believe is true.
_______________________________
But So What?
I believe it means the following:
a) if you can't begin to fit the training data, you have no hope
whatsoever or any meaningful generalization.
b) if you training algorithm favors output magnitude intervals, it is
difficult to say anything about the results of your training.
c) further, a training method which would treat output magnitudes
without bias, should produce better generalization since you don't
overtrain big magnitudes and undertrain small magnitudes.
Concluding:
I think something needs to be "adjusted", "fudged", or weighted so
that uniform training occurs.
I also believe that this phenomenon is currently dealt with "down
wind" where weight values are adjusted by things like regularization
etc.
Perhaps a little effort "up wind" might be worth investing. |
|
|
| Back to top |
|
|
|
| Tomasso... |
Posted: Sat Sep 12, 2009 5:15 am |
|
|
|
Guest
|
What is the transfer function on your output layer? Linear, logistic, exponential, etc?
What kind of function are you trying to predict (in terms of the above).
Are there sampling biases across the range of output values? Eg, are large values also
rare?
Your "inequality" isn't necessarily true.
Tom (the other one).
"TomH488" <tomhoo at (no spam) gmail.com> wrote in message
news:0f66a3e4-07d5-4f58-81a5-69a8bc919961 at (no spam) s39g2000yqj.googlegroups.com...
First, apologies for my long-winded reply.
On Sep 11, 11:07 am, Greg <he... at (no spam) alumni.brown.edu> wrote:
Quote:
Don't put too much faith in the value of the training set error.
See the FAQ and archives re validation sets and overfitting
(overtraining).
I wouldn't put too much faith in the value of training set error if my
inequality is true:
(variance of predicted error) > (variance of training error)
Regarding training set error in general:
What I believe I have discovered is that this error is a function of
the output magnitude.
For example, if you train until you reach an error of .01, there is no
way to know what the error associated with a particular output
magnitude interval is. In other words, this error is a kind of
aggregate error. It gives you no indication as to how much better
high magnitude outputs are fit and no indication of how poorly low
output magnitudes are fit.
About the only thing you can prove is that there exists a "middle"
interval which has an error of .01
This is fairly useless and even more useless when combined with my
inequality hypothesis.
________________________________
But here is the interesting part. Can it be that training error is so
useless because it is a non-uniform property and does not represent a
great majority of cases?
Keep in mind the contention that Memorization results when you "learn
the noise" of the input cases.
If that is true, then "learning the noise" would mean that the input
cases would have to be nailed with little error by the network. (A)
And the inverse, NOT "learning the noise" would mean that the
inputcases would NOT have to be nailed with little error by the
network. (B)
My inequality holds for (A) and I believe it should hold for (B).
NOTE: When simplistic plots of generalization & training error -v-
training are rendered, the training error curve is usually always
below the generalized error curve. This is exactly the inequality I
believe is true.
_______________________________
But So What?
I believe it means the following:
a) if you can't begin to fit the training data, you have no hope
whatsoever or any meaningful generalization.
b) if you training algorithm favors output magnitude intervals, it is
difficult to say anything about the results of your training.
c) further, a training method which would treat output magnitudes
without bias, should produce better generalization since you don't
overtrain big magnitudes and undertrain small magnitudes.
Concluding:
I think something needs to be "adjusted", "fudged", or weighted so
that uniform training occurs.
I also believe that this phenomenon is currently dealt with "down
wind" where weight values are adjusted by things like regularization
etc.
Perhaps a little effort "up wind" might be worth investing. |
|
|
| Back to top |
|
|
|
| Greg... |
Posted: Sun Sep 13, 2009 12:04 am |
|
|
|
Guest
|
On Sep 11, 1:39 pm, TomH488 <tom... at (no spam) gmail.com> wrote:
Quote: First, apologies for my long-winded reply.
On Sep 11, 11:07 am, Greg <he... at (no spam) alumni.brown.edu> wrote:
Don't put too much faith in the value of the training set error.
See the FAQ and archives re validation sets and overfitting
(overtraining).
I wouldn't put too much faith in the value of training set
error if my inequality is true:
(variance of predicted error) > (variance of training error)
1. The measure to be used is MSE, not VAR.
2. Except for very rare or pathological cases, the result
MSE(nontraining data) >= MSE(training data)
is obvious and well known. (There may be a proof in Fukunaga's
pattern recognition text.).
Quote: Regarding training set error in general:
What I believe I have discovered is that this error is a function of
the output magnitude.
This shouldn't surprise anyone:
1. You didn't normalize your targets to have the same variance
(or range).
2. Your objective function is the sum-of-squared-errors, not
the sum-of-error-magnitudes.
3. Since these options are available, you might want to try them
and compare.
Quote: For example, if you train until you reach an error of .01,
The value 0.01 means absolutely nothing unless it is compared
to a reasonable standard. The most reasonable standards are
the errors obtained when the model is either constant or linear.
For example, MSE/MSE00 is, approximately, the fraction of
total output variance that is not represented by the model.
Moreover, the well known statistical measure of fit, R^2 is
given by
R^2 = 1-(SSE/SSE00) = 1-(MSE/MSE00)
Quote: there is no
way to know what the error associated with a particular output
magnitude interval is. In other words, this error is a kind of
aggregate error. It gives you no indication as to how much better
high magnitude outputs are fit and no indication of how poorly low
output magnitudes are fit.
About the only thing you can prove is that there exists a "middle"
interval which has an error of .01
No, you can use an objective function that is a better
fit to what you think is important.
Quote: This is fairly useless and even more useless when combined with my
inequality hypothesis.
________________________________
But here is the interesting part. Can it be that training error is so
useless because it is a non-uniform property and does not represent a
great majority of cases?
Training error is not useless, provided you use common sense
in
1. Selecting design (training + validation) data.
2. Choose an appropriate training function
3. Use the training function wisely
Quote: Keep in mind the contention that Memorization results when you "learn
the noise" of the input cases.
Keep in mind that you are obsessing over training data
when the performance on validation data is much more important.
I am assuming that both are random draws from the population.
Quote: If that is true, then "learning the noise" would mean that the input
cases would have to be nailed with little error by the network. (A)
And the inverse, NOT "learning the noise" would mean that the
inputcases would NOT have to be nailed with little error by the
network. (B)
My inequality holds for (A) and I believe it should hold for (B).
NOTE: When simplistic plots of generalization & training error -v-
training are rendered, the training error curve is usually always
below the generalized error curve. This is exactly the inequality I
believe is true.
It tends to be true most of the time.
_______________________________
Quote:
But So What?
I believe it means the following:
a) if you can't begin to fit the training data, you have no hope
whatsoever or any meaningful generalization.
b) if you training algorithm favors output magnitude intervals, it is
difficult to say anything about the results of your training.
Don't expect to minimize one thing and automatically
get a free ride for another thing...
Make sure that minimizing your objective function will
satisfy whatever propreties you think are important.
Quote: c) further, a training method which would treat output magnitudes
without bias, should produce better generalization since you don't
overtrain big magnitudes and undertrain small magnitudes.
No. Generalization is with respect to a given objective function.
Quote: Concluding:
I think something needs to be "adjusted", "fudged", or weighted so
that uniform training occurs.
That is well known. See the FAQ and elementary texts in statistical
modeling.
Quote:
I also believe that this phenomenon is currently dealt with "down
wind" where weight values are adjusted by things like regularization
etc.
Perhaps a little effort "up wind" might be worth investing.
All of your fears are well known and have been sufficiently
investigated. You need to perform a more extensive search of
texts (statistical modelling, pattern recognition & NNs) and
previous research.
Hope this helps.
Greg |
|
|
| Back to top |
|
|
|
| TomH488... |
Posted: Sun Sep 13, 2009 8:11 pm |
|
|
|
Guest
|
I'll attempt to answer "The Other Tom's" questions first (!) (Greg,
you will be next)
On Sep 11, 11:21 pm, "Tomasso" <toma... at (no spam) a.a> wrote:
Quote: What is the transfer function on your output layer? Linear, logistic, exponential, etc?
Single output with linear activation or transfer function.
Quote:
What kind of function are you trying to predict (in terms of the above).
The output is "5 day change in price of the Russell 2000"
The input are "images" of stock charts. These "images" are large
samplings of the stock price time history. Many sampling models have
been tried. Currently, a very simple, "last 25 days" is being used to
as part of an ensemble model that includes other types of inputs/
models.
Quote:
Are there sampling biases across the range of output values? Eg, are large values also
rare?
The distribution of outputs is not terribly unlike a Normal
distribution centered at Zero.
Usually, for even output magnitude intervals, the variance of each
interval, plotted on semi-log appears as a straight line.
The larger the output, the greater the importance as far as investing
goes.
In fact, a traditional MSE is not a good metric since getting the
"correct sign" is the most important - I also plan to investigate
models where the output will be +1 or -1. This may be another member
of the ensemble.
Quote:
Your "inequality" isn't necessarily true.
Tom (the other one).
"TomH488" <tom... at (no spam) gmail.com> wrote in message
news:0f66a3e4-07d5-4f58-81a5-69a8bc919961 at (no spam) s39g2000yqj.googlegroups.com...
First, apologies for my long-winded reply.
On Sep 11, 11:07 am, Greg <he... at (no spam) alumni.brown.edu> wrote:
Don't put too much faith in the value of the training set error.
See the FAQ and archives re validation sets and overfitting
(overtraining).
I wouldn't put too much faith in the value of training set error if my
inequality is true:
(variance of predicted error) > (variance of training error)
Regarding training set error in general:
What I believe I have discovered is that this error is a function of
the output magnitude.
For example, if you train until you reach an error of .01, there is no
way to know what the error associated with a particular output
magnitude interval is. In other words, this error is a kind of
aggregate error. It gives you no indication as to how much better
high magnitude outputs are fit and no indication of how poorly low
output magnitudes are fit.
About the only thing you can prove is that there exists a "middle"
interval which has an error of .01
This is fairly useless and even more useless when combined with my
inequality hypothesis.
________________________________
But here is the interesting part. Can it be that training error is so
useless because it is a non-uniform property and does not represent a
great majority of cases?
Keep in mind the contention that Memorization results when you "learn
the noise" of the input cases.
If that is true, then "learning the noise" would mean that the input
cases would have to be nailed with little error by the network. (A)
And the inverse, NOT "learning the noise" would mean that the
inputcases would NOT have to be nailed with little error by the
network. (B)
My inequality holds for (A) and I believe it should hold for (B).
NOTE: When simplistic plots of generalization & training error -v-
training are rendered, the training error curve is usually always
below the generalized error curve. This is exactly the inequality I
believe is true.
_______________________________
But So What?
I believe it means the following:
a) if you can't begin to fit the training data, you have no hope
whatsoever or any meaningful generalization.
b) if you training algorithm favors output magnitude intervals, it is
difficult to say anything about the results of your training.
c) further, a training method which would treat output magnitudes
without bias, should produce better generalization since you don't
overtrain big magnitudes and undertrain small magnitudes.
Concluding:
I think something needs to be "adjusted", "fudged", or weighted so
that uniform training occurs.
I also believe that this phenomenon is currently dealt with "down
wind" where weight values are adjusted by things like regularization
etc.
Perhaps a little effort "up wind" might be worth investing. |
|
|
| Back to top |
|
|
|
| Greg... |
Posted: Mon Sep 14, 2009 9:08 pm |
|
|
|
Guest
|
On Sep 13, 4:11 pm, TomH488 <tom... at (no spam) gmail.com> wrote:
Quote: I'll attempt to answer "The Other Tom's" questions first (!) (Greg,
you will be next)
On Sep 11, 11:21 pm, "Tomasso" <toma... at (no spam) a.a> wrote:
What is the transfer function on your output layer? Linear, logistic, exponential, etc?
Single output with linear activation or transfer function.
What kind of function are you trying to predict (in terms of the above)..
The output is "5 day change in price of the Russell 2000"
The input are "images" of stock charts. These "images" are large
samplings of the stock price time history. Many sampling models have
been tried. Currently, a very simple, "last 25 days" is being used to
as part of an ensemble model that includes other types of inputs/
models.
Are there sampling biases across the range of output values? Eg, are large values also
rare?
The distribution of outputs is not terribly unlike a Normal
distribution centered at Zero.
Usually, for even output magnitude intervals, the variance of each
interval, plotted on semi-log appears as a straight line.
The larger the output, the greater the importance as far as investing
goes.
In fact, a traditional MSE is not a good metric since getting the
"correct sign" is the most important - I also plan to investigate
models where the output will be +1 or -1. This may be another member
of the ensemble.
Your "inequality" isn't necessarily true.
Tom (the other one).
"TomH488" <tom... at (no spam) gmail.com> wrote in message
news:0f66a3e4-07d5-4f58-81a5-69a8bc919961 at (no spam) s39g2000yqj.googlegroups.com....
First, apologies for my long-winded reply.
On Sep 11, 11:07 am, Greg <he... at (no spam) alumni.brown.edu> wrote:
Don't put too much faith in the value of the training set error.
See the FAQ and archives re validation sets and overfitting
(overtraining).
I wouldn't put too much faith in the value of training set error if my
inequality is true:
(variance of predicted error) > (variance of training error)
Regarding training set error in general:
What I believe I have discovered is that this error is a function of
the output magnitude.
For example, if you train until you reach an error of .01, there is no
way to know what the error associated with a particular output
magnitude interval is. In other words, this error is a kind of
aggregate error. It gives you no indication as to how much better
high magnitude outputs are fit and no indication of how poorly low
output magnitudes are fit.
About the only thing you can prove is that there exists a "middle"
interval which has an error of .01
This is fairly useless and even more useless when combined with my
inequality hypothesis.
________________________________
But here is the interesting part. Can it be that training error is so
useless because it is a non-uniform property and does not represent a
great majority of cases?
Keep in mind the contention that Memorization results when you "learn
the noise" of the input cases.
If that is true, then "learning the noise" would mean that the input
cases would have to be nailed with little error by the network. (A)
And the inverse, NOT "learning the noise" would mean that the
inputcases would NOT have to be nailed with little error by the
network. (B)
My inequality holds for (A) and I believe it should hold for (B).
NOTE: When simplistic plots of generalization & training error -v-
training are rendered, the training error curve is usually always
below the generalized error curve. This is exactly the inequality I
believe is true.
_______________________________
But So What?
I believe it means the following:
a) if you can't begin to fit the training data, you have no hope
whatsoever or any meaningful generalization.
b) if you training algorithm favors output magnitude intervals, it is
difficult to say anything about the results of your training.
c) further, a training method which would treat output magnitudes
without bias, should produce better generalization since you don't
overtrain big magnitudes and undertrain small magnitudes.
Concluding:
I think something needs to be "adjusted", "fudged", or weighted so
that uniform training occurs.
I also believe that this phenomenon is currently dealt with "down
wind" where weight values are adjusted by things like regularization
etc.
Perhaps a little effort "up wind" might be worth investing
In the few stock market prediction problems that I have tried, the
inportant outputs were either change in price or per cent change
in price. If more than one price was being forecasted, the data
was always normalized so that neither value nor range of values
of one output had a deleterious effect on the prediction of other
outputs.
Hope this helps.
Greg |
|
|
| Back to top |
|
|
|
| Tomasso... |
Posted: Thu Sep 17, 2009 2:21 am |
|
|
|
Guest
|
Greg wrote:
Quote: On Sep 13, 4:11 pm, TomH488 <tom... at (no spam) gmail.com> wrote:
...
In the few stock market prediction problems that I have tried, the
inportant outputs were either change in price or per cent change
in price. If more than one price was being forecasted, the data
was always normalized so that neither value nor range of values
of one output had a deleterious effect on the prediction of other
outputs.
Greg
Market time series usually contain a mixture of noise. There is gaussian
noise, price movements that happen extraneously (eg, dividend payments),
and impulsive noise (spikes) which is rarer, but still happens a lot.
NNs handle gaussian noise well. Extraneous movements should be adjusted
for. BUT spikes cause learning problems for NNs. They shock the convergence
process, so learning does not get very far, and sometimes you merely learn
to fit the spikes. They are like very nasty outliers. And that might explain
TomH's Q.
I am aware of trading systems which filter the spikes out and learn from
the rest of the data. Then they use the spikes for triggering trades. This
can work for tick data and 15 minute data.
Tomasso. |
|
|
| Back to top |
|
|
|
| TomH488... |
Posted: Thu Sep 17, 2009 2:05 pm |
|
|
|
Guest
|
On Sep 14, 5:08 pm, Greg <he... at (no spam) alumni.brown.edu> wrote:
Quote: On Sep 13, 4:11 pm, TomH488 <tom... at (no spam) gmail.com> wrote:
I'll attempt to answer "The Other Tom's" questions first (!) (Greg,
you will be next)
On Sep 11, 11:21 pm, "Tomasso" <toma... at (no spam) a.a> wrote:
What is the transfer function on your output layer? Linear, logistic, exponential, etc?
Single output with linear activation or transfer function.
What kind of function are you trying to predict (in terms of the above).
The output is "5 day change in price of the Russell 2000"
The input are "images" of stock charts. These "images" are large
samplings of the stock price time history. Many sampling models have
been tried. Currently, a very simple, "last 25 days" is being used to
as part of an ensemble model that includes other types of inputs/
models.
Are there sampling biases across the range of output values? Eg, are large values also
rare?
The distribution of outputs is not terribly unlike a Normal
distribution centered at Zero.
Usually, for even output magnitude intervals, the variance of each
interval, plotted on semi-log appears as a straight line.
The larger the output, the greater the importance as far as investing
goes.
In fact, a traditional MSE is not a good metric since getting the
"correct sign" is the most important - I also plan to investigate
models where the output will be +1 or -1. This may be another member
of the ensemble.
Your "inequality" isn't necessarily true.
Tom (the other one).
"TomH488" <tom... at (no spam) gmail.com> wrote in message
news:0f66a3e4-07d5-4f58-81a5-69a8bc919961 at (no spam) s39g2000yqj.googlegroups.com....
First, apologies for my long-winded reply.
On Sep 11, 11:07 am, Greg <he... at (no spam) alumni.brown.edu> wrote:
Don't put too much faith in the value of the training set error.
See the FAQ and archives re validation sets and overfitting
(overtraining).
I wouldn't put too much faith in the value of training set error if my
inequality is true:
(variance of predicted error) > (variance of training error)
Regarding training set error in general:
What I believe I have discovered is that this error is a function of
the output magnitude.
For example, if you train until you reach an error of .01, there is no
way to know what the error associated with a particular output
magnitude interval is. In other words, this error is a kind of
aggregate error. It gives you no indication as to how much better
high magnitude outputs are fit and no indication of how poorly low
output magnitudes are fit.
About the only thing you can prove is that there exists a "middle"
interval which has an error of .01
This is fairly useless and even more useless when combined with my
inequality hypothesis.
________________________________
But here is the interesting part. Can it be that training error is so
useless because it is a non-uniform property and does not represent a
great majority of cases?
Keep in mind the contention that Memorization results when you "learn
the noise" of the input cases.
If that is true, then "learning the noise" would mean that the input
cases would have to be nailed with little error by the network. (A)
And the inverse, NOT "learning the noise" would mean that the
inputcases would NOT have to be nailed with little error by the
network. (B)
My inequality holds for (A) and I believe it should hold for (B).
NOTE: When simplistic plots of generalization & training error -v-
training are rendered, the training error curve is usually always
below the generalized error curve. This is exactly the inequality I
believe is true.
_______________________________
But So What?
I believe it means the following:
a) if you can't begin to fit the training data, you have no hope
whatsoever or any meaningful generalization.
b) if you training algorithm favors output magnitude intervals, it is
difficult to say anything about the results of your training.
c) further, a training method which would treat output magnitudes
without bias, should produce better generalization since you don't
overtrain big magnitudes and undertrain small magnitudes.
Concluding:
I think something needs to be "adjusted", "fudged", or weighted so
that uniform training occurs.
I also believe that this phenomenon is currently dealt with "down
wind" where weight values are adjusted by things like regularization
etc.
Perhaps a little effort "up wind" might be worth investing
In the few stock market prediction problems that I have tried, the
inportant outputs were either change in price or per cent change
in price. If more than one price was being forecasted, the data
was always normalized so that neither value nor range of values
of one output had a deleterious effect on the prediction of other
outputs.
Hope this helps.
Greg
I'm pretty much in agreement with you except that we decided to
abandon multiple outputs.
In the very beginning we predicted a +1 and +2 days into future price
but then decided to build and train, 2 separate nets for that desired
result. The intent was to not have one output error interact with the
other output error. Whether that was a wise decision is unknown at
this time since that interaction may be more of a smoothing effect
which would be welcome.
Either change or %change is what we settled on to get rid of the "one
day prediction desired, one day lag predicted problem." It turns out
that in Price, for example predicting +1 day into future, merely
shifting the known data one day into future as calling it a result
yields excellent error and correlation values. In fact, a correlation
of .92 was typical of merely shifting data +1 day.
The other interesting thing is that with Price, you have the range of
price in the training cases being standardized and mapped into the,
lets call it, the Input Interval. What then happens is that portions
of the input interval are near max curvature sigmoid areas and others
are not so that regions of the price history are treated
differently.
When delta is used, +1 day correlation is poor and hence the net
seldom gravitates to that simply solution. Further, I believe the
deltas which are near max curvature and randomly distributed in
history which should be a good thing.
ALthough we have been using delta almost exclusively, I think that
%change (or log) which we have surveyed a little, would be even better
since it is a further "non-dimensionalization."
______________________________
Our technical input is a stock chart image (sampled price) and I am
beginning to thing that row normalization may be in order. For
instance, if you are trying to identify shapes (and not sizes), you
probably want to preprocess the shape images into some standard size.
It is so common when plotting scientific/engineering parameters to non-
dimensionalize them to either [0,1] or [-1,1] and that may be the
thing to do with a "stock chart."
Thanks,
Tom
PS. Greg, I will still respond to your previous reply. |
|
|
| Back to top |
|
|
|
| TomH488... |
Posted: Thu Sep 17, 2009 2:19 pm |
|
|
|
Guest
|
On Sep 16, 6:21 pm, "Tomasso" <toma... at (no spam) a.a> wrote:
Quote: Greg wrote:
On Sep 13, 4:11 pm, TomH488 <tom... at (no spam) gmail.com> wrote:
...
In the few stock market prediction problems that I have tried, the
inportant outputs were either change in price or per cent change
in price. If more than one price was being forecasted, the data
was always normalized so that neither value nor range of values
of one output had a deleterious effect on the prediction of other
outputs.
Greg
Market time series usually contain a mixture of noise. There is gaussian
noise, price movements that happen extraneously (eg, dividend payments),
and impulsive noise (spikes) which is rarer, but still happens a lot.
When dealing with dP (change in price), a gap turns into a spike.
Usually gaps occur when there is news, both related and unrelated to
the stock.
So those points are impossible to predict or learn from - unless maybe
you
have some kind of news input.
Quote:
NNs handle gaussian noise well. Extraneous movements should be adjusted
for. BUT spikes cause learning problems for NNs. They shock the convergence
process, so learning does not get very far, and sometimes you merely learn
to fit the spikes. They are like very nasty outliers. And that might explain
TomH's Q.
It sure does to me. I found that the net was fitting the large delta
outputs which are
a small fraction of the input. While I wouldn't call them spikes, I'd
call them concentrated
regions of extreme price movement.
If I removed the upper 50% of the output magnitude, I think I would
only remove about
10% of the data.
I did make a run where I took my lowest interval of price delta,
duplicated it 3 times and added
it to the input. The convergence was more difficult but the
forecasting on some preliminary data
was dramatically different. I will be taking a broad look at its
performance to see if any improved
generalization occurred.
Quote:
I am aware of trading systems which filter the spikes out and learn from
the rest of the data. Then they use the spikes for triggering trades. This
can work for tick data and 15 minute data.
That sounds interesting, learning the rest of the data is like
characterizing the
state of the stock such as overbought or oversold, and then using the
spike
as an external force acting on our "spring-mass stock model." For
example,
if the stock is on the verge of breaking down and we get a spike
(news) that
reinforces this direction, it is probably a good time to short.
|
|
|
| Back to top |
|
|
|
| Greg... |
Posted: Fri Sep 18, 2009 1:40 am |
|
|
|
Guest
|
On Sep 17, 10:05 am, TomH488 <tom... at (no spam) gmail.com> wrote:
Quote: On Sep 14, 5:08 pm, Greg <he... at (no spam) alumni.brown.edu> wrote:
On Sep 13, 4:11 pm, TomH488 <tom... at (no spam) gmail.com> wrote:
I'll attempt to answer "The Other Tom's" questions first (!) (Greg,
you will be next)
On Sep 11, 11:21 pm, "Tomasso" <toma... at (no spam) a.a> wrote:
What is the transfer function on your output layer? Linear, logistic, exponential, etc?
Single output with linear activation or transfer function.
What kind of function are you trying to predict (in terms of the above).
The output is "5 day change in price of the Russell 2000"
The input are "images" of stock charts. These "images" are large
samplings of the stock price time history. Many sampling models have
been tried. Currently, a very simple, "last 25 days" is being used to
as part of an ensemble model that includes other types of inputs/
models.
Are there sampling biases across the range of output values? Eg, are large values also
rare?
The distribution of outputs is not terribly unlike a Normal
distribution centered at Zero.
Usually, for even output magnitude intervals, the variance of each
interval, plotted on semi-log appears as a straight line.
The larger the output, the greater the importance as far as investing
goes.
In fact, a traditional MSE is not a good metric since getting the
"correct sign" is the most important - I also plan to investigate
models where the output will be +1 or -1. This may be another member
of the ensemble.
Your "inequality" isn't necessarily true.
Tom (the other one).
"TomH488" <tom... at (no spam) gmail.com> wrote in message
news:0f66a3e4-07d5-4f58-81a5-69a8bc919961 at (no spam) s39g2000yqj.googlegroups.com...
First, apologies for my long-winded reply.
On Sep 11, 11:07 am, Greg <he... at (no spam) alumni.brown.edu> wrote:
Don't put too much faith in the value of the training set error.
See the FAQ and archives re validation sets and overfitting
(overtraining).
I wouldn't put too much faith in the value of training set error if my
inequality is true:
(variance of predicted error) > (variance of training error)
Regarding training set error in general:
What I believe I have discovered is that this error is a function of
the output magnitude.
For example, if you train until you reach an error of .01, there is no
way to know what the error associated with a particular output
magnitude interval is. In other words, this error is a kind of
aggregate error. It gives you no indication as to how much better
high magnitude outputs are fit and no indication of how poorly low
output magnitudes are fit.
About the only thing you can prove is that there exists a "middle"
interval which has an error of .01
This is fairly useless and even more useless when combined with my
inequality hypothesis.
________________________________
But here is the interesting part. Can it be that training error is so
useless because it is a non-uniform property and does not represent a
great majority of cases?
Keep in mind the contention that Memorization results when you "learn
the noise" of the input cases.
If that is true, then "learning the noise" would mean that the input
cases would have to be nailed with little error by the network. (A)
And the inverse, NOT "learning the noise" would mean that the
inputcases would NOT have to be nailed with little error by the
network. (B)
My inequality holds for (A) and I believe it should hold for (B).
NOTE: When simplistic plots of generalization & training error -v-
training are rendered, the training error curve is usually always
below the generalized error curve. This is exactly the inequality I
believe is true.
_______________________________
But So What?
I believe it means the following:
a) if you can't begin to fit the training data, you have no hope
whatsoever or any meaningful generalization.
b) if you training algorithm favors output magnitude intervals, it is
difficult to say anything about the results of your training.
c) further, a training method which would treat output magnitudes
without bias, should produce better generalization since you don't
overtrain big magnitudes and undertrain small magnitudes.
Concluding:
I think something needs to be "adjusted", "fudged", or weighted so
that uniform training occurs.
I also believe that this phenomenon is currently dealt with "down
wind" where weight values are adjusted by things like regularization
etc.
Perhaps a little effort "up wind" might be worth investing
In the few stock market prediction problems that I have tried, the
inportant outputs were either change in price or per cent change
in price. If more than one price was being forecasted, the data
was always normalized so that neither value nor range of values
of one output had a deleterious effect on the prediction of other
outputs.
Hope this helps.
Greg
I'm pretty much in agreement with you except that we decided to
abandon multiple outputs.
In the very beginning we predicted a +1 and +2 days into future price
but then decided to build and train, 2 separate nets for that desired
result. The intent was to not have one output error interact with the
other output error. Whether that was a wise decision is unknown at
this time since that interaction may be more of a smoothing effect
which would be welcome.
I was not precise in terminology. When I said multiple outputs
I meant multiple output quantities, not the same output quantity at
multiple lag values.
It would be very interesting to compare a model that predicts
the present and k future values of one output quantity with
k+1 separate models.
Hope this helps.
Greg |
|
|
| Back to top |
|
|
|
| Tomasso... |
Posted: Fri Sep 18, 2009 2:08 am |
|
|
|
Guest
|
TomH488 wrote:
Quote: On Sep 16, 6:21 pm, "Tomasso" <toma... at (no spam) a.a> wrote:
Greg wrote:
On Sep 13, 4:11 pm, TomH488 <tom... at (no spam) gmail.com> wrote:
...
In the few stock market prediction problems that I have tried, the
inportant outputs were either change in price or per cent change
in price. If more than one price was being forecasted, the data
was always normalized so that neither value nor range of values
of one output had a deleterious effect on the prediction of other
outputs.
Greg
Market time series usually contain a mixture of noise. There is gaussian
noise, price movements that happen extraneously (eg, dividend payments),
and impulsive noise (spikes) which is rarer, but still happens a lot.
When dealing with dP (change in price), a gap turns into a spike.
Usually gaps occur when there is news, both related and unrelated to
the stock.
So those points are impossible to predict or learn from - unless maybe
you
have some kind of news input.
NNs handle gaussian noise well. Extraneous movements should be adjusted
for. BUT spikes cause learning problems for NNs. They shock the convergence
process, so learning does not get very far, and sometimes you merely learn
to fit the spikes. They are like very nasty outliers. And that might explain
TomH's Q.
It sure does to me. I found that the net was fitting the large delta
outputs which are
a small fraction of the input. While I wouldn't call them spikes, I'd
call them concentrated
regions of extreme price movement.
In the literature it's called Impulsive noise. I happens in signal processing
where events like short circuits and lightning stikes add large noise, usually
unrelated to the channel and the channel's Gaussian noise.
Other things you should experiment with:
1. Whether you suppress the spikes or mask them without losing them
completely. There are several kinds of filter that can mask the spikes,
the simplest being a median filter...
2. Whether you suppress them for the output of the NN data, or both
the output and input. The spikes in output screw up the error feedback
and hence convergence. The spikes in input may have a predictive role
for price time series.
Quote: If I removed the upper 50% of the output magnitude, I think I would
only remove about
10% of the data.
I did make a run where I took my lowest interval of price delta,
duplicated it 3 times and added
it to the input. The convergence was more difficult but the
forecasting on some preliminary data
was dramatically different. I will be taking a broad look at its
performance to see if any improved
generalization occurred.
I am aware of trading systems which filter the spikes out and learn from
the rest of the data. Then they use the spikes for triggering trades. This
can work for tick data and 15 minute data.
That sounds interesting, learning the rest of the data is like
characterizing the
state of the stock such as overbought or oversold, and then using the
spike
as an external force acting on our "spring-mass stock model." For
example,
if the stock is on the verge of breaking down and we get a spike
(news) that
reinforces this direction, it is probably a good time to short.
What ratio of signal to noise are you getting in your models?
Tomasso.
>> Tomasso. |
|
|
| Back to top |
|
|
|
| TomH488... |
Posted: Fri Sep 18, 2009 5:38 pm |
|
|
|
Guest
|
Me talking:
Quote:
I did make a run where I took my lowest interval of price delta,
duplicated it 3 times and added
it to the input. The convergence was more difficult but the
forecasting on some preliminary data
was dramatically different. I will be taking a broad look at its
performance to see if any improved
generalization occurred.
I had a chance to review that training method.
As an acid test, I would use the net to predict from 1 to 150 days
into the future.
I used 30 hidden nodes based on the performance of the first 20 days
of the 150
day prediction plot.
_______________________
The following is a footnote regarding the trading model:
The trading model is simple: if the sign of the predicted change
equals the actual sign,
then a profit is shown equal to the abs value of that delta. Since it
is a delta 5 day prediction,
it is easiest to visualize 5 separate accounts: a Monday, Tues, W, H,
Friday account.
This primitive method could be tweaked as follows to produce more
success:
1) during the 5 day hold of a position, if the prediction is realized
interday, close position and
capture profit.
2) only enter a position when you can get in .5 to 1% lower than the
previous close which
the model uses to start the interval.
end of footnote
_________________________________________
The result was that I had to change the scale of my 150 day plot to
$2000 max from
$1000 max to accommodate the $1750 max profit I was generating.
While the first 20 days were actually a little worse performing than
the 1x input set,
all other aspects of the 150 day curve were improved:
1) first time a 150 day curve seen to be monotonically increasing
(some dither obviously)
2) highest peak account balance of $1750
3) much larger "better performance in near future" or initial slope:
increased from about 20 to about 55 days |
|
|
| Back to top |
|
|
|
| TomH488... |
Posted: Fri Sep 18, 2009 6:24 pm |
|
|
|
Guest
|
Quote:
Market time series usually contain a mixture of noise. There is gaussian
noise, price movements that happen extraneously (eg, dividend payments),
and impulsive noise (spikes) which is rarer, but still happens a lot.
When dealing with dP (change in price), a gap turns into a spike.
Usually gaps occur when there is news, both related and unrelated to
the stock.
So those points are impossible to predict or learn from - unless maybe
you
have some kind of news input.
NNs handle gaussian noise well. Extraneous movements should be adjusted
for. BUT spikes cause learning problems for NNs. They shock the convergence
process, so learning does not get very far, and sometimes you merely learn
to fit the spikes. They are like very nasty outliers. And that might explain
TomH's Q.
It sure does to me. I found that the net was fitting the large delta
outputs which are
a small fraction of the input. While I wouldn't call them spikes, I'd
call them concentrated
regions of extreme price movement.
In the literature it's called Impulsive noise. I happens in signal processing
where events like short circuits and lightning stikes add large noise, usually
unrelated to the channel and the channel's Gaussian noise.
Other things you should experiment with:
1. Whether you suppress the spikes or mask them without losing them
completely. There are several kinds of filter that can mask the spikes,
the simplest being a median filter...
We are planing on addressing the spikes in the output in the training
set a few different ways:
1) omit the input rows that are spikes or large outputs,
2) clip the spikes/large outputs to some lesser value,
3) process the Outputs into a binary set: Go Long or Go Short
more on 3):
there probably is a third set: Do Nothing which would depend on a
profit threshold of say, 1% over 5 days.
So you would have -1, 0, and +1 as the possibly output values.
Whether it would be better to have a pair of binary outputs... I don't
know.
and maybe 4) a "median" filter where new value is average of
neighbors.
This could be tweaked to "temper" the spike by making the new value
the average of the median AND the spike itself - "half the distance
to the... medial"
Quote:
2. Whether you suppress them for the output of the NN data, or both
the output and input. The spikes in output screw up the error feedback
and hence convergence. The spikes in input may have a predictive role
for price time series.
Since my input is a sequence of 25 daily closing prices, output spikes
would be pretty much in the input data since if you look at the dP5
calculation as a moving interval, you pickup a new dP and cast off an
old dP so if you see a spike, it was due to these 2 dP's.
A "merely" large P could involve multiple inputs with "large" P.
I have 2 ways to look at this: I could preprocess the dP5 price data
so that both Input and Output would have modified spikes, or I could
just modify the Outputs. Both ways probably should be looked at. My
guess is that the Inputs should be un-modified.
Quote:
If I removed the upper 50% of the output magnitude, I think I would
only remove about
10% of the data.
This is confirmed to be true.
Quote: I did make a run where I took my lowest interval of price delta,
duplicated it 3 times and added
it to the input. The convergence was more difficult but the
forecasting on some preliminary data
was dramatically different. I will be taking a broad look at its
performance to see if any improved
generalization occurred.
I am aware of trading systems which filter the spikes out and learn from
the rest of the data. Then they use the spikes for triggering trades. This
can work for tick data and 15 minute data.
That sounds interesting, learning the rest of the data is like
characterizing the
state of the stock such as overbought or oversold, and then using the
spike
as an external force acting on our "spring-mass stock model." For
example,
if the stock is on the verge of breaking down and we get a spike
(news) that
reinforces this direction, it is probably a good time to short.
What ratio of signal to noise are you getting in your models?
I never tried to calculate that even though it has been a question
since day 1.
SNR = mu/stdev might have some meaning when you assume some kind of
polynomial fit and proceed.
AutoCorrelation sounds much better but Excel doesn't have it - and I
don't have the will today to try the formula that is all over the
internet. Besides, you need to do a whole set of those calculations
to get the plot which is what really shows what might be lurking.
....too sleepy in the middle of the afternoon...
Tom |
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Thu Dec 10, 2009 8:14 pm
|
|