Main Page | Report this Page
Computers Forum Index  »  Computer Artificial Intelligence - Neural Nets  »  Logistic or Softmax Outputs for Multi-class Problems...
Page 1 of 1    

Logistic or Softmax Outputs for Multi-class Problems...

Author Message
David...
Posted: Thu Oct 01, 2009 7:16 pm
Guest
Hi,

I have read the book "Neural Networks for Pattern Recognition" by
Bishop (1995). It explains that a logistic output unit should be used
for a 2-class classification problem and softmax output units should
be used for a multi-class classification problem - each softmax output
corresponds to a class label.

I am just wondering. Is it appropriate to use logistic units for multi-
class classification problems as well as 2-class problems?

Thanks,
David
 
GavinCawley at (no spam) googlemail.com...
Posted: Fri Oct 02, 2009 10:10 am
Guest
On 1 Oct, 20:16, David <dtian.ty... at (no spam) googlemail.com> wrote:
Quote:
Hi,

I have read the book "Neural Networks for Pattern Recognition" by
Bishop (1995). It explains that a logistic output unit should be used
for a 2-class classification problem and softmax output units should
be used for a multi-class classification problem - each softmax output
corresponds to a class label.

I am just wondering. Is it appropriate to use logistic units for multi-
class classification problems as well as 2-class problems?

Thanks,
David

If the classes are mutually exclusive then you wouldn't be exploiting
the prior knowledge that the outputs ashould sum to one. In the limit
of an infinite dataset and a sufficiently large network then I expect
there would be no difference, but in the case of a finite dataset this
constraint on the outputs probably wouldn't hold. One very practical
reason for using the softmax units would be that the network would not
need to devote hidden layer units to try and make the outputs sum
approximately to one, so you may need a larger network for logistic
output units.

In short, I wouldn't go as far to say logistic units would be
inappropriate, but they are certainly sub-optimal (assuming mutually
exclusive classes).
 
David...
Posted: Fri Oct 02, 2009 11:19 am
Guest
Quote:
If the classes are mutually exclusive then you wouldn't be exploiting
the prior knowledge that the outputs ashould sum to one.  In the limit
of an infinite dataset and a sufficiently large network then I expect
there would be no difference, but in the case of a finite dataset this
constraint on the outputs probably wouldn't hold.  One very practical
reason for using the softmax units would be that the network would not
need to devote hidden layer units to try and make the outputs sum
approximately to one, so you may need a larger network for logistic
output units.

In short, I wouldn't go as far to say logistic units would be
inappropriate, but they are certainly sub-optimal (assuming mutually
exclusive classes).

By 'mutually exclusive' do you mean that each of the instances belongs
to 1 class only? On page 238 of Bishop's book, it says "If the output
values are to be interpreted as probabilities they must lie in the
range (0,1), and they must sum to unity. This can be achieved by using
a generalization of the logistic sigmoid activation function which
take the form etc.. which is known as th softmax activation function."
It seems that both logistic and softmax activation functions enable
that the sum of the outputs is one.

I have tried using both logistic and softmax outputs for a dataset.
The dataset contains 79 attributes, 5 classes and 208 instances. I
partitioned the dataset into 2/3 and 1/3 ratios. Then, I used 2/3
(139 instances) of the dataset as the training set and the 1/3 (69
instances) as testing dataset. I tried different network structures
for each type of outputs.

1. Logistic Ouputs

Network Structure classification accuracy using test
set (%)
79 x 1 x 5 82.7
79 x 5 x 5 95.7
79 x 8 x 5 98.6
79 x 10 x 5 97.1
79 x 11 x 5 97.1
79 x 12 x 5 97.1
79 x 15 x 5 97.1
79 x 42 x 5 97.1

2. Softmax Outputs

Network Structure classification accuracy using test test
(%)

79 x 1 x 5 82.7
79 x 4 x 5 98.6
79 x 6 x 5 97.1
79 x 8 x 5 97.1
79 x 10 x 5 97.1
79 x 15 x 5 97.1
79 x 20 x 5 97.1

3. Single layer networks

I also tried 2 single layer networks: one network with 5 logistic
outputs, another with 5 softmax outputs.

a single layer network with 5 logistic outputs: 97.1

a single layer network with 5 softmax outputs: 97.1

The confusion matrices for both networks are identical:
predicted class
actual 1
class 2
 
David...
Posted: Fri Oct 02, 2009 11:46 am
Guest
On 2 Oct, 12:19, David <dtian.ty... at (no spam) googlemail.com> wrote:
Quote:
If the classes are mutually exclusive then you wouldn't be exploiting
the prior knowledge that the outputs ashould sum to one.  In the limit
of an infinite dataset and a sufficiently large network then I expect
there would be no difference, but in the case of a finite dataset this
constraint on the outputs probably wouldn't hold.  One very practical
reason for using the softmax units would be that the network would not
need to devote hidden layer units to try and make the outputs sum
approximately to one, so you may need a larger network for logistic
output units.

In short, I wouldn't go as far to say logistic units would be
inappropriate, but they are certainly sub-optimal (assuming mutually
exclusive classes).

By 'mutually exclusive' do you mean that each of the instances belongs
to 1 class only? On page 238 of Bishop's book, it says "If the output
values are to be interpreted as probabilities they must lie in the
range (0,1), and they must sum to unity. This can be achieved by using
a generalization of the logistic sigmoid activation function which
take the form etc.. which is known as th softmax activation function."
It seems that both logistic and softmax activation functions enable
that the sum of the outputs is one.

I have tried using both logistic and softmax outputs for a dataset.
The dataset contains 79 attributes, 5 classes and 208 instances. I
partitioned the dataset into  2/3 and 1/3 ratios. Then, I used 2/3
(139 instances) of the dataset as the training set and the 1/3 (69
instances) as testing dataset. I tried different network structures
for each type of outputs.

1. Logistic Ouputs

 Network Structure               classification accuracy using test
set (%)
 79 x 1 x 5                           82.7
  79 x 5 x 5                          95.7
  79 x 8 x 5                          98.6
  79 x 10 x 5                        97.1
  79 x 11 x 5                        97.1
  79 x 12 x 5                        97.1
  79 x 15 x 5                        97.1
  79 x 42 x 5                        97.1

2. Softmax Outputs

Network Structure              classification accuracy using test test
(%)

79 x 1 x 5                          82.7
79 x 4 x 5                          98.6
79 x 6 x 5                          97.1
79 x 8 x 5                          97.1
79 x 10 x 5                        97.1
79 x 15 x 5                        97.1
79 x 20 x 5                        97.1

3. Single layer networks

 I also tried 2 single layer networks: one network with 5 logistic
outputs, another with 5 softmax outputs.

a single layer network with 5 logistic outputs: 97.1

a single layer network with 5 softmax outputs: 97.1

The confusion matrices for both networks are identical:
                   predicted class
  actual  1
  class   2

Sorry. I did not finish my message and I pressed the 'send' button
accidentally. Here is the rest of my message:

The confusion matrices for both networks are identical:

predicted class:
class1 class2 class3
class4 class5
actual class1 2 0 0
0 0
class: class2 1 6 1
0 0
class3 0 0 42
0 0
class4 0 0
0 13 0
class5 0 0
0 0 4

4. Confusion Matrices for MLPs

The confusion matices for all the above MLPs that I tried are
identical:

predicted class:
class1 class2 class3
class4 class5
actual class1 2 0 0
0 0
class: class2 2 6 1
0 0
class3 0 0
42 0 0
class4 0 0
0 13 0
class5 0 0
0 0 4

The MLPs misclassifies a class2 instance as a class1 instance (2nd row
and 1st column). It seems both training and testing sets are linearly
separable. I am thinking about trying SVMs which may give 100%
accuracy.

I used the 'mlp' and the 'glm' functions of Netlab toolbox.
 
Greg...
Posted: Sat Oct 03, 2009 12:04 am
Guest
On Oct 1, 3:16 pm, David <dtian.ty... at (no spam) googlemail.com> wrote:
Quote:
Hi,

I have read the book "Neural Networks for Pattern Recognition" by
Bishop (1995). It explains that a logistic output unit should be used
for a 2-class classification problem and softmax output units should
be used for a multi-class classification problem - each softmax output
corresponds to a class label.

I am just wondering. Is it appropriate to use logistic units for multi-
class classification problems as well as 2-class problems?

It depends on your meaning of the term "appropriate". If the outputs
of
a uniform approximator are configured into a c-class classifier with
unity
sum {0,1} targets, the outputs are consistent estimates of class-
conditional
posterior probabilities conditional on the input.

1. With a linear output layer the outputs
a. obey the unity sum constraint
b. are NOT constrained to the [0,1] interval.

2. With a logsig (logistic sigmoid) output layer the outputs
a. are NOT constrained to obey the unity sum constraint.
b. are constrained to the (0,1) (NOT [0,1]!) interval.

3.With a softmax output layer the outputs are
a. are constrained to obey the unity sum constraint.
b. are constrained to the (0,1) (NOT [0,1]!) interval.

There are generally two concerns

1. Classifier accuracy (error rate)
2. Classifier precision (confidence levels, class odds ratios)

Class assignment of inputs are, typically, determined by
the maximum output, regardless of value. If classifier accuracy
or output ranking is the only concern, any of the three choices
would probably suffice.

However, if the values of the posterior estimates are of concern for
confidence or Risk (sum of products of Priors, Misclassification
Costs and Posteriors) estimates, the.linear layer is probably
inadequate.

Theoretically, softmax would be the appropriate choice. However,
it is common to use logistic outputs and just divide each output
by the sum if the unity sum constraint is required.

Since the MATLAB NN Toolbox does not support the softmax output,
I'm afraid this occurs too often (NOTE I have posted a fix for
this in comp.soft-sys.matlab).

Finally, note that the canonical objective function for each of the
three output choices is different (See Bishop):

Linear: MSE
Logistic: Nonexclusive Cross Entropy
Softmax: Mutually Exclusive Cross Entropy

However, MSE is often used for Logistic and Softmax.

I'm sure there have been many studies comparing the use
of the three activation functions using one or more of the
three objective functions. The only one I recall is an example
in a post by Warren Sarle where he demonstrated that outliers
could cause a linear classifier to have an inferior correct
classification rate.

If, for some strange reason you cannot train with softmax, use
logsig and, if needed, add a normalizer after training. I would
expect significant differences only if the training set is
insufficiently
large or pathological.

Hope this helps.

Greg
 
Greg...
Posted: Sat Oct 03, 2009 1:19 am
Guest
On Oct 2, 7:19 am, David <dtian.ty... at (no spam) googlemail.com> wrote:
Quote:
If the classes are mutually exclusive then you wouldn't be exploiting
the prior knowledge that the outputs ashould sum to one. In the limit
of an infinite dataset and a sufficiently large network then I expect
there would be no difference, but in the case of a finite dataset this
constraint on the outputs probably wouldn't hold. One very practical
reason for using the softmax units would be that the network would not
need to devote hidden layer units to try and make the outputs sum
approximately to one, so you may need a larger network for logistic
output units.

In short, I wouldn't go as far to say logistic units would be
inappropriate, but they are certainly sub-optimal (assuming mutually
exclusive classes).

Agree..

Quote:
By 'mutually exclusive' do you mean that each of the instances belongs
to 1 class only?

Yes.

Quote:
On page 238 of Bishop's book, it says "If the output
values are to be interpreted as probabilities they must lie in the
range (0,1), and they must sum to unity.

Musr sum to unity only if the classes are mutually exclusive.

Quote:
This can be achieved by using
a generalization of the logistic sigmoid activation function which
take the form etc.. which is known as th softmax activation function."
It seems that both logistic and softmax activation functions enable
that the sum of the outputs is one.

Logistic enables but does not enforce unity sums. In general,
logistic
enables nonexclusive classes (e.g., tall, dark and handsome).

Quote:
I have tried using both logistic and softmax outputs for a dataset.
The dataset contains 79 attributes, 5 classes and 208 instances. I
partitioned the dataset into 2/3 and 1/3 ratios. Then, I used 2/3
(139 instances) of the dataset as the training set and the 1/3 (69
instances) as testing dataset. I tried different network structures
for each type of outputs.

Number of training equations for a classifier with O outputs

Neq = Ntrn*O = 169*5 = 845

Number of unknown weights

(I-H-O) MLP: Nw(H>0) = (I+1)*H+(H+1)*O = O+(I+O+1)*H = 5 + 85*H
(I-O) Linear: NW(H=0) = (I+1)*O = 400

A good generalization rule of thumb is

Ratio = Neq / Nw >> 1

A more tolerant ROT is Neq >~ 2*Nw

However, you have

H Ratio
------ ---------
1. 9.3889
2. 4.8286
3. 3.2500
4. 2.4493
0 2.1125 (Linear)

5. 1.9651
6. 1.6408
7. 1.4083
8. 1.2336
9. 1.0974
10. 0.9883

So, unless you are training with weight decay, your MLP
examples below are not very convincing.

Make it a practice to make mulltiple runs with random initial
weights for each case..


Quote:
1. Logistic Ouputs

Network Structure classification accuracy using test
set (%)
79 x 1 x 5 82.7
79 x 5 x 5 95.7
79 x 8 x 5 98.6
79 x 10 x 5 97.1
79 x 11 x 5 97.1
79 x 12 x 5 97.1
79 x 15 x 5 97.1
79 x 42 x 5 97.1

2. Softmax Outputs

Network Structure classification accuracy using test test
(%)

79 x 1 x 5 82.7
79 x 4 x 5 98.6
79 x 6 x 5 97.1
79 x 8 x 5 97.1
79 x 10 x 5 97.1
79 x 15 x 5 97.1
79 x 20 x 5 97.1

3. Single layer networks

I also tried 2 single layer networks: one network with 5 logistic
outputs, another with 5 softmax outputs.

a single layer network with 5 logistic outputs: 97.1

a single layer network with 5 softmax outputs: 97.1

This reinforces my last comment w.r.t. your MLP training.

Typically, NN comparison studies are more convincing
when the the problems are more difficult with larger error
rates.

Hope this helps.

Greg
 
David...
Posted: Sun Oct 04, 2009 8:07 pm
Guest
Thank you very much for your detailed suggestions. These are very
valueable to me.

Quote:
A good generalization rule of thumb is

Ratio = Neq / Nw >> 1

A more tolerant ROT is Neq >~ 2*Nw


Could you please point me to the references explaining these
performance measures?

I have added a regularization term to the error function for a network
with 5 softmax outputs and the architecture 79 x 10 x 5. Let the error
function be
E + lambda*Omega
where Omega = 0.5*(sum of the squres of the weights) (as defined in
Bishop)
This is achieved by calling mlp(79,10,5,lambda) of Netlab. I used the
Scaled Conjugate Gradiant to train the network. The maximum no. of
iterations of SCG is set to 1000. I used different values for lambda.
For each lambda value, I ran SCG multiple times. All runs give almost
identical results in terms of error per epoch and accuracies.

The results are the following:

Iteration lambda Error per Iteration Accuracy (%)
1000 0.1 2.35
97.1
230 0.01 0.35
97.1
215 0.001 0.047 97.1
1000 0.002 0.086 97.1
350 0.003 0.123 97.1

All the different lambda values give the same accuracy 97.1%. The mlp
with regularization also gives 97.1%. It seems regularization does not
improve the accuracy of mlp for this dataset.

David
 
Greg...
Posted: Sun Oct 04, 2009 10:55 pm
Guest
On Oct 4, 4:07 pm, David <dtian.ty... at (no spam) googlemail.com> wrote:
Quote:
Thank you very much for your detailed suggestions. These are very
valueable to me.

A good generalization rule of thumb is

Ratio = Neq / Nw >> 1

A more tolerant ROT is Neq >~ 2*Nw

Could you please point me to the references explaining these
performance measures?

Any good reference on nonregularized linear regression should
mention practical values of the equations/unknowns ratio for precise
coefficient estimation. I have seen use of 3,5,8,15,20 and 30. If
linear regression errors are Normally distributed, then regression
coefficient estimates will be normally distributed and the standard
deviations of the coefficient error estimates will depend on the
ratio
via a Chi-Squared Distribution.

Also, try a Google searchs using

greg-heath Neq Nw

For most RW problems, distributions are not Gaussian and there is
no universal practical critical value for the ratio. The smallest
value
that yields useful results is problem dependent and the best bet for
confidence is to use trial and error.

The extension of the ROT to nonregularized nonlinear regression
has less theoretical support. Nevertheless it has proved extremely
useful when estimating practical lower bounds by trial and error.
:
Quote:
I have added a regularization term to the error function for a network
with 5 softmax outputs and the architecture 79 x 10 x 5. Let the error
function be
E + lambda*Omega
where Omega = 0.5*(sum of the squres of the weights) (as defined in
Bishop)
This is achieved by calling mlp(79,10,5,lambda) of Netlab. I used the
Scaled Conjugate Gradiant to train the network. The maximum no. of
iterations of SCG is set to 1000. I used different values for lambda.
For each lambda value, I ran SCG multiple times. All runs give almost
identical results in terms of error per epoch and accuracies.

The results are the following:

Iteration lambda Error per Iteration Accuracy (%)
1000 0.1 2.35 97.1
230 0.01 0.35 97.1
215 0.001 0.047 97.1
1000 0.002 0.086 97.1
350 0.003 0.123 97.1

All the different lambda values give the same accuracy 97.1%. The mlp
with regularization also gives 97.1%. It seems regularization does not
improve the accuracy of mlp for this dataset.

That is why I said

"Typically, NN comparison studies are more convincing
when the problems are more difficult with larger error
rates"

.... say > 5%.

For example, this problem could have been made harder
by adding computer generated noise to the input data and
using the smallest possible values of H.

It is also helpful to

1. Quote confidence levels. For example, assume sum of
regression squared errors are CHISQ distributed and
classification counts are binomially distributed. With Ntst
= 69, one classification error yields 1.45% and with a
similarly sized confidence interval.

2. For small data sets use multiple trials of f-fold cross-validation
to use all of the data (possibly multiple times) for error
estimation to reduce the relative size of confidence interval
to error estimate.

Hope this helps.

Greg
 
Tomasso...
Posted: Mon Oct 05, 2009 12:03 pm
Guest
Greg wrote:
Quote:
On Oct 4, 4:07 pm, David <dtian.ty... at (no spam) googlemail.com> wrote:
Thank you very much for your detailed suggestions. These are very
valueable to me.

A good generalization rule of thumb is

Ratio = Neq / Nw >> 1

A more tolerant ROT is Neq >~ 2*Nw

Could you please point me to the references explaining these
performance measures?

Any good reference on nonregularized linear regression should
mention practical values of the equations/unknowns ratio for precise
coefficient estimation. I have seen use of 3,5,8,15,20 and 30. If
linear regression errors are Normally distributed, then regression
coefficient estimates will be normally distributed and the standard
deviations of the coefficient error estimates will depend on the
ratio
via a Chi-Squared Distribution.

Also, try a Google searchs using

greg-heath Neq Nw

For most RW problems, distributions are not Gaussian and there is
no universal practical critical value for the ratio. The smallest
value
that yields useful results is problem dependent and the best bet for
confidence is to use trial and error.

The extension of the ROT to nonregularized nonlinear regression
has less theoretical support. Nevertheless it has proved extremely
useful when estimating practical lower bounds by trial and error.

I have added a regularization term to the error function for a network
with 5 softmax outputs and the architecture 79 x 10 x 5. Let the error
function be
E + lambda*Omega
where Omega = 0.5*(sum of the squres of the weights) (as defined in
Bishop)
This is achieved by calling mlp(79,10,5,lambda) of Netlab. I used the
Scaled Conjugate Gradiant to train the network. The maximum no. of
iterations of SCG is set to 1000. I used different values for lambda.
For each lambda value, I ran SCG multiple times. All runs give almost
identical results in terms of error per epoch and accuracies.

The results are the following:

Iteration lambda Error per Iteration Accuracy (%)
1000 0.1 2.35 97.1
230 0.01 0.35 97.1
215 0.001 0.047 97.1
1000 0.002 0.086 97.1
350 0.003 0.123 97.1

All the different lambda values give the same accuracy 97.1%. The mlp
with regularization also gives 97.1%. It seems regularization does not
improve the accuracy of mlp for this dataset.

That is why I said

"Typically, NN comparison studies are more convincing
when the problems are more difficult with larger error
rates"

... say > 5%.

For example, this problem could have been made harder
by adding computer generated noise to the input data and
using the smallest possible values of H.

It is also helpful to

1. Quote confidence levels. For example, assume sum of
regression squared errors are CHISQ distributed and
classification counts are binomially distributed. With Ntst
= 69, one classification error yields 1.45% and with a
similarly sized confidence interval.

2. For small data sets use multiple trials of f-fold cross-validation
to use all of the data (possibly multiple times) for error
estimation to reduce the relative size of confidence interval
to error estimate.

Hope this helps.

Greg

I'd like to know more about what kind of data David is building
models from. Or what kind of models.

Greg's comments are valid and helpful for data which is sampled
in proportion to how it would arise in "the real world".

However if David is building a choice model or something similar
where the data sampling has an underlying experimental design he
needs to be careful.

Classical choice models use MNLs where experimental designs
are based on optimising the estimation of all important parameters.

More recent Bayesian approaches also bias their sampling to
answer questions 'in simulation' with optimal precision, but for
questions/scenarios which are considered worthy of simulating.

These approaches bias the data collection to maximise information
for some kind of pre-supposed model form. MNLs have a bunch
of exp(sumproduct) predictions (normalised) so they yield normalised
(sums to 1.0000) classification probabilities. Bayesian optimality also
presumes an underlying Bayesian network (or at least drops some
conditional dependences).

So... ...if David is modelling this kind of data, a NN model may well
go off the rails. The NN models have more degrees of freedom and
can do a better job for natural sampling, but if the sampling has a
deliberate bias to answer presumed questions, the NNS can go wrong.

Tom.
 
Greg...
Posted: Mon Oct 05, 2009 3:03 pm
Guest
On Oct 5, 4:03 am, "Tomasso" <toma... at (no spam) a.a> wrote:
-----SNIP
Quote:
I'd like to know more about what kind of data David is building
models from. Or what kind of models.

Greg's comments are valid and helpful for data which is sampled
in proportion to how it would arise in "the real world".

However if David is building a choice model or something similar
where the data sampling has an underlying experimental design he
needs to be careful.

Classical choice models use MNLs where experimental designs
are based on optimising the estimation of all important parameters.

Duh. Although it's late morning, I still can't deduce the meaning of
MNL

Quote:
More recent Bayesian approaches also bias their sampling to
answer questions 'in simulation' with optimal precision, but for
questions/scenarios which are considered worthy of simulating.

These approaches bias the data collection to maximise information
for some kind of pre-supposed model form. MNLs have a bunch
of exp(sumproduct) predictions (normalised) so they yield normalised
(sums to 1.0000) classification probabilities. Bayesian optimality also
presumes an underlying Bayesian network (or at least drops some
conditional dependences).

So... ...if David is modelling this kind of data, a NN model may well
go off the rails. The NN models have more degrees of freedom and
can do a better job for natural sampling, but if the sampling has a
deliberate bias to answer presumed questions, the NNS can go wrong.

I feel comfortable when the following assumptions are satisfied:

1. There is an underlying deterministic nonpathological I/O
relationship.
2. Data are random samples from the I/O probability distributions.
3. The model is sufficiently flexible for approximating the I/O
relationship.
4. The learning algorithm is sufficient for obtaining accurate
estimates
of the model parameters.
5. The amount of data is sufficient for accurate estimation of model
parameters and performance criteria.

Your interesting comments reinforce my concern when assumption 2
is not satisfied.

Hope this helps.

Greg
 
Tomasso...
Posted: Tue Oct 06, 2009 1:33 am
Guest
Greg wrote:
Quote:
On Oct 5, 4:03 am, "Tomasso" <toma... at (no spam) a.a> wrote:
-----SNIP
I'd like to know more about what kind of data David is building
models from. Or what kind of models.

Greg's comments are valid and helpful for data which is sampled
in proportion to how it would arise in "the real world".

However if David is building a choice model or something similar
where the data sampling has an underlying experimental design he
needs to be careful.

Classical choice models use MNLs where experimental designs
are based on optimising the estimation of all important parameters.

Duh. Although it's late morning, I still can't deduce the meaning of
MNL

Multinomial logit. Same as softmax with no hidden layer (and ideally
maximum likelihood).

Quote:
More recent Bayesian approaches also bias their sampling to
answer questions 'in simulation' with optimal precision, but for
questions/scenarios which are considered worthy of simulating.

These approaches bias the data collection to maximise information
for some kind of pre-supposed model form. MNLs have a bunch
of exp(sumproduct) predictions (normalised) so they yield normalised
(sums to 1.0000) classification probabilities. Bayesian optimality also
presumes an underlying Bayesian network (or at least drops some
conditional dependences).

So... ...if David is modelling this kind of data, a NN model may well
go off the rails. The NN models have more degrees of freedom and
can do a better job for natural sampling, but if the sampling has a
deliberate bias to answer presumed questions, the NNS can go wrong.

I feel comfortable when the following assumptions are satisfied:

1. There is an underlying deterministic nonpathological I/O
relationship.
2. Data are random samples from the I/O probability distributions.
3. The model is sufficiently flexible for approximating the I/O
relationship.
4. The learning algorithm is sufficient for obtaining accurate
estimates
of the model parameters.
5. The amount of data is sufficient for accurate estimation of model
parameters and performance criteria.

Experimental designs for choice models organises the data sampling into
cells which optimise contrast. This maximise the information usage for
predicting the model coefficients of a pre-determined model. Ie, it works
best for the model form which is assumed. This means that other model
forms may be affected by bias.

This is done to keep data costs down. Each record is expensive, and the
number of records which a given respondent will provide is limited (because
they are human and get irritated if the task sequence is too long).

Tom.



Quote:
Your interesting comments reinforce my concern when assumption 2
is not satisfied.

Hope this helps.

Greg
 
Greg...
Posted: Tue Oct 06, 2009 1:16 pm
Guest
On Oct 5, 6:34 pm, David <dtian.ty... at (no spam) googlemail.com> wrote:
Quote:
The training and testing data that I used are presented below.

-----SNIP

Well, since you spent all of that BW posting the data, you might
as well use a little more to explain the inputs and classes.

Greg
 
Phil Sherrod...
Posted: Wed Oct 07, 2009 1:05 am
Guest
On 5-Oct-2009, Greg <heath at (no spam) alumni.brown.edu> wrote:

Quote:
I feel comfortable when the following assumptions are satisfied:

1. There is an underlying deterministic nonpathological I/O
relationship.
2. Data are random samples from the I/O probability distributions.
3. The model is sufficiently flexible for approximating the I/O
relationship.
4. The learning algorithm is sufficient for obtaining accurate
estimates
of the model parameters.
5. The amount of data is sufficient for accurate estimation of model
parameters and performance criteria.

I agree with all of that. But I would add that you need to have a roughly
equal balance of the categories you are trying to predict. Building a
useful model where one category has 100 times as many training cases as
another is very difficult.

--
Phil Sherrod
(PhilSherrod 'at' comcast.net)
http://www.dtreg.com (Decision trees, Neural networks, SVM and Genetic
modeling)
http://www.nlreg.com (Nonlinear Regression)
 
Greg...
Posted: Wed Oct 07, 2009 2:02 am
Guest
On Oct 6, 5:05 pm, "Phil Sherrod" <PhilSher... at (no spam) REMOVETHIScomcast.net>
wrote:
Quote:
On 5-Oct-2009, Greg <he... at (no spam) alumni.brown.edu> wrote:

I feel comfortable when the following assumptions are satisfied:

1. There is an underlying deterministic nonpathological I/O
relationship.
2. Data are random samples from the I/O probability distributions.
3. The model is sufficiently flexible for approximating the I/O
relationship.
4. The learning algorithm is sufficient for obtaining accurate
estimates
of the model parameters.
5. The amount of data is sufficient for accurate estimation of model
parameters and performance criteria.

I agree with all of that. But I would add that you need to have a roughly
equal balance of the categories you are trying to predict.

No, that assumption is not necessary. As long as you have enough data,
you only need a learning agorithm that accomodates unbalanced priors.

Since that topic has been discussed several times in the past, (e.g.,
see
the Google Group archives) unbalanced priors don't make me feel
uncomfortable.

Quote:
Building a useful model where one category has 100 times as many
training cases as another is very difficult.

Difficult; but not impossible. Data duplication, Data weighting,
jittering and
prior weighted objective functions tend to be the most commonly used
techniques.

Hope this helps.

Greg
 
 
Page 1 of 1    
All times are GMT
The time now is Thu Nov 26, 2009 10:12 pm