Main Page | Report this Page
 
   
Science Forum Index  »  Statistics - Math Forum  »  Statistical test to show that two frequency distributions ar
Page 1 of 1    
Author Message
Guest
Posted: Sat Jan 06, 2007 6:33 pm
Hi,

I'm no statistics expert, so please bear with me.

I'm studying a certain type of protein, and I've gathered instances of
this protein from dozens of different bacteria (the protein, while
performing the same function, has a different sequence in each
bacterial species). These proteins contain what are called "glycine
repeats", in that part of their sequences are repeats of the form (with
X's representing any amino acid)
G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3... I
have calculated the frequency by which each of the 20 amino acids is
found in the X1, X2, and X3 positions in the G-X1-X2-X3-G repeats.

Basically, I wish to show that the distribution of the amino acids in
each variable position (X1, X2, or X3) is significantly different from
the "normal" distribution of amino acids found in all proteins. For
instance, this site:

http://www.expasy.ch/sprot/relnotes/relstat.html

gives the frequencies of the 20 amino acids computed from all the amino
acids in all the proteins that are present in their (large) database.

So, my question is, what kind of statistical test can I perform to show
that the distribution of amino acids that I have calculated in these
repeats does not match the distribution given in the above URL?

Thanks in advance for any help.
Bob O'Hara
Posted: Sun Jan 07, 2007 4:26 am
Guest
brt381@mail.usask.ca wrote:
Quote:
Hi,

I'm no statistics expert, so please bear with me.

I'm studying a certain type of protein, and I've gathered instances of
this protein from dozens of different bacteria (the protein, while
performing the same function, has a different sequence in each
bacterial species). These proteins contain what are called "glycine
repeats", in that part of their sequences are repeats of the form (with
X's representing any amino acid)
G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3... I
have calculated the frequency by which each of the 20 amino acids is
found in the X1, X2, and X3 positions in the G-X1-X2-X3-G repeats.

Basically, I wish to show that the distribution of the amino acids in
each variable position (X1, X2, or X3) is significantly different from
the "normal" distribution of amino acids found in all proteins. For
instance, this site:

http://www.expasy.ch/sprot/relnotes/relstat.html

gives the frequencies of the 20 amino acids computed from all the amino
acids in all the proteins that are present in their (large) database.

So, my question is, what kind of statistical test can I perform to show
that the distribution of amino acids that I have calculated in these
repeats does not match the distribution given in the above URL?

Thanks in advance for any help.

You could do a simple chi-squared test.


I'd be a bit cautious about what you make the comparison to. Firstly,
does it make sense to compare to a data base dominated by eukaryotes?
Secondly, the only argument for the proteins being the same would be
neutrality, but a lot of the amino acid positions in the database will
be under selection. I would be more surprised if you showed that the
distribution of amino acids was the same, so it's not clear what you
will learn by showing that they are different. You might therefore have
to focus your (biological) question a bit more, and try and find protein
sequences that, under your null hypothesis, would be similar to yours.

HTH

Bob
Guest
Posted: Sun Jan 07, 2007 11:52 am
Bob O'Hara wrote:
Quote:
brt381@mail.usask.ca wrote:
Hi,

I'm no statistics expert, so please bear with me.

I'm studying a certain type of protein, and I've gathered instances of
this protein from dozens of different bacteria (the protein, while
performing the same function, has a different sequence in each
bacterial species). These proteins contain what are called "glycine
repeats", in that part of their sequences are repeats of the form (with
X's representing any amino acid)
G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3... I
have calculated the frequency by which each of the 20 amino acids is
found in the X1, X2, and X3 positions in the G-X1-X2-X3-G repeats.

Basically, I wish to show that the distribution of the amino acids in
each variable position (X1, X2, or X3) is significantly different from
the "normal" distribution of amino acids found in all proteins. For
instance, this site:

http://www.expasy.ch/sprot/relnotes/relstat.html

gives the frequencies of the 20 amino acids computed from all the amino
acids in all the proteins that are present in their (large) database.

So, my question is, what kind of statistical test can I perform to show
that the distribution of amino acids that I have calculated in these
repeats does not match the distribution given in the above URL?

Thanks in advance for any help.

You could do a simple chi-squared test.

I'd be a bit cautious about what you make the comparison to. Firstly,
does it make sense to compare to a data base dominated by eukaryotes?
Secondly, the only argument for the proteins being the same would be
neutrality, but a lot of the amino acid positions in the database will
be under selection. I would be more surprised if you showed that the
distribution of amino acids was the same, so it's not clear what you
will learn by showing that they are different. You might therefore have
to focus your (biological) question a bit more, and try and find protein
sequences that, under your null hypothesis, would be similar to yours.

Very good point. Perhaps it would be more meaningful to compare the
frequencies of X1, X2, and X3 in the GXXXG's with the overall frequency
of each amino acid in the entirety of these proteins?

Quote:

HTH

Bob
Bob O'Hara
Posted: Sun Jan 07, 2007 12:54 pm
Guest
brt381@mail.usask.ca wrote:
Quote:
Bob O'Hara wrote:
brt381@mail.usask.ca wrote:
snip
I'd be a bit cautious about what you make the comparison to. Firstly,
does it make sense to compare to a data base dominated by eukaryotes?
Secondly, the only argument for the proteins being the same would be
neutrality, but a lot of the amino acid positions in the database will
be under selection. I would be more surprised if you showed that the
distribution of amino acids was the same, so it's not clear what you
will learn by showing that they are different. You might therefore have
to focus your (biological) question a bit more, and try and find protein
sequences that, under your null hypothesis, would be similar to yours.

Very good point. Perhaps it would be more meaningful to compare the
frequencies of X1, X2, and X3 in the GXXXG's with the overall frequency
of each amino acid in the entirety of these proteins?

Possibly, but these AAs will also be constrained by selection. I assume

you're interested in whether these AAs are evolving neutrally (i.e.
they're just filling in between the Gs), in which case you want to
compare them with other "neutral" AAs. Do you have data from other
proteins, with similar sequences of "neutral" AAs? Or would it make
sense to compare across species?

I suspect someone has already tackled these problems: you could check
the protein evolution literature.

Bob
 
Page 1 of 1       All times are GMT - 5 Hours
The time now is Mon Oct 13, 2008 1:04 pm