brt381@mail.usask.ca wrote:
Hi,
I'm no statistics expert, so please bear with me.
I'm studying a certain type of protein, and I've gathered instances of
this protein from dozens of different bacteria (the protein, while
performing the same function, has a different sequence in each
bacterial species). These proteins contain what are called "glycine
repeats", in that part of their sequences are repeats of the form (with
X's representing any amino acid)
G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3-G-X1-X2-X3... I
have calculated the frequency by which each of the 20 amino acids is
found in the X1, X2, and X3 positions in the G-X1-X2-X3-G repeats.
Basically, I wish to show that the distribution of the amino acids in
each variable position (X1, X2, or X3) is significantly different from
the "normal" distribution of amino acids found in all proteins. For
instance, this site:
http://www.expasy.ch/sprot/relnotes/relstat.html
gives the frequencies of the 20 amino acids computed from all the amino
acids in all the proteins that are present in their (large) database.
So, my question is, what kind of statistical test can I perform to show
that the distribution of amino acids that I have calculated in these
repeats does not match the distribution given in the above URL?
Thanks in advance for any help.
You could do a simple chi-squared test.
I'd be a bit cautious about what you make the comparison to. Firstly,
does it make sense to compare to a data base dominated by eukaryotes?
Secondly, the only argument for the proteins being the same would be
neutrality, but a lot of the amino acid positions in the database will
be under selection. I would be more surprised if you showed that the
distribution of amino acids was the same, so it's not clear what you
will learn by showing that they are different. You might therefore have
to focus your (biological) question a bit more, and try and find protein
sequences that, under your null hypothesis, would be similar to yours.