Main Page | Report this Page
 
   
Science Forum Index  »  Statistics - Education Forum  »  Box & whisker plots with skewed distributions?
Page 1 of 1    
Author Message
Jeff Miller
Posted: Wed Apr 02, 2008 1:38 pm
Guest
I'm just starting to use box & whisker plots
and have a (possibly very naive) question
about using them with skewed distributions.

The box represents skew in the middle of the
distribution nicely by showing the quartile
locations q1, q2, and q3. But the whisker
lengths are defined in terms of IQR=q3-q1, so
the whiskers are the same length for both tails.
This means that the skew in the tails (i.e.
below q1 or above q3) is not represented,
as far as I can see.

I wonder why the same value of IQR is used
to calculate the whiskers at both ends of the distribution.
Instead, I'd think the whisker should be longer on the side
with the longer tail. It would be easy enough to
define the whisker lengths for the two tails
separately, for example in terms of 2*(q2-q1)
and 2*(q3-q2). I'm just wondering why that
isn't routinely done. Have I overlooked
something obvious?

Thanks for your comments,

(By the way, with the distributions I am examining,
it would be very unhelpful to transform the scores
to eliminate the skew, since the skew itself is part
of what's interesting about the dataset.)
Richard Ulrich
Posted: Wed Apr 02, 2008 8:04 pm
Guest
On Wed, 2 Apr 2008 16:38:55 -0700 (PDT), Jeff Miller
<milleratotago@yahoo.com> wrote:

Quote:

I'm just starting to use box & whisker plots
and have a (possibly very naive) question
about using them with skewed distributions.

The box represents skew in the middle of the
distribution nicely by showing the quartile
locations q1, q2, and q3. But the whisker
lengths are defined in terms of IQR=q3-q1, so
the whiskers are the same length for both tails.
This means that the skew in the tails (i.e.
below q1 or above q3) is not represented,
as far as I can see.

The proper drawing will truncate each whisker at
a point where there is still data, and will show
points that may exist beyond the range.

So if there is much skewness, there will be
whiskers of different length, and some points
indicated beyond the longer one.

Quote:

I wonder why the same value of IQR is used
to calculate the whiskers at both ends of the distribution.
Instead, I'd think the whisker should be longer on the side
with the longer tail. It would be easy enough to
define the whisker lengths for the two tails
separately, for example in terms of 2*(q2-q1)
and 2*(q3-q2). I'm just wondering why that
isn't routinely done. Have I overlooked
something obvious?

Thanks for your comments,

(By the way, with the distributions I am examining,
it would be very unhelpful to transform the scores
to eliminate the skew, since the skew itself is part
of what's interesting about the dataset.)

Achieving "equal intervals" is an important reason to
perform a transformation, which is usually done by some
logically implied transformation, such as logs for chemical
concentrations, etc. These incidentally may achieve
homogeneity of variance or normality. If you don't know
much about the data, sometimes the latter (bad distribution)
can be a cue about the former (equal intervals). But that is
a bad thing to assume without thinking about it.

When outliers are the component of interest, it certainly
does not make sense to suppress them.

--
Rich Ulrich

http://www.pitt.edu/~wpilib/index.html
Bruce Weaver
Posted: Thu Apr 03, 2008 5:55 am
Guest
Jeff Miller wrote:
Quote:
I'm just starting to use box & whisker plots
and have a (possibly very naive) question
about using them with skewed distributions.

The box represents skew in the middle of the
distribution nicely by showing the quartile
locations q1, q2, and q3. But the whisker
lengths are defined in terms of IQR=q3-q1, so
the whiskers are the same length for both tails.
This means that the skew in the tails (i.e.
below q1 or above q3) is not represented,
as far as I can see.

I wonder why the same value of IQR is used
to calculate the whiskers at both ends of the distribution.
Instead, I'd think the whisker should be longer on the side
with the longer tail. It would be easy enough to
define the whisker lengths for the two tails
separately, for example in terms of 2*(q2-q1)
and 2*(q3-q2). I'm just wondering why that
isn't routinely done. Have I overlooked
something obvious?

Thanks for your comments,

(By the way, with the distributions I am examining,
it would be very unhelpful to transform the scores
to eliminate the skew, since the skew itself is part
of what's interesting about the dataset.)

Hi Jeff. The following (from the Wikipedia page on box-plots) expands a
bit on what Rich said.

* Any data observation which lies more than 1.5*(IQR) lower than the
first quartile or 1.5*(IQR) higher than the third quartile is considered
an outlier. Indicate where the smallest value that is not an outlier is
by connecting it to the box with a horizontal line or "whisker".
Optionally, also mark the position of this value more clearly using a
small vertical line. Likewise, connect the largest value that is not an
outlier to the box by a "whisker" (and optionally mark it with another
small vertical line).

* Indicate outliers by open and closed dots. "Extreme" outliers, or
those which lie more than three times the IQR to the left and right from
the first and third quartiles respectively, are indicated by the
presence of an open dot. "Mild" outliers - that is, those observations
which lie more than 1.5 times the IQR from the first and third quartile
but are not also extreme outliers are indicated by the presence of a
closed dot. (Sometimes no distinction is made between "mild" and
"extreme" outliers.)

So the maximum length of the whiskers is 1.5*IQR, but they can be
shorter than that. In your case, the end with the long tail will likely
have a whisker that is 1.5*IQR long PLUS some outliers and possibly
extreme outliers.

--
Bruce Weaver
bweaver@lakeheadu.ca
www.angelfire.com/wv/bwhomedir
"When all else fails, RTFM."
Jeff Miller
Posted: Thu Apr 03, 2008 1:53 pm
Guest
On Apr 3, 2:04 pm, Richard Ulrich <Rich.Ulr...@comcast.net> wrote:

Thanks for your comments, Rich.

Quote:
The proper drawing will truncate each whisker at
a point where there is still data, and will show
points that may exist beyond the range.

So if there is much skewness, there will be
whiskers of different lengths

But that's not _necessarily_ true, is it?
The pre-truncation whisker length depends
only on the data between q1 & q3, whereas
whisker truncation itself depends only on the
data outside these values. So, depending
on the ranges of values below q1 and
above q3, you might not truncate either
whisker, even with quite large skewness.

I can see how your comments would be true
with lots of distributions, though.
Jeff Miller
Posted: Thu Apr 03, 2008 2:09 pm
Guest
Hi Bruce,

Quote:
The following (from the Wikipedia page on box-plots) expands a
bit on what Rich said.
I had seen that, actually, but thanks for your reply.


Quote:
* Indicate outliers by open and closed dots. "Extreme" outliers, or
those which lie more than three times the IQR to the left and right from
the first and third quartiles respectively, are indicated by the
presence of an open dot. "Mild" outliers - that is, those observations
which lie more than 1.5 times the IQR from the first and third quartile
but are not also extreme outliers are indicated by the presence of a
closed dot.
What I don't like about this is that I don't think these

extreme values are outliers in my data set, at least not
in the sense of an outlier as a data point generated (primarily)
by some process other than the one I am trying to study.

In general, I am very uncomfortable with any such hard
and fast definitions of outliers. In my area, people often
exclude outliers from their data sets (e.g., before
computation of means, etc). These sorts of statistical
guidelines seem to evolve quickly into statistical
dogma, leading people to exclude a data point
at q3+3.001*IQR but include one at q3+2.999*IRQ.
It seems to me that would probably be a mistake.
Bruce Weaver
Posted: Thu Apr 03, 2008 9:06 pm
Guest
Jeff Miller wrote:
Quote:
Hi Bruce,

The following (from the Wikipedia page on box-plots) expands a
bit on what Rich said.
I had seen that, actually, but thanks for your reply.

* Indicate outliers by open and closed dots. "Extreme" outliers, or
those which lie more than three times the IQR to the left and right from
the first and third quartiles respectively, are indicated by the
presence of an open dot. "Mild" outliers - that is, those observations
which lie more than 1.5 times the IQR from the first and third quartile
but are not also extreme outliers are indicated by the presence of a
closed dot.
What I don't like about this is that I don't think these
extreme values are outliers in my data set, at least not
in the sense of an outlier as a data point generated (primarily)
by some process other than the one I am trying to study.

In general, I am very uncomfortable with any such hard
and fast definitions of outliers. In my area, people often
exclude outliers from their data sets (e.g., before
computation of means, etc). These sorts of statistical
guidelines seem to evolve quickly into statistical
dogma, leading people to exclude a data point
at q3+3.001*IQR but include one at q3+2.999*IRQ.
It seems to me that would probably be a mistake.

Jeff, I agree with your comments about outliers, and doubt that
Tukey ever intended the suspected outliers in box plots to be
treated that way.

Here are some examples of box plots for various kinds of
distributions (including left & right skewed) that might be useful:

http://www.basic.northwestern.edu/statguidefiles/boxplots.html

Cheers,
Bruce
--
Bruce Weaver
bweaver@lakeheadu.ca
www.angelfire.com/wv/bwhomedir
"When all else fails, RTFM."
Richard Ulrich
Posted: Thu Apr 03, 2008 11:57 pm
Guest
On Thu, 3 Apr 2008 17:09:39 -0700 (PDT), Jeff Miller
<milleratotago@yahoo.com> wrote:

Quote:
Hi Bruce,

The following (from the Wikipedia page on box-plots) expands a
bit on what Rich said.
I had seen that, actually, but thanks for your reply.

* Indicate outliers by open and closed dots. "Extreme" outliers, or
those which lie more than three times the IQR to the left and right from
the first and third quartiles respectively, are indicated by the
presence of an open dot. "Mild" outliers - that is, those observations
which lie more than 1.5 times the IQR from the first and third quartile
but are not also extreme outliers are indicated by the presence of a
closed dot.
What I don't like about this is that I don't think these
extreme values are outliers in my data set, at least not
in the sense of an outlier as a data point generated (primarily)
by some process other than the one I am trying to study.

In general, I am very uncomfortable with any such hard
and fast definitions of outliers. In my area, people often
exclude outliers from their data sets (e.g., before
computation of means, etc). These sorts of statistical
guidelines seem to evolve quickly into statistical
dogma, leading people to exclude a data point
at q3+3.001*IQR but include one at q3+2.999*IRQ.
It seems to me that would probably be a mistake.

Yes. I guess that is the sort of mistake that, especially,
might evolve when you have "data analysis" in the hands
of people who match their lack of statistical training,
with a lack of common sense. "Hard and fast definitions"?

On the other hand, good experience with one source of
data can lead to general guidelines, including treatment
of outliers, which can lead (even) statisticians astray,
for a while. (I'm thinking of the ozone-hole example.)


--
Rich Ulrich
http://www.pitt.edu/~wpilib/index.html
 
Page 1 of 1       All times are GMT - 5 Hours
The time now is Sat Oct 11, 2008 6:31 pm