SlideShare a Scribd company logo
1 of 29
Basics of Statistical Notation
You will be introduced to a large number of formulas in this section on statistical concepts.
These formulas use a relatively standardized notation to simplify the description of how a
statistic should be computed. This section introduces the logic and basic concepts behind that
notation. With each new formula, we will remind you what the notation means, but this section
provides a head's up before we get to those formulas and provides a helpful summary in case you
forget a notational concept.
Designating a Variable
Statistical formulas use algebraic notation, which rely on letters to designate a variable.
By convention, if there is just one variable in a formula, the letter X is used to designate
the variable. If there is a second variable in the formula, traditionally the letter Y is used
to indicate the variable. If there is a third variable, the letter Z is traditionally used. After
that, there are no universal traditions, but it is rare to have statistical formulas that involve
more than three variables.
The capital letter N traditionally refers to the total number of participants in a study.
The single letter in statistical formulas refers to the variable. The individuals scores on
that variable can be indicated by subscripts, which are numbers written below the letter to
refer to a specific score. For example, X1 refers to the score for the first person on the X
variable, and X27refers to the score for the 27th person on the X variable. Y11refers to the
score on the Y variable for the 11th person.
If there are several groups of participants, the number of participants in each group is
indicated by a lower-case n with a subscript to indicate the group number. For example,
n1refers to the number of participants in the first group.
Traditionally, the number of groups in a study are referred to by the lower-case letter k,
although in complex designs, this tradition is modified. Therefore, nk refers to the number
of participants in the kth
group, which is the last group.
Algebraic Rules
This is a specified order in which functions are to be carried out. The order is:
o The highest priority action should be to raise any variables to a power. For
example, to compute 2X2
, you would first square the value of X and then multiply
by 2.
o The next highest priority action is multiplication or division. For example, to
compute 2X +1, you would multiply the value of X by 2 and then add 1.
o The lowest priority action is addition or subtraction.
You can override any of these priorities by using parentheses. Anything in parentheses
should be done before other actions. For example, X + Y2
is computed by squaring Y and
adding it to X. In contrast, (X+Y)2
is computed by adding X and Y first and then squaring
the sum. In other words, the parentheses in the second equation overrides the normal
priority order (raise to a power before adding).
Summation Notation
Many statistical formulas involve adding a series of numbers. The notation for adding a
series of numbers is the capital Greek letter sigma. The sigma stands for "add up
everything that follows." Therefore, if the sigma is followed by the letter X, it means that
you should add up all of the X scores.
Parentheses indicate that you should perform the operation in parentheses before you do
the summation. For example, the notation below indicates that you should subtract Y
from X before you sum the difference.
Standard Notation for Statistics
A distinction is made between a statistic that is computed on everyone in a population
and a the same statistic that is computed on everyone in a sample drawn from the
population.
o A statistic computed on everyone in the population is called a population
parameter.
o A statistic computed on everyone in a sample is called a sample statistic.
The population mean is designated by the Greek letter mu, whereas the sample mean is
designated by an X with a bar over the top (read X bar). Both are illustrated below.
A similar distinction is made for standard deviation, which is a measure of variability.
The population standard deviation in indicated by the lower case Greek letter sigma,
whereas the sample standard deviation is indicated by the lower case letter s, as shown
below.
The lower case letter r is used to designate a correlation. If there is any doubt about
which two variables were used to compute the correlation, the two variables are listed as
subscripts. For example, rXY indicates the correlation of X and Y.
Descriptive Statistics
Descriptive statistics describe aspects of a data set in a single number. Many descriptive statistics
are also used in the computation of inferential statistics. In this section, we will be covering four
classes of descriptive statistics. Those classes and their definitions are listed below.
Measures of Central Tendency - measures that indicate the typical or average score.
Measures of Variability - measures that indicate the spread of scores about the measure
of central tendency.
Relative Scores - a score for an individual that tells how that individual performed
relative to the other individuals on the measure.
Measures of Relationships - measures that indicate the strength and direction of a
relationship between two or more variables.
In addition to these widely used descriptive statistics, we will also introduce some other less
frequently used statistics, but statistics that you will occasionally run into, especially if you use
computers to do your statistical analyses.
You can go directly to any of these sections by clicking the section in the Table of Contents
below, or you can simply click on the Next Page button to go to the next page within this section.
Measures of Central Tendency
Measures of central tendency indicate the typical or average score in a distribution of scores.
This section covers three measures of central tendency: the mean, median, and mode.
The Mean
The mean is the arithmetic average of the scores. It is computed by adding all the scores and
dividing by the total number of scores. Remember from our section on notation, we use
summation notation to indicated that we should add all the scores, and we use the uppercase
letter N to indicate the total number of scores. The notation for the mean of the X scores is an
uppercase X with a bar across the top. Therefore the formula for computing the mean is written
as follows:
If you have several groups in your research study, it is traditional to compute the mean for each
group. In such a situation, you would use a subscript notation to indicate the groups. In formulas,
the groups are numbered from 1 to k. Remember that k is the letter that we use to indicate the
number of groups. So using this notation, the mean for Group 1 would use the following
formula.
Note that we use a subscript 1 to indicate that we are computing the mean for Group 1. We are
adding all the scores in Group 1 (the X1s) and dividing the the number of scores in Group 1 (n1).
We use a lowercase n here, because we are NOT talking about the total number of scores in the
study, but rather the number of scores in just one group of the study.
These notation rules can be a pain to learn initially, but once you get them down, you can quickly
translate almost any formula into the computational steps that are required.
Although it is convenient to number groups for formulas, and we will generally be numbering
groups in all of the formulas that we use, it is easier to use subscripts in your computations that
are not a code. For example, if you are studying gender differences on a variable, you might
compute the mean of that variable for men and women separately. Instead of using the subscripts
1 and 2 and remembering which refers to males and which refers to females, you might as well
use a descriptive subscript. For example, the mean of the females might be written as follows:
Elsewhere on this website are instructions on how to compute the mean either by hand or using
SPSS for Windows. To see those instructions, you click on one of the buttons below. To return to
this page after viewing that material, you click on the back arrow key of the web browser that
you are using to view this website.
The mean is the most widely used measures of central tendency, because it is the measure of
central tendency that is most often used in inferential statistics. However, as you will see at the
end of this section, the mean does not always provide the best indication of the typical score in a
distribution.
The Median
The median is the middle score in a distribution. It is also the score at the 50 percentile, which
means that 50% of the scores are lower and 50% are higher. In the textbook, we showed how to
compute the median with a small number of scores. In such a case, you:
1. Order the scores from lowest to highest and count the number of scores (N).
2. If the number of scores is odd, you add 1 to N and divide by 2 to get the middle score
[(N+1)/2]. For example, if you have 15 scores, the middle score is the 8th score
[(15+1)/2=8]. There will be seven scores above the 8th score and seven below it.
3. If the number of scores is even, there will be no middle score. Instead there will be two
scores that straddle the middle. For example, if there are 14 scores, the 7th and 8th scores
in your ordered list will straddle the middle. You can figure out which scores to focus on
by dividing N by 2 and taking that score from the bottom and the one above it [e.g.,
14/2=7, so you take the 7th and 8th scores from the bottom]. The median is between these
two scores, so you average them. If the 7th score is 36 and the 8th score is 39, you sum
the two and divide by two to get the average [e.g., (36+39)/2=37.5].
This formula works fine when there is a small number of scores and little duplication of scores,
but it is not considered accurate enough when there are a large number of scores and many
scores are duplicated. In such a situation, a more complicated formula is used. This formula is
listed below. Don't panic; it is easier to use than it looks.
To compute a median using this formula, you must first create a frequency or grouped frequency
distribution. We will use the frequency distribution that we created in the section on organizing
data. That distribution is listed below.
Score Frequency
Cumulative
Frequency
17 8 394
16 20 386
15 33 366
14 48 333
13 71 285
12 85 214
11 58 129
10 39 71
9 21 32
8 11 11
Now we need to define each of the terms in the formula for the median. We must do this in steps.
We start by finding the middle score.
1. The middle score (nmedian) is the total number of scores divided by 2. In a frequency
distribution that also includes a cumulative frequency column, you can read the total
number of scores (N) as the number at the top of the cumulative frequency column. In
this case, N is 394, so nmedian is 394/2=197.
2. Next you find the interval that includes the 197th score from the bottom. To do this, start
at the bottom of the cumulative frequency column and move up until you find the first
number that is either equal to 197 or greater than 197. In this case, it is the interval for the
score of 12. You may be surprised that we are calling that an interval, because we have
only one score, but for the purposes of this computation it is an interval from 11.5 to
12.5. This is illustrated by the figure below.
3. Now we can identify all of the numbers that will go into the formula for the median. LRL
stands for Lower Real Limit of the interval that contains the median. In this case, it is
11.5. The interval width (i in the formula) is 1. We compute it by subtracting the lower
real limit from the upper real limit [e.g., 12.5-11.5=1.0]. We had previously computed
nmedian as 197 [394/2]. The term nLRL refers to the number of people with scores below the
lower real limit of the interval. You can read this off of the frequency distribution by
noting the number in the cumulative frequency column for the interval below the one that
contains the median. In this case, it is 129. In other words, 129 people in our example
score below a 12. Finally, fi is the frequency of scores within the interval that contains the
median. We can read that number from the frequency column of our distribution. In this
case, it is 85.
4. Now we plug all of those items into the formula to get the following.
This is the most complicated formula that you have had to deal with so far, but the logic behind it
is not as complicated as the formula makes it appear. We have determined that the middle score
(197th) appears in the interval of 12, which has real limits from 11.5 to 12.5. Furthermore, we
have determined that 85 people are in that interval and 129 score below that interval. This
formula makes the assumption that the 85 people scoring in the interval that contains the median
are evenly distributed. That is, the first person takes the bottom 1/85th of that interval, the next
person takes the next 1/85th of that interval, up to the last person, who takes the top 1/85th of
that space. The value "nmedian - nLRL" computes how far we have to count up from the bottom of the
interval. In this case, we must count up 68 people [197-129], which is 80% of the way from the
bottom of the interval [68/85]. The figure below illustrates this process, which is built into the
formula.
Although you can create a frequency distribution and do the computations that we just walked
you through, with large data sets, it is much more likely that you will use a computer package
like SPSS for Windows to do this computation. Computer packages will use this formula to
make the computation. To see how you would request the computation of the median using SPSS
for Windows, click on the button below. When you want to return to this page, use the back
arrow key on your browser to return.
The Mode
The mode is the most frequently occurring score. In a frequency distribution, you compute the
mode by looking for the largest number in the frequency column. The score associated with that
number is the mode. Using the frequency distribution previously used for computing the median,
the largest frequency is 85, and it is associated with a score of 12. Therefore, 12 is the mode.
If you have a grouped frequency distribution, the mode is the midpoint of the interval that
contains the largest number of scores. That can create a bit of instability, which we can illustrate
with an example.
Suppose that we use the data from the frequency distribution above to create a grouped
frequency distribution with an interval width of 2 scores. If we start by grouping 8 and 9
together, we will produce the following grouped frequency distribution.
Interval Frequency
Cumulative
Frequency
16-17 28 394
14-15 81 366
12-13 156 285
10-11 97 129
8-9 32 32
The interval with the largest frequency is 12-13, and the midpoint of that interval is 12.5.
Therefore, 12.5 is the mode.
However, suppose that we create a similar grouped frequency distribution, again with an interval
of 2, but this time we start with an interval of 7-8. If we do, we will get the following grouped
frequency distribution.
Interval Frequency
Cumulative
Frequency
17-18 8 394
15-16 53 386
13-14 119 333
11-12 143 214
9-10 60 71
7-8 11 11
Now the interval with the largest frequency is 11-12, with a midpoint of 11.5. Therefore the
mode is 11.5. So the mode shifts depending on how we set up the intervals. This effect is rather
small in this example, because the sample sizes are rather large and the distribution is close to
symmetric.
With small sample sizes and less symmetric distributions, you can get huge shifts in the mode.
This is why the mode is considered to be unstable. Another reason that the mode is unstable is
that a shift of just a few scores could change the mode by making a different score the mode.
Comparing the Measures
In the textbook, we provide an example in one of the Cost of Neglect boxes of how the mean can
misrepresent the typical score when there are a few deviant scores. In our example, there were
five employees with the company, four of which made $40,000 and one making $340,000. The
mean was $100,000, which clearly does not reflect the typical salary. The median ($40,000) was
a much better estimate of the typical salary.
This is an extreme example of a general principle. When a distribution is symmetric, like in the
top panel of the figure below, the mean, median, and mode will all be the same. However, as the
curve becomes more skewed, these three measures of central tendency diverge. The mode will
always be at the peak of the curve, because the highest point indicates the most frequent score.
The mean will be pulled the most toward the tail of the skew, with the median in between.
These graphs may help you to understand what each of these measures of central tendency
measure.
The mode is always the score at which the curve reaches its highest point (i.e., the most
frequent score).
The median is the score the cuts the curve into two equal areas. In other words, the area
above the median line is equal to the area below the median line. The area under a
frequency curve is proportional to the number of people or objects represented by that
curve. Remember, the median is the 50 percentile, so that should be an equal number
above and below the median. So the area of the curve should be equal above and below
the median to reflect this.
The mean is the balance point for the curve. What that means is if we cut a block of wood
in the exact shape of the curve, the mean would be the point at which that block of wood
could be perfectly balanced on your finger. It is the point where the average distance
from the mean is exactly the same for the scores above the mean and the scores below the
mean.
Measures of Variability
Measures of variability indicate the degree to which the scores in a distribution are spread out.
Larger numbers indicate greater variability of scores. Sometimes the word dispersion is
substituted for variability, and you will find that term used in some statistics texts.
We will divide our discussion of measures of variability into four categories: range measures, the
average deviation, the variance, and the standard deviation.
Range Measures
In Chapter 5, we introduced only one range measure, which was called the range. The range is
the distance from the lowest score to the highest score. We noted that the range is very unstable,
because it depends on only two scores. If one of those scores moves further from the distribution,
the range will increase even though the typical variability among the scores has changed little.
This instability of the range has lead to the development of two other range measures, neither of
which rely on only the lowest and highest scores. The interquartile range is the distance from
the 25th percentile and the 75 percentile. The 25th percentile is also called the first quartile,
which means that it divides the first quarter of the distribution from the rest of the distribution.
The 75th percentile is also called the third quartile because it divides the lowest three quarters
of the distribution from the rest of the distribution. Typically, the quartiles are indicated by
uppercase Qs, with the subscript indicating which quartile we are talking about (Q1 is the first
quartile and Q3 is the third quartile). So the interquartile range can be computed by subtracting Q1
from Q3 [i.e., Q3 -Q1].
There is a variation on the interquartile range, called the semi-interquartile range or quartile
deviation. This value is equal to half of the interquartile range.
Using this notation, the median is the second quartile (the 50th percentile). That means that we
can use a variation of the formula for the median to compute both the first and third quartiles.
Looking at the equation below for the median, we would make the following changes to compute
these quartiles.
To compute Q1, nmedian becomes nQ1, which is equal to .25*N. We then identify the interval
that contains the nQ1 score. All of the other values are obtained in the same way as for the
median.
To compute Q3, nmedian becomes nQ3, which is equal to .75*N. We then identify the interval
that contains the nQ3 score. All of the other values are obtained in the same way as for the
median.
To compute the interquartile range, subtract Q1 from Q3.
To compute the quartile deviation, divide the interquartile range by 2.
It is common to report the range, and many computer programs routinely provide the minimum
score, maximum score, and the range as part of their descriptive statistics package. Nevertheless,
these are not widely used measures of variability. The same computer programs that give a
range, will also provide both a standard deviation and variance. We will be discussing these
measures of variability shortly, after we have introduced the concept of the average deviation.
The Average Deviation
The average deviation is not a measure of variability that anyone uses, but it provides an
understandable introduction to the variance. The variance is not an intuitive statistic, but it is
very useful in other statistical procedures. In contrast, the average deviation is intuitive, although
generally worthless for other statistical procedures. So we will use the average deviation to
introduce the concept of the variance.
The average deviation is, as the name implies, the average deviation (distance) from the mean.
To compute it, you start by computing the mean, then you subtract the mean from each score,
ignoring the sign of the difference, and sum those differences. You then divide by the number of
scores (N). The formula is shown below. The vertical lines on either side of the numerator
indicate that you should take the absolute value, which converts all the differences to positive
quantities. Therefore, you are computing deviations (distances) from the mean.
Chapter 5 in the textbook walked you through the computation of the average deviation. The
reason we take the absolute value of these distances from the mean is that the sum of the
differences from the mean, some positive and some negative, will always equal zero. We can
prove that fact with a little algebra, but you can take our word for it.
As we mentioned earlier, the average deviation is easy to understand, but it has little value for
inferential statistics. In contrast, the next two measures (variance and standard deviation) are
useful in other statistical procedures. So we now turn our attention to them.
The Variance
The variance takes a different approach to making all of the distances from the mean positive so
that they will not sum to zero. Instead of taking the absolute value of the difference from the
mean, the variance squares all of those differences.
The notation that is used for the variance is a lowercase s2
. The formula for the variance is shown
below. If you compare it with the formula for average deviation, you will see two differences
instead of one between these formulas. The first is that the differences are squared instead of
taking the absolute value. The numerator of this formula is called the sum of squares, which is
short for sum of squared differences from the mean. See if you can spot the second difference.
Did you recognize that the variance formula does not divide by N, but instead divides by N-1?
The denominator (N-1) in this equation is called the degrees of freedom. It is a concept that you
will hear about again and again in statistics. If you would like to know more about degrees of
freedom, you can click on this link. This link provides a conceptual explanation of this concept.
The reason that the variance formula divides the sum of squared differences from the mean by N-
1 is that dividing by N would produce a biased estimate of the population variance, and that bias
is removed by dividing by N-1. You can learn more about the concept of biased versus unbiased
estimates of population parameters by clicking on this link.
The Standard Deviation
The variance has some excellent statistical properties, but it is hard for most students to
conceptualize. To start with, the unit of measurement for the mean is the same as the unit of
measurement for the score. For example, if we compute the mean age of our sample and find that
it is 28.7 years, that mean is on the same scale as the individual ages of our participants. But the
variance is in squared units. For example, we might find that the variance is 100 years2
.
Can you even imagine what the unit of years squared represents? Most people can't. But there is
a measure of variability that is in the same units as the mean. It is called the standard deviation,
and it is the square root of the variance (see the formula below). So if the variance was 100
years2
, the standard deviation would be 10 years. Since we used the symbol s2
to indicate
variance, you might not be surprised that we use the lowercase letter s to indicate the standard
deviation. You will see in our discussion of relative scores how valuable the standard deviation
can be.
At this point, many students assume that the variance is just a step in computing the standard
deviation, because the standard deviation seems like it is much more useful and understandable.
In fact, you will use the standard deviation for description purposes only and will use the
variance for all your other statistical tasks. If you are wondering why that is, click on this link to
find out.
Relative Scores
In this section, we will use some of the statistical information that you have already learned to
solve the practical problem of how to indicate the relative standing of a person on a measure.
As part of this section, you will learn about the standard normal distribution, which is a
distribution that is defined by a mathematical equation. Such mathematical distributions are a
critical part of the inferential statistical process that we will be covering later.
Percentile Ranks
Relative scores indicate where a person stands within a specified normative sample. In general,
scores have little meaning unless you know how other people scored. For example, on probably
most of the exams that you have taken in school, you need to be correct 70% or more of the time
just to pass the course, yet the standardized tests that you took as part of the college admission
process are designed so that less than half of the people taking them get 70% or more of the
questions correct.
To know how good a score is, you need to know what other people got. For example, a
professional baseball player who got a hit only half the times he went up to bat would be by far
the greatest hitter of all time. Most professional baseball players only get a hit about every fourth
time at bat. In contrast, a driver who only reached his or her destination without an accident half
the time would be considered so bad that no insurance company would cover the person. Most
people arrive safely at their destination 99+% of the time.
We are constantly seeking information about relative scores, sometime even before the scores
have been computed. How many times have you walked out of a test and asked other students
whether the test seemed hard or easy. If you thought it was hard, and therefore are worried that
you did not do well, you are likely to feel a little better after other students tell you that it seemed
very hard to them as well.
The most basic relative score is the percentile rank, which specifies the percentage of people in
the normative group who score lower on the measure than yourself. So if you scored at the 25th
percentile, it means that 25% of the people score lower than you and 75% score higher than you.
Percentile ranks can range from 0 (for the person with the lowest score) to 100 (for the person
with the highest score).
Most often percentile ranks are computed from a frequency distribution. Let's again use the
frequency distribution that we have used before for examples. That distribution is below.
Suppose that we want to compute the percentile rank for a score of 15. From the table, we can
see that there are 333 people with a score below 15, but what do we do with the 33 people who
have exactly 15? Do we count them as scoring above or below our person with a score of 15. The
tradition is to assume that half of the people with the same score are below and half are above.
That means that we have 33/2=16.5 with a 15 that we consider to be lower than us, and the same
number with a score of 15 that we consider to have a score higher than us. We add the 16.5
people to the 333 people with lower scores of 14 or lower to get the number of people with
scores lower than ours.
Score Frequency
Cumulative
Frequency
17 8 394
16 20 386
15 33 366
14 48 333
13 71 285
12 85 214
11 58 129
10 39 71
9 21 32
8 11 11
There are a total of 394 people. To get the percentile rank, we divide the number of people below
our score by the total number of people and multiply by 100 (to convert the proportion to a
percent). In this case, the percentile rank is 89 [(349.5/394)*100].
We traditionally round percentile ranks to two significant digits. So we rounded 88.705584% to
89%.
Standard Normal Distribution
Many variables in psychology tend to show a distinctive shape when graphed using a histogram
or frequency polygon. The shape resembles a bell shaped curve like the one shown below. This
classic bell shaped curve is called a normal curve or normal distribution.
The normal curve is perfectly symmetric. The right half and left half are mirror images of one
another. The curve also does not quite reaches zero, although it gets very close. The shape of the
normal curve is actually determined by a complex equation, with dictates the height of the curve
at every point. You need not know the details of this equation, but you should know that the
equation includes two variables. They are the mean and the standard deviation. The mean
dictates where the middle of the distribution is, which is the highest point of the curve and the
point that separates the the area under the curve into two equal segments. The standard deviation
determines how spread out the curve is.
Because the normal curve is based on an equation, it is possible to know exactly how high the
curve is at every point and how much area is under the curve between any two scores on the X-
axis. The figure below marks off 1 and 2 standard deviations both above and below the mean.
The area under the curve between the mean and one standard deviation below the mean is
approximately 34%, as shown in the figure. More precisely, it is 34.13%. We will show you
where that number comes from shortly.
Because the curve is symmetric, the area between the mean and one standard deviation above the
mean is also 34%. Similarly, the area between 1 and 2 standard deviations, either above or below
the mean, is approximately 14%, and the area beyond 2 standard deviations is 2% on either side
of the distribution. All of these areas are determined by the equation for the normal curve, but
you do not need to use this equation, because the values are computed for you and available in a
table called the Area under the Standard Normal Curve Table. If you click on the link to this
table, you can see what it looks like.
To use the Standard Normal Table, you need to know a little more about the normal curve and
you need to learn about the standard score, also known as the Z-score.
If you look at the two normal curves above, you might recognize that there are no scores listed
on the X-axis. Remember that the location on the X-axis is determined by the mean of the
distribution and the spread of the curve is determined by the standard deviation of the
distribution. But you can convert a normal curve with any mean and variance into a standard
normal distribution, which is aa normal curve with a mean of zero and a standard deviation of
1.
If the curve above were a standard normal distribution, the labels on the X-axis at the lines that
divide the curve into sections would be -2, -1, 0, +1, +2, read from left to right. The table we
showed you earlier gives the areas under the curve of such a standard normal distribution.
Shown below is the equation that converts any score to a standard score using the values of the
score and the mean and standard deviation of the distribution. A standard score shows where the
person scores in a standard normal distribution. It tells you instantly whether the score is above
or below the mean by the sign of the Z-score. If the Z-score is positive, the person scored above
the mean; if it is negative, the person scored below the mean. The size of the Z-score indicates
how far away from the mean the person scored.
If you want to see exactly how the standard normal distribution and the Z-score can be used to
compute a percentile rank, you can click on this link. Besides walking you through the process,
this link provides exercises to help you master this concept and procedure.
Other Relative Scores
The score on any measure could be converted to a Z-score, which would tell you at a glance how
a person scored relative to the reference group. For example, if someone tells you that her Z-
score on the exam was +1.55, you know immediately that she scored above the mean and enough
above the mean that she is near the top of the class. Remember, most of the normal distribution is
contained between the boundaries of -2.0 and +2.0 standard deviations. There is only about 2%
of the area under the curve in each of the tails. If another student tells you the his exam score was
a Z of -.36, you know that he scores a bit below the mean.
With the standard normal table, you could compute the percentile rank for each of these students
in a few minutes. Of course, this procedure is only legitimate if the shape of the distribution of
scores is normal or very close to normal. If the shape is not normal, the Standard Normal Table
will not give accurate information about how the proportion of people who score above and
below a given score.
Although Z-scores are very useful and allow people to judge the relative performance of an
individual quickly, many people get easily confused by the negative numbers that are possible
with Z-scores. Consequently, many tests compute Z-scores, but then translate them
mathematically to avoid negative numbers.
For example, the IQ test produces a distribution of scores that is very close to normal. However,
the IQ test does not give a person's score as a Z-score; instead, the IQ test reports the score as an
IQ score. The IQ score is simply the Z-score multiplied by 15 and then added to 100, as shown in
the equation below. The values of 15 and 100 are arbitrary, but the effect of this transformation
is to produce a normal distribution with a mean of 100 and a standard deviation of 15. So the IQ
distribution looks like the figure below. Note that this figure is identical in shape to all the other
figures in this section. The only difference is that the scores on the X-axis are IQ scores. So just
over 95% of people have IQ scores between 70 and 130, and no one has a negative IQ.
Standardized tests often perform a similar transformation to avoid negative scores. For example,
the Scholastic Aptitude Test (SAT) used for college admission and the Graduate Record Exam
(GRE) used for admission to graduate school are both standardized so the the mean of the
subtests is 500, with a standard deviation of 100. So if you score 450 on the verbal section of the
SAT, you are scoring .5 standard deviations below the mean, which puts you at the 31st
percentile. (See if you can do the computations and use the standard normal table to verify this
percentile rank. This link shows you the method to make this computation.)
The Normative Sample
Z-scores and transformed Z-scores, such as SAT scores, are very handy and are used extensively
in reporting test scores. But it is critical to understand that the score is meaningful ONLY if you
take into account the normative sample.
A quick example will illustrate this point. Let's assume that Dan took the SAT as a High School
senior and scored 650 on each subtest. That is 150 points above the mean (1.5 standard
deviations) and would place him at approximately the 93rd percentile.
Four years later, after doing well in college, he decides to go to graduate school, and so he takes
the GRE. This time he only obtained a 550 on each of the subtests. What happened? Why did his
performance decrease despite the fact that he worked hard in college and did very well?
You may already have guessed the answer to that question. In effect, we are trying to compare
apples and oranges. The scores on the SAT and the GRE mean entirely different things, because
they are based on entirely different normative samples.
The SAT is taken by people who expect to complete high school and are considering going on to
college. In contrast, the GRE is taken by people who expect to graduate from college and plan to
go onto graduate school. Anyone who drops out of college or does poorly in college is unlikely
to take the GRE. In other words, the normative sample for the GRE is much more exclusive than
for the SAT. Dan's GRE score would place him in the 70th percentile of the people applying for
graduate school, who are a pretty elite group academically. Most of the people who took the SAT
did not take the GRE, and most of the people who take the GRE did very well on the SAT. The
competition (i.e., the normative group) was tougher for the GRE than the SAT.
Whenever you are given a normative score, such as a Z-score, percentile rank, or score on a
standardized test, you should always consider the nature of the normative sample. A person
making $500,000 per year may be one of the best paid people in the country (the normative
sample including all workers), but one of the lowest paid CEOs for a Fortune 500 company (a
different normative sample).
Measures of Relationship
Chapter 5 of the textbook introduced you to the three most widely used measures of relationship:
the Pearson product-moment correlation, the Spearman rank-order correlation, and the Phi
correlation. We will be covering these statistics in this section, as well as other measures of
relationship among variables.
What is a Relationship?
Correlation coefficients are measures of the degree of relationship between two or more
variables. When we talk about a relationship, we are talking about the manner in which the
variables tend to vary together. For example, if one variable tends to increase at the same time
that another variable increases, we would say there is a positive relationship between the two
variables. If one variable tends to decrease as another variable increases, we would say that there
is a negative relationship between the two variables. It is also possible that the variables might be
unrelated to one another, so that you cannot predict one variable by knowing the level of the
other variable.
As a child grows from an infant into a toddler into a young child, both the child's height and
weight tend to change. Those changes are not always tightly locked to one another, but they do
tend to occur together. So if we took a sample of children from a few weeks old to 3 years old
and measured the height and weight of each child, we would likely see a positive relationship
between the two.
A relationship between two variables does not necessarily mean that one variable causes the
other. When we see a relationship, there are three possible causal interpretations. If we label the
variables A and B, A could cause B, B could cause A, or some third variable (we will call it C)
could cause both A and B.
With the relationship between height and weight in children, it is likely that the general growth
of children, which increases both height and weight, accounts for the observed correlation. It is
very foolish to assume that the presence of a correlation implies a causal relationship between
the two variables. There is an extended discussion of this issue in Chapter 7 of the text.
Scatter Plots and Linear Relationships
A helpful way to visualize a relationship between two variables is to construct a scatter plot,
which you were briefly introduced to in our discussion of graphical techniques. A scatter plot
represents each set of paired scores on a two dimensional graph, in which the dimensions are
defined by the variables.
For example, if we wanted to create a scatter plot of our sample of 100 children for the variables
of height and weight, we would start by drawing the X and Y axes, labeling one height and the
other weight, and marking off the scales so that the range on these axes is sufficient to handle the
range of scores in our sample. Let's suppose that our first child is 27 inches tall and 21 pounds.
We would find the point on the weight axis that represents 21 pounds and the point on the height
axis that represents 27 inches. Where these two points cross, we would put a dot that represents
the combination of height and weight for that child, as shown in the figure below.
We then continue the process for all of the other children in our sample, which might produce the
scatter plot illustrated below.
It is always a good idea to produce scatter plots for the correlations that you compute as part of
your research. Most will look like the scatter plot above, suggesting a linear relationship. Others
will show a distribution that is less organized and more scattered, suggesting a weak relationship
between the variables. But on rare occasions, a scatter plot will indicate a relationship that is not
a simple linear relationship, but rather shows a complex relationship that changes at different
points in the scatter plot.
The scatter plot below illustrates a nonlinear relationship, in which Y increases as X increases,
but only up to a point; after that point, the relationship reverses direction. Using a simple
correlation coefficient for such a situation would be a mistake, because the correlation cannot
capture accurately the nature of a nonlinear relationship.
Pearson Product-Moment Correlation
The Pearson product-moment correlation was devised by Karl Pearson in 1895, and it is still
the most widely used correlation coefficient. This history behind the mathematical development
of this index is fascinating. Those interested in that history can click on the link. But you need
not know that history to understand how the Pearson correlation works.
The Pearson product-moment correlation is an index of the degree of linear relationship between
two variables that are both measured on at least an ordinal scale of measurement. The index is
structured so the a correlation of 0.00 means that there is no linear relationship, a correlation of
+1.00 means that there is a perfect positive relationship, and a correlation of -1.00 means that
there is a perfect negative relationship.
As you move from zero to either end of this scale, the strength of the relationship increases. You
can think of the strength of a linear relationship as how tightly the data points in a scatter plot
cluster around a straight line. In a perfect relationship, either negative or positive, the points all
fall on a single straight line. We will see examples of that later.
The symbol for the Pearson correlation is a lowercase r, which is often subscripted with the two
variables. For example, rxy would stand for the correlation between the variables X and Y.
The Pearson product-moment correlation was originally defined in terms of Z-scores. In fact, you
can compute the product-moment correlation as the average cross-product Z, as show in the first
equation below. But that is an equation that is difficult to use to do computations. The more
commonly used equation now is the second equation below.
Although this equation looks much more complicated and looks like it would be much more
difficult to compute, in fact, this second equation is by far the easier of the two to use if you are
doing the computations with nothing but a calculator.
You can learn how to compute the Pearson product-moment correlation either by hand or using
SPSS for Windows by clicking on one of the buttons below. Use the browser's return arrow key
to return to this page.
Spearman Rank-Order Correlation
The Spearman rank-order correlation provides an index of the degree of linear relationship
between two variables that are both measured on at least an ordinal scale of measurement. If one
of the variables is on an ordinal scale and the other is on an interval or ratio scale, it is always
possible to convert the interval or ratio scale to an ordinal scale. That process is discussed in the
section showing you how to compute this correlation by hand.
The Spearman correlation has the same range as the Pearson correlation, and the numbers mean
the same thing. A zero correlation means that there is no relationship, whereas correlations of
+1.00 and -1.00 mean that there are perfect positive and negative relationships, respectively.
The formula for computing this correlation is shown below. Traditionally, the lowercase r with a
subscript s is used to designate the Spearman correlation (i.e., rs). The one term in the formula
that is not familiar to you is d, which is equal to the difference in the ranks for the two variables.
This is explained in more detail in the section that covers the manual computation of the
Spearman rank-order correlation.
The Phi Coefficient
The Phi coefficient is an index of the degree of relationship between two variables that are
measured on a nominal scale. Because variables measured on a nominal scale are simply
classified by type, rather than measured in the more general sense, there is no such thing as a
linear relationship. Nevertheless, it is possible to see if there is a relationship.
For example, suppose you want to study the relationship between religious background and
occupations. You have a classification systems for religion that includes Catholic, Protestant,
Muslim, Other, and Agnostic/Atheist. You have also developed a classification for occupations
that include Unskilled Laborer, Skilled Laborer, Clerical, Middle Manager, Small Business
Owner, and Professional/Upper Management. You want to see if the distribution of religious
preferences differ by occupation, which is just another way of saying that there is a relationship
between these two variables.
The Phi Coefficient is not used nearly as often as the Pearson and Spearman
correlations. Therefore, we will not be devoting space here to the computational procedures.
Advanced Correlational Techniques
Correlational techniques are immensely flexible and can be extended dramatically to solve
various kinds of statistical problems. Covering the details of these advanced correlational
techniques is beyond the score of this text and website. However, we have included brief
discussions of several advanced correlational techniques on this Student Resource Website,
including multidimensional scaling, path analysis, taxonomic search techniques, and statistical
analysis of neuroimages.
Nonlinear Correlational Procedures
The vast majority of correlational techniques used in psychology are linear correlations.
However, there are times when we can expect to find nonlinear relationships and we would like
to apply statistical procedures to capture such complex relationships. This topic is far too
complex to cover here. The interested student will want to consult advanced statistical textbooks
that specialize in regression analyses.
There are two words of caution that we want to state about using such nonlinear correlational
procedures. Although it is relatively easy to do the computations using modern statistical
software, you should not use these procedures unless you actually understand them and their
pitfalls. It is easy to misuse the techniques and to be fooled into believing things that are not true
from a naive analysis of the output of computer programs.
The second word of caution is that there should be a strong theoretical reason to expect a
nonlinear relationship if you are going to use nonlinear correlational procedures. Many
psychophysiological processes are by their nature nonlinear, so using nonlinear correlations in
studying those processes makes complete sense. But for most psychological processes, there is
no good theoretical reasons to expect a nonlinear relationship.
Linear Regression
As you learned in Chapters 5 and 7 of the text, the value of correlations is that they can be used
to predict one variable from another variable. This process is called linear regression or simply
regression. It involves fitting mathematically a straight line to the data from a scatter plot.
Below is a scatter plot from our discussion of correlations. We have added a regression line to
that scatter plot to illustrate how regression works. We compute the regression line with formulas
that we will present to you shortly. The regression line is based on our data. Once we have the
regression line, we can then use it to predict Y from knowing X.
The scatter plot below shows the relationship of height and weight in young children (birth to
three years old). The line that runs through the data points is called the regression line. It is
determined by an equation, which we will discuss shortly. If we know the value of X (in this
case, weight) and we want to predict Y from X, we draw a line straight up from our value of X
until it intersects the regression line, and then we draw a line that is parallel to the X-axis over to
the Y-axis. We then read from the Y-axis our predicted value for Y (in this case, height).
In order to fit a line mathematically, there must be some stated mathematical criteria for what
constitutes a good fit. In the case of linear regression, that mathematical criteria is called least
squares criteria, which is shorthand for the line being positioned so that the sum of the squared
distances from the score to the predicted score is as small as it can be.
If you are predicting Y, you will compute a regression line that minimized the sum of the (Y-Y')2
.
Traditionally, a predicted score is referred to by using the letter of the score and adding a single
quotation after it (Y' is read Y prime or Y predicted).
To illustrate this concept, we removed most of the clutter of data points from the above scatter
plot and showed the distances that are involved in the least squares criteria. Note that it is the
vertical distance from the point to the prediction line--that is, the difference from the predicted Y
(along the regression line) and the actual Y (represented by the data point). A common
misconception is that you measure the shortest distance to the line, which will be a line to the
point that is at right angles to the regression line.
It may not be immediately obvious, but if you were trying to predict X from Y, you would be
minimizing the sum of the squared distances (X-X'). That means that the regression line for
predicting Y from X may not be the same as the regression line for predicting X from Y. In fact, it
is rare that they are exactly the same.
The first equation below is the basic form of the regression line. It is simply the equation for a
straight line, which you probably learned in high school math. The two new notational items are
byx and ayx which are the slope and the intercept of the regression line for predicting Y from X. The
slope is how much the Y scores increase per unit of X score increase. The slope in the figure
above is approximately .80. For every 10 units movement along the line on the X axis, the Y axis
moves about 8 units. The intercept is the point at which the line crosses the Y axis (i.e., the point
at which X is equal to zero.
The equations for computing the slope and intercept of the line are listed as the second and third
equations, respectively. If you want to predict X from Y, simply replace all the Xs with Ys and the
Ys with Xs in the equations below.
A careful inspection of these equations will reveal a couple of important ideas. First, if you look
at the first version of the equation for the slope (the one using the correlation and the population
variances), you will see that the slope is equal to the correlation if the population variances are
equal. That would be true either for predicting X from Y or Y from X. What is less clear, but is
also true, is that the regression lines for predicting X or predicting Y will be identical if the
population variances are equal. That is the ONLY situation in which the regression lines are the
same.
Second, if the correlation is zero (i.e., no relationship between X and Y), then the slope will be
zero (look at the first part of the second equation). If you are predicting Y from X, your
regression line will be horizontal, and if you are predicting X from Y, your regression line will be
vertical. Furthermore, if you look at the third equation, you will see that the horizontal line for
predicting Y will be at the mean of Y and the vertical line for predicting X will be at the mean of
X.
Think about that for a minute. If X and Y are uncorrelated and you are trying to predict Y, the best
prediction that you can make is the mean of Y. If you have no useful information about a variable
and are asked to predict the score of a given individual, your best bet is to predict the mean. To
the extent that the variables are correlated, you can make a better prediction by using the
information from the correlated variable and the regression equation.
Other Descriptive Statistics
The most commonly used descriptive statistics for distribution of scores are the mean and the
variance (or standard deviation), but there are other descriptive statistics that are available and
are often computed by data analysis programs. We will discuss two of them in this section:
Skewness and Kurtosis.
Skewness and kurtosis both describe an aspect of the shape of a distribution, much as the mean
and the variance describe the shape of a distribution. But these indices of the shape of the
distribution are used far less frequently than the mean and variance.
Skewness was discussed in the section on graphing data with histograms and frequency
polygons. There is an index of the degree and direction of skewness that many statistical
programs produce as part of their entire descriptive statistics package. When the skewness is near
zero, the distribution is close to symmetric. A negative number indicates a negative skew, and a
positive number indicates a positive skew.
Kurtosis indicates the degree of flatness of the distribution. The larger the number, the flatter the
distribution.
Statisticians refer to a concept called moments of a distribution. The mean is based on the first
moment of the distribution, and the variance is based on the second moment of the distribution.
The skewness and kurtosis are based on the third and fourth moments of the distribution. These
concepts make more sense to people who do theoretical work in statistics. But for the purpose of
this course, the only points you need to remember is the definition of these terms in case your
run across them in your reading.
If you want to compute them, most statistical analysis packages allow you to include these
statistics in the descriptive statistics package.

More Related Content

What's hot

Markov chain and its Application
Markov chain and its Application Markov chain and its Application
Markov chain and its Application Tilakpoudel2
 
STATISTICS: Hypothesis Testing
STATISTICS: Hypothesis TestingSTATISTICS: Hypothesis Testing
STATISTICS: Hypothesis Testingjundumaug1
 
Kinds Of Variables Kato Begum
Kinds Of Variables Kato BegumKinds Of Variables Kato Begum
Kinds Of Variables Kato BegumDr. Cupid Lucid
 
Central limit theorem
Central limit theoremCentral limit theorem
Central limit theoremVijeesh Soman
 
Introduction to Statistics - Basic Statistical Terms
Introduction to Statistics - Basic Statistical TermsIntroduction to Statistics - Basic Statistical Terms
Introduction to Statistics - Basic Statistical Termssheisirenebkm
 
Knowledge management and business intelligence
Knowledge management and business intelligenceKnowledge management and business intelligence
Knowledge management and business intelligenceAzmi Taufik
 
Decision Trees
Decision TreesDecision Trees
Decision TreesStudent
 
Hypothesis testing an introduction
Hypothesis testing an introductionHypothesis testing an introduction
Hypothesis testing an introductionGeetika Gulyani
 
Quantitative Data - A Basic Introduction
Quantitative Data - A Basic IntroductionQuantitative Data - A Basic Introduction
Quantitative Data - A Basic IntroductionDrKevinMorrell
 
What is statistics
What is statisticsWhat is statistics
What is statisticsRaj Teotia
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regressionkishanthkumaar
 
"A basic guide to SPSS"
"A basic guide to SPSS""A basic guide to SPSS"
"A basic guide to SPSS"Bashir7576
 
Measures of relationships
Measures of relationshipsMeasures of relationships
Measures of relationshipsyogesh ingle
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsHiba Armouche
 

What's hot (20)

Markov chain and its Application
Markov chain and its Application Markov chain and its Application
Markov chain and its Application
 
STATISTICS: Hypothesis Testing
STATISTICS: Hypothesis TestingSTATISTICS: Hypothesis Testing
STATISTICS: Hypothesis Testing
 
Kinds Of Variables Kato Begum
Kinds Of Variables Kato BegumKinds Of Variables Kato Begum
Kinds Of Variables Kato Begum
 
Central limit theorem
Central limit theoremCentral limit theorem
Central limit theorem
 
Chapter 4 estimation of parameters
Chapter 4   estimation of parametersChapter 4   estimation of parameters
Chapter 4 estimation of parameters
 
Bayes Belief Networks
Bayes Belief NetworksBayes Belief Networks
Bayes Belief Networks
 
Introduction to Statistics - Basic Statistical Terms
Introduction to Statistics - Basic Statistical TermsIntroduction to Statistics - Basic Statistical Terms
Introduction to Statistics - Basic Statistical Terms
 
Knowledge management and business intelligence
Knowledge management and business intelligenceKnowledge management and business intelligence
Knowledge management and business intelligence
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Hypothesis testing an introduction
Hypothesis testing an introductionHypothesis testing an introduction
Hypothesis testing an introduction
 
Structure of mis
Structure of misStructure of mis
Structure of mis
 
Chi square test
Chi square test Chi square test
Chi square test
 
Quantitative Data - A Basic Introduction
Quantitative Data - A Basic IntroductionQuantitative Data - A Basic Introduction
Quantitative Data - A Basic Introduction
 
Inferential statistics
Inferential statisticsInferential statistics
Inferential statistics
 
What is statistics
What is statisticsWhat is statistics
What is statistics
 
Parameter estimation
Parameter estimationParameter estimation
Parameter estimation
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
 
"A basic guide to SPSS"
"A basic guide to SPSS""A basic guide to SPSS"
"A basic guide to SPSS"
 
Measures of relationships
Measures of relationshipsMeasures of relationships
Measures of relationships
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 

Similar to Basics of statistical notation

descriptive and inferential statistics
descriptive and inferential statisticsdescriptive and inferential statistics
descriptive and inferential statisticsMona Sajid
 
Statistics in research
Statistics in researchStatistics in research
Statistics in researchBalaji P
 
Data Management_new.pptx
Data Management_new.pptxData Management_new.pptx
Data Management_new.pptxDharenOla3
 
Major types of statistics terms that you should know
Major types of statistics terms that you should knowMajor types of statistics terms that you should know
Major types of statistics terms that you should knowStat Analytica
 
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docx
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docxSAMPLING MEANDEFINITIONThe term sampling mean is a stati.docx
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docxanhlodge
 
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docx
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docxSAMPLING MEANDEFINITIONThe term sampling mean is a stati.docx
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docxagnesdcarey33086
 
Principal components
Principal componentsPrincipal components
Principal componentsHutami Endang
 
Quantitative techniques in business
Quantitative techniques in businessQuantitative techniques in business
Quantitative techniques in businesssameer sheikh
 
CJ 301 – Measures of DispersionVariability Think back to the .docx
CJ 301 – Measures of DispersionVariability Think back to the .docxCJ 301 – Measures of DispersionVariability Think back to the .docx
CJ 301 – Measures of DispersionVariability Think back to the .docxmonicafrancis71118
 
These is info only ill be attaching the questions work CJ 301 – .docx
These is info only ill be attaching the questions work CJ 301 – .docxThese is info only ill be attaching the questions work CJ 301 – .docx
These is info only ill be attaching the questions work CJ 301 – .docxmeagantobias
 
SAMPLING MEAN DEFINITION The term sampling mean is.docx
SAMPLING MEAN  DEFINITION  The term sampling mean is.docxSAMPLING MEAN  DEFINITION  The term sampling mean is.docx
SAMPLING MEAN DEFINITION The term sampling mean is.docxagnesdcarey33086
 
Measures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docxMeasures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docxARIV4
 
Statistics
StatisticsStatistics
Statisticspikuoec
 
Statistics digital text book
Statistics digital text bookStatistics digital text book
Statistics digital text bookdeepuplr
 
Wynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg Girls High
 
Quality Engineering material
Quality Engineering materialQuality Engineering material
Quality Engineering materialTeluguSudhakar3
 
R for statistics session 1
R for statistics session 1R for statistics session 1
R for statistics session 1Ashwini Mathur
 
Ashford 4 - Week 3 - Discussion 1Your initial discussio.docx
Ashford 4 - Week 3 - Discussion 1Your initial discussio.docxAshford 4 - Week 3 - Discussion 1Your initial discussio.docx
Ashford 4 - Week 3 - Discussion 1Your initial discussio.docxfredharris32
 

Similar to Basics of statistical notation (20)

descriptive and inferential statistics
descriptive and inferential statisticsdescriptive and inferential statistics
descriptive and inferential statistics
 
Statistics in research
Statistics in researchStatistics in research
Statistics in research
 
Data Management_new.pptx
Data Management_new.pptxData Management_new.pptx
Data Management_new.pptx
 
Day 3 descriptive statistics
Day 3  descriptive statisticsDay 3  descriptive statistics
Day 3 descriptive statistics
 
Major types of statistics terms that you should know
Major types of statistics terms that you should knowMajor types of statistics terms that you should know
Major types of statistics terms that you should know
 
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docx
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docxSAMPLING MEANDEFINITIONThe term sampling mean is a stati.docx
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docx
 
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docx
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docxSAMPLING MEANDEFINITIONThe term sampling mean is a stati.docx
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docx
 
Principal components
Principal componentsPrincipal components
Principal components
 
Quantitative techniques in business
Quantitative techniques in businessQuantitative techniques in business
Quantitative techniques in business
 
CJ 301 – Measures of DispersionVariability Think back to the .docx
CJ 301 – Measures of DispersionVariability Think back to the .docxCJ 301 – Measures of DispersionVariability Think back to the .docx
CJ 301 – Measures of DispersionVariability Think back to the .docx
 
These is info only ill be attaching the questions work CJ 301 – .docx
These is info only ill be attaching the questions work CJ 301 – .docxThese is info only ill be attaching the questions work CJ 301 – .docx
These is info only ill be attaching the questions work CJ 301 – .docx
 
SAMPLING MEAN DEFINITION The term sampling mean is.docx
SAMPLING MEAN  DEFINITION  The term sampling mean is.docxSAMPLING MEAN  DEFINITION  The term sampling mean is.docx
SAMPLING MEAN DEFINITION The term sampling mean is.docx
 
Measures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docxMeasures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docx
 
Elementary Statistics
Elementary Statistics Elementary Statistics
Elementary Statistics
 
Statistics
StatisticsStatistics
Statistics
 
Statistics digital text book
Statistics digital text bookStatistics digital text book
Statistics digital text book
 
Wynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statistics
 
Quality Engineering material
Quality Engineering materialQuality Engineering material
Quality Engineering material
 
R for statistics session 1
R for statistics session 1R for statistics session 1
R for statistics session 1
 
Ashford 4 - Week 3 - Discussion 1Your initial discussio.docx
Ashford 4 - Week 3 - Discussion 1Your initial discussio.docxAshford 4 - Week 3 - Discussion 1Your initial discussio.docx
Ashford 4 - Week 3 - Discussion 1Your initial discussio.docx
 

Basics of statistical notation

  • 1. Basics of Statistical Notation You will be introduced to a large number of formulas in this section on statistical concepts. These formulas use a relatively standardized notation to simplify the description of how a statistic should be computed. This section introduces the logic and basic concepts behind that notation. With each new formula, we will remind you what the notation means, but this section provides a head's up before we get to those formulas and provides a helpful summary in case you forget a notational concept. Designating a Variable Statistical formulas use algebraic notation, which rely on letters to designate a variable. By convention, if there is just one variable in a formula, the letter X is used to designate the variable. If there is a second variable in the formula, traditionally the letter Y is used to indicate the variable. If there is a third variable, the letter Z is traditionally used. After that, there are no universal traditions, but it is rare to have statistical formulas that involve more than three variables. The capital letter N traditionally refers to the total number of participants in a study. The single letter in statistical formulas refers to the variable. The individuals scores on that variable can be indicated by subscripts, which are numbers written below the letter to refer to a specific score. For example, X1 refers to the score for the first person on the X variable, and X27refers to the score for the 27th person on the X variable. Y11refers to the score on the Y variable for the 11th person. If there are several groups of participants, the number of participants in each group is indicated by a lower-case n with a subscript to indicate the group number. For example, n1refers to the number of participants in the first group. Traditionally, the number of groups in a study are referred to by the lower-case letter k, although in complex designs, this tradition is modified. Therefore, nk refers to the number of participants in the kth group, which is the last group. Algebraic Rules This is a specified order in which functions are to be carried out. The order is: o The highest priority action should be to raise any variables to a power. For example, to compute 2X2 , you would first square the value of X and then multiply by 2. o The next highest priority action is multiplication or division. For example, to compute 2X +1, you would multiply the value of X by 2 and then add 1. o The lowest priority action is addition or subtraction. You can override any of these priorities by using parentheses. Anything in parentheses should be done before other actions. For example, X + Y2 is computed by squaring Y and adding it to X. In contrast, (X+Y)2 is computed by adding X and Y first and then squaring
  • 2. the sum. In other words, the parentheses in the second equation overrides the normal priority order (raise to a power before adding). Summation Notation Many statistical formulas involve adding a series of numbers. The notation for adding a series of numbers is the capital Greek letter sigma. The sigma stands for "add up everything that follows." Therefore, if the sigma is followed by the letter X, it means that you should add up all of the X scores. Parentheses indicate that you should perform the operation in parentheses before you do the summation. For example, the notation below indicates that you should subtract Y from X before you sum the difference. Standard Notation for Statistics A distinction is made between a statistic that is computed on everyone in a population and a the same statistic that is computed on everyone in a sample drawn from the population. o A statistic computed on everyone in the population is called a population parameter. o A statistic computed on everyone in a sample is called a sample statistic. The population mean is designated by the Greek letter mu, whereas the sample mean is designated by an X with a bar over the top (read X bar). Both are illustrated below. A similar distinction is made for standard deviation, which is a measure of variability. The population standard deviation in indicated by the lower case Greek letter sigma, whereas the sample standard deviation is indicated by the lower case letter s, as shown below.
  • 3. The lower case letter r is used to designate a correlation. If there is any doubt about which two variables were used to compute the correlation, the two variables are listed as subscripts. For example, rXY indicates the correlation of X and Y. Descriptive Statistics Descriptive statistics describe aspects of a data set in a single number. Many descriptive statistics are also used in the computation of inferential statistics. In this section, we will be covering four classes of descriptive statistics. Those classes and their definitions are listed below. Measures of Central Tendency - measures that indicate the typical or average score. Measures of Variability - measures that indicate the spread of scores about the measure of central tendency. Relative Scores - a score for an individual that tells how that individual performed relative to the other individuals on the measure. Measures of Relationships - measures that indicate the strength and direction of a relationship between two or more variables. In addition to these widely used descriptive statistics, we will also introduce some other less frequently used statistics, but statistics that you will occasionally run into, especially if you use computers to do your statistical analyses. You can go directly to any of these sections by clicking the section in the Table of Contents below, or you can simply click on the Next Page button to go to the next page within this section. Measures of Central Tendency Measures of central tendency indicate the typical or average score in a distribution of scores. This section covers three measures of central tendency: the mean, median, and mode. The Mean The mean is the arithmetic average of the scores. It is computed by adding all the scores and dividing by the total number of scores. Remember from our section on notation, we use summation notation to indicated that we should add all the scores, and we use the uppercase letter N to indicate the total number of scores. The notation for the mean of the X scores is an uppercase X with a bar across the top. Therefore the formula for computing the mean is written as follows:
  • 4. If you have several groups in your research study, it is traditional to compute the mean for each group. In such a situation, you would use a subscript notation to indicate the groups. In formulas, the groups are numbered from 1 to k. Remember that k is the letter that we use to indicate the number of groups. So using this notation, the mean for Group 1 would use the following formula. Note that we use a subscript 1 to indicate that we are computing the mean for Group 1. We are adding all the scores in Group 1 (the X1s) and dividing the the number of scores in Group 1 (n1). We use a lowercase n here, because we are NOT talking about the total number of scores in the study, but rather the number of scores in just one group of the study. These notation rules can be a pain to learn initially, but once you get them down, you can quickly translate almost any formula into the computational steps that are required. Although it is convenient to number groups for formulas, and we will generally be numbering groups in all of the formulas that we use, it is easier to use subscripts in your computations that are not a code. For example, if you are studying gender differences on a variable, you might compute the mean of that variable for men and women separately. Instead of using the subscripts 1 and 2 and remembering which refers to males and which refers to females, you might as well use a descriptive subscript. For example, the mean of the females might be written as follows: Elsewhere on this website are instructions on how to compute the mean either by hand or using SPSS for Windows. To see those instructions, you click on one of the buttons below. To return to this page after viewing that material, you click on the back arrow key of the web browser that you are using to view this website. The mean is the most widely used measures of central tendency, because it is the measure of central tendency that is most often used in inferential statistics. However, as you will see at the end of this section, the mean does not always provide the best indication of the typical score in a distribution.
  • 5. The Median The median is the middle score in a distribution. It is also the score at the 50 percentile, which means that 50% of the scores are lower and 50% are higher. In the textbook, we showed how to compute the median with a small number of scores. In such a case, you: 1. Order the scores from lowest to highest and count the number of scores (N). 2. If the number of scores is odd, you add 1 to N and divide by 2 to get the middle score [(N+1)/2]. For example, if you have 15 scores, the middle score is the 8th score [(15+1)/2=8]. There will be seven scores above the 8th score and seven below it. 3. If the number of scores is even, there will be no middle score. Instead there will be two scores that straddle the middle. For example, if there are 14 scores, the 7th and 8th scores in your ordered list will straddle the middle. You can figure out which scores to focus on by dividing N by 2 and taking that score from the bottom and the one above it [e.g., 14/2=7, so you take the 7th and 8th scores from the bottom]. The median is between these two scores, so you average them. If the 7th score is 36 and the 8th score is 39, you sum the two and divide by two to get the average [e.g., (36+39)/2=37.5]. This formula works fine when there is a small number of scores and little duplication of scores, but it is not considered accurate enough when there are a large number of scores and many scores are duplicated. In such a situation, a more complicated formula is used. This formula is listed below. Don't panic; it is easier to use than it looks. To compute a median using this formula, you must first create a frequency or grouped frequency distribution. We will use the frequency distribution that we created in the section on organizing data. That distribution is listed below. Score Frequency Cumulative Frequency 17 8 394 16 20 386 15 33 366 14 48 333 13 71 285
  • 6. 12 85 214 11 58 129 10 39 71 9 21 32 8 11 11 Now we need to define each of the terms in the formula for the median. We must do this in steps. We start by finding the middle score. 1. The middle score (nmedian) is the total number of scores divided by 2. In a frequency distribution that also includes a cumulative frequency column, you can read the total number of scores (N) as the number at the top of the cumulative frequency column. In this case, N is 394, so nmedian is 394/2=197. 2. Next you find the interval that includes the 197th score from the bottom. To do this, start at the bottom of the cumulative frequency column and move up until you find the first number that is either equal to 197 or greater than 197. In this case, it is the interval for the score of 12. You may be surprised that we are calling that an interval, because we have only one score, but for the purposes of this computation it is an interval from 11.5 to 12.5. This is illustrated by the figure below. 3. Now we can identify all of the numbers that will go into the formula for the median. LRL stands for Lower Real Limit of the interval that contains the median. In this case, it is 11.5. The interval width (i in the formula) is 1. We compute it by subtracting the lower real limit from the upper real limit [e.g., 12.5-11.5=1.0]. We had previously computed nmedian as 197 [394/2]. The term nLRL refers to the number of people with scores below the lower real limit of the interval. You can read this off of the frequency distribution by noting the number in the cumulative frequency column for the interval below the one that
  • 7. contains the median. In this case, it is 129. In other words, 129 people in our example score below a 12. Finally, fi is the frequency of scores within the interval that contains the median. We can read that number from the frequency column of our distribution. In this case, it is 85. 4. Now we plug all of those items into the formula to get the following. This is the most complicated formula that you have had to deal with so far, but the logic behind it is not as complicated as the formula makes it appear. We have determined that the middle score (197th) appears in the interval of 12, which has real limits from 11.5 to 12.5. Furthermore, we have determined that 85 people are in that interval and 129 score below that interval. This formula makes the assumption that the 85 people scoring in the interval that contains the median are evenly distributed. That is, the first person takes the bottom 1/85th of that interval, the next person takes the next 1/85th of that interval, up to the last person, who takes the top 1/85th of that space. The value "nmedian - nLRL" computes how far we have to count up from the bottom of the interval. In this case, we must count up 68 people [197-129], which is 80% of the way from the bottom of the interval [68/85]. The figure below illustrates this process, which is built into the formula. Although you can create a frequency distribution and do the computations that we just walked you through, with large data sets, it is much more likely that you will use a computer package like SPSS for Windows to do this computation. Computer packages will use this formula to make the computation. To see how you would request the computation of the median using SPSS for Windows, click on the button below. When you want to return to this page, use the back arrow key on your browser to return. The Mode
  • 8. The mode is the most frequently occurring score. In a frequency distribution, you compute the mode by looking for the largest number in the frequency column. The score associated with that number is the mode. Using the frequency distribution previously used for computing the median, the largest frequency is 85, and it is associated with a score of 12. Therefore, 12 is the mode. If you have a grouped frequency distribution, the mode is the midpoint of the interval that contains the largest number of scores. That can create a bit of instability, which we can illustrate with an example. Suppose that we use the data from the frequency distribution above to create a grouped frequency distribution with an interval width of 2 scores. If we start by grouping 8 and 9 together, we will produce the following grouped frequency distribution. Interval Frequency Cumulative Frequency 16-17 28 394 14-15 81 366 12-13 156 285 10-11 97 129 8-9 32 32 The interval with the largest frequency is 12-13, and the midpoint of that interval is 12.5. Therefore, 12.5 is the mode. However, suppose that we create a similar grouped frequency distribution, again with an interval of 2, but this time we start with an interval of 7-8. If we do, we will get the following grouped frequency distribution. Interval Frequency Cumulative Frequency 17-18 8 394 15-16 53 386 13-14 119 333 11-12 143 214 9-10 60 71 7-8 11 11
  • 9. Now the interval with the largest frequency is 11-12, with a midpoint of 11.5. Therefore the mode is 11.5. So the mode shifts depending on how we set up the intervals. This effect is rather small in this example, because the sample sizes are rather large and the distribution is close to symmetric. With small sample sizes and less symmetric distributions, you can get huge shifts in the mode. This is why the mode is considered to be unstable. Another reason that the mode is unstable is that a shift of just a few scores could change the mode by making a different score the mode. Comparing the Measures In the textbook, we provide an example in one of the Cost of Neglect boxes of how the mean can misrepresent the typical score when there are a few deviant scores. In our example, there were five employees with the company, four of which made $40,000 and one making $340,000. The mean was $100,000, which clearly does not reflect the typical salary. The median ($40,000) was a much better estimate of the typical salary. This is an extreme example of a general principle. When a distribution is symmetric, like in the top panel of the figure below, the mean, median, and mode will all be the same. However, as the curve becomes more skewed, these three measures of central tendency diverge. The mode will always be at the peak of the curve, because the highest point indicates the most frequent score. The mean will be pulled the most toward the tail of the skew, with the median in between.
  • 10. These graphs may help you to understand what each of these measures of central tendency measure. The mode is always the score at which the curve reaches its highest point (i.e., the most frequent score). The median is the score the cuts the curve into two equal areas. In other words, the area above the median line is equal to the area below the median line. The area under a frequency curve is proportional to the number of people or objects represented by that curve. Remember, the median is the 50 percentile, so that should be an equal number above and below the median. So the area of the curve should be equal above and below the median to reflect this.
  • 11. The mean is the balance point for the curve. What that means is if we cut a block of wood in the exact shape of the curve, the mean would be the point at which that block of wood could be perfectly balanced on your finger. It is the point where the average distance from the mean is exactly the same for the scores above the mean and the scores below the mean. Measures of Variability Measures of variability indicate the degree to which the scores in a distribution are spread out. Larger numbers indicate greater variability of scores. Sometimes the word dispersion is substituted for variability, and you will find that term used in some statistics texts. We will divide our discussion of measures of variability into four categories: range measures, the average deviation, the variance, and the standard deviation. Range Measures In Chapter 5, we introduced only one range measure, which was called the range. The range is the distance from the lowest score to the highest score. We noted that the range is very unstable, because it depends on only two scores. If one of those scores moves further from the distribution, the range will increase even though the typical variability among the scores has changed little. This instability of the range has lead to the development of two other range measures, neither of which rely on only the lowest and highest scores. The interquartile range is the distance from the 25th percentile and the 75 percentile. The 25th percentile is also called the first quartile, which means that it divides the first quarter of the distribution from the rest of the distribution. The 75th percentile is also called the third quartile because it divides the lowest three quarters of the distribution from the rest of the distribution. Typically, the quartiles are indicated by uppercase Qs, with the subscript indicating which quartile we are talking about (Q1 is the first quartile and Q3 is the third quartile). So the interquartile range can be computed by subtracting Q1 from Q3 [i.e., Q3 -Q1]. There is a variation on the interquartile range, called the semi-interquartile range or quartile deviation. This value is equal to half of the interquartile range. Using this notation, the median is the second quartile (the 50th percentile). That means that we can use a variation of the formula for the median to compute both the first and third quartiles. Looking at the equation below for the median, we would make the following changes to compute these quartiles. To compute Q1, nmedian becomes nQ1, which is equal to .25*N. We then identify the interval that contains the nQ1 score. All of the other values are obtained in the same way as for the median. To compute Q3, nmedian becomes nQ3, which is equal to .75*N. We then identify the interval that contains the nQ3 score. All of the other values are obtained in the same way as for the median.
  • 12. To compute the interquartile range, subtract Q1 from Q3. To compute the quartile deviation, divide the interquartile range by 2. It is common to report the range, and many computer programs routinely provide the minimum score, maximum score, and the range as part of their descriptive statistics package. Nevertheless, these are not widely used measures of variability. The same computer programs that give a range, will also provide both a standard deviation and variance. We will be discussing these measures of variability shortly, after we have introduced the concept of the average deviation. The Average Deviation The average deviation is not a measure of variability that anyone uses, but it provides an understandable introduction to the variance. The variance is not an intuitive statistic, but it is very useful in other statistical procedures. In contrast, the average deviation is intuitive, although generally worthless for other statistical procedures. So we will use the average deviation to introduce the concept of the variance. The average deviation is, as the name implies, the average deviation (distance) from the mean. To compute it, you start by computing the mean, then you subtract the mean from each score, ignoring the sign of the difference, and sum those differences. You then divide by the number of scores (N). The formula is shown below. The vertical lines on either side of the numerator indicate that you should take the absolute value, which converts all the differences to positive quantities. Therefore, you are computing deviations (distances) from the mean. Chapter 5 in the textbook walked you through the computation of the average deviation. The reason we take the absolute value of these distances from the mean is that the sum of the differences from the mean, some positive and some negative, will always equal zero. We can prove that fact with a little algebra, but you can take our word for it. As we mentioned earlier, the average deviation is easy to understand, but it has little value for inferential statistics. In contrast, the next two measures (variance and standard deviation) are useful in other statistical procedures. So we now turn our attention to them.
  • 13. The Variance The variance takes a different approach to making all of the distances from the mean positive so that they will not sum to zero. Instead of taking the absolute value of the difference from the mean, the variance squares all of those differences. The notation that is used for the variance is a lowercase s2 . The formula for the variance is shown below. If you compare it with the formula for average deviation, you will see two differences instead of one between these formulas. The first is that the differences are squared instead of taking the absolute value. The numerator of this formula is called the sum of squares, which is short for sum of squared differences from the mean. See if you can spot the second difference. Did you recognize that the variance formula does not divide by N, but instead divides by N-1? The denominator (N-1) in this equation is called the degrees of freedom. It is a concept that you will hear about again and again in statistics. If you would like to know more about degrees of freedom, you can click on this link. This link provides a conceptual explanation of this concept. The reason that the variance formula divides the sum of squared differences from the mean by N- 1 is that dividing by N would produce a biased estimate of the population variance, and that bias is removed by dividing by N-1. You can learn more about the concept of biased versus unbiased estimates of population parameters by clicking on this link. The Standard Deviation The variance has some excellent statistical properties, but it is hard for most students to conceptualize. To start with, the unit of measurement for the mean is the same as the unit of measurement for the score. For example, if we compute the mean age of our sample and find that it is 28.7 years, that mean is on the same scale as the individual ages of our participants. But the variance is in squared units. For example, we might find that the variance is 100 years2 . Can you even imagine what the unit of years squared represents? Most people can't. But there is a measure of variability that is in the same units as the mean. It is called the standard deviation, and it is the square root of the variance (see the formula below). So if the variance was 100 years2 , the standard deviation would be 10 years. Since we used the symbol s2 to indicate variance, you might not be surprised that we use the lowercase letter s to indicate the standard deviation. You will see in our discussion of relative scores how valuable the standard deviation can be.
  • 14. At this point, many students assume that the variance is just a step in computing the standard deviation, because the standard deviation seems like it is much more useful and understandable. In fact, you will use the standard deviation for description purposes only and will use the variance for all your other statistical tasks. If you are wondering why that is, click on this link to find out. Relative Scores In this section, we will use some of the statistical information that you have already learned to solve the practical problem of how to indicate the relative standing of a person on a measure. As part of this section, you will learn about the standard normal distribution, which is a distribution that is defined by a mathematical equation. Such mathematical distributions are a critical part of the inferential statistical process that we will be covering later. Percentile Ranks Relative scores indicate where a person stands within a specified normative sample. In general, scores have little meaning unless you know how other people scored. For example, on probably most of the exams that you have taken in school, you need to be correct 70% or more of the time just to pass the course, yet the standardized tests that you took as part of the college admission process are designed so that less than half of the people taking them get 70% or more of the questions correct. To know how good a score is, you need to know what other people got. For example, a professional baseball player who got a hit only half the times he went up to bat would be by far the greatest hitter of all time. Most professional baseball players only get a hit about every fourth time at bat. In contrast, a driver who only reached his or her destination without an accident half the time would be considered so bad that no insurance company would cover the person. Most people arrive safely at their destination 99+% of the time. We are constantly seeking information about relative scores, sometime even before the scores have been computed. How many times have you walked out of a test and asked other students whether the test seemed hard or easy. If you thought it was hard, and therefore are worried that you did not do well, you are likely to feel a little better after other students tell you that it seemed very hard to them as well. The most basic relative score is the percentile rank, which specifies the percentage of people in the normative group who score lower on the measure than yourself. So if you scored at the 25th percentile, it means that 25% of the people score lower than you and 75% score higher than you. Percentile ranks can range from 0 (for the person with the lowest score) to 100 (for the person with the highest score). Most often percentile ranks are computed from a frequency distribution. Let's again use the frequency distribution that we have used before for examples. That distribution is below. Suppose that we want to compute the percentile rank for a score of 15. From the table, we can
  • 15. see that there are 333 people with a score below 15, but what do we do with the 33 people who have exactly 15? Do we count them as scoring above or below our person with a score of 15. The tradition is to assume that half of the people with the same score are below and half are above. That means that we have 33/2=16.5 with a 15 that we consider to be lower than us, and the same number with a score of 15 that we consider to have a score higher than us. We add the 16.5 people to the 333 people with lower scores of 14 or lower to get the number of people with scores lower than ours. Score Frequency Cumulative Frequency 17 8 394 16 20 386 15 33 366 14 48 333 13 71 285 12 85 214 11 58 129 10 39 71 9 21 32 8 11 11 There are a total of 394 people. To get the percentile rank, we divide the number of people below our score by the total number of people and multiply by 100 (to convert the proportion to a percent). In this case, the percentile rank is 89 [(349.5/394)*100]. We traditionally round percentile ranks to two significant digits. So we rounded 88.705584% to 89%. Standard Normal Distribution Many variables in psychology tend to show a distinctive shape when graphed using a histogram or frequency polygon. The shape resembles a bell shaped curve like the one shown below. This classic bell shaped curve is called a normal curve or normal distribution. The normal curve is perfectly symmetric. The right half and left half are mirror images of one another. The curve also does not quite reaches zero, although it gets very close. The shape of the normal curve is actually determined by a complex equation, with dictates the height of the curve at every point. You need not know the details of this equation, but you should know that the equation includes two variables. They are the mean and the standard deviation. The mean
  • 16. dictates where the middle of the distribution is, which is the highest point of the curve and the point that separates the the area under the curve into two equal segments. The standard deviation determines how spread out the curve is. Because the normal curve is based on an equation, it is possible to know exactly how high the curve is at every point and how much area is under the curve between any two scores on the X- axis. The figure below marks off 1 and 2 standard deviations both above and below the mean. The area under the curve between the mean and one standard deviation below the mean is approximately 34%, as shown in the figure. More precisely, it is 34.13%. We will show you where that number comes from shortly. Because the curve is symmetric, the area between the mean and one standard deviation above the mean is also 34%. Similarly, the area between 1 and 2 standard deviations, either above or below the mean, is approximately 14%, and the area beyond 2 standard deviations is 2% on either side of the distribution. All of these areas are determined by the equation for the normal curve, but you do not need to use this equation, because the values are computed for you and available in a table called the Area under the Standard Normal Curve Table. If you click on the link to this table, you can see what it looks like. To use the Standard Normal Table, you need to know a little more about the normal curve and you need to learn about the standard score, also known as the Z-score. If you look at the two normal curves above, you might recognize that there are no scores listed on the X-axis. Remember that the location on the X-axis is determined by the mean of the distribution and the spread of the curve is determined by the standard deviation of the distribution. But you can convert a normal curve with any mean and variance into a standard normal distribution, which is aa normal curve with a mean of zero and a standard deviation of 1.
  • 17. If the curve above were a standard normal distribution, the labels on the X-axis at the lines that divide the curve into sections would be -2, -1, 0, +1, +2, read from left to right. The table we showed you earlier gives the areas under the curve of such a standard normal distribution. Shown below is the equation that converts any score to a standard score using the values of the score and the mean and standard deviation of the distribution. A standard score shows where the person scores in a standard normal distribution. It tells you instantly whether the score is above or below the mean by the sign of the Z-score. If the Z-score is positive, the person scored above the mean; if it is negative, the person scored below the mean. The size of the Z-score indicates how far away from the mean the person scored. If you want to see exactly how the standard normal distribution and the Z-score can be used to compute a percentile rank, you can click on this link. Besides walking you through the process, this link provides exercises to help you master this concept and procedure. Other Relative Scores The score on any measure could be converted to a Z-score, which would tell you at a glance how a person scored relative to the reference group. For example, if someone tells you that her Z- score on the exam was +1.55, you know immediately that she scored above the mean and enough above the mean that she is near the top of the class. Remember, most of the normal distribution is contained between the boundaries of -2.0 and +2.0 standard deviations. There is only about 2% of the area under the curve in each of the tails. If another student tells you the his exam score was a Z of -.36, you know that he scores a bit below the mean. With the standard normal table, you could compute the percentile rank for each of these students in a few minutes. Of course, this procedure is only legitimate if the shape of the distribution of scores is normal or very close to normal. If the shape is not normal, the Standard Normal Table will not give accurate information about how the proportion of people who score above and below a given score. Although Z-scores are very useful and allow people to judge the relative performance of an individual quickly, many people get easily confused by the negative numbers that are possible with Z-scores. Consequently, many tests compute Z-scores, but then translate them mathematically to avoid negative numbers. For example, the IQ test produces a distribution of scores that is very close to normal. However, the IQ test does not give a person's score as a Z-score; instead, the IQ test reports the score as an IQ score. The IQ score is simply the Z-score multiplied by 15 and then added to 100, as shown in the equation below. The values of 15 and 100 are arbitrary, but the effect of this transformation
  • 18. is to produce a normal distribution with a mean of 100 and a standard deviation of 15. So the IQ distribution looks like the figure below. Note that this figure is identical in shape to all the other figures in this section. The only difference is that the scores on the X-axis are IQ scores. So just over 95% of people have IQ scores between 70 and 130, and no one has a negative IQ. Standardized tests often perform a similar transformation to avoid negative scores. For example, the Scholastic Aptitude Test (SAT) used for college admission and the Graduate Record Exam (GRE) used for admission to graduate school are both standardized so the the mean of the subtests is 500, with a standard deviation of 100. So if you score 450 on the verbal section of the SAT, you are scoring .5 standard deviations below the mean, which puts you at the 31st percentile. (See if you can do the computations and use the standard normal table to verify this percentile rank. This link shows you the method to make this computation.) The Normative Sample Z-scores and transformed Z-scores, such as SAT scores, are very handy and are used extensively in reporting test scores. But it is critical to understand that the score is meaningful ONLY if you take into account the normative sample. A quick example will illustrate this point. Let's assume that Dan took the SAT as a High School senior and scored 650 on each subtest. That is 150 points above the mean (1.5 standard deviations) and would place him at approximately the 93rd percentile. Four years later, after doing well in college, he decides to go to graduate school, and so he takes the GRE. This time he only obtained a 550 on each of the subtests. What happened? Why did his performance decrease despite the fact that he worked hard in college and did very well? You may already have guessed the answer to that question. In effect, we are trying to compare apples and oranges. The scores on the SAT and the GRE mean entirely different things, because they are based on entirely different normative samples.
  • 19. The SAT is taken by people who expect to complete high school and are considering going on to college. In contrast, the GRE is taken by people who expect to graduate from college and plan to go onto graduate school. Anyone who drops out of college or does poorly in college is unlikely to take the GRE. In other words, the normative sample for the GRE is much more exclusive than for the SAT. Dan's GRE score would place him in the 70th percentile of the people applying for graduate school, who are a pretty elite group academically. Most of the people who took the SAT did not take the GRE, and most of the people who take the GRE did very well on the SAT. The competition (i.e., the normative group) was tougher for the GRE than the SAT. Whenever you are given a normative score, such as a Z-score, percentile rank, or score on a standardized test, you should always consider the nature of the normative sample. A person making $500,000 per year may be one of the best paid people in the country (the normative sample including all workers), but one of the lowest paid CEOs for a Fortune 500 company (a different normative sample). Measures of Relationship Chapter 5 of the textbook introduced you to the three most widely used measures of relationship: the Pearson product-moment correlation, the Spearman rank-order correlation, and the Phi correlation. We will be covering these statistics in this section, as well as other measures of relationship among variables. What is a Relationship? Correlation coefficients are measures of the degree of relationship between two or more variables. When we talk about a relationship, we are talking about the manner in which the variables tend to vary together. For example, if one variable tends to increase at the same time that another variable increases, we would say there is a positive relationship between the two variables. If one variable tends to decrease as another variable increases, we would say that there is a negative relationship between the two variables. It is also possible that the variables might be unrelated to one another, so that you cannot predict one variable by knowing the level of the other variable. As a child grows from an infant into a toddler into a young child, both the child's height and weight tend to change. Those changes are not always tightly locked to one another, but they do tend to occur together. So if we took a sample of children from a few weeks old to 3 years old and measured the height and weight of each child, we would likely see a positive relationship between the two. A relationship between two variables does not necessarily mean that one variable causes the other. When we see a relationship, there are three possible causal interpretations. If we label the variables A and B, A could cause B, B could cause A, or some third variable (we will call it C) could cause both A and B. With the relationship between height and weight in children, it is likely that the general growth of children, which increases both height and weight, accounts for the observed correlation. It is
  • 20. very foolish to assume that the presence of a correlation implies a causal relationship between the two variables. There is an extended discussion of this issue in Chapter 7 of the text. Scatter Plots and Linear Relationships A helpful way to visualize a relationship between two variables is to construct a scatter plot, which you were briefly introduced to in our discussion of graphical techniques. A scatter plot represents each set of paired scores on a two dimensional graph, in which the dimensions are defined by the variables. For example, if we wanted to create a scatter plot of our sample of 100 children for the variables of height and weight, we would start by drawing the X and Y axes, labeling one height and the other weight, and marking off the scales so that the range on these axes is sufficient to handle the range of scores in our sample. Let's suppose that our first child is 27 inches tall and 21 pounds. We would find the point on the weight axis that represents 21 pounds and the point on the height axis that represents 27 inches. Where these two points cross, we would put a dot that represents the combination of height and weight for that child, as shown in the figure below. We then continue the process for all of the other children in our sample, which might produce the scatter plot illustrated below.
  • 21. It is always a good idea to produce scatter plots for the correlations that you compute as part of your research. Most will look like the scatter plot above, suggesting a linear relationship. Others will show a distribution that is less organized and more scattered, suggesting a weak relationship between the variables. But on rare occasions, a scatter plot will indicate a relationship that is not a simple linear relationship, but rather shows a complex relationship that changes at different points in the scatter plot. The scatter plot below illustrates a nonlinear relationship, in which Y increases as X increases, but only up to a point; after that point, the relationship reverses direction. Using a simple correlation coefficient for such a situation would be a mistake, because the correlation cannot capture accurately the nature of a nonlinear relationship.
  • 22. Pearson Product-Moment Correlation The Pearson product-moment correlation was devised by Karl Pearson in 1895, and it is still the most widely used correlation coefficient. This history behind the mathematical development of this index is fascinating. Those interested in that history can click on the link. But you need not know that history to understand how the Pearson correlation works. The Pearson product-moment correlation is an index of the degree of linear relationship between two variables that are both measured on at least an ordinal scale of measurement. The index is structured so the a correlation of 0.00 means that there is no linear relationship, a correlation of +1.00 means that there is a perfect positive relationship, and a correlation of -1.00 means that there is a perfect negative relationship. As you move from zero to either end of this scale, the strength of the relationship increases. You can think of the strength of a linear relationship as how tightly the data points in a scatter plot cluster around a straight line. In a perfect relationship, either negative or positive, the points all fall on a single straight line. We will see examples of that later. The symbol for the Pearson correlation is a lowercase r, which is often subscripted with the two variables. For example, rxy would stand for the correlation between the variables X and Y.
  • 23. The Pearson product-moment correlation was originally defined in terms of Z-scores. In fact, you can compute the product-moment correlation as the average cross-product Z, as show in the first equation below. But that is an equation that is difficult to use to do computations. The more commonly used equation now is the second equation below. Although this equation looks much more complicated and looks like it would be much more difficult to compute, in fact, this second equation is by far the easier of the two to use if you are doing the computations with nothing but a calculator. You can learn how to compute the Pearson product-moment correlation either by hand or using SPSS for Windows by clicking on one of the buttons below. Use the browser's return arrow key to return to this page. Spearman Rank-Order Correlation The Spearman rank-order correlation provides an index of the degree of linear relationship between two variables that are both measured on at least an ordinal scale of measurement. If one of the variables is on an ordinal scale and the other is on an interval or ratio scale, it is always possible to convert the interval or ratio scale to an ordinal scale. That process is discussed in the section showing you how to compute this correlation by hand. The Spearman correlation has the same range as the Pearson correlation, and the numbers mean the same thing. A zero correlation means that there is no relationship, whereas correlations of +1.00 and -1.00 mean that there are perfect positive and negative relationships, respectively. The formula for computing this correlation is shown below. Traditionally, the lowercase r with a subscript s is used to designate the Spearman correlation (i.e., rs). The one term in the formula that is not familiar to you is d, which is equal to the difference in the ranks for the two variables. This is explained in more detail in the section that covers the manual computation of the Spearman rank-order correlation.
  • 24. The Phi Coefficient The Phi coefficient is an index of the degree of relationship between two variables that are measured on a nominal scale. Because variables measured on a nominal scale are simply classified by type, rather than measured in the more general sense, there is no such thing as a linear relationship. Nevertheless, it is possible to see if there is a relationship. For example, suppose you want to study the relationship between religious background and occupations. You have a classification systems for religion that includes Catholic, Protestant, Muslim, Other, and Agnostic/Atheist. You have also developed a classification for occupations that include Unskilled Laborer, Skilled Laborer, Clerical, Middle Manager, Small Business Owner, and Professional/Upper Management. You want to see if the distribution of religious preferences differ by occupation, which is just another way of saying that there is a relationship between these two variables. The Phi Coefficient is not used nearly as often as the Pearson and Spearman correlations. Therefore, we will not be devoting space here to the computational procedures. Advanced Correlational Techniques Correlational techniques are immensely flexible and can be extended dramatically to solve various kinds of statistical problems. Covering the details of these advanced correlational techniques is beyond the score of this text and website. However, we have included brief discussions of several advanced correlational techniques on this Student Resource Website, including multidimensional scaling, path analysis, taxonomic search techniques, and statistical analysis of neuroimages. Nonlinear Correlational Procedures The vast majority of correlational techniques used in psychology are linear correlations. However, there are times when we can expect to find nonlinear relationships and we would like to apply statistical procedures to capture such complex relationships. This topic is far too complex to cover here. The interested student will want to consult advanced statistical textbooks that specialize in regression analyses. There are two words of caution that we want to state about using such nonlinear correlational procedures. Although it is relatively easy to do the computations using modern statistical software, you should not use these procedures unless you actually understand them and their pitfalls. It is easy to misuse the techniques and to be fooled into believing things that are not true from a naive analysis of the output of computer programs.
  • 25. The second word of caution is that there should be a strong theoretical reason to expect a nonlinear relationship if you are going to use nonlinear correlational procedures. Many psychophysiological processes are by their nature nonlinear, so using nonlinear correlations in studying those processes makes complete sense. But for most psychological processes, there is no good theoretical reasons to expect a nonlinear relationship. Linear Regression As you learned in Chapters 5 and 7 of the text, the value of correlations is that they can be used to predict one variable from another variable. This process is called linear regression or simply regression. It involves fitting mathematically a straight line to the data from a scatter plot. Below is a scatter plot from our discussion of correlations. We have added a regression line to that scatter plot to illustrate how regression works. We compute the regression line with formulas that we will present to you shortly. The regression line is based on our data. Once we have the regression line, we can then use it to predict Y from knowing X. The scatter plot below shows the relationship of height and weight in young children (birth to three years old). The line that runs through the data points is called the regression line. It is determined by an equation, which we will discuss shortly. If we know the value of X (in this case, weight) and we want to predict Y from X, we draw a line straight up from our value of X until it intersects the regression line, and then we draw a line that is parallel to the X-axis over to the Y-axis. We then read from the Y-axis our predicted value for Y (in this case, height).
  • 26. In order to fit a line mathematically, there must be some stated mathematical criteria for what constitutes a good fit. In the case of linear regression, that mathematical criteria is called least squares criteria, which is shorthand for the line being positioned so that the sum of the squared distances from the score to the predicted score is as small as it can be. If you are predicting Y, you will compute a regression line that minimized the sum of the (Y-Y')2 . Traditionally, a predicted score is referred to by using the letter of the score and adding a single quotation after it (Y' is read Y prime or Y predicted). To illustrate this concept, we removed most of the clutter of data points from the above scatter plot and showed the distances that are involved in the least squares criteria. Note that it is the vertical distance from the point to the prediction line--that is, the difference from the predicted Y (along the regression line) and the actual Y (represented by the data point). A common misconception is that you measure the shortest distance to the line, which will be a line to the point that is at right angles to the regression line.
  • 27. It may not be immediately obvious, but if you were trying to predict X from Y, you would be minimizing the sum of the squared distances (X-X'). That means that the regression line for predicting Y from X may not be the same as the regression line for predicting X from Y. In fact, it is rare that they are exactly the same. The first equation below is the basic form of the regression line. It is simply the equation for a straight line, which you probably learned in high school math. The two new notational items are byx and ayx which are the slope and the intercept of the regression line for predicting Y from X. The slope is how much the Y scores increase per unit of X score increase. The slope in the figure above is approximately .80. For every 10 units movement along the line on the X axis, the Y axis moves about 8 units. The intercept is the point at which the line crosses the Y axis (i.e., the point at which X is equal to zero. The equations for computing the slope and intercept of the line are listed as the second and third equations, respectively. If you want to predict X from Y, simply replace all the Xs with Ys and the Ys with Xs in the equations below.
  • 28. A careful inspection of these equations will reveal a couple of important ideas. First, if you look at the first version of the equation for the slope (the one using the correlation and the population variances), you will see that the slope is equal to the correlation if the population variances are equal. That would be true either for predicting X from Y or Y from X. What is less clear, but is also true, is that the regression lines for predicting X or predicting Y will be identical if the population variances are equal. That is the ONLY situation in which the regression lines are the same. Second, if the correlation is zero (i.e., no relationship between X and Y), then the slope will be zero (look at the first part of the second equation). If you are predicting Y from X, your regression line will be horizontal, and if you are predicting X from Y, your regression line will be vertical. Furthermore, if you look at the third equation, you will see that the horizontal line for predicting Y will be at the mean of Y and the vertical line for predicting X will be at the mean of X. Think about that for a minute. If X and Y are uncorrelated and you are trying to predict Y, the best prediction that you can make is the mean of Y. If you have no useful information about a variable and are asked to predict the score of a given individual, your best bet is to predict the mean. To the extent that the variables are correlated, you can make a better prediction by using the information from the correlated variable and the regression equation. Other Descriptive Statistics The most commonly used descriptive statistics for distribution of scores are the mean and the variance (or standard deviation), but there are other descriptive statistics that are available and are often computed by data analysis programs. We will discuss two of them in this section: Skewness and Kurtosis. Skewness and kurtosis both describe an aspect of the shape of a distribution, much as the mean and the variance describe the shape of a distribution. But these indices of the shape of the distribution are used far less frequently than the mean and variance.
  • 29. Skewness was discussed in the section on graphing data with histograms and frequency polygons. There is an index of the degree and direction of skewness that many statistical programs produce as part of their entire descriptive statistics package. When the skewness is near zero, the distribution is close to symmetric. A negative number indicates a negative skew, and a positive number indicates a positive skew. Kurtosis indicates the degree of flatness of the distribution. The larger the number, the flatter the distribution. Statisticians refer to a concept called moments of a distribution. The mean is based on the first moment of the distribution, and the variance is based on the second moment of the distribution. The skewness and kurtosis are based on the third and fourth moments of the distribution. These concepts make more sense to people who do theoretical work in statistics. But for the purpose of this course, the only points you need to remember is the definition of these terms in case your run across them in your reading. If you want to compute them, most statistical analysis packages allow you to include these statistics in the descriptive statistics package.