Interpreting Test Scores
This page describes which scores to use to accomplish each of several purposes and
tells what the different types of scores mean.
Three of the fundamental purposes for testing are (1) to describe each student's
developmental level within a test area, (2) to identify a student's areas of relative
strength and weakness in subject areas, and (3) to monitor year-to-year growth in
the basic skills. To accomplish any one of these purposes, it is important to select
the type of score from among those reported that will permit the proper interpretation.
Scores such as percentile ranks, grade equivalents, and standard scores differ from
one another in the purposes they can serve, the precision with which they describe
achievement, and the kind of information they provide. A closer look at these types
of scores will help differentiate the functions they can serve and the meanings
they can convey. Additional detail can be found in the Interpretive Guide for Teachers
and Counselors.
In Iowa, school districts can obtain scores that are reported using national norms
or Iowa norms. On some reports, both kinds of scores are reported. The difference
is simply in the group with which comparisons are made to obtain score meaning.
A student's Iowa percentile rank (IPR) compares the student's score with those of
others in his/her grade in Iowa. The student's national percentile rank (NPR) compares
that same score with those of others in his/her grade in the nation. For other types
of scores described below, there are both Iowa and national scores available to
Iowa schools.
Types of Scores
Raw Score (RS)
The number of questions a student gets right on a test is the student's raw score
(assuming each question is worth one point). By itself, a raw score has little or
no meaning. The meaning depends on how many questions are on the test and how hard
or easy the questions are. For example, if Kati got 10 right on both a math test
and a science test, it would not be reasonable to conclude that her level of achievement
in the two areas is the same. This illustrates why raw scores are usually converted
to other types of scores for interpretation purposes.
Percent Correct (PC)
When the raw score is divided by the total number of questions and the result is
multiplied by 100, the percent-correct score is obtained. Like raw scores, percent-correct
scores have little meaning by themselves. They tell what percent of the questions
a student got right on a test, but unless we know something about the overall difficulty
of the test, this information is not very helpful. Percent-correct scores are sometimes
incorrectly interpreted as percentile ranks, which are described below. The two
are quite different.
Grade Equivalent (GE)
The grade equivalent is a number that describes a student's location on an achievement
continuum. The continuum is a number line that describes the lowest level of knowledge
or skill on one end (lowest numbers) and the highest level of development on the
other end (highest numbers). The GE is a decimal number that describes performance
in terms of grade level and months. For example, if a sixth-grade student obtains
a GE of 8.4 on the Vocabulary test, his score is like the one a typical student
finishing the fourth month of eighth grade would likely get on the Vocabulary test.
The GE of a given raw score on any test indicates the grade level at which the typical
student makes this raw score. The digits to the left of the decimal point represent
the grade and those to the right represent the month within that grade.
Grade equivalents are particularly useful and convenient for measuring individual
growth from one year to the next and for estimating a student's developmental status
in terms of grade level. But GEs have been criticized because they are sometimes
misused or are thought to be easily misinterpreted. One point of confusion involves
the issue of whether the GE indicates the grade level in which a student should
be placed. For example, if a fourth-grade student earns a GE of 6.2 on a fourth-grade
reading test, should she be moved to the sixth grade? Obviously the student's developmental
level in reading is high relative to her fourth-grade peers, but the test results
supply no information about how she would handle the material normally read by students
in the early months of sixth grade. Thus, the GE only estimates a student's developmental
level; it does not provide a prescription for grade placement. A GE that is much
higher or lower than the student's grade level is mainly a sign of exceptional performance.
In sum, all test scores, no matter which type they are or which test they are from,
are subject to misinterpretation and misuse. All have limitations or weaknesses
that are exaggerated through improper score use. The key is to choose the type of
score that will most appropriately allow you to accomplish your purposes for testing.
Grade equivalents are particularly suited to estimating a student's developmental
status or year-to-year growth. They are particularly ill-suited to identifying a
student's standing within a group or to diagnosing areas of relative strength and
weakness.
Developmental Standard Score (SS)
Like the grade equivalent (GE), the developmental standard score is also a number
that describes a student's location on an achievement continuum. The scale used
with the ITBS and ITED was established by assigning a score of 200 to the median
performance of students in the spring of grade 4 and 250 to the median performance
of students in the spring of grade 8.
The main drawback to interpreting developmental standard scores is that they have
no built-in meaning. Unlike grade equivalents, for example, which build grade level
into the score, developmental standard scores are unfamiliar to most educators,
parents, and students. To interpret the SS, the values associated with typical performance
in each grade must be used as reference points.
The main advantage of the developmental standard score scale is that it mirrors
reality better than the grade-equivalent scale. That is, it shows that year-to-year
growth is usually not as great at the upper grades as it is at the lower grades.
(Recall that the grade-equivalent scale shows equal average annual growth -- 10
months -- between any pair of grades.) Despite this advantage, the developmental
standard scores are much more difficult to interpret than grade equivalents. Consequently,
when teachers and counselors wish to estimate a student's annual growth or current
developmental level, grade equivalents are the scores of choice.
The potentials for confusion and misinterpretation that were described in the previous
subsection for the GE are applicable to the SS as well. Relative to the GE, the
SS is not as easy to use in describing growth, but it is equally inappropriate for
identifying relative strengths and weaknesses of students or for describing a student's
standing in a group.
Percentile Rank (PR)
A student's percentile rank is a score that tells the percent of students in a particular
group that got lower raw scores on a test than the student did. It shows the student's
relative position or rank in a group of students who are in the same grade and who
were tested at the same time of year (fall, midyear, or spring) as the student.
Thus, for example, if Toni earned a percentile rank of 72 on the Language test,
it means that she scored higher than 72 percent of the students in the group with
which she is being compared. Of course, it also means that 28 percent of the group
scored higher than Toni. Percentile ranks range from 1 to 99.
A student's percentile rank can vary depending on which group is used to determine
the ranking. A student is simultaneously a member of many different groups: all
students in her classroom, her building, her school district, her state, and the
nation. Different sets of percentile ranks are available with the Iowa Tests of Basic
Skills to permit schools to make the most relevant comparisons involving
their students.
Types of Score Interpretation
An achievement test is built to help determine how much skill or knowledge students
have in a certain area. We use such tests to find out whether students know as much
as we expect they should, or whether they know particular things we regard as important.
By itself, the raw score from an achievement test does not indicate how much a student
knows or how much skill she or he has. More information is needed to decide "how
much." The test score must be compared or referenced to something in order to bring
meaning to it. That "something" typically is (a) the scores other students have
obtained on the test or (b) a series of detailed descriptions that tell what students
at each score point know or which skills they have successfully demonstrated. These
two ways of referencing a score to obtain meaning are commonly called norm-referenced
and criterion-referenced score interpretations.
Norm-Referenced Interpretation
Standardized achievement batteries like the ITBS and ITED are designed mainly to
provide for norm-referenced interpretations of the scores obtained from them. For
this reason they are commonly called norm-referenced tests. However, the scores
also permit criterion-referenced interpretations, as do the scores from most other
tests. Thus, norm-referenced tests are devised to enhance norm-referenced interpretations,
but they also permit criterion-referenced interpretation.
A norm-referenced interpretation involves comparing a student's score with the scores
other students obtained on the same test. How much a student knows is determined
by the student's standing or rank within the reference group. High standing is interpreted
to mean the student knows a lot or is highly skilled, and low standing means the
opposite. Obviously, the overall competence of the norm group affects the interpretation
significantly. Ranking high in an unskilled group may represent lower absolute achievement
than ranking low in an exceptional high performing group.
Most of the scores on ITBS and ITED score reports are based on norm-referencing,
i.e., comparing with a norm group. In the case of percentile ranks, stanines, and
normal curve equivalents, the comparison is with a single group of students in a
certain grade who tested at a certain time of year. These are called status scores
because they show a student's position or rank within a specified group. However,
in the case of grade equivalents and developmental standard scores, the comparison
is with a series of reference groups. For example, the performances of students
from third grade, fourth grade, fifth grade, and sixth grade are linked together
to form a developmental continuum. (In reality, the scale is formed with grade groups
from kindergarten up through the end of high school.) These are called developmental
scores because they show the students' positions on a developmental scale. Thus,
status scores depend on a single group for making comparisons and developmental
scores depend on multiple groups that can be linked to form a growth scale.
An achievement battery like the ITBS or ITED is a collection of tests in several
subject areas, all of which have been standardized with the same group of students.
That is, the norms for all tests have been obtained from a single group of students
at each grade level. This unique aspect of the achievement battery makes it possible
to use the scores to determine skill areas of relative strength and weakness for
individual students or class groups, and to estimate year-to-year growth. The use
of a battery of tests having a common norm group enables educators to make statements
such as "Suzette is better in mathematics than in reading" or "Danan has shown less
growth in language skills than the typical student in his grade." If norms were
not available, there would be no basis for statements like these.
Norms also allow students to be compared with other students and schools to be compared
with other schools. If making these comparisons were the sole reason for using a
standardized achievement battery, then the time, effort, and cost associated with
testing would have to be questioned. However, such comparisons do give educators
the opportunity to look at the achievement levels of students in relation to a nationally
representative student group. Thus, teachers and administrators get an "external"
look at the performance of their students, one that is independent of the school's
own assessments of student learning. As long as our population continues to be highly
mobile and students compete nationally rather than locally for educational and economic
opportunities, student and school comparisons with a national norm group should
be of interest to students, parents, and educators.
A common misunderstanding about the use of norms has to do with the effect of testing
at different times of the year. For example, it is widely believed that students
who are tested in the spring of fourth grade will score higher than those who are
tested in the fall of fourth grade with the same test. In terms of grade-equivalent
scores, this is true because students should have moved higher on the developmental
continuum from fall to spring. But in terms of percentile ranks, this belief is
false. If students have made typical progress from fall to spring of grade 4, their
standing among fourth-grade students should be the same at both times of the year.
(The student whose percentile rank in reading is 60 in the fall is likely to have
the same percentile rank when given the same test in the spring.) The reason for
this, of course, is that separate norms for fourth grade are available for the fall
and the spring. Obviously, the percentile ranks would be as different as the grade
equivalents if the norms for fourth grade were for the entire year, regardless of
the time of testing. Those who believe students should be tested only in the spring
because their scores will "look better" are misinformed about the nature of norms
and their role in score interpretation.
Scores from a norm-referenced test do not tell what students know and what they
do not know. They tell only how a given student's knowledge or skill compares with
that of others in the norm group. Only after reviewing a detailed content outline
of the test or inspecting the actual items is it possible to make interpretations
about what a student knows. This caveat is not unique to norm-referenced interpretations,
however. In order to use a test score to determine what a student knows, we must
examine the test tasks presented to the student and then infer or generalize about
what he or she knows.
Criterion-Referenced Interpretation
A criterion-referenced interpretation involves comparing a student's score with
a subjective standard of performance rather than with the performance of a norm
group. Deciding whether a student has mastered a skill or demonstrated minimum acceptable
performance involves a criterion-referenced interpretation. Usually percent-correct
scores are used and the teacher determines the score needed for mastery or for passing.
Even though the tests in the ITBS and ITED batteries were not developed primarily
for criterion-referenced purposes, it is still appropriate to use the scores in
those ways. Before doing so, however, the user must establish some performance standards
(criterion levels) against which comparisons can be made. For example, how many
math estimation questions does a student need to answer correctly before we regard
his/her performance as acceptable or "proficient?" This can be decided by examining
the test questions on estimation and making a judgment about how many the minimally
prepared student should be able to get right. The percent of estimation questions
identified in this way becomes the criterion score to which each student's percent-correct
score should be compared.
When making a criterion-referenced interpretation, it is critical that the content
area covered by the test -- the domain -- be described in detail. It is also important
that the test questions for that domain cover the important areas of the domain.
In addition, there should be enough questions on the topic to provide the students
ample opportunity to show what they know and to minimize the influence of errors
in their scores.
Most of the tests in batteries like the ITBS or ITED cover such a wide range of
content or skills that good criterion-referenced interpretations are difficult to
make with the test scores. However, in most tests the separate skills are defined
carefully, and there are enough questions measuring them to make good criterion-referenced
interpretations of the skill scores possible. For example, the Reference Materials
test covers too many discrete topics to permit useful criterion-referenced interpretations
with scores from the whole test. But such skills as alphabetizing, using a dictionary,
or using a table of contents are defined thoroughly enough so that criterion-referenced
interpretations of scores from them are quite appropriate. However, in an area like
Mathematics Concepts at Level 12, some of the skill scores may not be suitable for
making good criterion-referenced interpretations. Each of the six skills in that
test is a broad content area which is further defined by two to four subskills.
Furthermore, some skills, such as measurement, each have only three questions to
cover a broad topic. That is generally too few for making sound judgments about
mastery.
The percent-correct score is the type used most widely for making criterion-referenced
interpretations. Criterion scores that define various levels of performance on the
tests are generally percent-correct scores arrived at through teacher analysis and
judgment. Several score reports available from Iowa Testing Programs include percent-correct
skill scores that can be used to make criterion-referenced interpretations: Primary
Reading Profile, Class Item Response Record, Group Item Analysis, Individual Performance
Profile, and Group Performance Profile.
Interpreting Scores from Special Test Administrations
A testing accommodation is a change in the procedures for administering the test
that is intended to neutralize, as much as possible, the effect of the student's
disability on the assessment process. The intent is to remove the effect of the
disability(ies), to the extent possible, so that the student is assessed on equal
footing with all other students. In other words, the score reflects what the student
knows, not merely what the student's disabilities allow him/her to show.
The expectation is that the accommodation will cancel the disadvantage associated
with the student's disability. This is the basis for choosing the type and amount
of accommodation to be given to a student. Sometimes the accommodation won't help
quite enough, sometimes it might help a little too much, and sometimes it will be
just right. We never can be sure, but we operate as though we have made a good judgment
about how extensive a student's disability is and how much it will interfere with
obtaining a good measure of what the student knows. Therefore, the use of an accommodation
should help the student experience the same conditions as those in the norm group.
Thus, the norms still offer a useful comparison; the scores can be interpreted in
the same way as the scores of a student who needs no accommodations.
A test modification involves changing the assessment itself so that the tasks or
questions presented are different from those used in the regular assessment. A Braille
version of a test modifies the questions just like a translation to another language
might. Helping students with word meanings, translating words to a native language,
or eliminating parts of a test from scoring are further examples of modifications.
In such cases, the published test norms are not appropriate to use. These are not
accommodations. With modifications, the percentile ranks or grade equivalents should
not be interpreted in the same way as they would be had no modifications been made.
Certain other kinds of changes in the tests or their presentation may result in
measuring a different trait than was originally intended. For example, when a reading
test is read to the student, we obtain a measure of how well the student listens
rather than how well he/she reads. Or if the student is allowed to use a calculator
on a math estimation test, you obtain a measure of computation ability with a calculator
rather than a measure of the student's ability to do mental arithmetic. Obviously
in these situations, there are no norms available and the scores are quite limited
in value. Consequently, these particular changes should not be made.