Jump to page content Jump to navigation

College Board

AP Central

AP Exam Reader
Siemens Awards for Advanced Placement

APAC 2010
Print Page
Home > AP Courses and Exams > Exam Analysis

Exam Analysis

Behind the Scenes

AP Exams are taken by students in all 50 states and in nearly 100 other countries and scored at the AP Reading. Then, AP statisticians at the Educational Testing Service set to work analyzing data to ensure that the AP Exams are -- and continue to be -- among the finest assessments in the world. Their findings are published in a test analysis report for each subject. This report, which contains information on the current exam as well as comparisons with those of the past five years, provides valuable insights in several areas.

Score Distributions
Correlations
Speededness
Reliability
Item Statistics
Differential Item Functioning

Score Distributions
These tables include the scores on individual free-response questions, Section I scores, Section II scores, composite scores, and the final AP grades.

Table 4.1 shows the distribution of scores on the free-response questions of the 2003 English Literature and Composition Exam. This type of data enables you to see whether large numbers of students omitted the question or wrote off-topic answers, and whether the Readers made full use of the scoring scale in scoring students' responses.

Standard and Non-Standard Groups
In most cases, scores for all the students who took the exam are included in score distribution tables. However, there are exceptions. For the AP foreign language exams (French, German, and Spanish), three separate tables are created, based on students' experience of the language:

  • The Standard Group consists of students who have little or no experience with the language out of school.
  • Non-Standard Group 1 is made up of students who do not speak or hear the language at home, but who have spent a month or more in a country where the language is spoken.
  • Non-Standard Group 2 consists of students who regularly speak and/or hear the language at home.

Correlations
The correlation between the scores of two sections shows the magnitude of the association of the scores for the two exam sections.

  • A high correlation indicates either that the two sections are measuring the same skills, or that the skills they measure tend to be strongly associated with each other (at least among the AP test-takers).
  • A low correlation may indicate one of two things. It may be that the sections are measuring quite different skills. On the other hand, it may show that at least one section is not measuring reliably.

The test analysis report includes estimated true-score correlations, which are estimates of what the correlations would be if both sections were measuring perfectly reliably. If the estimated true-score correlation between the two sections is low, it is reasonable to conclude that the two sections are measuring substantially different skills.

Table 4.2 is the correlation table for the 2003 Microeconomics Exam.

Speededness
"Speededness" in testing is the effect of time limits on the test-takers' scores. An exam is "speeded" to the extent that those taking it score lower than they would have if they had had unlimited time. Most of the speededness statistics that are produced for AP Exams are based on the number of items that were "not reached." In each separately timed section or subsection, if a student leaves items 47, 49, and 50 unanswered on a 50-item exam, it would be assumed that the students reached item 47 but not items 49 and 50.

How reliable are these data?
Statistics based on the number of students "not reaching" items have some obvious limitations:

  • They cannot be used for free-response questions for which the answer is more than a very brief written response.
  • If a student skips an item, intending to answer it later, and then runs out of time, the item will be classified as "omitted" rather than "not reached," making the exam appear less speeded than it is.
  • A student may complete the exam by hurrying to finish, and answer several incorrectly because of time pressure. Again, the "not reached" count of zero will incorrectly make the exam appear to be unspeeded for that student.

However, despite these limitations, "not reached" statistics can provide a reasonably useful indication of speededness:

  • Although a small number of "not reached" items may not guarantee that the exam is unspeeded, it suggests that most of the students had sufficient time at least to read and respond to the items.
  • A large number of "not reached" items implies that the exam is speeded for many test-takers.
Analyzing the Data
For each exam, statisticians create a table, showing the percentage of students reaching the last item, the next to last, the third from last, and so on, up to the point at which virtually all students reached the item. You can see the data for the 2003 AP Physics B Exam in Table 4.8. This information is important because if the last one or two items are very difficult, the percentage of students finishing the exam may make it appear much more speeded than it is. For example, a statistic showing that only 52 percent of the students finished an exam would suggest that the exam is highly speeded. However, if 98 percent of the students reached the next-to-last item, it would be reasonable to conclude that the exam was not highly speeded, but that the last item was very difficult.

Table 4.3 shows the percentage of students completing 100 percent and 75 percent of the multiple-choice section of each AP Exam in 2003. Where the multiple-choice section consists of two or more separately timed subsections, the data for the subsections are shown separately.

Reliability
A useful way to conceptualize the reliability of a set of scores on an exam is to think about what would happen to those scores if we could give the exam a second time to the same group of students under identical conditions. The square of the correlation between the scores on the two occasions is called the "reliability coefficient" and it tells us how well we can predict scores on the second occasion from the scores on the first occasion using a linear equation. If the squared correlation is zero, then there is no linear relationship between the scores on the two occasions. If the squared correlation is one, then there is a perfect linear relationship between the scores on the two occasions.

Since we cannot give the exam a second time under identical conditions to the same group of students, the "internal consistency" of the exam is used as an estimate of the reliability. Internal consistency refers to the intercorrelations among the scores on different items of the exam (e.g., between scores on all pairs of the multiple-choice items, or between scores on all pairs of the free-response questions). In particular, a statistic known as Coefficient Alpha is used to estimate the reliability of scores on AP Exams.

Coefficient Alpha is computed separately for Section I and Section II. In addition, for exams in which multiple-choice items or free-response questions form homogeneous groupings, Alpha is computed separately for each such grouping. For example, for AP Calculus, Alpha is computed separately for items or questions based on the use of a graphing calculator and for items or questions not based on the use of a calculator.

The reliability estimates for different sections are combined to obtain an estimate of the composite score reliability. The formulas used for combining reliabilities take the weighting of Sections I and II into account.

Note that the type of reliability discussed here is distinct from the "Scoring Reliability Studies" which are discussed in the section "Scoring the Free-Response Section." These studies are used to address the following question: "If each student's free-response papers were rescored by a second set of Readers, how strongly would the second set of scores correlate with the first?" To do this, a special study is conducted in which the same papers are read and scored independently by two sets of Readers.

Table 4.4 shows the estimated reliability of Section I, Section II, and the composite scores for each AP Exam in 2003.

Item Statistics
The AP Program performs an item analysis of the multiple-choice section of each exam every year. With the 2000 administrations, AP statisticians at ETS started to use a new graphically-oriented procedure for evaluating the statistical performance of items. This approach is based on the realization that most useful statistical information that an item analysis could provide to test assemblers would be a series of conditional probability estimates displayed graphically.

For each possible value of the total AP score, these conditional probability estimates would indicate the test-taker's probability of answering correctly -- and of choosing each option. These estimates could be plotted on a graph showing a response curve for the correct answer and for the other options. Such a graph allows the test developer to see at a glance the most important statistical characteristics of the item: its difficulty, and how its difficulty is related to the test-taker's total score; the popularity of the individual options; and the way each item's popularity corresponded to the test-taker's total score.

Figure 4.2 shows the item analysis results for question 2 on the 2000 AP Environmental Science Exam. The item analysis is based on 13,269 students. The horizontal axis on the graph represents the student's score on this particular exam, and the vertical axis represents the probability scale, from .00 to 1.00. There is a curve for the correct answer and a curve for each option. The height of the curve indicates the student's probability of choosing that answer option.

Another feature of the graph is a series of dashed vertical lines, indicating the 10th, 25th, 50th, 75th, and 90th percentiles of the distribution of total scores. These lines allow the test developer to relate the information in the graph to the abilities of the test-takers. A high correct-answer probability in the middle of the score scale may mean one thing to the test developers if that point is near the 50th percentile of the score distribution but may mean something quite different if that point is near the 10th percentile. These lines also help the test developer see where the data were sparse and where the data were plentiful.

The procedure used to produce these curves employs a smoothing algorithm that borrows data from other score levels to provide more consistent conditional probabilities. For the highest total score, all the borrowing comes from lower score levels. As a consequence, the smoothed probability estimates for the perfect total score are often less than 1, sometimes noticeably so for hard questions.

Item 2 is a hard question. At the far right, the smoothed curve for the key, C, has a maximum of 75 percent, while option D accounts for most of the remaining 25 percent. The percent choosing C does not reach 50 percent until the fourth dotted vertical line from the left vertical axis; only half of the candidates at this 75th percentile score answer question correctly. The lack of interest in option A at all score levels suggests that this five-choice question is in effect a four-choice question. Because O hugs the x-axis like A, we can conclude that few candidates choose to omit this question. At the extreme left, C, is about as popular as option B, C and E; all are around .25, which is the chance value for a four-choice question. Note the popularity of the key is slightly below .25.

As scores move from 0 at the far left to 98 at the far right, an interesting pattern emerges. Almost immediately the popularity of option B decreases, and by the 50th percentile score very few students select B. At the 25th percentile score (second dashed vertical line), option E and C cross, and E continues its downward slide while C begins a strong climb upward. Option D actually remains more popular than C until just below the 59th percentile score (third dashed vertical line), at which point D declines in popularity while C continues to become more attractive to higher-scoring students.

Removing an Item
ETS content experts closely scrutinize the information provided in each item analysis, and may remove an item that appears to have performed atypically. Effective with the 2000 AP Exams, items were automatically excluded from scoring if performance on them showed little, no, or negative relationship with performance on the entire exam, and an option is more popular than the key in the highest scoring 1/5th of the student group. For example, question 81 of the 2000 AP Environmental Science Exam was administered, but not included in the multiple-choice scores. Its item analysis is summarized in Figure 4.3

Unlike question 2, where the key stood out clearly, the correct answer for question 81 cannot be so readily identified. Is it the relatively popular option A, which behaves like a key between the 25th percentile and the 75th percentile? Or is it the unpopular option D, which seems to emerge as a potential key in the top 25 percent of the student group. Actually, the key is C, which appears to be unrelated to total test score, attracting somewhere between 20 percent and 25 percent of the students regardless of score level. Questions like 81 are not counted toward a student's score.

Delta Equating
Statistical specifications for the AP Exams are expressed in terms of a measure of difficulty called the equated delta. Delta equating provides a measure of difficulty that has the same meaning for items from different editions of the exam. It is based on the "common items" -- items repeated from a previous edition of the exam. Each of these items has two delta values -- an equated delta from the previous edition of the exam, and an observed delta from the current edition.

The equated delta is derived from a more readily understood measure of difficulty, the P+, which is the percentage of students reaching the item who answered it correctly. The derivation is a two-step process.

  1. The observed P+ is converted into an observed delta. This delta is a mathematical function of P+. It is defined in terms of a normal distribution with a mean of 13 and a standard deviation of 4. If P+ represents the percent correct for the item, then 1 - P+ is the percentage of students who answered the item incorrectly or omitted it. Delta is the (1 - P+)th percentile of this normal distribution.
  2. The observed delta is converted to an equated delta to estimate via a process called delta equating. This statistic is called an "equated delta" and the group of students to which it refers is called the "base group;" the equated delta statistic estimates the difficulty of the question for the base group.

Delta equating of the current edition to the previous edition consists of:

  1. Finding the relationship between the two sets of delta values for the repeated items. This relationship is found by computing the linear equation that will transform the new observed deltas for the repeated items to the same mean and standard deviation as the old equated deltas for these items.
  2. Applying this relationship to the all the items on the new exam to transform their delta values to the "equated delta" scale.

These equated deltas provide feedback to test development staff about how well this edition of the exam matched the statistical specifications.

Differential Item Functioning
On all AP Exams, the items that are relatively easy for one group of examinees (e.g. males) tend to be relatively easy for other groups of examinees (e.g. females). Similarly, the items that are relatively difficult for one group tend to be relatively difficult for other groups. "Relatively" is important here, because some groups of students find an entire exam more difficult than does another group. Differential Item Functioning (DIF) occurs when an item is substantially harder for one group than for another group after the overall differences in knowledge of the subject tested are taken into account. Therefore, DIF would occur when an item that is one of the most difficult items on the exam for one group is one of the easiest items on the exam for another group. DIF does not mean simply that an item is harder for one group than for another; if the AP students in one group tend to know more than the other group about the subject, they will tend to perform better on all exam items.

The DIF analysis is therefore based on the principle of comparing the performance of focal groups (e.g. female, African Americans, or Hispanic examinees) on an item with that of a reference group (e.g. male or White examinees) in a way that controls for overall knowledge of the subject tested. The measure of overall knowledge of the subject is called the "matching criterion," and is the total score on the multiple-choice section or subsection.

DIF Categories
The DIF analysis produces statistics describing the amount of DIF for each test item. The analysis also produces statistics describing the statistical significance of the DIF effect - the probability of finding so large an effect in the available samples of examinees if there were no DIF in the population from which they were sampled. The decision rule based on the Mantel-Haenszel procedure sorts out the test items into three categories, labeled A (least DIF), B, and C (most DIF).

All items whose statistics place as "high B" or "C" are reviewed at a special meeting. The participants in this meeting include at least one person from outside ETS and at least one person who is a member of one of the ethnic minorities represented among the ETS focal groups. At the meeting a group of testing experts examine each "high B" and "C" item and try to determine whether its DIF can be explained by characteristics of the item that are unrelated to the measurement purpose of the test. If it can, the item is deleted from the scoring of the test.

It is important to realize that DIF by itself is not considered sufficient grounds for removing an item from the exam. The item may test an important piece of knowledge that happens to be more common in one group than another. Only if the DIF is attributable to factors other than the knowledge being tested is it grounds for deleting the item.

Figure 4.4 is a plot illustrating the DIF performance of an item on an AP United States History Exam for Asian-American and White students. The item was removed from scoring.

Figure 4.5 is a plot illustrating the DIF performance of an item on an AP International English Examination for female and male students. The item was removed from scoring.

  ABOUT MY AP CENTRAL
    Course and Email Newsletter Preferences
  AP COURSES AND EXAMS
    Course Home Pages
    Course Descriptions
    The Course Audit
    Sample Syllabi
    Teachers' Resources
    Exam Calendar and Fees
    Exam Questions
    FAQs
  PRE-AP
    Teachers' Corner
    Workshops
  AP COMMUNITY
    About Electronic Discussion Groups
    Become an AP Exam Reader

Back to top