Author Gerianne de Klerk
Reviewed by René Butter, June 2008
Reviewed by Cheryl Foxcroft, July 2008
When citing this reading, please reference it as follows:
de Klerk, G. Classical test theory (CTT). In M. Born, C.D. Foxcroft & R. Butter (Eds.), Online Readings in Testing and Assessment, International Test Commission, http://www.intestcom.org/Publications/ORTA.php
Today most people would have seen or even have completed psychometric tests themselves. These could be, for instance, ‘just-for-fun’ tests you can sometimes find in magazines, tests and exams done at schools and universities, or recruitment and selection tests that are done when applying for a job. Psychometric tests are widely available and measure a wide variety of constructs, that is, psychological variables that are not-directly observable (e.g. intelligence, personality, motivation, etc). But how can you know whether a test is actually a good test? How do you know if a test measures a construct accurately? Or whether that test actually measures what it claims to measure? And what about influences from outside the testing situation, what kind of effect will they have on the outcome of a test? For instance, one group of candidates is packed into a small room where the noise from construction work next door is overwhelming. As compared to a group of candidates who complete the same test in a large, noise-free room. Do you think these two sets of candidates will have comparable scores, or could the testing situation have influenced the score that they obtained?
And then we can think of several scenarios where a test ends up measuring a construct differently from what it had aimed to. For example, is it possible that one question on its own can measure a complex personality construct? Would one question be able to cover the whole construct and could it measure that construct accurately? Or what would happen where a test is supposed to measure numerical reasoning ability, but the questions that are phrased using very complex English. Then the test might actually measure a candidate’s ability to understand that form of English, rather than the numerical reasoning ability in question. In short, Classical Test Theory deals with the effect of both unsystematic and systematic influences on the observed test result.
Classical Test Theory, commonly abbreviated as CTT, originates from the beginning of the 20th century. The final ‘Classical Model’ was only published in the late 1960s however (Lord & Novick, 1968). The most important formula that lies at the core of Classical Test Theory is defined as follows:
X = T + E, where
X = the total score/observed score obtained
T = the true score and
E = the error component
Classical Test Theory assumes that each observed score (X) contains a True component (T) and an Error component (E). When measuring a psychological construct, unsystematic errors occur. These unsystematic errors could be anything, for instance distractions from outside the testing situation, physical wellbeing of the candidate or good/bad luck. You can think of many different influences that can affect a candidate at the specific moment of taking the test. Sometimes these influences have a positive effect on the test result; other times they have a negative influence. In other words they cause a band (range) of error around the True score.
The True score can be seen as the systematic component of the raw score obtained. Classical Test Theory assumes that the measurements are evenly dispersed around the average, that is, the deviations occur equally to both sides of the True score. Consequently, the True score is in actual fact the average score. This means that the average error of measurement is 0, as the concomitant or concurrent positive and negative deviations cancel each other out.
Systematic errors are different from unsystematic errors. By systematic errors we mean a characteristic of the test or the testing situation that will affect all measurements equally. For example, if there is a mistake in one or more of the test items that is presented to all the candidates completing the test, it will influence all the candidates in the same way. As psychological tests are mainly used to determine individual differences, the influence of systematic errors is unimportant and will thus not be included in the Classical Test Theory concepts that will be introduced and discussed in the remainder of this reading. However, it is important to note that when the performance of candidates who experience some systematic error on a test is compared to the performance of candidates who completed a test free from such error, the comparison will be unfair. This is referred to as Differential Item Functioning, which is covered in the section on Item Response Theory. Later on I will briefly return to the subject of fairness in testing. However, the next part of the discussion will focus on unsystematic errors, their consequences and the way in which Classical Test Theory deals with this.
Hypothetically, if we administered one test repeatedly to one candidate for an indefinite number of times, the range of measurement error is equal to the range of observed scores. In other words, the candidate will have an average score over all these repeated measures and the difference between his/her lowest score and this average score indicates the largest negative error. In this case, the error represents the situation where the candidate had the most negative external influences which caused him/her to perform most poorly compared to the other administrations. This also works the other way around in that the largest positive error represents a situation where the candidate experienced the most positive external influences that impacted positively on the performance on the test.
Unfortunately, in real life it is impossible to have such a situation where repeated measures are possible. First of all, it is of course impossible in practice to have numerous repetitions. Furthermore, with most, if not all psychological constructs, learning and memory processes are involved that will have a systematic, but undesirable influence on performance if a test is repeatedly administered. For instance, people could remember their previous test session and answer in a similar way, or they might figure out how to solve certain problems between test sessions and then perform better on the test the next time round (this is especially true for ability tests). While this is true for psychological measurement, when more physical measurements are repeated often, such as measuring length and time over and over, the same problems do not arise. Such measurements are relatively constant over repetitions and therefore have a very small error of measurement. It should always be kept in mind that a psychological test aims to measure psychological processes that are much harder to observe and quantify..
However, errors of measurement will also average out over a large number of repeated measures, for a large group of people, where each individual has completed the test once. This implies of course that the individual errors of measurement do not correlate with individual characteristics in the specific population of people.
From the above formula and the assumptions of measurement errors cancelling out, formulas for reliability and standard error of measurement can be derived. They are central to Classical Test Theory and with these two concepts an estimate of the accuracy of a measurement can be obtained. It is important to realise, however, that a test with a high level of accuracy (or reliability) does not necessarily imply that a test is measuring what it is supposed to measure. Let us go back to the example used in the introduction; the numerical test where very complex English was used. In this case, the test might actually measure a candidate’s proficiency in English rather than his numerical reasoning ability. However, it might measure this English language ability quite well and accurately. Therefore, the reliability of the test could be high, but the construct it is measuring is not the construct it was supposed to measure. This issue will be explored in the Validity section. First we will focus on reliability.
Reliability of test scores deals with the topic of consistency of scores over replications. By this we primarily mean the agreement between a candidate's scores when taking the test several times. It will be clear, for example, that when a person's score for extraversion varies a lot when taking the same test several times in a short period, the test scores are likely to reflect unsystematic influences rather than his/her true degree of extraversion. In such cases test scores are considered unreliable.
The reliability or consistency of test scores, can be estimated in two main ways:
1. Through repeated measures, that is, various administrations at different points in time.
a. Parallel form method (by comparing performance on two tests that are parallel or equivalent alternative forms)
b. Test-retest method (by comparing several administrations of the same test
2. Through a single measure, that is, one administration at one time point:
a. Split half method (by comparing performance in two halves of the same test)
b. Internal consistency method (by comparing performance item by item within the same test)
The parallel (or alternate) form method assumes complete equivalence of two tests. This means that one could actually exchange the tests and all candidates will score identically on both tests. The correlation between the total scores of these tests gives an indication of the reliability of the independent test scores. It will be clear that this method is very complex because items in the two tests should be equal in terms of what they measure, but they cannot be identical (otherwise you will have exactly the same test). You can prove that tests are parallel by demonstrating that their average observed scores are identical, as well as the variance in their scores. Also, both tests should have equal correlations with other variables.
To estimate test-retest reliability, one test will be completed by the same group of people twice. However, there will have to be a considerable time lag between the two administrations, to prevent memory or learning effects. The correlation between the two total scores give an estimate of the test’s reliability, but only if the two administrations can be seen as being independent of each other. This method can give a good estimate of reliability, provided that there have been no changes in the measured characteristic between the two administrations of the test. For instance, there could be a learning effect. Candidates might have looked up the answers to or discussed some of the items leading to improved performance by the time the second administration takes place. This will decrease the correlation, and therefore the reliability estimate of the test. Also, candidates might feel the need to be consistent in their answering pattern, leading to artificial consistency between the administrations thus inflating the reliability estimate. Another issue with respect to this approach is how the extent of the the time lag is determined. Too short a time interval between the two administrations will imply a bigger chance of people remembering the previous administration, and subsequently learning or memory effects are enhanced. Too long a time interval might cause the sample group to decrease in size as some people might not be interested in taking another test or they might not be able to be traced anymore.
The split-half method relies on one test and is more efficient than the alternate form method. The test is split in two equal halves that are equal in length and preferably equal in difficulty (parallel halves). For each half a total score is calculated and when both halves are truly parallel, the correlation between the two halves is an estimation of each half test’s reliability. To determine the full test’s reliability, a correction needs to be applied to the reliability estimate for the test halves. This correction is dependent on the full test’s length and how this relates to a reliability estimate. The full test’s reliability estimate can be determined using the Spearman-Brown formula which relates reliability to test length (you can look up the formula in Wikipedia: http://en.wikipedia.org/wiki/Spearman-Brown_prediction_formula). If the test halves are not completely parallel, the estimated reliability coefficient obtained through this method will be an underestimation of the true reliability of the test. To prevent problems due to the two halves not being completely parallel, the reliability can be estimated using the internal consistency method.
The internal consistency method is based on the notion that individual items in a test can be seen as a separate tests in themselves. The complete test is administered once to a representative group of people. Subsequently, covariances are calculated between all the pairs of two different items. The covariance is a measure of the relationship between two variables or how they vary together. A positive value will be obtained when both variables have a value deviating from their mean in the same direction (if one is high, the other is high too). The variance of the total score is also needed for the computation. The most common internal consistency measure is the alpha coefficient, also called Cronbach’s Alpha (Cronbach, 1951). It is based on inter-item correlations. Items that are highly correlated can be considered to be measuring a common construct. The internal consistency will increase if the number of similar items increases as this operation increases the "average inter-item correlation" . In other words, the longer a test containing similar items, the higher the internal consistency - reliability estimate. For this reason it is always important to ask yourself if the reliability estimate is high due to many, items that are reasonably highly inter-correlated, or because there is a a smaller number of highly inter-correlated items. The second option, a test with fewer, highly inter-correlating items is preferred in most cases as it provides for more efficient measurement.
Given all these forms of reliability estimates and and the various ways to estimate them, you might wonder what a good level of reliability actually is. When can you be confident a test score is an accurate measure? There is no agreement on the level that should be obtained for a good test, but some guidelines can be given. First of all, you have to think about the purpose of the test. Is it going to be used for high-stakes selection decisions (important ‘hire’ or ‘not-hire’ decisions are to be made based on the test scores)? For example, when intelligence tests are used for selection procedures where there is a pass/fail decision to be made. Or will the test results be used more descriptively, such as a personality questionnaire that is mainly used for coaching purposes. A very high reliability figure indicates that the test has very similar items; it might consist of quite a small number of items that all measure one very narrowly defined construct. Each item does not give a lot of new information on the construct in this case as they all measure the same thing. A very low reliability figure indicates that the items are quite varied; the items are different from each other or might even be ambiguous. In general, values above 0.7 are seen as being acceptable reliability figures, but for high-stakes testing you would like to see the test’s reliability above 0.8. However, be critical of reliability values above 0.9: ask yourself if the test items are too narrowly defined. For personality type questionnaires, where the constructs are more broadly defined, a value below 0.8 (and occasionally even below 0.7) is acceptable. A reliability figure below 0.6 is generally seen as being too low.
Finally, we need to address an issue that has not been covered so far. In some practical applications of psychometric tests decisions are reached based on test scores, for example pass-fail, or hire-reject decisions. In this so called "criterion referenced" context the assumptions of CTT may not always be valid and reliability will have to be reframed in terms of consistency of decisions based on scores in stead of consistency of test scores. This topic goes beyond this text, but needs careful consideration in, for instance, educational settings, where the reliability of pass-fail decisions based on one specific test score is a very important issue.
The validity of test scores can be described as the extent to which a test measures what it is supposed to measure. In other words, is the test measuring the construct that you want it to measure (e.g., verbal reasoning ability) or might it be measuring something else other than this trait?
The first aspect that is important is whether the test appears to be relevant for the type of measurement is claims to yield. For example, we might argue that there is serious reason to doubt the validity of a graphology test that claims to predict emotional stability, without even looking at the data that is collected through using the test.. This so called face validity is an important aspect of testing, as we are dealing with human beings who expect that the tasks presented by tests should be relevant and transparent. Not only for psychometric reasons, but also as a matter of professional ethics and for tests to carry weight in the courtroom.
The second important aspect of validity deals with studying the theoretical construct itself. The third aspect of validity is linked to the prediction of behaviour or achievements beyond the test scores. Say for example, that we would like to make a prediction of sales success, of which we do not have direct evidence, based on personality scores that are available to us. Thus, the most common forms of validity are:
Face validity is a very simple form of validity; it tries to answer the question whether the measure, on the face of it, seems to measure what it is supposed to measure. It is a subjective impression of the test’s content by the psychologist or even an outsider. It is a necessary condition when building an argument that the test items are linked to the construct they should be measuring. It is not sufficient on its own, however, as the other forms of validity are required to demonstrate that this is really the case.
Construct validity is a more thorough form of checking or evaluating whether the test is measuring the construct properly. It involves quite a complex process and all other forms of validity methods mentioned above can be seen as contributing to construct validity. It is important for the researcher to identify all the concepts that might be an explanation of the achievements on the test.. Convergent validity means that different indicators of the same construct should be highly correlate. Discriminant validity implies that indiators of different constructs should not correlate highly. Using these principles, elaborate hypotheses can be derived from the relevant theory, and subsequently be tested through empirical research.
With convergent validity an indication of the test's ability to measure the construct is determined by establishing whether the test provides results similar to that of another test measuring the same construct. Obviously, to properly assess this type of validity, the test against which it is compared should have good validity and reliability estimates itself.
Discriminant validity, in a sense, is the opposite of convergent validity.. Here you are not evaluating whether the test is measuring what it is supposed to measure but instead whether the test is not accidentally measuring what it should not measure. For instance, let’s consider the example used in the introduction about the numerical reasoning test that might measure ability in English rather than numerical reasoning ability. If the test has a low correlation with established verbal reasoning or verbal comprehension tests, you would have proof that it at least is not measuring those unwanted constructs.
Finally, we discuss predictive validity, that is, the transferability of the test scores to "the real world". In concurrent validity the measure is compared to other measures of the same construct. For instance the test results are compared to supervisor ratings of each candidate on the same construct.
Predictive validity is similar to concurrent validity, but it involves the extent to which predictions can be made based on the test scores of observations or data obtained at a later stage. This could for instance imply a question like, how well did this test done during the recruitment and selection process predict success in a job? You can then determine how well people who did the test during recruitment were performing in their job two years later, for example. Thus, the focus is on the predictive power of a test for future performance, such as school success, job success or even marital success.
Validation research very often shows quite disappointing results. Correlations for different criteria are on average not a lot higher than 0.30 to 0.45. Why is this? First of all it is essential that the measures used, either an established test or a criterion measure (e.g., job success) have good reliability. When the reliabilities are low, you will not obtain good validity estimates . It can also work the other way around though; a perfectly reliable test can have very poor validity, as argued in the Classical Test Theory section above but in this case you have to search for another explanation for poor validity coefficients other than the unreliability of the test. Regarding criterion measures, especially in concurrent or predictive validity studies, supervisor ratings or school grades, which are commonly used, are not normally the most reliable measures.
It is relatively straightforward to illustrate the impact that reliability, or rather the lack thereof, has on the validity coefficient. When for instance your test has a reliability coefficient of 0.80 and your criterion measure is perfectly accurate (reliability of 1.00) the maximum validity coefficient you can obtain in your data set in the case of a perfect relationship, will only be 0.89. Be aware of the word “maximum” here, as even when the relationship between test and criterion is perfect, it can never reach a figure of 1.00. When your criterion measure has a reliability coefficient of only 0.50, the maximum validity coefficient that can be obtained will be 0.63. The lower the reliability of both the test and criterion, the lower the maximum validity figure will be.
The second explanation why low validity coefficients are often found is that the relation between the test and the criterion might be non-linear. Take for instance the situation in which a low test result corresponds with a low criterion achievement, a higher test result with a higher criterion achievement but a very high test result with a lower criterion achievement. As an example of this situation we can use the relationship between motivation and achievement. Low motivation will result in low achievement, a higher motivation in higher achievement but very high motivation can give the candidate stress and tensions resulting in a poorer test result. Standard correlation coefficients will provide the wrong estimate of validity in this instance.
Third, there might be a bias in the test and/or criterion measure. For instance the test might correlate well with a specific criterion for males but not for females, therefore, in a heterogeneous population (with both males and females), the validity coefficient will be low. This form of bias is called gender bias. There are many forms of bias. One most commonly found form of bias is cultural bias (see section in ORTA on cross-cultural testing). The psychological construct might not have the same meaning in other cultures or the items might not be interpreted in the same way by people from different cultures. Needless to say, this is very likely in situations where there is a mismatch between the native language of a group of test takers and the language of the test. Gender or culture could therefore be a moderator (or suppressor) of the correlation between the test and the criterion.
Fourth, it is important to critically review the criterion measures. Could they possibly be defined too broadly (i.e., when ‘successful performance’ is used as a single criterion measure while in reality specific aspects of performance might be more relevant). Or in the case where you compare the performance of a group of candidates in the same position within different companies, you should consider that this position might be fulfilled differently in different organisations?
The validity coefficient found, will tell you the extent to which the test is valid for making statements about the criterion. It can be squared and multiplied by 100 to get to the explained variance in a relationship. To explain this more fully, the following example may assist: when the predictive validity coefficient of an ability test in relation to the criterion of job retention is 0.70 (i.e.the number of years the person will stay in the job) it means that 49% ((0.70)²*100 = 49%) of the differences in job retention (the criterion) can be explained (or predicted) by differences in the test achievements.
You have now developed your test. You have proof that your test scores are reliable (i.e., consistent) and valid (i.e., they measure the construct they should measure), but how can you interpret each candidate’s score? Can you compare the results of one candidate to the results of another candidate completing the same test? And what about candidates completing different tests, can you compare their results?
First of all, to be able to compare scores of one candidate to the another candidate you have to ensure your test administration is standardised and that it is ‘fair’ to make these comparisons between candidates (this is a requirement even in the pilot and developmental stages of the test). You want to minimise, as much as possible, external influences that impact on a person’s performance on a test. In other words you want to minimise the level of unsystematic error. Every candidate should get the same opportunity to perform at his/her best. So, tests need to be administered under identical conditions. It should not matter where or when, by whom or to whom a test is given, it should be administered in the same way. Therefore instructions on how to complete the test as well as practice items should be exactly the same in each test administration. It should be noted that the advent of internet-delivered testing we have seen in the past decade raises some serious concerns with respect to the standardisation of testing circumstances. All candidates should complete the test in a quiet, un-disturbed environment where there are no distractions. It is not always possible to ensure this during internet-delivered testing where the test is not being taken at a test centre.
Having obtained test scores under standardised conditions, the scores of all the candidates should be translated to the same scale and their relative performance compared to that of a representative group of people completing the same test. This group of people, also called the normative sample, should be representative of the target population, that is, the comparison group should contain the same type of people as the people who are being tested. It is very important to have a large enough comparison group (at least 100 people but preferably a lot more) to make sure the sample includes all the typical ‘varieties’ of people that you would normally find in a population. Be aware that a norm group is not an absolute, unchangeable entity. Populations change over time and regular norm updates should thus be undertaken.
In most cases, a candidate’s total score is compared to the norm (or comparison) group using standardised scores that are based on the comparison group’s average and standard deviation. Using the average and standard deviation has the advantage of giving you an indication of the percentage of people scoring higher as well as that of people scoring lower. The most common example of a standardised score is the z-score. Z-scores indicate how many standard deviations each candidate’s total score differs from the average score. If a candidate scores 1 standard deviation above the average score of the comparison group (which represents about the 84th percentile, that is, 84% score lower), his/her z-score will have a value of 1. If he/she has the same score as the average score of the comparison group, his/her z-score will be 0. For these percentages to hold true, however, the raw scores need to be normally distributed. It should be noted that raw scores that are not normally distributed cannot be normalized by applying a z transformation. In these cases a non-linear transformation should be considered.
One final remark on the issue of score distribution is that no assumptions with respect to the distribution of the scores are made in CTT. Hence, it can be said that CTT is a distribution-free approach. This is different in Item Response Theory, where many models assume for instance normal distribution of person parameters (or simply put, person scores).
By using z-scores it is easier to compare the scores of one candidate on multiple tests and to quantify differences between candidates completing the same test. If the candidate obtains a z-score of 1 on two different verbal reasoning tests, and if the comparison groups for both tests are the same, you can compare the candidate’s scores and interpret the relative performance on each test.
However, comparing raw scores as opposed to standardised scores is a lot less informative. What if two candidates scored a raw score of 25 and 28 respectively, what does that tell you? You can say the second candidate performed better on the test, but how much better did he do? If you know the standard deviation of a representative comparison group is 6, than you know the second candidate scored 0.5 z-scores higher than the first candidate. Using the comparison’s group average and standard deviation assumes that the scores were normally distributed, however. This is a reasonable assumption to make for most tests but it should be tested before standard scores are applied.
Once it has been shown that the distribution of scores is normal, many standardised scores can be used (over and above the z-scores just explained). The next figure will give an overview of some standardised scores in relation to the normal distribution. In the figure, T-scores, which have a mean of 50 and a standard deviation of 10, are listed above z-scores and standard IQ scores, which have a mean of 100 and a standard deviation of 15.
See Figure on next page
Figure 1: normal curve with standard scores.
In Classical Test Theory, the total score of a candidate is dependent on the content of the test used. It does not give an absolute measure of the characteristic (e.g., ability) of the candidate. A candidate’s test score is the sum of the scores obtained on the items in the test.
The difficulty of an item is defined in relation to the comparison group. It is given in the form of a so-called p-value (a number between 0 and 1) which indicates the proportion of the comparison group that answered that item correctly. While candidates have different ability levels, their performance on a test might vary in relation to the difficulty level of the test and not for any other reason. When a test is difficult, a candidate will appear to have a low ability; compared to his/her performance on a test that is easy, as the candidate will score a lot more items correctly and will appear to have a high ability (i.e., higher total score). For this reason it is difficult to compare a candidate’s results from different tests unless you have established that the two tests are completely parallel or equivalent (see reliability section) or you have used the same norm group and standard scores. This dependency on the comparison group used, is removed when using Item Response Theory modelling. This topic is further explored in the Item Response Theory section in ORTA.
Gerianne de Klerk
Good books/references to read with expansion on subject introduced in this text:
Older books which might be more widely available:
Please briefly explain how the ‘Classical Model’ of Classical Test Theory works.
Please describe what the reliability of a test score means and how it differs from the validity of a test score.
Which methods of estimating the reliability of a test score can you think of? For each method, briefly explain how it estimates the reliability (you don’t need to give formula’s).
Two people have a discussion about the reliability of test scores on a test measuring extraversion/introversion. The internal consistency coefficient they found for their data set is 0.97. Person A argues this is a very good reliability coefficient and he would not like to change anything in the test. Person B argues that he would like to explore the items and maybe change the test. Can you think of reasons why person B is not completely happy with the internal consistency coefficient?
Which types of construct validity can you think of? Please describe each type and how you establish or determine validity in each case.
Describe how it is possible to compare the test scores of two candidates completing different tests measuring the same construct.
Gerianne de Klerk holds a Masters degree in occupational psychology from the University of Utrecht (The Netherlands). She is currently living in South Africa and working as an independent consultant in the field of psychometrics; assisting clients in research projects and test development. She worked for over 6 years at SHL (Saville & Holdsworth Limited). She did so firstly in The Netherlands where she fulfilled the role of product manager, later she worked at the European office in a similar role and subsequently at Head Office in the UK for the Design & Innovation team where she was responsible for the innovation, design and development of HR products and tools.
To contact Gerianne de Klerk, please email: