
Gerianne de Klerk
Reviewed by René Butter, June 2008
When a psychometric measure, for instance a measure of personality, or intelligence or some other construct, is used in different cultural settings, to compare test candidates from different cultural backgrounds, we speak of cross-cultural testing. The need for multiple language versions of tests, questionnaires and surveys is continuously increasing. Therefore, many tests are adapted from one language and culture to another. Individual scores based on tests supposedly measuring the same construct in various cultures cannot be interpreted at face value. The influence of culture on measuring the specific psychological construct needs to be explored to be able to adjust measurements to make them meaningful to the particular culture and to get equivalent or comparable measures across cultures. In this text, issues and challenges regarding cross-cultural testing are explored. The concepts of test bias and test equivalence, test adaptation procedures are introduced, as well as the issue of comparability between test scores of different cultural populations.
Main text
When a psychometric measure, for instance a measure of personality, or intelligence or some other construct, is used in different cultural settings, to compare test candidates from different cultural backgrounds, we speak of cross-cultural testing. The need for multiple language versions of tests, questionnaires and surveys is continuously increasing. Therefore, many tests are adapted from one language and culture to another. Individual scores based on tests supposedly measuring the same construct in various cultures cannot be interpreted at face value. The influence of culture on measuring the specific psychological construct needs to be explored to be able to adjust measurements to make them meaningful to the particular culture and to get equivalent or comparable measures across cultures.One of the most obvious issues in cross-cultural testing is language. A test that is written in English cannot be expected to yield a sound measure of the same construct in a French population. To give both English and French candidates the same starting position in completing the test it should be adapted and made available in the native language of a specific group, i.e. French in this case. The different versions of a test should be equivalent (see the text on Classical Test Theory for the meaning of the concept of "equivalence") There is more to ensuring equivalent measures than just translating tests in another language, however.. When psychological construct are measured across cultures, the influence of culture on the measurement needs to be explored. For each psychological process or construct that you want to measure in a new cultural population, it is necessary to to determine the extent to which it is universal across cultures and, if not, to specify the exact differences, and make the necessary adjustments. Without these adjustments meaningful and equivalent cross cultural test versions cannot be obtained. It cannot be assumed that a test designed in one culture, in one specific language, based on that specific culture’s conceptions of psychological processes and constructs can be "copy pasted" to any other culture. No surprisingly, research in cross-cultural psychology plays an important role in addressing the challenges faced in cross-cultural testing. Cross-cultural psychology studies human behaviour in the broadest sense with the focus on the relationship between human behaviour and culture. Although, in this section the focus will be on cross-cultural testing , which explores the influence of culture on the measurement of psychological constructs, a brief introduction into the field of cross-cultural psychology in general is indispensable a stepping stone.
Cross-cultural researchers are involved in identifying the relationships between culture and behaviour. They try to pinpoint if, and to what extent, culture and behaviour are two separate things and if and tot what extent the one leads to the other. They wonder which specific kind of cultural experience may cause differences in behaviour and if there is uniformity in psychological behaviour across cultures. Are there any contextual variables, such as biological variables (i.e. nutrition, hormonal processes or genetic inheritance) and ecological variables (i.e. economic activity and population density) that can cause differences in behaviour? A general definition of cross-cultural psychology by Berry et.al. (2002) puts it very clearly: ‘Cross-cultural psychology is the study of similarities and differences in individual psychological functioning in various cultural and ethno-cultural groups; of the relationships between psychological variables and socio-cultural, ecological and biological variables; and of ongoing changes in these variables.’A goal of cross cultural psychology is to understand the relation between human behaviour and the cultural contexts from which it stems. Culture can be understood as “a shared way of life of a group of people” (Berry, Poortinga, Segall & Dasen, 2002).. There are many definitions of culture though, some emphasizing concrete, observable activities and artefacts , while others address underlying symbols, values and meanings. Berry et.al. think that culture has some objective existence, i.e. it can be measured. This objective and stable quality of a group can both influence and be influenced by individuals and their actions.
In cross-cultural psychology, three main theoretical orientations can be identified; 1) absolutism, 2) relativism and 3)universalism:
These different theoretical orientations from cross cultural psychology lead to different methodological approaches to test construction. The ‘etic’ approach assumes that universals can be identified in intelligence or other psychological constructs and one specific, standard test can be applied cross-culturally. This relates to the orientation of absolutism. The ‘emic’ approach opposes this view and claims that for assessment, culture specific measurements need to be developed, preferably by an indigenous psychologist. These should be based on that culture’s meaning and value systems with respect to the psychological characteristic in question. This approach relates to the theoretical orientation of relativism. The middle way in test construction is called ‘derived emic’, whereby the two approaches of etic and emic test construction are combined. The different assessments of a construct can be based on the universal underlying process, but the concrete measures i.e. the test items need to be adjusted to get a meaningful instrument for each particular culture. In this approach a test with a relatively universal measurement pretension might be constructed in one culture (etic) and subsequently indigenous researchers from the other cultures derive culture specific versions (emic) from the universal stem.. Empirical comparisons between the responses on the different tests will demonstrate whether the tests measure the same construct or not. This approach relates to the third orientation of universalism.
As said, research in cross-cultural psychology plays an important role in cross-cultural testing. The need for multiple language versions of tests, questionnaires and surveys is continuously increasing. Tests are adapted from one language and culture not only to obtain a valid measurement in each culture, but also to facilitate comparative studies across cultural and language groups. This is intended not only to achieve fairness, since comparisons are to be made between people from different backgrounds. These comparisons will have to take place on the same "scale" in order to avoid comparing apples and oranges. For example when a multinational company is developing selection criteria for managers who originate from different cultures. Also, economic reasons play on role in test adaptation, as it is more expensive to create a new test from scratch than to adapt an existing test. When applying instruments or tests in various linguistic and cultural groups, psychological characteristics, or at least the roots of these characteristics, are assumed to be universal, for all groups. This is highlighted by Poortinga and Van der Flier (1988) who say that to be able to use tests in different cultural populations you have to assume that:
To refer to comparability between test scores of different cultural populations, the words bias and equivalence need to be used. According to the taxonomy given by Van de Vijver & Leung (1997a, 1997b), bias occurs if score differences on the indicators of a particular construct do not correspond to differences in the underlying trait or ability. Equivalence refers to the measurement level at which scores can be compared across cultures (Van de Vijver & Tanzer 1997). This is also referred to by Poortinga & Van de Vlier in point 3 as noted above. Both bias and equivalence are very important concepts in cross-cultural assessment. To be able to make valid comparisons across cultural groups, measures should have optimal equivalence and
and, thus, have minimal bias. Test bias and test equivalence are explored further in the respective sections below.
We speak of test bias if score differences between (cultural) groups are caused by other factors than differences of the groups’ underlying trait or ability. Van de Vijver & Tanzer (1997) identified three forms of test bias; construct bias, method bias and item bias. Bias has become the common term for nuisance factors in cross-cultural score comparisons.
Construct bias is defined as differences in scores caused by how the theoretical construct is defined. The construct measured by the instrument is not identical across cultural groups. This form of bias could also arise because behaviours associated with the construct in the different groups might not have the same meaning in all groups.
When group differences arise out of differences in the method, such as way of test construction or administration, there is method bias. Method bias can arise out of incomparable samples used for constructing or applying the two measures. This is especially likely if very different cultures are compared, the matching of samples could then be very difficult to achieve. Differences in the level of educational background or previous exposure to tests are good examples where samples might differ from each other. The instrument itself might cause method bias, for instance when one group is more familiar with the tasks or type of questions asked than the other group. Also, there might be administration problems, for instance when for one of the groups the test administrator does not speak that group’s native language.
When the item content causes group differences the term item bias is used. We also speak of differential item functioning (DIF) when referring to this type of bias. The behaviour of individual items is the focal point here, as opposed to the other types of bias which focus more on total score differences. Biased items have a different psychological meaning across cultural groups. Item bias can be caused for example by poor item translations, that is the item content might not be suitable or appropriate in certain cultures, or connotations of words might be different in different cultures. Item bias can be determined by comparing candidates from both groups with the same total score, i.e. it is expected that those candidates should score similarly on each item. For persons from different cultural groups with the same total score, the items should have the same level of difficulty. If this is not the case, then the item is biased.
The literature describes many strategies to deal with the three types of item bias briefly outlined above. The majority of these strategies assume that item bias can be avoided and show techniques to reduce or even remove bias. The fact that bias might be an indicator of systematic cross-cultural differences is not very often considered, but this point will be explored briefly later on. Van de Vijver & Tanzer (1997) give an overview of the most common and relevant approaches to address bias, This overview is represented in the next table.
| Type of Bias | Strategies |
| Construct bias |
|
| Construct bias and/or method bias |
|
| Method bias |
|
| Item bias |
|
Table 1: strategies for identifying and dealing with bias in cross-cultural assessment (Van de Vijver & Tanzer (1997)).
To be able to make statements about score differences across different cultural groups, the test versions should be equivalent. Van de Vijver en Leung (1997a, 1997b) identified three forms of test equivalence:
As stated above, scale equivalence assumes complete bias free measurement. However different forms of bias have different impacts on the level of equivalence. If test versions show construct bias there will be inequivalence in the psychological concepts on which the test is based. When this occurs it is impossible to make score comparisons between cultural groups. However, method and item bias do not affect construct equivalence, which, as said, only implies that the same construct is measured across the cultural groups. These types of bias do impact the scale equivalence of a test. If, for example, a particular item is consistently favouring the scores of one group over the other, comparisons between the groups are unfair as such items will distort real score differences of the groups on the construct.Van de Vijver & Tanzer (1997) argue that the debate on cross-cultural differences in cognitive test performance is to a large extent coloured by the level of equivalence of cross-cultural score comparisons. Some researchers might argue that when appropriate instruments are used (displaying full scale equivalence), cross-cultural differences in test performance will reflect valid differences between groups. Others argue that tests will always have forms of bias, such as method bias (e.g. people might have different levels of familiarity with the test stimuli), making measurement unit equivalence the highest attainable form of equivalence.
Regardless of this debate, the concepts of test bias and test equivalence are important to consider in cross-cultural testing. Tests and their adaptations used for different cultural groups should be as free of bias as possible to achieve equivalence and as a result maximum comparability of test scores. The methodology and guidelines for adapting tests will be discussed in the next section.
In cross-cultural testing the main type of group comparisons are made between people who speak different native languages and very often are from different countries. However, the majority of tests are developed for an English speaking population. People who are not native English speakers should not have the disadvantage of taking the test in a language that is unfamiliar to them, especially if they are subsequently compared with candidates who are fluent in that language. If the tests are going to be used in for instance Spanish or French populations, you cannot expect an English test to be a fair way of testing.Hence, the tests need to be translated to make sure it is available in all the home languages of candidates who will need to complete them.
Translating a test sounds fairly straightforward, but it is not as simple as rewriting the item wordings in another language. A test should not ‘just’ be translated, but also cultere dependent meanings or connotations of wordsshould be taken into account. This point refers to topics of test bias and equivalence that were explained before. A translation might suffer from some form of bias if the choice of words used is more complicated than in the original version. Also, words could be used that might have multiple meanings while their equivalents in the original language are unambiguous. Therefore we prefer to use the word ‘adapting’ tests rather than ‘translating’ tests when we are creating multilingual , equivalent versions of the same tests. Adapting tests also implies that it is checked whether or not the adapted versions measure the same construct for different language and/or culture groups. It even includes other aspects of the assessment process. For example, it should be checkjed whether the test administration procedure, item formats used and influence of speed on examinee performance are equivalent for the two groups in question (Van de Vijver & Leung (1997b)).
There are two common procedures to develop a good translation. First, there is the translation / back-translation approach whereby the original text is translated into the target language by a native speaker of the target language. Subsequently a second expert, preferably a native speaker of the source language, translates the translated text back into the source language of the text. By comparing the original text and the back-translated text the accuracy of the initial translation can be checked., However, despite seeming overlap in meaning between the translated version and the original, it might turn out that in the translated version, concepts, words or expressions are used that are culturally, psychologically or linguistically in-equivalent to the original language. This procedure will not be sensitive to such problems. The committee approach might solve these, however..
In the committee approach a group of people are working on a translation. They will have different levels of expertise (i.e. cultural, linguistic or psychological) to ensure that the translated version is as closely related to the original text as possible. However, even with this approach it is still possible that for some words, concepts or expressions no equivalent is available in the target language. To overcome this particular problem a test and its translations could be developed at the same moment in time. Such procedure allows to change the a test item in all versions, once a above described problem arises in one of the versions. Still, for some instruments it is unrealistic to assume that a translation will cover the construct the same way as in the original language version. Then the items should be adapted such that they will measure the same characteristic as the original, but using adapted items containing suitable words or expressions.
There are a lot of factors that influence the quality of a test adaptation and the equivalence of the translated version to the original. As presented by Hambleton (2006), sources of error or invalidity that arise in test adaptation can be organised into three broad categories: (a) cultural/language differences, (b) technical issues, designs, and methods, and (c) interpretation of results. Ignoring the sources of error in each of these categories can result in an adapted test that is not equivalent in the two language and cultural groups for which it is intended. Non-equivalent tests, when they are assumed to be equivalent, can only lead to errors in interpretation and erroneous conclusions about the groups compared.
A set of guidelines for the translation and adaptation of tests was developed by an international working group supported by the International Test Commission (Hambleton, 1994; Van de Vijver & Hambleton, 1996). The twenty-two guidelines as listed by Hambleton cover four domains:
The next step in creating and adapting different test versions is to evaluate the equivalence of the tests and their items in the different versions. In general, three common approaches, to be denoted as I, II and III to this can be identified (see Hambleton, 2005).I. Bilingual candidates complete both the original as well as the adapted version of the test. With this approach, individual differences in backgrounds between and within candidate groups can be controlled for (e.g. demographic characteristics). But it also assumes that the candidates are equally capable in both languages, which is quite unlikely for a large number of candidates. At the same time the group of bilinguals might not represent the target population properly, and scores might not be generalisable to the intended audience.
II. A group of candidates take both the original version as well as the back-translated version of the test. Item equivalence of the two versions can be identified by comparing the scores of the candidates on both versions. Preferably, also factor analysis and the comparison of factor structures of the two versions are used. However, this approach does not imply any data collection for the actual target-language version of the test.
III. One group of candidates take the original test version and a second group of candidates take the adapted version of the test. It has to be assured that all candidates in both groups are native speakers in the language of that version of the test. Another assumption that seems plausible to make is that the two groups are similar in with respect to their personal scores on the construct measured,. However, this is a problem in Classical Test Theory, that can be overcome using modern test theory or Item Response Theory (see section on Item Response Theory). The underlying principle here, is that item difficulties or item scores can be estimated independently from the specific sample of persons and person scores can be estimated independently from the specific items used. This way, score comparisons can be made even when the groups are not truly equivalent, or even if items of the different versions are not completely identical. This technique can also be used in when adapting (some of) the items in a test translation. By doing that, direct score comparisons are not possible as the scores are not based on the same instrument any longer, but by using this technique, scale equivalence can still be reached.
One of the most commonly used approaches to judge the quality of a test adaptation is by looking at the factor structure of a test in two or more language versions. If the factor structure remains the same in the adapted version, it is assumed that the test adaptation was successful. In an article written by Hambleton & de Jong (2003) this approach is criticised. The writers refer to the work of Bruno Zumbo from the University of British Columbia in Canada, who provided evidence that item level bias can still be present even when the factor structures found between the two versions are equivalent. Therefore, item bias review procedures should be used as a method to determine test and item equivalence of different versions of a test as well.
Despite information on sources of test bias and strategies and techniques to improve the equivalence of test versions, many tests report group differences, especially at the score level. Especially, the area of cognition has a long history of findings of cross-cultural differences and as a result harbours a long standing debate on how these differences can be interpreted. These debates can be traced back to the different theoretical orientations as introduced in the cross-cultural psychology paragraph.Accordingly, some researchers would argue that differences arise out of variations in innate competencies of different groups. Others say that the cognitive processes that reflect these competencies are in some way a product of a specific cultural context. As an example it could be defended that for some African tribes intelligence might be reflected in the speed with which a herdsman recognises his own cattle in a massive herd, whereas in a Western culture other forms of pattern recognition might be more indicative of intelligence.Thus,cultural groups will have different ability patterns that are rooted in ecological demands as well as socio-cultural patterns, which implies that differences in intelligence (as defined by western society) will occur between cultural groups. On the other hand, the main basic cognitive functions and processes, so called fluid intelligence, appear to be common to all human beings, as universally shared properties of our intellectual life. Cognitive competencies may stem from some common structures, but can result in highly varied crystallized performances that are responsive to ecological contexts, and to cultural norms and social situations encountered both during socialization and at the time of testing (Berry et.al. (2003).
A good illustration of cross-cultural differences found in the field of cognition is the study by Baron et al (2003) who reviewed current findings on ethnic group differences for cognitive ability test scores. They found that there are consistent findings of score differences between groups in countries around the world. Not always the same groups are underperforming,, but, in general, groups that have lower socio-economic status and poorer educational opportunities show lower test scores on the majority of tests than more privileged groups. These groups typically come from cultural traditions that are very different from Western culture, where cognitive ability testing approaches were developed. Baron et al also found that the literature is consistent with respect to the validity of cognitive ability tests for predicting job performance, training performance, and educational achievements. In general, differential studies found that this holds for all subgroups. This latter finding illustrating the point as introduced by Berry et al (2003) that characteristics of cognitive functions and processes appear to be common to all human beings.
By adopting a more universalistic view on cross cultural testing, implying that that core characteristic cognitive functioning are assumed to be common to all human beings, cross-cultural score comparisons seem possible to make. But, if the behaviour or performances are relative to the cultural context, can there by fair comparisons between cultural groups? Even when everything possible is done to reduce or even remove any form of test bias? The answer to these questions is yes, but with caution.
When making cross cultural comparisons, each variable that potentially has an effect on the construct being measured should be considered in the explanation of score differences. The potential set of variables on which groups might differ is however enormous. Education, religious beliefs, practices of socialisation and word knowledge are only a few of the possible variables that can differ systematically between groups. Also, powerful psychological influences have effects on a broad range of measurements. One of the most important influences is western style school education. It is part of a complex of which literacy, test-taking experience, urbanisation, economic wealth and acculturation all form part. These variables form part, to a large extent, of the socio-political context of a group and it is very hard to imagine psychological measurements to be unaffected by such variables.
The aim of cross-cultural comparisons should be to understand differences and similarities between cultural groups. Seeing all the possible factors that may affect cross-cultural measures it is very important to consider every possible factor that might affect the results. We should not take score differences for granted and interpret them straightforwardly as basic trait differences between two cultures. At the same time it should be noted that despite many findings of cross-cultural score differences, not all cultures differ from each other to the same extent. Some cultures are very similar, and test equivalence might be reached in a relatively straightforward way, while for other cultures this is much more difficult to obtain.
Cross-cultural testing and the field of cross-cultural psychology face many challenges, as highlighted in this text. It is an exciting field of research and highly relevant in the globalised world we live in today. Hopefully this introduction to cross-cultural testing has given you an insight into the theories, methodologies and issues that you might encounter when creating, administering and/or analysing tests across different cross-cultural groups.
1. Baron, H., Martin, T., Proud, A., Weston, K. & Elshaw, C. (2003). Chapter 6: Ethnic group differences and measuring cognitive ability. International Review of Industrial and Organizational Psychology, 18, 191-238.
2. Berry, J.W., Poortinga, Y.H., & Pandey, J. (eds.) (1997). Handbook of cross-cultural psychology: Vol. 1: Theory and Method (2nd ed.). Boston, MA: Allyn and Bacon.
3. Berry, J.W., Poortinga, Y.H., Segall, M.H.& Dasen, P.R. (2002). Cross-cultural psychology: research and applications. New York: Cambridge University Press.
4. Hambleton, R.K. (1994). Guidelines for adapting educational and psychological tests: A progress report. European Journal of Psychological Assessment, 10, 229–244.
5. Hambleton, R.K., & de Jong, J.H.A.L. (2003). Advances in translating and adapting educational and psychological tests. Language Testing 2003; 20; 127-134.
6. Hambleton, R.K. (2006). Chapter 1: Issues, Designs, and Technical Guidelines for Adapting Tests Into Multiple Languages and Cultures. In: Hambleton, R.K., Merenda, P.F., & Spielberger, C.D. (2005/6). Adapting educational and psychological tests for cross-cultural assessment. Lawrence Erlbaum Associates: Mahwah NJ.
7. Poortinga, Y.H., & Van der Flier, H. (1988). The meaning of item bias in ability tests. IN S.H. Irvine & J.W. Berry (Eds.), Human abilities in cultural context (pp. 166-183). New York: Cambridge University Press.
8. Van de Vijver, F.J.R., & Leung, K. (1997a). Methods of data analysis and comparative research. In Berry, J.W., Poortinga, Y.H., & Pandey, J. (eds.). Theory and method (pp257-300). Handbook of cross-cultural psychology: Vol. 1: Theory and Method (2nd ed.). Boston, MA: Allyn and Bacon.
9. Van de Vijver, F.J.R., & Leung, K. (1997b). Methods and data analysis for cross-cultural research. Newbury Park, CA: Sage.
10. Van de Vijver, F.J.R & Tanzer, N.K. (1997). Bias and equivalence in Cross-cultural assessment: an overview. European Review of Applied Psychology, vol. 47, 263-279.
Question 1
In the field of cross-cultural psychology, which main theoretical orientations can you identify? Please explain what each orientation entails.
Question 2:
When you would like to create and use a test in different cultural populations, which are the main things you should consider and research?
Question 3
Please explain the difference between method, construct and item bias of a test.
Question 4
Please give some examples of techniques that can address different types of test bias.
Question 5
Which three forms of test equivalence can be identified? Please explain each form of test equivalence.
Question 6
Why do we like to use the word ‘test adaptation’ rather then ‘test translation’ when we are referring to creating tests in another language?
Question 7
Can you briefly explain the two procedures, as highlighted in the text, to develop a good test translation?
Gerianne de Klerk holds a Masters degree in occupational psychology from the University of Utrecht (The Netherlands). She is currently living in South Africa and working as an independent consultant in the field of psychometrics; assisting clients in research projects and test development. She worked for over 6 years at SHL (Saville & Holdsworth Limited). She did so firstly in The Netherlands where she fulfilled the role of product manager, later at the European office in a similar role and subsequently at Head Office in the UK for the Design & Innovation team where she was responsible for the innovation, design and development of HR products and tools.
To contact Gerianne de Klerk, please email:
Gerianne.deklerk@gmail.com