home > publications > orta International Test Commission

Table of contents

Developing Standardized Tests
  Introduction
  Steps in Developing Standardized Tests
   Table 1
   Steps in Test Development
  Conclusions
  References
  Discussion Point
  Author Information

Developing Standardized Tests

Thomas Oakland

Uploaded June 2009

Note:
When citing this reading, please reference it as follows:
Oakland, T. Developing standardized tests. In M. Born, C.D. Foxcroft & R. Butter (Eds.), Online Readings in Testing and Assessment, International Test Commission, http://www.intestcom.org/orta

Reviewed by Cheryl Foxcroft

Introduction

Test development has had a long and illustrious history. Test development activities occurred in China about 3000 years ago (Wang, 1993). One goal of these early efforts was to assist in selecting persons with suitable qualities to hold high ranking government officers. The next evidence of test development activities occurred during late 1800s when pioneers in psychology established laboratories for the study of human behavior. They developed tests to collect data from children, youth, and adults in order to describe important psychological qualities as well as to formulate and test theories (Oakland, 1995).

Test use has become a daily occurrence for many people. Test use in schools is common. Students often take one or more teacher made tests each day. In addition, in some countries, students are given standardized tests measuring such qualities as academic aptitude, achievement, vocational interests, and learning styles.

However, test use is not confined to school settings or to children and youth. For example, tests are used in medical and legal settings, when applying for jobs or driver's license, and when qualifying for professions. Persons of all ages can be expected to take tests.

Various reasons exist for using tests. The principle reason is to accurately describe behaviors or other important qualities. Accurate description is prerequisite to other reasons for test use. Other reasons may include the following: to assess and evaluate prior levels of attainment, to compare one individual with others of similar age, to diagnose, to estimate future behaviors, to assist in counseling (e.g., overcoming problems and planning for one's future), and to assist in student and employee selection as well as program planning and evaluation.

Test use may be characterized in terms of the importance of test use. Information from some tests (e.g., teacher made tests used routinely in classrooms) generally is less important than many other tests (e.g., those used in student and employee selection). Care is needed in constructing all tests. However, greater care is needed in the construction of those that are used for more important decisions that may significantly influence life outcomes. Given their importance, tests used for these purposes often are standardized and more carefully developed.

Psychologists, educators, and other professionals have acquired considerable experience and expertise in developing standardized tests during the last 100 years. Thousands of standardized tests have been developed. Their use can be found in every country.

The primary purpose of this chapter is to outline and describe the steps used to develop standardized tests. A secondary purpose is to encourage those interested in test development activities to become engaged in this process. The belief that test development activities are beyond the ability of all but a few clearly is inaccurate.

 

Steps in Developing Standardized Tests

Methods to develop standardized tests often are very similar and follow various steps.

Table 1

Steps in Test Development

Step 1. Identify a need for a test and define the qualities to be tested.

A) Linguistically define the qualities to be tested.

  • Specify the qualities the test is designed to measure

  • Define the purposes of the test.

  • State the demographically related qualities to be considered.

  • Decide on the desired length of the test.

  • Specify who will use the test results.

B) Assemble and meet with one or more advisors to assist in activities.

Step 2. Obtain a contract for the test.

A. Prepare and submit a proposal.
B. Do not become discouraged with the first rejection letter.
C. Obtain a legal review of contract offers.
D. Finalize the contract.

Step 3. Initiate item-writing activities.

A. Select the most appropriate format for the test.
B. Operationally define the qualities to be tested.
C. Begin writing items.
D. Review items by other experts.
E. Make needed item changes.
F. Assemble items into a pretest.
G. Design the record forms.
H. Print pretest booklets and record forms.

Step 4. Initiate collection of pretest data.

A. Locate data collection sites.
B. Identify the personal qualities of those to be tested.
C. Identify those administratively responsible for data collection.
D. Prepare detailed directions for site administrators.
E. Collect pretest data.

Step 5. Analyze pretest data.

A. Code and enter pretest data.
B. Analyze pretest data.
C. Review pretest data
D. Select items.
E. Review final set of items.
F. Assemble the test.

Step 6. Initiate collection of standardization data.

A. Print test and directions for site administrators
B. Locate data collection sites
C. Identify those administratively responsible for data collection
D. Specify personal qualities of those to be tested
E. Plan reliability and validity studies
F. Collect standardization data.

Step 7. Analyze standardization data.

A. Code and enter standardization data.
B. Analyze standardization data
C. Review standardization data
D. Select final set of items.
E. Analyze data from reliability and validity studies.
F. Prepare norms tables.
G. Finalize test format.
H. Finalize record forms.
I. Finalize scoring methods.
J. Write and edit manual and other technical materials.

Step 8. Prepare test for distribution.

A. Print test.
B. Store test.
C. Market and distribute test.

A Post-script. Consider developing more than one test at the same time.

 

Test development comprises a series of activities in which many people are involved. Rarely does one person independently complete each of the following responsibilities. Although test development requires the efforts of many, there must be a project manager, one person in charge who provides principle leadership. The project manager is not expected to be knowledgeable in all technical areas important to a test's development and thus needs to utilize the expertise of others.

Each of the steps listed in Table 1 will now be discussed in more detail.

Step 1. Identify a need for a test and define the qualities to be tested

The development of a standardized test begins with the recognition that one is needed. Tests generally are developed in response to market needs. Test publishers and developers consult with test consumers to determine their needs for additional tests. A need may exist because the qualities to be tested currently are not assessed accurately, do not reflect existing theory or prescribed practice, currently are measured by tests that are too brief, or are too long or too old. This first step of consulting with test consumers is critical to the success of subsequent test development activities. Tests designed in light of consumer needs will be more appealing and successful than those that do not consider their needs.

Specify the qualities (construct, content domain) the test is designed to measure. Clearly define the qualities to be tested by using precise language. Terms used in the behavioral sciences often are misunderstood. Terms such as intelligence, personality, vocational aptitude, and achievement may be interpreted differently by professionals and test consumers, given their general nature. For example, when developing a measure of reading achievement, one needs to clearly indicate whether it will include an assessment of reading recognition, reading comprehension, reading fluency, reading rate, listening comprehension, vocabulary, and/or language usage. Thus, test developers need to clearly define in writing the nature of the specific qualities the test is designed to measure.

Clearly define the purposes of the test. For example, a test may be intended to describe behaviors or other important qualities, to assess and evaluate prior levels of attainment, to compare one individual with others of similar age, to diagnose, to estimate future behaviors, to assist in counseling (e.g., overcoming problems and planning for one's future), to assist in student and employee selection, or in program evaluation. An accurate description of the test's purpose is critical to future test development activities.

Specify the test taker's personal qualities. Tests typically are designed for people who display specific personal qualities: ages, grades, genders, occupations, pathological conditions, and other personal qualities. For example, one test author may want to develop a measure of intelligence for males and females between 5 and 18 (e.g., generally a school age population). Another test author may want to develop a measure of intelligence for males and females between 6 and 80. Intelligence tests designed for those between 5 and 18 will select persons of these ages when standardizing the test while the test for those between 6 and 80 will standardize the test on this broader age range. Statements as to the personal qualities of the test taker help define the sample on which the test will be developed.

When standardizing a test, a sample is drawn to accurately reflect the qualities within a well defined population. In developing a measure of intelligence for children between 5 and 18, a test developer cannot test the entire population (i.e., all children within the country between these ages). Instead, the developer acquires data on a sample that represents this population.

State the demographically related qualities that should be considered when drawing this sample. As previously noted, standardized tests typically are designed for persons who exhibit specific personal qualities (e.g., age, grade, gender, occupation). However, other qualities also may need to be considered when drawing this sample so as to reflect the qualities of the population. For example, differences may be found in the regions of a country in which one lives; whether one lives in an urban, suburban or rural area; one's socioeconomic and racial/ethnic status; the number of grades they attended school (and whether they are in school); and other personal qualities. Test developers who standardize their test merely on persons who are conveniently located will draw a sample that is unlikely to reflect the population for which the test in intended.

For example, when developing an intelligence test for children, samples may be selected from children who attend school. However, children who are severely mentally retarded may not be found in schools. Thus, efforts are needed to locate children with severe mental retardation to include in the norming sample. Similar examples could be provided for adults who may not be in the workforce, women who care for their children at home, persons who exhibit severe forms of mental illness. They too may not be sampled properly when samples of convenience (i.e., those people who are readily available due to their schooling, jobs, or other group-related associations) are used.

Decide on the desired length of the test. This issue addresses practical concerns, one of many that will impact the test's acceptance. The following five issues impact decisions regarding suitable test length: psychometric properties, age, personal tolerance, cost, and institutional issues.

Test developers often strive to include enough items to insure the test has suitable reliability and validity. In general, longer tests are more reliable and valid than shorter tests. Test developers initially develop more items than will appear in the final version. Two to four times the number of items may be developed initially; most will be discarded after their review by consultants and the statistical results found during the item selection process.

Age related qualities of test takers need to be considered. Young children (e.g., below age 4) and older people (e.g., over age 70) may display test taking qualities that limit the length of time they can concentrate and attend to test items. In addition, those with brain injury and other disabling mental and physical qualities may be unable to comply when asked to take lengthy tests. People with personality disorders and those who see little value in taking the test also may be unable or unwilling to comply when asked to take lengthy tests. Test length may need to be shortened in order to address these practical concerns.

Longer tests cost more to develop, administer, and score. These financial issues impact test length. Test producers work to minimize production costs. Test consumers prefer tests that minimize costs associated with administering and scoring tests. Thus, in contrast to longer tests, shorter tests are more likely to be marketable, an important consideration in test development.

Institutions are likely to be important test consumers. Industries and other commercial business, government agencies, schools, and other institutions that widely use tests often have time constraints that impact test use. Two examples follow. Job interviews may be less than one hour. School classes may be 30 to 45 minutes in length. Tests designed in light of these time constraints and other practical issues will be more readily accepted and used.

Decide who will use the test results. Test results typically are intended for both primary and secondary consumers. Three examples follow. Teachers and school administrators typically are the primary consumers of achievement test results; students and their parents typically are secondary consumers. Clients typically are the primary consumers of personality test results; their physicians may be secondary consumers. Employers typically are consumers of tests used in occupational settings; employees and those applying for work may be secondary consumers. Test developers need to plan their tests principally in light of the needs of primary consumers while also providing for the needs of its secondary consumers.

Assemble a group of advisors and meet with them one-on-one or in a group to assist in test development activities. A test rarely is developed by one person. Consultation with experts in areas important to both test development and test use will help insure its success. Although the qualities of those asked to consult will differ from test to test, they are likely to include experts in psychometrics (i.e., a specialty devoted to test development and use), in the particular characteristics measured by the test (e.g., achievement, intelligence, personality, vocational aptitude), in the personal qualities of those for whom the test is designed (e.g., age, racial ethnic qualities), as well as the primary test consumers.

 

Step 2. Obtain a contract

The development of a standardized test generally involves a company that assumes primary responsibility for marketing and distributing the test. The company may also be involved at other steps (e.g., collecting, entering, and analyzing standardization data). A legal contract is needed to specify the nature of the collaborative relationship between the person or persons responsible for developing the test and those responsible for its commercialization.

Although the process for developing this contract often differs, test companies typically want to receive a proposal that details important test features before entering into a contractual relationship with the test developer. Details include those listed in Step 1: the test's purposes and qualities it will measure, its standardization sample and test length, and those who will use it. In addition, the proposal should indicate the nature and size of test's market, comparisons between the proposed measure and existing measures, empirical or theoretical support for the measure, and time lines for its development. Thus, these details are similar to those publishers request before issuing a book contract.

After reviewing the author's proposal, a test company may decide to reject, accept, or propose to modify the proposal. If rejected, authors often submit a proposal, possibly in revised form, to another publisher. If the author and test publisher are able to come to a generally favorable agreement, the test publisher typically submits a first draft of a contract for review to the test author. Some authors mistakenly are inclined to accept these initial contractual provisions, given their belief that a successful partnership will emerge with the test company.

However, authors should receive a detailed legal opinion from an attorney, one who specializes in contracts of this type. Attorneys with this expertise may be located by asking other testing companies for the names of attorneys who represent them. Authors also may ask colleagues who have test contracts to assist them in this review. Following these reviews, the test author suggests changes in the initial contract. Negotiations between the test developer and company follow, leading to a compromise that is acceptable to both parties.

 

Step 3. Initiate item writing activities

The steps outlined above are preliminary to writing test items. Those who successfully complete the above processes will accomplish the following activities more successfully and efficiently.

Select the most appropriate format for the test. Various test formats may be used: physically performing an activity, making products, drawing, behavior observations, questionnaires and other report forms used to acquire information directly from the test taker or from others about the test taker (e.g., supervisors, teachers, parents), or tests that use an essay or multiple choice format or Likert scale.

Operationally define the qualities (constructs, content domains) to be tested. This is accomplished largely by writing items consistent with the previously defined qualities the test is intended to measure. For example, returning to the previous example of developing a test of reading recognition, items for this test component may be based on words commonly found in the reading textbooks used in one's country. The items themselves operationally define the qualities to be tested (i.e., reading recognition is the ability to say the following words commonly found in reading textbooks).

Begin writing items. Items must be consistent with the definition of the qualities being assessed. In addition, they must represent the domain being assessed. For example, items that assess the domain of reading recognition should vary in difficulty and content to be consistent with the nature of the curriculum used in schools. To use another example, measures of personality must include sufficient numbers of items that assess each of the subdomains of personality.

As previously noted, test developers often develop two to four times the number of items needed in the final version, knowing most will be discarded after their review by consultants and their statistical results. Test developers often contract with others to assist in item writing. When developing each item, key each response to indicate which response is correct or reflects a particular theory.

Have items reviewed by other experts. Although the items may appear to be suitable to those who developed them, expert consultants and advisors often identify problems in items. For example, when the author was developing a scale of temperament for children, the experts who reviewed items immediately eliminated about 20% of those items the author thought to be desirable. Deleting these items improved the test and made it shorter and less time consuming to administer and score.

Make needed item changes. Some items will be eliminated, others will be revised, and still others may be added. The quality of work provided by the advisory panel will be at a higher level when persons with high levels of expertise and those who are diverse in their experiences (e.g., including those who will be consumers of the test results) are involved in this item review and revision process.

Assemble items into a pretest. The pretest is used to acquire initial data on test items and to receive suggestions from those responsible for administering the pretest as to how the test may be further improved. The pretest should closely resemble the final test in its directions for administering the test, item and response formats, item arrangement, printing quality, and other important test qualities. Time limits, the use of calculators and books, and other features that may influence test performance should be stated clearly in an attempt to standardize the methods in which the test is administered and scored.

Note that the term standardized refers to the uniform ways in which it is administered and scored. A test can be standardized and not normed.

Make preliminary decisions on record forms. Various methods exist to collect data. Different styles in the layout of record forms (also referred to as answer sheets or answer protocols) are needed for tests that require individuals to personally complete various tasks in contrast to those that require others to record the behaviors of persons being assessed. Tests developers need to design record forms to be consistent with the characteristics of the test and those being tested. Record forms also should enable test takers or examiners to record vital data about those taking the test, the date and location of the testing, and other demographic information that may be important to the test's data analysis, interpretations, or later use.

Print pretest booklets and record forms. Use a level of quality that is affordable yet conveys a good impression of the test. People asked to complete the pretest will be more inclined to do so when the printing and paper quality reflect high standards.

 

Step 4. Initiate collection of pretest data

Persons involved in test development often find activities completed under this third major section to be personally rewarding as the activities provide the first direct and concrete evidence regarding important test qualities.

Locate data collection sites. Some test developers acquire all pretest data on their own while other test developers ask agencies or professionals to assist in the collection of data. The former generally is more labor intensive, less expensive, and longer while the latter generally is less labor intensive, more expensive, and shorter, Agencies or professionals often are willing to assist in these data collection activities if they believe the scale may be of value to them later. Expressions of appreciation for their assistance are acknowledged in the test manual. Those who collect data may request and receive compensation for their services. Treat these sites with proper respect because test developers may want to utilize their assistance later.

Decide on personal qualities of those to be tested. As previously noted, the people tested should represent the population for which the test is intended (e.g., age, gender, geographic region, social and racial status). The number of people from whom data are needed varies with the number of items. The number of people from whom you collect pretest data should be approximately 10 times the number of test items (e.g., if you have a 100 item test, collect data on approximately 1000 persons).

After selecting the sites, identify those administratively responsible for data collection. Prepare detailed directions for site administrators. Provide sufficient information to these important data collectors so as to enable them to administer the scale in a standardized fashion. Provide information as to when and where the completed protocols should be sent. Site administrators and data collectors represent the author's eyes and ears. Their experiences in administering the test are likely to be important to the test developer. Aggressively solicit their comments and suggestions as to how the test may be improved.

After completing the above activities collect pre test data. This process may take a few weeks or extend into few months. One has more control over the length of this activity if one is personally collecting the data. Remain in contact with site administrators to assist and motivate them.

 

Step 5. Analyze pretest data

This period also is an important and potentially exciting midway step in a test's development. Attention to detail and statistical expertise are needed during this phase.

Code and enter pretest data. The test developer prepares a somewhat detailed form that provides a code for the data. The code identifies information on personal qualities of those who completed each pretest (e.g., age, gender), site location, and each item. This code form is completed prior to data entry, an activity that can be performed by someone with proper although often minimum training. The use of desk-top and laptop computers together with readily available software simplify data entry.

Analyze pretest data. The purpose of these activities is to modify the pretest so as to prepare a version that will be used to acquire test norms and conduct reliability and validly studies. Analysis typical utilizes item analyses (e.g., difficulty, distractibility and item total test correlations, differential item functioning) and factor analyses.

Proper analysis requires considerable thought as to critical issues associated with the nature of the data as well as theories underlying the test. For example, the scale used to collect the data may be nominal, ordinal, or interval in nature; the nature of the properties of the scale influences the selection of statistics. In addition, the pretest data may not provide sufficient numbers to analyze the data separately by age, gender, geographic region, or other distinguishing characteristics. Thus, desired analyses may need to be postponed until the next data collection period.

Review pretest data and select items with the goal of preparing a standardization version that will be used to acquire test norms and to conduct reliability and validly studies. Items typically are eliminated if they display unsuitable characteristics from item analyses or factor analyses. The goal is to have a standardization version in which each of the subdomains being assessed has at least the desired number of items plus 10% to 20% additional items.

On occasion, a review of pretest data reveals significant problems in the data. Test developers then must decide whether to abandon the project or to acquire additional information to further investigate the feasibility of continuing to the test norming phase.

Review the final set of items. The item review process by experts, as described in Step 3, is used here again. Have items reviewed by other experts this time. This may be the last time to make item revisions. Keep in mind test consumers and those reviewing the test are likely to review each item carefully and to be critical of those with which they find fault. Some call this activity a sensitivity review. Tests have been judged to be unacceptable because test consumers strongly objected to one or two items. Thus, a close review of items at this time may prevent significant problems in the future.

Assemble the test. Having previously performed this activity under step 3, one is familiar with the objective: to prepare booklets and record forms that will closely resemble the final products in their directions for administering the test, item and response formats, item arrangement, printing quality, and other important test qualities. Information provided by site administrators and examiners who collected the pretest data may result in some important changes in this edition.

 

Step 6. Initiate collection of standardization data

Many activities that comprise this step are similar to those conducted during Step 4. Thus, the following information is somewhat redundant with that found in the prior section.

Print test and directions for site administrators. The quality of the paper, art work, and printing should be suitable. The directions for taking the test should be clear and set the proper attitudes for taking the test. Tests designed to assess a person's maximum performance (e.g., tests of cognitive abilities) should encourage respondents to do their very best work. Tests designed to assess a person's typical performance (e.g., tests of personality and temperament) should encourage respondents to indicate general characteristic, not those that reflect their best or worst.

In addition, directions to site administrators again are needed to enable them to administer the scale in a standardized fashion. Again, provide information as to when and where the completed protocols should be sent and ways to suggest test changes.

Locate data collection sites. Many sites used during pretest development may be willing to assist again. Contract them first. These sites also may suggest others to contract and be willing to encourage them to participate. Testing companies and university personal often have developed good working relationships with test users. These relationships should be utilized.

After locating data collection sites, identify those who will be administratively responsible for data collection. These important people should be reliable and respect time lines. When working within an institution, those responsible for collecting data generally should hold administrative or professional positions.

Specify personal qualities of those to be tested. This important activity defines the qualities of the norm sample. Persons tested should represent the population for which the test is intended. Qualities include but are not limited to age, grade (or level of education when sampling adults), gender, geographic region, and social and racial status. When developing tests for use with persons with disabilities or dysfunctions, test data on members of these populations are needed in addition to those who do not have the disabilities of dysfunctions. Remember the caution expressed in Step 1 concerning the use of samples of convenience.

The number of people on whom standardization data are needed exceeds the number tested during the pretest development phase. There are few statistically derived estimates routinely used to identify minimum numbers needed. Practices established by the testing industry generally form accepted standards. Two qualities generally impact the number of people needed: whether the test in individually or group administered and whether the qualities reflect age changes.

The numbers of people who comprise the standardization samples for individually-administered tests generally are lower than those for group-administered tests. Norms for individually-administered tests often are based on 200 persons per age group. The sample within each age group should be stratified by gender and other important personal qualities that may influence nature of the data. Although norms for group administered tests may be as low as 200, they often extend to figures two to three times this number.

Some qualities assessed by tests reflect strong age changes (e.g., motor development, intelligence, and achievement in children as well as motor coordination and strengths in older adults) while others reflect few if any age changes (e.g., achievement and personality in middle aged adults). Tests that assess qualities that change with age should provide norms for each age group in which change occurs. Thus, the standardization samples of tests that assess qualities that change with age should be acquired in smaller age ranges than those that assess qualities that show little if any change with age. For example, scales of motor development for young children should sample from months 0-3, 4-6, 7-9, 10-12, 13-18, and 19-24.

Qualities assessed in children are more likely to display developmental changes than those in most adults. Thus, separate norms for each age typically are needed for children's tests. In addition, inasmuch as persons over the age of 60 may show declining skills and abilities, separate norms and larger samples also may be needed for them.

Plan reliability and validity studies. This is a new and critical activity. Manuals are expected to provide evidence that the test is reliable and its proposed uses are valid. Reliability estimates generally include those of internal consistency and stability. With reference to stability, the period between the first and second tests ideally should reflect a period over which the test developer believes the test data to be stable. For example, if the data are thought to be stable over a period of six months, the interval between the first and second tests should be six months. Practical concerns often limit the length of time test retest data to 6 to 10 week periods. However, these short intervals are not ideal. Test developers should be encouraged to subsequently publish in professional journals test-retest reliability estimates acquired over longer intervals,

Careful attention needs to guide the formation of validity studies. Contrary to general belief, a test has no validity. Validity refers to the degree of accuracy in ways test data are interpreted and used. Thus, test developers need to state the ways data from the test being developed may be used and then to acquire data that reflect the degree of accuracy in using tests in these ways.

For example, validity studies may be needed to determine the extent the test measures well defined constructs (e.g., intelligence, introversion-extroversion). Factor analytic methods often are used for these purposes.

Validity studies may be needed to determine the extent the test data correlate with other sources of information (e.g., the degree test data agree with opinions of teachers or other professionals or data from other measures). Various correctional methods often are used for these concurrent validly studies.

Many people on whom the test is being standardized may have existing data (e.g., medical or psychiatric records). An attempt should be made to acquire these and other data to use in validity studies.

Validity studies may be needed to determine the extent the test measures qualities that appear in the future (e.g., whether a scale of intelligence predicts future levels of achievement or a scale of personality predicts later psychological pathology). Various correctional analyses often are used for these predictive validity studies.

The above are examples. Each test developer should devote considerable attention to identify the nature of the studies needed and to initiate them as soon as possible. The main goal is to provide information as to the accuracy in using test data in intended ways. The assistance of advisory board members and other consultants often is very helpful at this step. The involvement of undergraduate and graduate students as well as practicing professionals in the collection of validity data occurs frequently. Rely, too, on well respected professional and academic references on test development, including the Standards for Educational and Psychological Testing (1999).

Collect standardization data. Refer again to previously presented information on this issue. Site administrators responsible for data collection generally work in applied settings (e.g., schools, hospitals, government agencies, as well as private industry and commercial locations) and have access to a number of people on whom test data are needed. Thus, test development costs may be somewhat minimized by utilizing their resources. The processes followed in collecting standardization data need to be more precise than those used in collecting pretest data. The test developer previously specified the nature of the standardization sample and collected data on a smaller sample. He or she now must implement methods that insure the data acquired during the standardization process will match this plan.

Thus, each data collector should agree to collect data on a specific number of people who display specific qualities (e.g., 50 non disabled boys and 50 non-disabled girls, aged 8, 1/2 of who come from middle class home and 1/2 of whom come from lower class homes). The test developer should anticipate 20% of those who agree to collect data will not do so and thus should plan to acquire 20% more data that are minimally needed.

The period for collecting standardization data together with that needed for reliability and validity studies may range from a few months to a year or more. Arrange to have the data sent by the site coordinators to the test developer as they are obtained. The test developer should examine the data for accuracy and completeness and inform site supervisors if problems exist.

 

Step 7. Analyze standardization data

This is the last step that includes activities for which the test developer is principally responsible. Activities during Step 7 are likely to be shared by others who have greater expertise in marketing and distributing tests. Moreover, activities completed during Step 6 are likely to impact the quality of the test and its future acceptance and appeal. Thus, careful attention is needed during each of the following activities as their outcomes will impact the test's qualities and be apparent to others.

Code and enter the data. A detailed coding form must be developed and used to facilitate data entry and retrieval. Although the need to analyze the data for test development is most immediate, these data also may be analyzed for various additional purposes by many people over many years. Thus, a detailed coding form that clearly indicates the nature of the data and their placement is critical. Lack of attention to this issue may result in the loss of data and thus the possible destruction of the aims of this costly test-development project. Strive to acquire and code as much data as possible on each person and site that participates in the test's standardization.

Analyze the standardization data. As previously noted, decisions as to which statistics are appropriate to use will depend on the nature of the data, the principle aims of the test, and the nature of the reliability and validity studies. Test developers often consult with specialists in statistics to confirm proposed data analysis methods and when difficulties arise. Some test developers employ specialists in statistics or rely on personnel from the testing company to perform these tasks. Those who develop tests should not be expected to know the latest methods for analyzing data.

Review the standardization data to insure they reflect desired qualities. Begin by analyzing data from all items and from all persons on whom data have been acquired; later reviews may consider possible sub-group differences. Initial reviews examine expected developmental trends by age (or the expected similarities over ages), the expected factor structure, and other hypothesized test characteristics. Reviews of data that reflect reliability and validity also are initiated during this period.

These initial analyses may indicate the expected and desired standards have not been met. Thus, test developers typically make adjustments by removing various items thought to be problematic (remember that the test has 10 to 20% more items than may be needed) and excluding subjects so as to provide more suitable norms. This process is data based and involves patience, logic and intuitions; the successful resolution of difficulties during this critical activity may take some months. Consultation with others may help overcome complex problems.

Select the final set of items. Issues associated with this selection process have been discussed previously. One strives for the smallest number of quality items that help insure adequate reliability and validity.

After selecting the final set of items, analyze the data from the reliability and validity studies. Item analysis and factor analytic studies typical are critical both to the review of the standardization data as well as to the analysis of reliability and validity studies. Thus, these activities are likely to occur during the earlier data review process and again after the final set of items are identified.

Prepare norms tables. Test consumers may need various norms tables. Thus, consult with them before preparing norms tables. Tables for age, grade, gender, occupation, and other identifying personal qualities commonly are provided. Information in tables typically includes the sample sizes by age and gender, standard and percentile scores, and reliability estimates. However, the nature of information provided by norms tables varies considerably. Their construction should reflect the nature of the data and the needs of the test consumers.

Finalize the test format after selecting the items. The format refers to the placement of the directions, arrangement of test items, print size, use of color, and other features important to a standard and efficient test administration.

Finalize the record form. As previously noted, record forms (answer sheets or protocols) should be designed to be consistent with the characteristics of the test and those being tested. Record forms also should enable test takers or examiners to record vital data about those taking the test, the date and location of the testing, and other information that may be important to the test's interpretation (e.g., examiner's test taking characteristics) or later use.

Finalize scoring methods. The test developer and staff members are likely to have scored tests during previous steps. Methods now are needed to enable test consumers to score tests using methods that are efficient and easy yet insure the data are scored in a reliable fashion.

Write and edit manual and other technical materials. The process of writing these important documents can begin as soon as the theory and rational for the test are known and guidelines for interpreting the test are developed. Additional information as to the standardization sample, norms tables, reliability, and validity can be added as it becomes known.

 

Step 8. Prepare test for marketing and distribution

This step is critical to a test's success. Some one or some company needs to assume responsibility for the printing of the test materials and for their promotion and sale. Most tests are marketed by test publishers. Thus, the following comments are discussed with this in mind. However, the activities also could be assumed by an individual and thus bypass a company that specializes in these activities.

Print the test, record forms, manuals, and other materials. The printing, art work, quality of paper, and other features should be of sufficient quality so as to promote the test's acceptance. Tests that are poorly printed generally are not well accepted.

Market the test well. Good tests can be developed yet do not do well commercially because they have not been properly marketed and distributed. Success during this step depends on the test author forming a strong relationship with a publisher, one committed to publicizing, marketing, and distributing the test to its intended audience. Test authors should avoid contracting with a test publisher who relies primarily on sales derived from information put in the company's test catalog. Test publishers should use more direct and active methods to promote test purchase.

The author also needs to actively promote the test. This can be done by presenting workshops at regional and national meetings of professional associations, agencies, and other organizations. Promotion also occurs through publication of test related research in newsletters and refereed journals. Encourage professors who teach courses in areas related to the test to become familiar with the new instrument. The broad acceptance of a test will be enhanced when the test author and publisher are strongly committed to and work actively to promote it.

Arrange for test storage. Persons ordering the test expect to contact test publishers easily by letter, fax, and telephone, to have the test materials immediately shipped to arrive within a few days. This requires printing, packaging, and storing tests in sufficient numbers for immediate shipment.

A Post script: Consider developing more that one scale simultaneously

 

Test development involves a long, costly, and complex process. Although a three to five year development process often occurs, the period of time from test design to test sales may take ten or more years. Tests that require more developmental time generally are more costly and complex.

Consider developing more than one test with the same standardization sample. The concurrent development of two or more tests can significantly reduce the costs and time associated with their separate development. In addition, the use of tests co-normed on the same population often is appealing to consumers and may assist in their interpretation.

Conclusions

Many able professionals who have an interest in developing standardized tests believe they lack the skills and abilities to do so. The tasks often seem overwhelming and too complex. Hopefully professionals will find information in this chapter helpful in clarifying the steps when developing standardized tests and will be encouraged to venture forth in their effort to develop needed measures.

No one was born with the skills and abilities needed to develop tests. Professionals acquire them, in part, by becoming involved in test development activities and either learning the skills needed at each step or finding assistance from others who have needed skills. You and others also are likely to find personal and professional satisfaction associated with test development experienced by others who have developed tests.

 

References

American Educational Research Association, American Psychological Association, and the National Council on Measurement in Education. (1999). Standards for Educational and Psychological Testing. Washington DC: American Psychological Association.

Oakland, T. (1995). Test use with children and youth internationally: current status and future directions: In Oakland, T. & Hambleton, R. (1995). International perspectives on academic assessment. Boston: Kluwer.

Wang, Z. M. (1993). Psychology in China: A review. Annual Review of Psychology, 44, 87 116.

 

Discussion Point

What knowledge and skills should psychologists and educationalists acquire during their university studies to make it possible for them to develop tests?

 

Author Information

Thomas Oakland
Professor, UF Research Foundation
Professor, Department of Educational Psychology
College of Education
University of Florida
1410 Norman Hall
PO Box 117047
Gainesville FL 32611
352-273-4283
Fax: 352-392-5929
oakland@coe.ufl.edu

Degrees earned:
Ph.D., Educational Psychology, Indiana University, 1967
M.S., Educational Psychology, Indiana University, 1965
B.A., History, Lawrence College, 1962

Biographical sketch
Professor Thomas Oakland has demonstrated excellence in his profession through his national and international scholarship and service. His research has contributed in the areas of assessment—especially that of minority children, school-related disorders, children’s adaptive behavior, test development, and legal and ethical issues. Oakland’s scholarly productivity is remarkable for its quality, consistency, and longevity, and many of his publications are found in the leading peer-reviewed journals and most prestigious books. He has also served on 26 editorial boards—12 nationally and 14 internationally. In recognition of his scholarship, Oakland has received various national and international awards, including the American Psychological Association (APA) Award for Distinguished Contributions to the Advancement of Psychology, distinguished service awards from the APA International School Psychology Association and National Association of School Psychologists, the Willard Nelson Lifetime Achievement Award In Recognition of Lifetime Achievement in the Practice of School Psychology from the Florida Association of School Psychologists and the UF International Educator of the Year Award. Oakland has served as the APA president of the Division of School Psychology, Policy and Planning Board, International School Psychology Association, and member of the Ethics Code Task Force. He has also served as President of the International Test Commission and is a long-standing member of the OTC Council