Professor Wiliam’s essay below represents the end of the AfL In Science Symposium. We are extremely grateful to him for offering to be involved as well as to all those online who joined the discussion. We hope that the posts have been interesting and helpful, and look forward to the debate continuing.

Adam, Rosalind, Niki, Deep, Ben and Matt

Assessment for learning in science symposium: Some reflections

Dylan Wiliam, UCL

One of the most frustrating debates that I have witnessed over the past twenty years or so is one in which people argue about whether formative assessment is generic or subject-specific. The answer, to my mind at least, is that it is obviously, and trivially, both. For me, the important issues are the trade-offs that occur when we treat formative assessment as more generic or more subject specific.

For example, in formulating a school policy on assessment, I would hope that schools would seek to play up the commonalities across subjects. The idea that it is helpful for students to know what they are meant to be learning, for teachers to find out what their students are learning and give feedback, for students to support each others’ learning, and have ownership of their own learning—these apply to the learning of any subject. And when teachers of different subjects use common language, students are more likely to see their learning experiences in different subjects as coherent.

But when individual teachers seek to develop their own practice of formative assessment, it is essential that the way they do this takes into account the specific subject, and in particular there are at least four levels that need attention.

The first is curriculum philosophy: what do we want students to know? In some countries, such as the United States, science curricula include earth sciences whereas in other countries these are regarded as part of geography. In some countries, science curricula focus primarily on the knowledge that science has produced, whereas in others, students are also expected to learn about how science produces knowledge—a distinction captured by Harlen (2010) as the difference between “big ideas of science” and “big ideas about science.”

In her blog in this series, Rosalind Walker argues persuasively that school science knowledge consists of propositions, relations between those propositions, and more or less formal procedures for answering scientific questions. This is a view with which I have much sympathy, but I am acutely aware that I cannot prove to someone who disagrees with me that my view of what it means to know science is the correct, or even the best one. Some things are included in our science curricula—as Rosalind Walker says, “You can’t not do the alkali metals”—and some are not, and there can be no rational basis for inclusion or exclusion. Curricula are, in Denis Lawton’s memorable phrase, “a selection from the culture of a society” (Lawton, 1975, p. 7).

The second is epistemology: what does it mean to know? For example, some might be happy that students can state certain scientific principles, such as Archimedes’ principle. Others might argue that just being able to state Archimedes’ principle is not enough. For example, Gobert, O’Dwyer, Horwitz, Buckley, Levy, and Wilensky (2011) suggest that knowing science requires knowing how to state scientific principles, but also that students can use the principles to reason scientifically. Even here, quite what we mean by being able to use a principle varies. We might say that a student knows Archimedes’ principle if she can use it to explain why ice floats, or we could be more demanding and say that they only really know it if they can use it to answer questions such as the following:

Someone sits in a boat in a swimming pool holding a 10kg mass. What happens to the level of the water in the pool if the mass is dropped into the swimming pool?

The third level is psychology: what happens when learning take place? For some, learning science is making links between stimuli and responses, as when students learn that there is a single electron in the outer shell of a sodium atom, which is why it bonds with a single atom of chlorine to make salt. There is no point in asking why an atom of sodium has a single electron in its outer shell. Students just need to know that the atoms of the element that we call sodium happen to have a single electron in the outer shell. For others, learning is a more constructive process, whereby students construct models of the world they experience, as evidenced by the fact that many young children believe that the wind is caused by the movement of trees. This is not the result of poor-quality science instruction, nor is it the result of insufficient reinforcement of the correct links between stimuli and responses. Rather it is the result of students creating schemas to make sense of their experience. While some people adopt entrenched positions on such issues, it probably makes more sense to acknowledge that some aspects of learning science are best described by the former approach, which we might call an associationist approach, while some are more like the latter which we might call a constructivist approach.

The fourth level is pedagogy: how do we get students to know? Over the last thirty years, there has been considerable confusion in science education caused by assuming that a particular view of what happens when learning takes place (psychology) necessarily entails a particular approach to getting students to learn (pedagogy). In particular, people have talked about constructivist approaches to teaching, as if a constructivist approach to psychology necessarily entails a particular way of teaching, whereas this does not necessarily follow (Kirschner, 2009).

The important point, for this series of blogs about science education, is that a commitment to formative assessment entails absolutely no commitment of any kind about any of the four levels discussed above. For example:

First, formative assessment can be used to improve the teaching of creationism or intelligent design if that’s what you think students should be learning in science.

Second, formative assessment can be used to find out whether students have mastered the scientific vocabulary they need, or whether they can use the models they have learned to reason about novel phenomena.

Third, formative assessment is needed by associationists to find out whether sufficient reinforcement has taken place and by constructivists to discover whether the facets (Minstrell, 1992) of students’ thinking that have been constructed are in fact those intended.

Finally, formative assessment says nothing about what kinds of learning activities should take place if the assessment reveals that what students have learned is not what was intended. Formative assessment reveals what learning has taken place, not what should happen next.

I think it is also worth pointing out that while some people use the terms “formative assessment” and “assessment for learning” interchangeably, for Paul Black and myself at least, there are important differences between the two. While definitions differ, most people seem to define assessment for learning in a similar way to the way that Paul Black and I defined it in “Working inside the black box”:

Assessment for learning is any assessment for which the first priority in its design and practice is to serve the purpose of promoting pupils’ learning. It thus differs from assessment designed primarily to serve the purposes of accountability, or of ranking, or of certifying competence. (Black, Harrison, Lee, Marshall, & Wiliam, 2002, p. 2)

This definition includes the use of assessment for motivation (Black, Broadfoot, et al., 2002) and assessment for retrieval practice (Dunlosky, Rawson, Marsh, Nathan, & Willingham, 2013). For Paul Black and myself, formative assessment is a much narrower idea, focusing on assessment processes that support inferences about instructional decisions:

Practice in a classroom is formative to the extent that evidence about student achievement is elicited, interpreted, and used by teachers, learners, or their peers, to make decisions about the next steps in instruction that are likely to be better, or better founded, than the decisions they would have taken in the absence of the evidence that was elicited. (Black & Wiliam, 2009, p. 9)

In his opening contribution to this symposium, Adam Boxer asks, if formative assessment is so good, why did it not make a bigger impact on student achievement? My answer here is simple—it was never implemented. Instead, the work that Paul Black and I had done with maths and science teachers on making their regular teaching more responsive to student needs was used as a pretext for increased monitoring of student achievement, initially as part of the national Key Stage 3 Strategy, and then as part of the Making Good Progress initiative.

In this context, it is perhaps worth noting that in our initial project with mathematics and science teachers in Oxfordshire and Kent, Paul Black, Christine Harrison and I always thought of formative assessment as both subject specific and as generic. In our regular one-day workshops with the participating teachers, each workshop included some general sessions, and some where the mathematics and science teachers met separately, to focus on the subject-specific aspects of formative assessment. And because we were acutely aware of the need for subject-specific accounts of formative assessment, we produced subject-specific booklets, including “Science inside the black box” (Black & Harrison, 2002) and “Mathematics inside the black box” (Hodgen & Wiliam, 2006).

Niki Kaiser, in her blog, asks “Do they really get it, or are they just giving me the correct answer?” and this highlights the fundamental, inherent, limitations any kind of evidence elicitation procedure as a way of finding out what our students think. As Ernst von Glasersfeld puts it, the teacher is always trying to construct a model of what is going on in the learner’s head, and inevitably, that model will be constructed, not out of the child’s conceptual elements, but out of the conceptual elements that are the interviewer’s own. It is in this context that the epistemological principle of fit, rather than match is of crucial importance. Just as cognitive organisms can never compare their conceptual organisations of experience with the structure of an independent objective reality, so the interviewer, experimenter, or teacher can never compare the model he or she has constructed of a child’s conceptualisations with what actually goes on in the child’s head. In the one case as in the other, the best that can be achieved is a model that remains viable within the range of available experience. (von Glasersfeld, 1987, p. 13)

Constructing good questions therefore requires more than just subject knowledge. It requires a knowledge also of the kinds of misconceptions that students have about the material they need to learn—what in Japan is called kyozaikenkyu (literally “knowledge of instructional materials” but in practice knowing how to teach particular ideas, and what students find difficult). Those who know science in great depth but have never taught school students are generally not able to construct good questions, because while they know the content, they do not know what kinds of ideas students bring with them, and what they struggle with.

In his contribution, Deep Ghataura traces the development of thinking about validity. Originally, validity was conceived of as a property of a test: a test was regarded valid to the extent it assessed what it purports to assess. The problem with that definition, of course, is that a test does not purport anything. The purporting is done by those who claim that a particular test result has a particular meaning. As a result, the focus shifted from validity as a property of a test to validity as a property of inferences made on the basis of assessment outcomes. Initially, as Deep Ghataura points out, tests were validated in terms of their content (does the test representatively sample the domain of interest?) or their predictive power (does the test predict performance on other measures, either now, or at some point in the future?). However, particularly in personality psychology, there were some assessments where there was no adequate definition of the domain of interest, nor any clear idea what the assessment should predict. For such cases, to be valid, a test had to be shown to be a good measure of some hypothetical construct, such as “conscientiousness” or “test anxiety”. For many years this “holy trinity” of content, criterion, and construct considerations dominated work in the theory of validity (Guion, 1980), but as early as 1957, Jane Loevinger had pointed out that “since predictive, concurrent, and content validities are all essentially ad hoc, construct validity is the whole of validity from a scientific point of view” (Loevinger, 1957, p. 636).

As Deep Ghataura points out, this unified approach culminated in the definition of validity provided by Samuel Messick:

Validity is an integrative evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment. (Messick, 1989, p. 13)

The problem with that, in turn, is that it seemed to absolve test makers of any responsibility for their tests—the burden was on those who used test results. This was unfair for two reasons. The first is that test producers tend to have more expertise in this area than test users. The second is that it is rather disingenuous for test producers to produce tests but then claim that any problems with test use are the fault of the user. As a result, work on validity over the last forty years or so, has sought to provide a definition of validity with appropriate divisions of responsibility between test producers and test users.

One particular issue that has been the subject of much debate over the last 50 years is whether the social consequences of test use should be considered part of validity. James Popham has argued that while a concern for the social consequences of test use is entirely appropriate, it is not helpful to use the concept of validity for this purpose. As he argued in a critique of the idea: “right concern, wrong concept” (Popham, 1997).

However, while the idea of “consequential validity” is often attributed to Messick, he does not appear ever to have used this phrase. The clearest statement of his beliefs about when, and under what circumstances, social consequences should be regarded as part of validity is provided in his “magnum opus”—the 100,000 words chapter on validity in the third edition of Educational Measurement:

As has been stressed several times already, it is not that adverse social consequences of test use render the use invalid, but, rather, that adverse social consequences should not be attributable to any source of test invalidity such as construct-irrelevant variance [see below for an explanation of this term]. If the adverse social consequences are empirically traceable to sources of test invalidity, then the validity of the test use is jeopardized. If the social consequences cannot be so traced—or if the validation process can discount sources of test invalidity as the likely determinants, or at least render them less plausible—then the validity of the test use is not overturned. Adverse social consequences associated with valid test interpretation and use may implicate the attributes validly assessed, to be sure, as they function under the existing social conditions of the applied setting, but they are not in themselves indicative of invalidity.” (Messick, 1989, pp. 88-89)

In other words, adverse social consequences should be regarded as part of validity only when they arise due to deficiencies in the assessment.

In his blog, Deep Ghataura wonders why I invoked Messick’s four-facet model of validity in a paper I wrote in 2000. The answer is simple: my views on how we might validate formative assessment have been evolving fairly constantly over the last quarter-century.

As early as 1994, I had become convinced that it made no sense to apply the formative-summative distinction to assessments, since the same assessment might function both formatively and summatively (Wiliam & Black, 1996), and in a series of papers over the next few years, Paul Black and I explored the idea that the formative-summative distinction might in fact be a continuum. In recent years, however, I have become more and more convinced that the simplest approach to this issue is, following Cronbach (1971), to regard assessments as procedures for making inferences, and that the terms “formative” and “summative” describe the kinds of inferences being made. Where the inferences are about the status or potential of a student, the assessment is functioning summatively; where the inferences are about what kinds of actions are likely to improve learning, the assessment is functioning formatively.

Of course, as Deep Ghataura points out, this clarification goes only a little way to understanding how assessment may help teachers on Friday period 5, but thinking of formative assessment as assessment—as a process of evidentiary reasoning (Mislevy, 2004)—draws attention to the quality of the evidence that teachers have for their instructional decisions. In particular, as Deep Ghataura suggests, well-designed multiple-choice questions can be especially powerful since they allow teachers to get information on the whole class quite quickly. Moreover, where the incorrect options relate to specific misconceptions—what Sadler (1998) calls “distractor-driven” multiple-choice questions—then teachers are likely to be able to identify productive courses of action quickly.

The challenging thing about writing such questions is that they force us to get off the fence. While, in conversations with colleagues, we might think we agree about what we mean by being able to apply the gas laws, we often find that when we write questions to test such understand, we disagree about whether such questions are appropriate. Such disagreements appear to be about whether the questions under discussion are good questions for the science being assessed, but generally, they signal differences in curriculum philosophy or epistemology. In other words, we disagree about the construct being assessed. For example, asking students to explain in writing their understanding of science concepts is regarded by some as unfair, because we assessing both scientific knowledge and writing ability. The results of such assessments are difficult to interpret because poor performance may be caused by poor writing ability, poor science knowledge, or both. From such a perspective, the scores from assessments that require students to write explanations suffer from what is called, in the assessment jargon, construct-irrelevant variance, because some of the variation in student scores is caused by differences in scientific knowledge (the construct of interest) and some of the variance in the scores is caused by differences in writing ability (an irrelevant construct). For others, however, being able to explain one’s scientific understanding in writing is part of being good at science. Such people would regard a science assessment that did not include some writing as under-representing the construct of science (in the jargon, the assessment suffers from construct underrepresentation). The important point here is that when people argue about whether an assessment that requires students to write extended explanations of their thinking is good or not, it appears to be an argument about the technical adequacy of the assessment, whereas in fact, the argument is about what, in fact, we should be assessing (curriculum philosophy and epistemology). Assessments are the subject of much debate because “assessments operationalize constructs” (Wiliam, 2010, p. 259).

This issue is brought to the fore by the post from Ben Rogers, which highlights the fact that what constitutes effective writing varies from discipline to discipline—the construct of writing differs. However, this post also illustrates a less obvious, but no less important, point about the choice of assessment tasks in different disciplines and that is that tasks vary substantially in what might be termed their disclosure—the extent to which assessment tasks yield relevant evidence of students’ capabilities, or, to put it crudely, if they know it, do they show it?

In English, a very general writing prompt can yield rich evidence about the extent to which a student has some specific knowledge or capability. However, in science, as in mathematics, such general prompts are much less likely to yield the evidence we need. To borrow a metaphor from competitive diving, in English it is quite common to give the same writing prompt to all students in a class, with the assessment focusing on “marks for style”. In science, however, general tasks often misfire. We give students a task to complete, and they complete it, but do so in such a way that does not give us any useful information about the specific thing we were interested in. In science, we tend to value students’ work in terms of their “degree of difficulty”. These differences between disciplines means that effectively implementing formative assessment will always require a strong disciplinary perspective.

In the final contribution to this online symposium, Matt Perks draws together a number of threads from the earlier conversations and shows how, although some teachers may be able to assess formatively “on the fly” there is no substitute for careful planning. He points out that often it is impossible to give good feedback to the student because the task that the student was given was not carefully enough planned to provide insights into students’ thinking. The starting point for good feedback is asking the right questions in the first place. And what constitutes a good question requires clarity in what it means to know science, which brings us back to the four levels I discussed at the beginning of this already too long post. If knowing science means knowing definitions of scientific terms, being able to state scientific principles, and using these principles to solve routine, familiar problems, then formative assessment degenerates into just checking that students know the terms, can state the principles, and can solve routine problems.

At a more advanced level, we might ask students to use their scientific ideas to solve novel problems, such as the example given by Matt Perks about evolution and deer antlers. We might expect students to argue that deer with larger antlers are more likely to have offspring, and since the size of antlers is at least partly heritable, then the male offspring of such deer are also likely to have larger antlers.

However, at a more advanced level still, we might want to check whether, as many students (and indeed adults) do, they believe in evolution as some kind of design process culminating in some perfect form of a species, as opposed to a more or less random process that results in forms of life are, at least for the moment, better adapted to their current environments. In the same way, when we ask students to “explain digestion” we have the key stage 1 answer, the key stage 2 answer, the key stage 3 answer, and so on all the way up to the PhD answer and beyond. Ultimately, the clarity of vision about where we want to get our students to drives everything we do in classrooms; when we push for more evidence from our students, as in the extended probing of the forces acting on the bus in Matt Perks’ blog in the section on “Inferences” or when we judge that students have “got it” or that further probing is unlikely to be helpful. Indeed, clarity about what is important, and what is less important, in the learning of science may be the most important aspect of teacher expertise. After all there is already too much content in our science curricula for any but the fastest learners, and embracing formative assessment necessarily entails prioritizing content because we often find out our students haven’t learned what they needed to. As a result, we have to make smart choices about which aspects of content are essential for students to understand, and which aspects are less important. Any teacher who thinks that the phases of the moon is as important a topic as the particulate nature of matter, simply because both are in the curriculum, is unlikely to make smart choices about what to do in the light of uneven levels of understanding in a classroom. In this sense, a deep understanding of what curriculum content is important for students to know, because it lays the foundation for future learning, and what content are one-off topics that, while desirable for students to understand, will not jeopardize their future progress if they are not mastered.

Perhaps the most useful way of thinking about the extent to which formative assessment is domain-specific or domain-general is with an analogy from science. Almost 100 years ago, physicists learned that understanding the behaviour of electrons required understanding that sometimes treating them as particles provided the greatest insights, whereas in other situations, it was useful to treat them as waves. In the same way, whether we treat formative assessment as domain-specific or domain-general depends on whatever is most appropriate for the situation at hand. As George E. P. Box, the British statistician, said, “All models are wrong but some are useful” (Box, 1979, p. 2).


Black, P., Broadfoot, P. M., Daugherty, R., Gardner, J., Harlen, W., James, M., Stobart, G., & Wiliam, D. (2002). Testing, motivation and learning. Cambridge, UK: University of Cambridge School of Education.

Black, P., & Harrison, C. (2002). Science inside the black box: Assessment for learning in the science classroom. London, UK: King’s College London Department of Education and Professional Studies.

Black, P., Harrison, C., Lee, C., Marshall, B., & Wiliam, D. (2002). Working inside the black box: assessment for learning in the classroom. London, UK: GL Assessment.

Black, P., & Wiliam, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation and Accountability, 21(1), 5-31.

Box, G. E. P. (1979). Robustness in the strategy of scientific model building. Madison, WI: University of Wisconsin-Madison.

Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2 ed., pp. 443-507). Washington DC: American Council on Education.

Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T. (2013). Improving students’ learning with effective learning techniques: Promising directions from cognitive and educational psychology. Psychological Science in the Public Interest, 14(1), 4-58. doi: 10.1177/1529100612453266

Gobert, J. D., O’Dwyer, L., Horwitz, P., Buckley, B. C., Levy, S. T., & Wilensky, U. (2011). Examining the relationship between students’ understanding of the nature of models and conceptual learning in biology, physics, and chemistry. International Journal of Science Education, 33(5), 653-684. doi: 10.1080/09500691003720671

Guion, R. M. (1980). On trinitarian doctrines of validity. Professional Psychology, 11, 385-398.

Harlen, W. (Ed.). (2010). Principles and big ideas of science education. Hatfield, UK: Association for Science Education.

Hodgen, J., & Wiliam, D. (2006). Mathematics inside the black box: Assessment for learning in the mathematics classroom. London, UK: NFER-Nelson.

Kirschner, P. A. (2009). Epistemology or pedagogy? That is the question. In S. Tobias & T. M. Duffy (Eds.), Constructivist instruction: Success or failure? (pp. 144-157). New York, NY: Routledge.

Lawton, D. L. (1975). Class, culture and the curriculum. London, UK: Routledge and Kegan Paul.

Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological reports, 3(Monograph Supplement 9), 635-694.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3 ed., pp. 13-103). Washington, DC: American Council on Education/Macmillan.

Minstrell, J. (1992). Facets of students’ knowledge and relevant instruction. In R. Duit, F. M. Goldberg & H. Niedderer (Eds.), Research in physics learning: Theoretical issues and empirical studies (Proceedings of an international workshop held at the University of Bremen, March 4-8, 1991) (pp. 110-128). Kiel, Germany: Institut für die Pädagogik der Naturwissenschaften an der Universität Kiel.

Mislevy, R. J. (2004, October 21). Cognitive diagnosis as evidentiary argument (Spearman Conference 2004). Paper presented at the Fourth Spearman Conference, Philadelphia, PA.

Popham, W. J. (1997). Consequential validity: right concern—wrong concept. Educational Measurement: Issues and Practice, 16(2), 9-13.

Sadler, P. M. (1998). Psychometric models of student conceptions in science: Reconciling qualitative studies and distractor-driven assessment instruments. Journal of Research in Science Teaching, 35(3), 265-296. doi: 10.1002/(SICI)1098-2736(199803)35:3<265::AID-TEA3>3.0.CO;2-P

von Glasersfeld, E. (1987). Learning as a constructive activity. In C. Janvier (Ed.), Problems of representation in the teaching and learning of mathematics (pp. 3-17). Hillsdale, NJ: Lawrence Erlbaum Associates.

Wiliam, D. (2010). What counts as evidence of educational achievement? The role of constructs in the pursuit of equity in assessment. In A. Luke, J. Green & G. Kelly (Eds.), What counts as evidence in educational settings? Rethinking equity, diversity and reform in the 21st century (Vol. 34, pp. 254-284). Washington, DC: American Educational Research Association.

Wiliam, D., & Black, P. J. (1996). Meanings and consequences: A basis for distinguishing formative and summative functions of assessment? British Educational Research Journal, 22(5), 537-548.