May 22

Outcome measures: external standards vs. method-internal scores

Brendan Parsons, Ph.D., BCN
Neuroscience, Practical guide

*Brendan's Perspective* Key points : • Outcome measures are not decoration. They are the endpoint the entire trial is built on, the only place the result actually lands. A trial whose primary outcome was designed by the same team that designed the intervention has done less of the evidentiary work than it appears to. • External standardised instruments are validated objects in their own right. Their reliability, their responsiveness, the conditions under which they detect change — all of that is the subject of an entire methodological literature. Method-internal scores rarely traverse that literature, and the comparability of any trial that uses one is constrained as a result. • When the same intervention can be made to look strong on one outcome measure and unimpressive on another, the choice of measure is doing part of the work of the conclusion. Recognising that choice — naming it, noticing whose hand was on it — is one of the more durable habits of evidence reading.

In the previous piece I argued that independent replication is the step at which a finding starts to look reliable, and that PubMed metadata gives the reader the raw material to check whether replication has actually happened. The closing question was about who had tried to make a given method fail.

This piece picks up the thread on the other side of the trial. Suppose the replication has happened. Suppose the team is independent. The trial has been run cleanly, the comparator is reasonable, the analysis is published. There is still one question a careful reader has to ask before reading the conclusion at face value: what was measured, and on whose instrument?

It sounds like a small question. It is not. The outcome measure is the place where the entire trial — the years of design, the recruitment, the participants' time, the careful comparator selection — is reduced to a number that gets either better or worse. Everything else feeds into that single point. If the instrument was poorly chosen, or designed in-house specifically to show what the intervention was meant to show, the trial can be impeccable in every other respect and still produce a conclusion that says less than it appears to. More than once in my reading career I have come away from a clinical paper where the methods were sharper than the measurement and walked away unsure whether the result had earned the language used to describe it.

What "external" actually means

The word external, in this context, has a specific meaning. An external outcome measure is one whose measurement properties have been examined and reported by a community of researchers other than the team that developed the intervention being tested. That community has run validation studies looking at reliability across raters and across time, at construct validity against related measures, at responsiveness to clinically meaningful change, at whether the floor and the ceiling of the instrument actually catch the kinds of patients the trial is recruiting.

This validation work is its own scientific literature, with its own methodological standards. The COSMIN initiative — COnsensus-based Standards for the selection of health Measurement INstruments — has, since the late 2000s, produced a taxonomy of measurement properties, a checklist for evaluating studies of those properties, and a recommended workflow for choosing instruments in clinical trials (Mokkink et al., 2010). The handbook of which it is a part runs hundreds of pages. Choosing a measurement instrument is not, in serious methodology, a Tuesday-morning decision. It is a research choice with as much epistemic weight as the choice of comparator.

When a trial reports its outcome on a validated instrument, the word is doing real work. It implies that someone, somewhere, has shown (in peer-reviewed publications) that the instrument measures what it claims to measure, in patients who resemble the ones being studied, with sensitivity to changes of the kind the trial is hoping to detect. The instrument has, in other words, traversed its own evidentiary journey. (Yes, instruments have evidentiary journeys too. Recursion all the way down.)

What method-internal scoring signals

A method-internal outcome measure is one designed by the same group that designed the intervention. The scoring system is not necessarily wrong; it may, in fact, capture something the validated instruments miss. But it carries three structural features the reader needs to register.

The first is comparability. A trial scored on the Brief Pain Inventory can be set side by side with hundreds of other trials that used the same instrument. A trial scored on a five-item questionnaire developed for the study, with no published validation, cannot. Each method-internal score is, in effect, an island. Meta-analysis becomes difficult or impossible, and cross-trial inference collapses to qualitative comparison — a polite way of saying it stops being inference.

The second is responsiveness asymmetry. A scale developed by people who know exactly what their intervention is supposed to do tends to be responsive to those exact effects. That sounds desirable; it is not, quite. A responsive-to-our-effect instrument will detect the changes we expected and may quietly fail to detect changes the intervention also produced, including, occasionally, changes the patient cared about more than the developers did. This is rarely deliberate; it is almost always sincere, which is part of why it is hard to see from the inside.

The third is a structural overlap with the independence question from the previous piece. A literature in which the intervention, the training, and the outcome measure were all designed by the same network is a literature whose internal evidence is harder to weigh from outside. None of the three pieces, taken alone, is a problem. The combination concentrates the evidentiary risk in a way the reader can name. That naming, more than any single methodological judgement, is most of what evidence literacy actually buys you.

When measurement choice changes the conclusion

To see how much weight an outcome measure can carry, the chronic pain literature is a useful case, partly because the field eventually noticed and tried to fix the problem, which gives us both the disease and the treatment in the same body of work.

Through the 1990s and into the early 2000s, trials of analgesic and behavioural interventions for chronic pain used a wide range of outcome measures: the McGill Pain Questionnaire, the Brief Pain Inventory, ad-hoc visual analogue scales, single-item patient global impression items, condition-specific function scales, and various method-internal responder criteria defined trial by trial. The same intervention could appear modestly effective in one trial and unimpressive in another, with much of the divergence traceable to which instrument the investigators had chosen and how a responder had been defined. Comparison across trials was difficult. Pooling them in meta-analyses was harder still. The field knew this; clinicians felt it; trainees encountered the heterogeneity on their first systematic review and assumed it was simply how clinical research worked.

The IMMPACT consensus process was the response. In 2005, an international group of researchers, clinicians, and methodologists published a set of recommendations for the core outcome domains that should be measured in any chronic pain clinical trial — pain intensity, physical functioning, emotional functioning, patient global impression of improvement, symptoms and adverse events, and participant disposition (Dworkin et al., 2005). Subsequent papers proposed specific validated instruments for each domain. Fifteen years on, the literature looks structurally different. Trials report on a recognisable common core, meta-analyses pool more cleanly, comparisons begin from a shared measurement language rather than a Tower of Babel.

The relevant point for the reader of any clinical claim is the size of the shift. The interventions did not change between 2002 and 2015. The patients did not change. What changed was the instrument they were measured on, and the apparent effects (when read in the older versus the newer literature) look different in ways that matter clinically.

This is a positive story. They are unfortunately not the norm. Most areas of clinical research do not have a consensus like IMMPACT, and the burden of evaluating outcome measures falls back on the reader, which is the situation most readers of most fields are actually in.

Reading study reports for the outcome-measure question

Where, in a published trial, does a reader actually look?

The methods section of a randomised trial is required, under the CONSORT 2010 reporting guidelines, to specify completely defined pre-specified primary and secondary outcome measures, including how and when they were assessed (Schulz, Altman, & Moher, 2010). That is the first place to look. Three questions are usually enough.

Was the primary outcome named, and named alone? If the trial's analysis treats two or three measures as co-equal endpoints without a pre-specified hierarchy, the reader is being asked to accept a result that may have benefited from analytic flexibility. Pre-specification is a structural protection against showing the most favourable outcome and calling it the trial's conclusion.

Was the primary outcome an externally validated instrument? If yes, the reader can — in a few minutes on PubMed — locate the original validation studies, check whether the instrument has been used in comparable populations, and read whether other groups have flagged its limitations. If no, the reader is being asked to take the developers' word on what the score means. Both situations exist; only one of them lets the reader check.

Was the responder criterion pre-specified, and justified against published precedent? Many trials report the proportion of participants achieving a clinically meaningful improvement — usually defined as a threshold change, expressed in points or in percentage terms. Whether that threshold matches an independently established minimal clinically important difference is one of the most informative single questions a reader can ask. A threshold chosen by the trial team, with no reference to a prior published value, is a place where the conclusion can sit on a shifted floor.

A trial that survives all three questions has done its outcome-measure homework. A trial that does not is not necessarily wrong; it is simply doing less of the evidentiary work than the conclusion implies, and the reader's confidence should adjust accordingly.

A drill, for next time

Here is a small exercise to run on the next clinical paper you read. (You don't have to do it well. You only have to do it.)

Look only at the methods and results — not the discussion. Identify every outcome measure named. For each, mark whether it is:

(a) an externally validated instrument with published psychometric properties — a familiar name, citable references in the methods, used by other groups in comparable populations;

(b) an externally validated instrument adapted for the trial — responder criterion redefined, threshold shifted, sub-scale extracted;

Then read the discussion. Notice which outcome the authors lean on hardest in the conclusion. Notice whether it is the one the trial registered as its primary endpoint. Notice whether the secondary measures — particularly the externally validated ones — tell the same story or a different one.

You will not always be able to answer cleanly. Some abstracts compress the methods to the point of opacity, and the full paper is the only place the measurement choices are visible. (That, too, is a finding.) The habit of looking — first at what was measured, and only then at what the result was — is one of those small disciplines that quietly changes how a literature reads.

Conclusion

An outcome measure is the trial's coupling to the world. It is the place where physiology, behaviour, suffering, function, or wellbeing is reduced to a number that either moves or does not. The choice of that coupling is itself a research decision; sometimes contested, sometimes settled, rarely neutral.

A trial that chose its outcome measure from a published, externally validated catalogue, used it as pre-specified, reported it whether the result was positive or null, and compared its result to other trials using the same instrument has done the kind of measurement work that lets a reader take the conclusion seriously. A trial whose outcome instrument was built in-house, whose responder criterion was set without reference to an established threshold, and whose conclusion rests on a measure no other research group has reason to use is doing different evidentiary work — and the reader is entitled to weigh it differently.

The point is not that method-internal scores are unacceptable. They sometimes capture exactly what nothing else captures, and the field would be poorer without them. The point is that they belong inside a frame the reader can see. The frame is not punitive; it is interpretive. It says here is what the trial measured, here is who decided what counts as improvement, here is what those decisions buy and what they cost.

The most useful question to keep close, when the next clinical claim arrives, is the one that follows from this piece and the last:

What was measured, on whose instrument — and could that instrument ever have returned nothing?

References

Boers, M., Kirwan, J. R., Wells, G., Beaton, D., Gossec, L., d'Agostino, M. A., Conaghan, P. G., Bingham, C. O., Brooks, P., Landewé, R., March, L., Simon, L. S., Singh, J. A., Strand, V., & Tugwell, P. (2014). Developing core outcome measurement sets for clinical trials: OMERACT filter 2.0. Journal of Clinical Epidemiology, 67(7), 745–753. https://doi.org/10.1016/j.jclinepi.2013.11.013
Dworkin, R. H., Turk, D. C., Farrar, J. T., Haythornthwaite, J. A., Jensen, M. P., Katz, N. P., Kerns, R. D., Stucki, G., Allen, R. R., Bellamy, N., Carr, D. B., Chandler, J., Cowan, P., Dionne, R., Galer, B. S., Hertz, S., Jadad, A. R., Kramer, L. D., Manning, D. C., Martin, S., … Witter, J. (2005). Core outcome measures for chronic pain clinical trials: IMMPACT recommendations. Pain, 113(1–2), 9–19. https://doi.org/10.1016/j.pain.2004.09.012
Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Stratford, P. W., Knol, D. L., Bouter, L. M., & de Vet, H. C. W. (2010). The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. Journal of Clinical Epidemiology, 63(7), 737–745. https://doi.org/10.1016/j.jclinepi.2010.02.006
Schulz, K. F., Altman, D. G., & Moher, D., for the CONSORT Group. (2010). CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials. BMJ, 340, c332. https://doi.org/10.1136/bmj.c332

0 comments

Joinor login to leave a comment