Thursday, 3 January 2013

Evidence and Sentencing

   Ben Goldacre’s Radio 4 programme on January 2nd 2013 rightly argued that the outcomes of public policy should be much more rigorously tested in advance of implementation and more thoroughly evaluated afterwards. Ben sees a much greater role for randomised controlled trials in identifying the most effective measures for producing a given outcome and even for checking that certain practices do not in fact aggravate a problem they are designed to alleviate; RCT’s revealed that despite their appeal, Scared Straight programmes which take young people into adult prisons increased rather than reduced future delinquency. 

There is no doubt that RCT’s can show to a high standard of proof of whether something works    and they should be encouraged as a way of  evaluating interventions which have  clear objectives; curing disease or  relieving symptoms in medicine or, teaching children to read and write. In criminal justice RCT’s have played an important role in showing the value of psychological treatment programmes, Restorative Justice and strict enforcement of probation.

Where I consider RCT’s to be more problematic and less useful is in relation to sentencing. By this I mean the decisions made by judges and magistrates to dispose of individuals convicted of criminal offences.  The underlying question is what would RCT’s be testing?

For Ben the answer is simple.   According to a piece he wrote in the Guardian in May 2010 he thinks that a judge is in “the exact same position as a doctor making a decision on a patient’s treatment  ... ...choosing an intervention for an individual in front of them with the intention of producing a particular set of positive outcomes” which Ben sees as reduced crime and reduced drug use.

Were this true, RCT’s might help to identify the best treatment approach for particular types of individual offender- although the record of so called “what works” programmes has been somewhat less than hoped for.  Unfortunately treatment is only one aspect of sentencing. In our legal system, the sentence imposed on an offender must reflect the crime committed and be proportionate to the seriousness of the offence. In reaching their decision courts must have regard not to one but to five purposes of sentencing. Giving effect to one of these purposes – the reform and rehabilitation of offenders- could well be informed by the findings of RCT’s . Ways to achieve another -the reduction of crime- could also be identified through research although this would be complicated by the inclusion of reduction by deterrence. The impact is thus not simply being sought on the individual culprit but on potential culprits of the future.

 As for the other three purposes- the punishment of offenders, the protection of the public and the making of reparation- it might be possible to construct experiments to assess how well particular sentences served them. But we suddenly seem to need an awful lot of RCT’s; but more importantly how useful would their findings be in assisting judges and magistrates to sentence?

 The law does not indicate that anyone purpose of sentencing  should be more important than any other and in practice they may all be relevant to a greater or lesser degree in any individual case. As the Sentencing Council, which gives guidelines to courts says “the sentencer has the task of determining the manner in which they apply”. The key decision is thus a normative one – which purpose or purposes are relevant?   So yes, RCT’s could tell us something important about the relative effects of components of the execution of sentencing. But the big sentencing questions – should I send this defendant to prison or not, should he get three years or five years – would remain relatively unaffected by this kind of evidence.

Even if the results might be useful, there is a more specific problem relating to the ethics of mounting RCT’s in relation to sentencing. For the sake of argument let’s agree that a judge, setting aside the other factors that he or she must consider, wants to impose the most effective rehabilitative sentence on a heroin dependent offender. Surely an RCT might indicate whether this might best be achieved by sending the offender to a community based drug treatment programme or to prison. But how would one mount such a study? According to Ben’s paper for the Cabinet office, the methodology would “assign each unit to one of the policy interventions, using a robust randomisation method”, before introducing the policy interventions to the assigned groups.  I doubt judges would or could agree to different sentences being assigned to defendants in this way. 

Peter Neyroud has suggested that there are several studies in the US and Australia which have used RCT’s to study sentencing.    The one he referenced- the study of HOPE Probation in Hawaii- did involve random allocation of probationers to a special programme with strict enforcement on the one hand and probation as usual on the other.  This was not a decision made by a court however. All of the offenders had been sentenced and were assigned to different modes of probation afterwards. It is hard to see how a judge could ethically impose sentences in the random way required by a RCT. The execution of the sentence can , as in the HOPE example, be varied up to a point for experimental reasons but not in my view where  there is a question about deprivation of liberty . The European prison rules make it clear that “Prisoners shall not be subjected to any experiments without their consent.” The objection is clearer if one thinks about an RCT comparing the effectiveness of short, medium and long prison sentences.

A tetchy exchange with Ben on twitter did not allow us to explore the fact that the disagreement may in one sense be over a narrow point. I agree there should be more research, including RCT’s, on different approaches to addressing offending behaviour. I also think there needs to be much more research about sentencing and in particular t
he cost- effectiveness of imprisonment. This is a deeply contested question reflecting profound disagreements about the impact which prison has on the future behaviour of individual offenders and on the wider community. A recent review of evidence by the Ministry of Justice found that  “To date there has been no clear consensus from criminologists and commentators about whether there is an incapacitation effect at all, and if so, its scale.” However I am not sure how easy, useful or morally acceptable RCT’s would be either in settling this question or in relation to the specific decisions made by judges.  I quite see the case for trials of programmes and interventions which can be applied in the execution – as opposed to the imposition- of sentences.


  1. Good observations Rob. Its worth reading the BMJ piece he wrote on this with Sheila Bird, and the Cabinet Office paper on the same issue.

    In terms of the multiple outcomes - I dont think this is neccassarily that different to medical RCTs - where there will generally also be multiple outcomes - nor is it a limiting obstacle. There are obvious complexities to be considered in devising useful and methodologically sound trials - as there will be challenges in interpreting the data they produce. But having data on multiple outcomes from two (or more) randomly allocated sentencing options doesnt mean the data cant be very useful - even if the outcomes show conflicting levels of effectiveness on different outcomes. Increasing levels of complexity can also be build into RCTs by subdividng sentencing options.

    I agree with you on the ethics of prison as one of the sentencing options needs to addressed (i think a similar issue exisits with getting a crimainl record or not) - but I'm not sure alternatives to prison trialled alongside conventional prison sentencing policy is neccassarily unethical (ie trialling less harsh sentencing options could be Ok even if more harsh sentincing isnt). Im sure Alex Stevens would have a view on this.

    One important issue I think Ben has raised is that the context of such ethical questions is that (often) judges are making sentenncing decisions based on an absence of evidence in the first place. The implication is that experimenting to find which of a range of available options delivers the best outcomes (whatever our priorities KPIs may be) is an essentially ethical undertaking in broad terms.

    On the issue of deterrence - This is very hard to measure anyway, especially given the wide variation in sentencing that already exists, and I would sugest almost impossible in terms of a given sentence to an individual. Searching for a literature on deterence in drug sentencing in particular produces almost nothing useful anyway - so almost any data would be better than what we have now. As an aside I think higlighting the paucity of support for any deterrent effect in drug sentencing would be a useful way of challenging the 'tough on drugs' political narrative - given the cetrality of the deterrence myth within it.

    I think Ben is essentially pushing the point you make in your opening sentence and highlighting RCTs as one potentially useful, and currently under utilised tool. The debate about specifically where and how RCTs could be used is the next step on after the acknowledgement that most sentencing is not evidence based. Its a debate Ben's keen to engage in.

  2. Thanks for taking up my suggestion of writing a blog, I don’t the views of you and your colleagues came across well on twitter at all.

    You seem concerned that RCTs would lead to sentences that fail to reflect the seriousness of the crime, or deliver adequate punishment (“In our legal system, the sentence imposed on an offender must reflect the crime committed and be proportionate to the seriousness of the offence.” Etc).

    But there is already variation in sentencing. Drug Testing and Treatment Orders were introduced as an alternative to custodial sentences. The decision to introduce such change, such variation, had nothing to do with anyone running a randomised controlled trial on DTTOs (indeed, no such trial has ever has been run). It is happening anyway. Nobody is suggesting that designing an RCT should involve inventing some wild new form of sentencing. A randomised trial simply introduces a proposed change in sentencing – a change that has already been agreed by society in principle – in a structured fashion to see if it achieves its stated objectives.

    As Steve Rolles above says, there is no problem with measuring several outcomes (as long as the statistics account for the fact that you’re measuring multiple outcomes, which is simple). It’s perfectly normal and healthy to measure multiple outcomes, and discuss both before and after a trial which outcomes are more important. This happens all the time in medicine, as Steve says.

    You say that you doubt judges would agree to participate in an RCT. I don’t doubt that. This is one thing that needs to change: judges need to reflect on the lack of evidence for whether these sentences really do achieve the stated objectives of judges, politicians, and society. In my view it is profoundly unethical of judges (and doctors, when faced with similar situations) to fail to reduce this uncertainty, when the opportunity to do so is present. When we practice in ignorance, we can end up doing harm to individuals and society.

    I agree with everything Steve has said, above.

    Ben Goldacre

  3. For our research on the DTTO (in the QCT Europe project), we would have loved to do a randomised study. But this was never a serious possibility. The policy discussions in advance of the policy presented it as THE answer to drug-related crime, drawing on (non-randomised) studies of US drug courts. Its effectiveness was not treated as an open question.

    There are several examples of criminal justice interventions that could and should have been piloted through cluster randomised trials, including the Drug Intervention Programme, the prison Integrated Drug Treatment System and, most recently, Drug Recovery Wings. In each case, money has been/is being spent on designs that were/are doomed not to answer the principal research question by their lack of randomisation.

    The DTTO was a slightly different case, as it was presented as an alternative to imprisonment (the examples above are in the implementation - not the imposition - of sentencing). Randomisation is ethically preferable in situations of equipoise; ‘unless you know the effect of what you’re doing, how do you know that you’re not doing harm’. But studies randomising to imprisonment are not in equipoise. We know that imprisonment is harmful to the recipient; it is meant to be. The alternative, experimental condition (e.g. the DTTO) may have unknown harms and benefits, but we know that it has the benefit to the recipients of not depriving them of their liberty and private life. So we could only ethically justify randomisation to the experimental condition if we are sure that the participants would definitely have otherwise gone to prison. The problem was, with the DTTO and many other purported alternatives to prison, that it acted as an addition, not an alternative. The number of people sent to prison did not decrease as DTTOs increased. So the existence of a pilot (even randomised) would be likely to increase the harm of imprisonment, unless there were very firm guarantees in place.

    Even if randomisation could be done on CJ initiatives, there are well known problems of politics and ideology that would hamper the implementation of the resulting knowledge. One that should be better known is the control that the Home Office keeps of the intellectual property that results from the evaluations it funds. When results have not been in the desired direction, some researchers have found it very difficult to get permission to publish their work.

    So I agree with Ben and Steve that there should be more randomised studies in criminal justice, but also with Rob that there are a host of complexities to consider.

  4. Agree with the broad point that there is a problem concerning the weak evidence base within criminal justice. But I don't think RCTs are a solution to that problem. First, I think the Pawson-Tilley critique remains unanswered - that in evaluating complex social interventions, we need to grasp the (messy) ways in which different contexts shape implementation and patterns of outcomes. This is simply not how RCTs conceptualise 'interventions'. Second, as Rob pointed out, in relation to sentencing, decisions are not made solely on instrumental grounds. Sentencing also has non-instrumental, expressive purposes. So even if randomised studies could deliver on their promise to provide gold-standard evidence of effectivness (which they can't in this area), they would still only be addressing half of the matter for sentencers.

  5. CJ interventions may well take place in messy contexts, but I don't think the Pawson/Tilley critique rules out the use of RCTs in this field. We do need to make sure we gather and publish information on context, process and 'generative mechanisms' of the evaluated interventions. But without some form of rigorous comparison group, questions on the causal effect of the intervention - even in the context of delivery in the evaluation - will remain unanswered. There is a current fashion to use propensity score matching for this, but I am concerned that this often gives an unrealistic impression of the quality of the comparison, given the inevitable presence of unobserved, unmatched confounders.

  6. I don't rule out RCTs in CJ either - I just think they're not terribly fit for purpose in that field and unlikely to be that significant in terms of improving the evidence base.

    I can't say either way whether PSM is a useful technique or not but my (limited) understanding of it is that actually no cases are unmatched. Others may be able to clarify this.

  7. "There is a current fashion to use propensity score matching for this, but I am concerned that this often gives an unrealistic impression of the quality of the comparison, given the inevitable presence of unobserved, unmatched confounders."
    PSM is only as good as the data being used for the match but if good data are used (e.g. relevant to the chances of receiving a given 'treatment') then good estimates of realised effects can be obtained. There are precedents in CJS of using PSM for things which were not randomised, e.g. ETS:

    @Toby - matching can be either on common support or not. Meaning that one can choose whether or not to match everyone, or only those with similar predicted chances of receiving intervention. Common support is regarded as a better approach. (See the Sadlier report for MoJ)