DSM-5 Inter-Rater Reliability is Low


There’s an article by Jack Carney, DSW, on this topic on Mad in America.  Jack refers to the DSM-5 field trials published earlier this year in the American Journal of Psychiatry.

Inter-rater reliability is measured by a statistic called a kappa score.  A score of 1 means perfect inter-rater agreement; a score of 0 indicates zero agreement.  In psychosocial research a kappa score of 0.7 or above is generally considered good.

Only one DSM-5 “diagnosis” was higher than 0.7 in the field trials.  This was major neurocognitive disorder (essentially dementia).  Major depressive disorder was 0.32; antisocial personality disorder was 0.22; obsessive compulsive disorder was 0.31; and so on.  Even schizophrenia, the flagship “diagnosis,” scored only 0.46.  You can see other values in Jack’s article.

What this means is that in the field trials, if one psychiatrist “diagnosed” a person with major depression, for instance, another psychiatrist was quite likely to come up with another “diagnosis.”  They weren’t consistent.  And remember, people participating in field trials are on their best behavior.  They probably studied the new criteria, and were very conscious of the fact that their findings were being checked and scrutinized.

Most psychiatrists in their offices, I venture to predict, will buy DSM-5, glance at the changes, and put it on the shelf.  Their inter-rater agreements will likely be lower.


This is important, because the APA continues to push the notion that the manual is based on solid science.  In fact, it isn’t, and never has been.  Its purpose is to create the appearance of science, and to provide an umbrella under which psychiatrists can do pretty much whatever they like.

Here’s a little known quote from DSM-IV.

“The specific diagnostic criteria included in DSM-IV are meant to serve as guidelines to be informed by clinical judgment and are not meant to be used in a cookbook fashion.  For example, the exercise of clinical judgment may justify giving a certain diagnosis to an individual even though the clinical presentation falls just short of meeting the full criteria for the diagnosis as long as the symptoms that are present are persistent and severe.” (p xxiii)

In lay circles this is known as having your cake and eating it too.  Or perhaps it could be called “fuzzy science.”


The poor inter-rater agreement is a serious problem, but as an issue it needs to be kept in perspective.  One could have 100% agreement in this area and still be talking utter nonsense.  For instance, suppose I were to form a society for the detection and prosecution of witches. We have a meeting and decide that we need to have hard and fast criteria for identifying these wicked ladies.  So we get a panel of experts (which is easily achieved by shaking a big box of money).  The experts draw up a list of identifying signs, each of which is sharp and unambiguous.  Personally, I’m no expert on witchcraft, but I can imagine that they might produce items like:  extra digit on left hand; red birthmark on thigh; owns a black cat, etc…  Then, provided that each criterion is clear and precise, and that each rater sticks to the criteria, we will have 100% rater agreement.

But we’re still talking nonsense, because there’s no such thing as a witch.  And DSM is nonsense because there’s no such thing as a mental illness.

Actually, I’m surprised that the DSM-5 figures weren’t better, because it’s not very difficult to get good reliability.  Psychosocial researchers do it all the time.  In fact, you can’t really do good research without good reliability.  Suppose for instance, you want to study violence in schoolyards.  You must first make sure that all your raters are on the same sheet of music when it comes to recording an incident of violence.  If one rater is recording pushing as an act of violence, but another is not, then clearly the research will be fundamentally flawed.

Which means that any research based on DSM-5 will, of course, be fundamentally flawed, but we knew that anyway, because the concept of mental illness is fundamentally flawed.

DSM-5 vs. DSM-IV

The agreement figures for DSM-5 are noticeably poorer than the figures for DSM-IV.  The likely reason for this is APA’s persistent desire to widen the net.  One way to do this is to make the criteria less precise, which inevitably means that different raters will apply them differently.

So what can the APA do now?  Will they have to scrap DSM-5 and start again?  No.  As I said earlier, it’s never been about science.  It’s about marketing.  My prediction is that they will either ignore the poor reliability matter, or spin it somehow into a positive feature.  For instance, they might try to promote the notion that psychiatrists are less concerned about excessive fastidiousness than with providing real help to real people.  If there’s one thing the APA is good at (and it may well be the only thing), it’s spin!


The last job I had before retirement was in a prison.  One of my major responsibilities was meeting with groups of prisoners, and facilitating discussions on subjects like anger, critical self-scrutiny, coping with conflict, etc…

One morning I was on my way to one of these meetings when I overheard a confrontation between a prisoner and an officer.  Apparently the prisoner had stolen a loaf of raisin bread from the kitchen, and the officer was giving him a hard time.  To which the prisoner replied, “If you people would give us enough food, we wouldn’t have to steal!”

I thought this was a beautiful piece of spin, but also that it was a mode of thinking that keeps people coming back into prison.

During the group session that morning, I mentioned the incident.  All the guys started to laugh, and one man at the back said in a loud voice: “flip the script.”  I asked them to explain, and the notion goes like this.  If you’re ever accused of wrongdoing, your first priority is to neutralize or deflect the accusation.  With the loaf of raisin bread, for instance, one could point out that it was stale, and that you were just saving the kitchen staff the trouble of throwing it away.  So an act of theft becomes an act of civic responsibility.  Or you can shift the “real” responsibility to someone else, which is what had been attempted in the incident I had witnessed earlier.  A third variation that the men mentioned was deflecting attention, and they gave as an example something like:  “Hey, a loaf of raisin bread is nothing.  I saw one of the senior officers backing his truck up here yesterday, and he took out a crateful of beef!”

Politicians are good at this sort of thing too.  One of the first rules of campaigning is that if you’re asked a difficult question, ignore it and answer a different question.  This is a variation of flip the script.

And there’s a beautiful flip the script in DSM-IV, which the APA published in 1994.  By that time there were rumblings of dissent in various circles with regards to the general concept of mental disorders/mental illnesses.  And the various DSM-IV committees had to be aware of this.  In their introduction to the revision, they might have addressed this matter, but they didn’t.  Instead, they talked about reliability (i.e. inter-rater agreement) and in a notable display of self-congratulation, they proclaimed:  “more than any other nomenclature of mental disorders, DSM-IV is grounded in empirical evidence,” and the reader is referred to a five-volume sourcebook of research findings.

But the thornier question about the ontological status of these disorders was deflected with a single sentence.  “The need for a classification of mental disorders has been clear throughout the history of medicine, but there has been little agreement on which disorders should be included and the optimal method for their organization.”  This is called preemptive strike flip the script, and my guys back at the prison would have been proud of the APA!


We’ve heard a great deal in the news lately about the Higgs boson.  I’m no expert on quantum physics, but I understand that this elusive particle is very important to physicists, who had expressed the belief that it exists way back in 1964.  If it didn’t exist, they could think of no other way to explain the existence of mass.  So they were very attached to the idea, but like true scientists, they refused to just take it for granted.  They insisted that its existence had to be verified experimentally.

Well they built this enormous underground circular tunnel on the Swiss-French border (1998-2008), and for four years drove sub-atomic particles round this at close to the speed of light.  They arranged for them to crash into each other and all sorts of other stuff.  Until finally – a few weeks ago – they found the Higgs boson!  Well – tentatively.  They still have some minor reservations, and work continues, but it looks very promising.

What I can’t figure out is:  why didn’t they just get together and take a vote, the way the APA do!  It would have saved a lot of time and money.

  • no one important

    Science vs $cience…

    Thank you for the explanation of what is going so horribly wrong. You are like a beacon of relevance in a sea of adverts.

  • Phil_Hickey

    Thanks for the encouragement

  • Falco

    The DSM 5 is pushing to use scales of severity for every disorder instead of a present/absent checklist of symptoms. That scale is called a dimensional assessment. approach.

    When Dr. Darrel Regier (Vice-Chair of the DSM-5 Task Force) presented an overview of DSM-5 updates at the 26th Annual “U.S. Psychiatric and Mental Health Congress,” he said:

    “The DSM-5 is not a cookbook—it’s not meant to be just a checklist.” (previous DSM’s were cookbook and cookie cutter approaches). “Clinicians should familiarize themselves with dimensional assessment, defined as assessment of factors not necessarily included in the diagnostic criteria but of high relevance to prognosis and treatment planning for most patients” said Dr. Regier.


    Dr. Darrel Regier said: “We’re recommending as a change that you don’t just have the number of criteria and duration, but actually have some kind of severity measure for every disorder.”

    To some degree, doctors are already assessing patients this way. “It’s not new in practice,” says Dr. David Shaffer (Columbia University Medical Center). Doctors treat patients who need help even if their symptoms don’t add up to a clear-cut disorder. And, he says, researchers and people testing treatments already use scales of severity to describe the patients in their studies. “So the goal is to somehow bring that into the diagnosis.”

    Shifting to dimensional assessment vs. a checklist approach seems like a move to mirror — and quantify — what’s happening in practice. But it’s not that simple, says Dr. Michael First (an expert on psychiatric diagnosis and assessment issues).

    Dr. First agrees a checklist of categories doesn’t perfectly reflect mental disorders, but he argues that adding dimensions to the DSM isn’t a good idea…. There’s been a history of trying out dimensions and then getting rid of them for lack of use,” and “there is no evidence that systems using measures of severity get better outcomes,” he says.


  • Phil_Hickey


    Thanks for coming in.

    There was actually a suggestion to adopt a dimensional system in DSM-IV (p. xxii). The suggestion was rejected because “…dimensional descriptions are much less familiar and vivid then are the categorical names for mental disorders.”

    In my view, they rejected the dimensional approach because “generalized anxiety disorder” sounds like an illness, but “anxiety, multi-stimulated, level 2.8” does not. The primary purpose of the DSM is to promote the fiction that “mental disorders” are real illnesses, and that psychiatrists are real doctors.

    Best wishes.

  • Pingback: Pharma-funded Research | Mad In America()

  • A. C.

    thank you for writing this valuable piece of truth. It is like a dandelion seed that will be spread by many in order for truth to grow.

  • Phil_Hickey


    Thanks for writing, and for your encouragement.

    Best wishes.