Many of y’all know that I have been using TherapyNotes as our practice EHR for over 10 years now. I’ve looked at others, and I just keep coming back to TherapyNotes because they do it all. If you’re interested in an EHR for your practice, you can get two free months of TherapyNotes by going to thetestingpsychologist.com/therapynotes and enter the code “testing.”
This podcast is brought to you by PAR.
Use the Feifer diagnostic achievement test to hone in on specific reading, writing, and math learning disabilities and figure out why academic issues are occurring. Learn more at parinc.com\feifer.
Hey, everyone. Welcome to The Testing Psychologist [00:01:00] podcast. I’m here with a clinical episode today, a clinical topic anyway. We’re talking about overinterpreting our data, which is a problem that a lot of us might be aware of. Some of us certainly practice accordingly based on best practices, but a lot of us forget, and it’s easy to fall into the temptation to overinterpret data when we don’t necessarily have the statistical grounding to do so.
So my guest today, Dr. Ulrich Mayr, is going to talk with me all about that. He is a Robert and Beverly Lewis Professor for Neuroscience at the University of Oregon, where he was department head for nearly 10 years and does NIH and NSF-funded research on cognitive functioning and decision making across the adult life-span. He has also been editor in chief of the scientific journal, Psychology and Aging.
While his research is on the basic science of cognitive functioning, his partner runs a psychological testing [00:02:00] practice, which often leads to fantastic conversations where the theory and the pragmatics of assessment clash in interesting and often productive ways.
So this is a good example of that, today’s conversation, where we tried to marry the neuroscience, the mathematics, the statistics behind test development and measurement with clinical practice, and bring it home and give some suggestions for what we as clinicians can do given that, as you’ll hear, a lot of the measurement and scores from our batteries cannot be interpreted or generalized the way that we think they can.
So we talk about a lot of different things. We do some basics on measurement and test development. We talk about what we can pull from the data reliably. We do get into some math and a little bit of statistics around reliability and so forth. So there’s a little something for [00:03:00] everyone.
And then we conclude with a discussion of, given the situation that we have right now and the measures that we have, what can we do to adhere to best practices with interpretation and gathering the data that we want to gather? Fascinating conversation. So stay tuned and I hope that you can take some things away from this discussion with Dr. Ulrich Mayr.
Ulrich, hello. Welcome to the podcast.
Dr. Ulrich: Hello. I’m very happy to be here.
Dr. Sharp: Thank you for being here and willing to dive into what seems like a relatively complicated but important topic for those of us who are doing testing. We haven’t visited this topic in a long time, so I’m grateful to [00:04:00] have you.
Dr. Ulrich: I believe it’s important. Nothing I’m saying is completely new to most people, but it deserves repeating every once in a while.
Dr. Sharp: I totally agree. I think it’s one of those things that we probably learn to some degree in graduate school, revisit periodically, but ultimately forget about, honestly, in the day-to-day work that we do because there’s a lot of cognitive dissonance if we were to fully confront it.
Dr. Ulrich: There is, and the testing manuals invite you to go down these routes that are not always completely kosher.
Dr. Sharp: Right. Oh, yes. We got a lot to get into, but I’ll start with the question that I always start with, which is, of all the things that you could spend your time and energy on in your life, why care about this topic so much?
Dr. Ulrich: It’s a little bit of a hobby of mine. From my actual profession, I’m a cognitive neuroscientist. I do [00:05:00] care a lot. That’s my actual interest, the building blocks of the mind. I’m interested in how to measure executive control functions. That’s basically what I spent my career on.
And so I come at this from a basic science perspective, from a measurement perspective, that is what I know, that is what I do for a living.
I should also make that clear, except for a very short stint as a student intern in a psychiatric hospital in Munich, I never actually tested patients. So, why am I here? Well, it’s mostly thanks to my wife and life partner, who is a testing psychologist. She has her own practice and specializes in testing and diagnosing ADHD and related syndromes.
We very frequently have these dinner conversations that I really enjoy, where she presents a little bit what they call a case [00:06:00] round where she presents like somebody that showed up in her practice and has
this unique profile, what to do with it. How should I interpret that?
In these conversations that often becomes clear that there is a bit of a tension between what appears to be a regular practice among testing psychologists in how to interpret these profiles, these test results, and including what the handbook suggests you should do, and what from a more basic science side where you recognize the method larger constraints would seem allowable as safe and sound inferences.
These discussions have been interesting for both of us. I’ve been able to catch her from going down some rabbit holes every once in a while, but it also got me to think more seriously about how to use what I know [00:07:00] most productively and not just saying, no, you cannot do this, but maybe this is how far you can go given what we know. And so combining both this relatively restrictive pessimistic view and getting also every once in a while to a yes, is what I’m trying to do.
Dr. Sharp: People ask me sometimes, because my wife is also a therapist, and people will say, oh my gosh, your conversations must be so fascinating. I’m like, well, they’re pretty boring. But I get the sense. Y’all have some of the same thing going on with these conversations that …
Dr. Ulrich: We both started on the science side and then she at some point transitioned. So we have this common interest in the basic issues that …
Dr. Sharp: Yes. That’s always nice.
Dr. Ulrich: … people would think we are complete nerds.
Dr. Sharp: That’s totally okay. You’re doing it [00:08:00] together, and that’s the important thing, being nerds together.
Let’s start, this is super important. I want to lay a little bit of groundwork just for folks who maybe haven’t tapped into this in a while or have forgotten or whatever it may be. Maybe we start talking about some of the limits of our current testing methods to provide some context. We could start with this question of, what are some of the inherent limitations of the cognitive tests or neuropsychological tests that we’re using?
Dr. Ulrich: It comes down to one fundamental issue. I’ll start with the top-line conclusion. Let’s take the Wechsler, to the extent that’s the one that in our household is being discussed a lot, so I’m going to work with that.
These test batteries provide two categorically different types of information. The first is [00:09:00] the general level, which is best captured in the Full Scale IQ. That is highly reliable, very meaningful, and can be used pretty much as advertised. So we have no beef with that. That’s the good news.
Then around that level, the battery offers the tap dancing of scores around that mean level, the strengths and weaknesses, the differences between the indices. I just assumed that under the label, the profile-based scores, those are almost always misused and should be treated with greatest caution. I can back that up empirically.
I want to highlight none of what I’m saying here is new. There are other people that have researched about that. I particularly went [00:10:00] back and read a book by a professor from Baylor called Marley Watkins, who spent much of his career addressing these issues.
One of the studies that he reports is 400 participants are tested in the Wechsler. Then he uses the handbook-based rationale for picking out for each individual the strengths and weaknesses and the critical differences in the index scores that you would reasonably interpret if this was a patient in your practice. Then the same group of participants is tested again, I think it was 2.5 years later.
And so now you can ask, if a participant identified this particular weakness and this particular strength, will that show up again 2.5 years later? You might be interested in that. That is [00:11:00] something that you want to see because you’re not just making an inference about this individual for right now; you hope you capture something more general about that individual.
The bad news here is that the reliability of these inferences was essentially zero. And so that’s something to grapple with. And you deal with these. If you take this one result seriously, and there are others, if you take this one result seriously, you went through all this process of identifying those profile-based scores, and you generated, in the end, meaningless information.
People may make recommendations on the basis of this information and placement decisions. And so that is something that needs to be taken seriously.
[00:12:00] Dr. Sharp: I completely agree. Yes. I would guess that at this point, about 5 minutes into our interview, the entirety of the audience is completely freaked out and wondering what we are doing with our careers. I’m kidding.Dr. Ulrich: I do want to go back. There is still the mean level score, which is a completely reasonable piece of information. In my understanding, that is what most people start with. So then the question. I do think it’s important to understand why these profile scores are so problematic.
Dr. Sharp: I think that’d be a good place to go. Just establishing, we know that the Full Scale IQ is largely stable. We can rely on that. We can make some inferences from that. But you’re [00:13:00] saying anything else within that, the index scores, the strengths and weaknesses, those are not going to be reliable over time from what we know.
Dr. Ulrich: Yes. The paradoxical aspect in all of this is that it’s essentially exactly the strengths of the overall score that harms the degree to which you can interpret the individual scores, the profile-based scores.
Dr. Sharp: Ooh, say more about that. What do you mean when you say it’s a strength of the overall score that harms the others?
Dr. Ulrich: There are different ways in which you can elaborate or develop this. I may have it tried coming from different corners because it’s, especially if you don’t have a whiteboard where I can draw some patterns, to get this across is not a trivial issue. So I hope that listeners stay with me here.
[00:14:00] One thing about that is, we already established the overall general ability factor is something that is very strongly expressed in these test batteries. All test batteries have this in common.So this is a piece of information that is inherently independent of the profile-based wiggling around the mean. So you can take that information out, and you can do that. For example, if you have a bunch of profiles in front of you and you subtract the mean level out of each one of them, they collapse onto each other. The wiggling is still there, but you have taken out the mean level. So that demonstrates that you can treat them as completely independent information.
The problem is that when you look at the reliability of a particular, let’s say, the Working Memory Index, [00:15:00] highly reliable in itself, but because so much of that Working Memory Index score is driven by general ability, most of the reliability in that working memory score is also dependent on that general ability score. Once you’ve taken that out, there’s much less reliability left for the individual, the specific score that you then might use to detect weaknesses and strengths.
You can put numbers on that. The reliability of the Full Scale IQ is very high. I think it’s something like 0.93 or 0.94. Once you go down to the index score, and you’ve taken out the reliability, you can do that [00:16:00] mathematically. If you take out the reliability of the general ability factor, then the remaining reliability is somewhere between 0.2 and, if you’re lucky, up to 0.6. That’s not a good range to be in. Raw, useful information about individual patients.
We usually say that we want to have at least 0.8 reliability to draw inferences about individual people. And so that’s basically what you have to work with when you are dealing with profile scores. That’s the crux of it.
It’s an annoying problem. I feel for the test designers because a test designer wants to have reliability. The best way to get reliability is to saturate all the different tests with general [00:17:00] ability. That’s how you drive up overall reliability, yet that same process gets in your way when you want to interpret the individual scores.
To put in a difference, it’s assumed the Wechsler battery was one where there was no relationship between the individual index scores. They were completely related. Not that we necessarily want that because then you don’t have a general ability score anymore, but in that, that would be a case where nobody would ever have a problem with interpreting profiles and differences, because now all of the information is actually in the individual index scores, and you don’t have to worry about that problem anymore.
So the more you have the individual index scores related to general ability, the less you have to work with in [00:18:00] terms of interpreting the wiggling of the profile.
Dr. Sharp: The way that you described it to me during our pre-interview chat really resonated. Maybe we can dive into that a little bit. You framed it like a pie chart where, like you said, about 50% of the pie is occupied by g or the general ability.
Dr. Ulrich: That’s one of the other ways to get at that. Every index score of every subtest contains a bucket of information. Let’s assume you can think of a bucket; you can think of a pie chart. The pie chart is the whole amount of information. Once you remove the information that is specific to the general factor, that typically removes about 50% to [00:19:00] 80% of the overall pie chart. 0.8 might be a little exaggerating; up to 0.6, 0.7.
To give you one concrete example, just look that up, the correlation between the reasoning index score and Full Scale IQ is 0.8. So 0.8 translates, you square that. Then you get to the common variance, so it’s point 0.64. So 64% of the pie chart that belongs to the reasoning score is taken up by the Full Scale IQ.
Once you take that out, there’s only the small sliver of the pie chart left. Now, that is potentially what you can work with in terms of identifying individual strengths or weaknesses. However, not all of that is pure, reasonable information. At least half of it is likely to be measurement error.
And you don’t know which part of that [00:20:00] it is. You have an unknown quantity of meaningful, relatively small information left to work with to establish profile-based scores or profile-based information. Does that help?
Dr. Sharp: Yeah, it does. I’m a super concrete person, and I think the visual does help. And in the absence of a whiteboard, I’m just going to belabor this a little bit to try to cement it for folks.
So thinking about this pie chart, like we said, let’s just call it 60% of the pie chart is eaten up by g, so that we take that away and the remaining 40%, you said half of that is measurement error or noise, give or take. So then that leaves us with, give or take, 20-ish% or a little more, a little less that’s the ability we’re thinking we’re measuring.
Dr. Ulrich: Yes. And you don’t know which part of that pie chart is the [00:21:00] meaningful and not. There’s no way to determine that anymore because you’re already taking out all the meaningful information by taking out g, which we know we can measure with high reliability. So it leaves you with not much to work with.
It sometimes gets hard to understand why this is a problem, particularly given that an optimistic reliability of this remaining piece that we would like to work with is maybe around 0.5. This is not nothing. There is meaningful information there. And it can be used, for example, and that’s completely legitimate.
Let’s say you do a scientific study where you have a group of people, let’s say with ADHD, and a group of controls, and now you want to compare profiles. The 0.5 reliability of each of the [00:22:00] individual index scores is sufficient to detect differences between groups.
So potentially meaningful information about cognitive functioning in this type of setting where you compare groups with each other can be derived from profiles. So there’s enough information there to do that. It’s unfortunately just not enough in most cases, except for some exceptions that we might talk about later, to draw inferences about individuals.
That’s the problem, that chomping from a group comparison to the case in the clinical practice where you have just one patient in front of you at a time and need to draw inferences about that patient, not about a group of people, that’s where you start running into problems.
Dr. Sharp: I think that’s where people probably get tripped up. It’s one of those things that’s easy to cognitively know and understand but then hard to implement when we’re sitting [00:23:00] in front of patients and have that pressure to come up with something meaningful in the evaluation.
And so that to me leads to two areas of discussion. Maybe the first is just to dive into that a little bit more if you can, and explain why don’t those group level differences translate to individual, just to make that super clear. And then it leaves a question of, are we overinterpreting with these individuals?
Dr. Ulrich: It’s just a question of measurement error. You can measure a group of individuals because you aggregate across individuals, and the measurement error shrinks. Whereas for one individual, it remains relatively large. And so with large measurement errors, you need very high [00:24:00] reliability to be able to draw inferences.
So that’s the crux of it. It’s just about if something is imprecise, you need higher reliability, and otherwise you just can’t draw inferences now. The second question is, are we overinterpreting? Probably we are often overinterpreting if we use these profile-based scores.
I would like to add that there may be a way to get to a better place by being very highly disciplined and understanding which of your potential scores you might be interested in interpreting can be assessed reliably. [00:25:00] There’s a second problem that comes in when you try to interpret profiles. That second problem is that you’re looking at all of these wiggling ups and downs, and say, oh, here’s something interesting. This looks high, or this looks low.
And so essentially what you’re doing is looking at all possible combinations of ups and downs at the same time. And that ignores the fact that when you think about the confidence interval that is generated from a certain reliability, it is always meant for a single comparison.
Essentially, a confidence interval means that I am willing to accept a 5% error [00:26:00] of accepting something as a true difference that actually is not. That only works once. If you do that twice, the confidence interval has to increase so that you protect yourself from … Every time you look, you add the potential of making this measurement error. So you have to adjust your confidence interval level accordingly.
If you want to do that for a whole battery of 10 different tests and all possible configurations of differences, your confidence interval would have to increase so much that you basically leave no opportunity anymore for getting any difference reliable and robust.
What that means is that the way to look at profiles is in a very highly disciplined and priority-based manner. In a particular case, you might have already some [00:27:00] inkling because of the history and because of the background information that that patient may have a particular weakness in X, let’s say. Just making that up processing speed. Maybe you believe, based on the literature that people with ADHD, one diagnostic sign may be a slow processing speed. I’ve heard it said, but I don’t know whether it is true.
So you say, I want to confirm that this patient potentially has this diagnostic sign associated with ADHD, and that is low processing speed. So then I constrain myself to a single inspection of the profile and say, I’m going to accept a potential drop in processing speed, if it’s big enough as diagnostic meaningful information. [00:28:00] But I’m only going to do that once. I’m not going to sift through the whole profile and look for differences because there’s so much opportunity for differences popping up randomly if you give differences so much opportunity to show themselves.
Dr. Sharp: Yes.
Dr. Ulrich: So that’s one way to use the information about statistical limitations and how much we can learn potentially from such a profile and getting to some minimum allowable inferences from this type of information.
Dr. Sharp: Sure. This is important to talk about, for sure, and just make it super clear that this [00:29:00] is backwards reasoning, if you want to call it that. You come in with a hypothesis, and then you look at the data, too. Test that hypothesis versus just hey, let’s see what’s showing up.
Dr. Ulrich: That’s the key. You have one hypothesis, and that’s what you test. You don’t let bottom up overvalue you with differences popping up in the profile.
Dr. Sharp: Right. This might be a good time to mention the likelihood that there are going to be some outliers in a profile. There are going to be some pretty significant differences in index scores, subtest scores, or whatever it may be. Can you speak to that at all? Just the likelihood.
Let’s take a break to hear from a featured partner.
Y’all know that I love TherapyNotes, but I am not the only one. They have a 4.9 out of 5-star rating on trustpilot.com and Google, which makes them the number one-rated Electronic Health Record system [00:30:00] available for mental health folks today. They make billing, scheduling, note-taking, and telehealth all incredibly easy. They also offer custom forms that you can send through the portal. For all the prescribers out there, TherapyNotes is proudly offering ePrescribe as well. And maybe the most important thing for me is that they have live telephone support seven days a week, so you can actually talk to a real person in a timely manner.
If you’re trying to switch from another EHR, the transition is incredibly easy. They’ll import your demographic data free of charge so you can get going right away. So if you’re curious or you want to switch or you need a new EHR, try TherapyNotes for two months absolutely free. You can go to thetestingpsychologist.com/therapynotes and enter the code “testing”. Again, totally free. No strings attached. Check it out and see why everyone is switching to TherapyNotes.
[00:31:00] The Feifer diagnostic achievement tests are comprehensive tools that help you help struggling students. Use the FAR, FAM, and FAW to hone in on specific reading, writing, and math learning disabilities and figure out why academic issues are occurring. Instant online scoring is available via PARiConnect, and in-person e-stimulus books allow for more convenient and hygienic administration via tablet. Learn more at parinc.com\feifer.All right, let’s get back to the podcast.
Dr. Ulrich: I can’t give you exact numbers, but it’s simply the case. So if you look at all the possible strengths and weaknesses and different scores, the scatter, these are all opportunities for things that look interesting to pop up.
Dr. Sharp: That’s a good way to put it.
Dr. Ulrich: So [00:32:00] the confidence interval that the handbook gives you are geared towards a 5% error of probability. So if you have 10 different opportunities for something like that to pop up, 10 x 5 = 50. So now you have a 50% chance that something will show up.
I’m sure there are about 10 different opportunities for differences in everything the handbook lists about what you could potentially do with these types of profiles. So if you stay with just looking at one comparison, then you accept the 5% threshold and don’t move that around. That’s then what you work with, and that’s probably more acceptable.
Dr. Sharp: That makes sense. This comes up in supervision a lot. We have interns and postdocs. I think a lot of us, even as licensed clinicians, get [00:33:00] tempted by these major differences. Like, oh my gosh, how could this subtest be so much different than this other subtest within the same index, for example. And that’s pretty typical.
Dr. Ulrich: It is. And even for that, if you really believe this is a diagnostically important question, this one difference that pops up that you didn’t expect, there is a way to deal with that. The way would be to do further testing.
Let’s say a verbal comprehension deficit pops up. You add additional tests that get at wearable comprehension and see whether that hypothesis is confirmed. So that would be some adaptive approach towards using the [00:34:00] information you get, but not run with it, but design further tests, not you as a practitioner use for the tests to confirm this potential hypothesis.
Dr. Sharp: I want to dig into that a little bit more in a bit. I think that’s the optimism here or the solution, which people, I’m sure … One of them, certainly.
I did want to go back to something that you talked about in the beginning and the difficulty of comparing results over time. I think a lot of us do that. A lot of us test kids multiple times, maybe 2 years apart, 3 years apart, 4 years apart, or we get an evaluation from a previous practitioner from maybe 6 months ago, and we get different results within that. And then we get stuck with this job of, oh, how do we [00:35:00] explain that? We’re trying to reverse engineer what those differences are about. Do you think that’s a worthwhile pursuit? And if so, how do we do it? If not, how do we ignore it?
Dr. Ulrich: My own background, among other things, is in lifespan aging research. This problem comes up all the time. From the diagnostic perspective, for example, diagnosing something like the beginning of Alzheimer’s should ideally depend on seeing trends where you have to make a decision: is this downhill trend more than what you would expect if trying to interpret differences across test occasions?
This is an extremely hard problem for which there’s, in the current testing literature, no good solution, but in the essence, it’s exactly the same problem because [00:36:00] now the profile that we’re looking at is not the profile across different tests at one test occasion, it’s a profile across the same tests in different points in time.
And so, why is this the same problem? Because as we want it to be, these tests will be highly reliable. So, the general factor is the correlation from one measurement occasion to the next. Once you take out that common factor, the remaining information that can encapsulate the change over time is very unreliable and very difficult to interpret.
So that’s why I generally would be very cautious about interpreting changes at all. And if you add in not just interpreting a change in the [00:37:00] overall score like a Full Scale IQ, but a change in profile, given what we already talked about at the beginning, namely that they just don’t replicate, I would be very, very careful with doing that because you add a different score over different scores. It’s a different score in the profile, and then they might change. So this is an explosion of different score uncertainty.
Dr. Sharp: Right. Are there any circumstances you can think of where it is advisable or doable to interpret change over time in our evaluations?
Dr. Ulrich: To some degree, it’s always a matter of degrees. If I see something that deviates from an expected pattern, then at that point, I would add additional testing to confirm whether this is a [00:38:00] one-shot occurrence that then reverts back to the mean or rather a true effect there.
If you think in terms of long-term real-world potential solutions, the whole problem could be fixed in principle if we had relatively frequent assessments of individuals over time. Let’s come up with an ideal testing world where everybody gets a short but highly reliable cognitive assessment once a year. So now you have an individual’s timeline, and each individual now is captured with their specific timeline. You don’t have to compare to norms anymore. You just compare it to that individual. And if that individual at measurement 0.35, all of a sudden shows a drop, that is [00:39:00] potentially really meaningful because you compare it to the standard error that this individual has generated for himself through his testing history.
Of course, I probably would do additional testing and see whether there’s something real, but that is a real signal that I would take seriously because it’s based on information generated within that individual. This is a somewhat separate problem, but people who deal with diagnosing deficits in older age often see a patient the first time being tested in their practice, and then you see a university professor like me might have above average score, but potentially that individual was way above average in his early years. And so you would not interpret that individual [00:40:00] necessary as having a deficit, even though relative to his own standard, he actually had a drop.
And so having an individual testing history for individual people, that gets around that problem. This is, of course, in a dreamland right now, but it’s doable in principle.
Dr. Sharp: Sure. I did an interview with some folks from a company called Boston Cognitive Assessment maybe six months ago or something. I don’t know if you know them or the test. It’s very brief. It’s a 10-minute assessment that you can repeat as frequently as you would like.
They’re not the only ones in that space by any means. There are some options coming on the market that can tackle that may ignore the whole thing.
Dr. Ulrich: I would [00:41:00] very much recommend doing something like that, especially in a situation where there’s some likelihood that you see a patient repeatedly.
Ideally, it would be something that basically in a family practice can happen while people are in the waiting room just to get that type of information that would be much more useful than any large norming studies that we are basing our information right now.
Dr. Sharp: I was just going to say that. This opens the whole can of worms of mental health and keeping it on the same level as physical health. But yeah, if we were doing an annual or semi-annual cognitive assessment with our primary care doctor.
Dr. Ulrich: Spending time and money on so many things, why not on that?
Dr. Sharp: That’s true. I’m with you. Before we transition, we’ve taken little dips into strategies that can help with our interpretation, but [00:42:00] just on a broad level or big picture, given the state of things now and how most people are doing assessment, what is the most sound way to interpret our test data at this point?
Dr. Ulrich: It comes down to two things. The first one is stay as much as you can with the overall level score, the Full Scale IQ or whatever that is in the battery that you are using, and try to extract as much meaningful information relative to the other things you know about that patient from that score.
I know that in the ADHD diagnostic practice, the questionnaire-based scores are very informative, very important, highly valid, and highly reliable. And so comparing that to the Full Scale IQ can be very meaningful.
[00:43:00] And particularly, I’m talking a little bit beyond what I should actually know and parroting what I learned from my wife is that in those cases, something like the Full Scale IQ can really be very informative about people’s potential for compensating for the deficits they have, but I would stay almost completely away from the six acts in the profiles with the exception that I mentioned before of trying to be well-informed about the reliability of the indices that you’re really interested in. I could talk a little bit more about that, but that gets very mushy.There are ways to get at that information. Unfortunately, I checked in the Wechsler handbook yesterday whether I could find that information, I was not able to do. [00:44:00] The handbook, they want you to know the overall reliability of the Full Scale IQ and of the indices that are all great, but it doesn’t help you with that particular problem. You need to know what it’s called; I call them a reliability score called omega hierarchical.
That reliability score tells you what is the specific reliability of, let’s say, verbal comprehension after extracting out the Full Scale IQ reliability. When you know that, then you can construct a confidence interval of the minimum sized difference between, let’s say, the Full Scale IQ, the general level, and the verbal comprehension that you need to accept.
Let’s say that is 15 points, which is somewhat realistic if you assume a reliability of [00:45:00] 0.5, 15-point difference, but that’s only for the first time you look. So that gets back to, don’t use that criteria of 15 for every single comparison that you can make. Use it once, and then stop. So that would be, from my world, still allowable, maybe already somewhat shaky, but if you carefully apply that, I think you’re still in somewhat safe grounds, but I wouldn’t go with it.
Dr. Sharp: You mentioned the behavioral questionnaires and behavior checklist, just briefly. I know we’ve been talking primarily about cognitive measures, and we’ve used the Wechsler measures as the example. But do you have a sense of how this all applies to the behavioral questionnaires that we administer?
Dr. Ulrich: It’s important to understand what I just discussed. In no way, this is a [00:46:00] general methodological issue that has nothing to do with whether it’s about cognitive or questionnaire-based. If you want to go in and interpret specific facets of your questionnaire, I don’t know very little about these, I’m talking very abstractly now, but if you want to interpret specific aspects, you would have to be very mindful of how reliable these are relative to the general factor that I’m sure is also expressed in these questionnaire-based measures.
So, the problem stays the same. You would have to look very carefully at the relationship between what the equivalent of the Full Scale IQ in something like the BRIEF might be and the individual scores. So the problem doesn’t go away. We were using the BRIEF as an example of an additional piece of information outside the cognitive [00:47:00] assessment that can be brought to bear.
Dr. Sharp: That’s fair. Thank you. And then one other component I just wanted to touch on again, to make sure for anybody who missed it, what is the term you used for the statistic or the measurement that we’re looking for that would capture the reliability specific to an index?
Dr. Ulrich: It’s called Omega hierarchical. I don’t know whether you have something like show notes.
Dr. Sharp: Yes.
Dr. Ulrich: I can send some references. There’s one paper by the person I mentioned before, Marley Watkins, who presents that type of information for one version of the Wechsler.
Dr. Sharp: Great. That sounds good.
Dr. Ulrich: As a software package that you can use to extract that information from published information about the test.
Dr. Sharp: Fantastic. Great.
Dr. Ulrich: Somebody you should have in your show sometime.
Dr. Sharp: Yeah. I’m bookmarking that, certainly. I’m going to look him up. I think we’ve kept people in [00:48:00] suspense for long enough. I would love to dive into how we can do better, essentially. Given everything we’ve just talked about, you’ve mentioned additional measures validating the results. Let’s dive into that for a bit.
Dr. Ulrich: This is now a lot more speculative. It’s also, in some ways, political and talking about the markets because the testing industry is a big market. The technology that is being used is pretty much the same as 50 years ago. We basically are riding and driving a bicycle even though we could be driving a Porsche.
It seems like there has been very little pressure from the psychological associations and so forth on the testing industry to do better. I don’t know why that is; that’s not my field, but there is work to [00:49:00] be done there to put more pressure on doing things better.
And that can go in different directions. Maybe the most difficult one, as I said before, the main problem is that our cognitive tests are saturated with […] It is possible that there isn’t anything beyond g, and it’s very difficult to go beyond.
That’s the field that I’m in, in my basic science. It is true, it’s really hard to find specific meaningful individual differences of variants beyond the g factor. So that’s hard work, but it’s worth trying to get to measurement instruments that measure individual aspects reliably and reduce the relationship to the g factor.
So that would be one way to design instruments that give you meaningful [00:50:00] profiles. And so, ideally, then you would have a much shorter battery to get at the general g factor. And then you have a bunch of satellite measures that assess the things that are still interesting but not captured already by g. So you broaden your perspective that way.
There we get more in some methodological details that I probably don’t want to bore anybody with, but there are now statistical methods that could be used to much more meaningful and adaptively design how you select tests for a given individual where you basically, you test somebody, the information that you gain from that individual is immediately used to suggest what is the most meaningful next test that you should be doing to address or test [00:51:00] certain hypotheses.
That is something for which the technology absolutely exists. I used Bayesian modeling. I don’t know how much people know about Bayesian, but Bayesian modeling means essentially that you use the information you already have to make the best search for the next possible relevant piece of information. That can be done adaptively.
It’s a little bit like the idea that I suggested before. Don’t go ideally with a full scale 12 test battery. Pick a few tests that really get at the general cognitive ability and then in test specific hypothesis of what might be going on and over sample those tests where you think that something interesting might be happening.
That would [00:52:00] be a tailored, adaptive way to do that, but of course, our instruments right now are not geared towards doing that. So this is something you can’t ask a current practitioner just to go around and do that. You would have to have different testing technology to do that.
Dr. Sharp: I think that’s where things get frustrating. I don’t know a lot, honestly, about the testing industry, but what seems to be on the surface is the fact that a lot of tests are locked behind different publishing houses, which makes this difficult. So it’s hard to sample from each of these different measures and put together a truly comprehensive or meaningful battery, because you have to switch between different platforms and the data isn’t housed in the same place, and then you’re doing your [00:53:00] own calculations on the results, and that seems hard. That seems to be a component of this.
Dr. Ulrich: God, I was losing my thread here.
Dr. Sharp: There’s a lot of threads.
Dr. Ulrich: I think some hope here comes from the big data technology side because this is a big data problem, where in order to get these types of Bayesian estimates, you need lots of data. You don’t need to have one sample of 1,000 participants that are tested for norming studies, you just need a lot of people who do different types of tests where you can collect information and then hear these procedures based on these data. It’s a problem that is [00:54:00] solved in principle, it just needs somebody who wants to do the R&D diverse investment into this.
Dr. Sharp: Yeah. So we’re talking about computerized adaptive testing here. And just to make it super concrete, the theory is, you give someone a relatively brief set of subtests or something, and then if they do poorly on a verbal subtest, then it triggers, hey, we’re going to administer these additional 10 to 20 items looking at verbal comprehension to go deeper into whatever.
Dr. Ulrich: Exactly. Yes.
Dr. Sharp: That’s essentially what they use for the GRE and the SAT.
Dr. Ulrich: Yes. They have to come up with new versions every year. So it is doable. Somebody has to lobby Pearsons or whatever they’re called.
Dr. Sharp: Someone has to do it. There’s a lot to consider [00:55:00] there, and there are some downsides. Capitalism is important. Making money and selling different tests is important. Sometimes, that comes up against best practices. Goodness.
Are there other strategies that we can use? Anything else that can be helpful with what we’ve got right now in terms of interpreting and using our data in a meaningful way?
Dr. Ulrich: The few things that I’ve said are the ones that I feel comfortable with right now. I think that, more generally, as scientists and practitioners, we have to be aware of the confirmation bias that haunts everything we do and think about. Psychological practice is not free of that confirmation bias.
The testing manuals that present you ready-to-go information about strengths and weaknesses and so forth [00:56:00] are designed to work with that confirmation bias and give it something to work with. I think that’s, if nothing else, a take-home message to get out of this, don’t fall for that.
Dr. Sharp: I like that. We’ve talked about bias on the podcast a few times in the past. I’m currently trying to schedule another guest to talk about bias and diagnostic impressions. So it’s important. I’m glad you highlighted that.
Well, it’s been a great discussion. I know in some ways we could see this as a little bit of a bleak discussion and there are some ways that we can combat the problems here. I appreciate that you highlighted those.
It’s important to keep it front and center. It’s easy, like I said at the beginning, to fall into the temptation to overinterpret our data and [00:57:00] succumb to the pressure of making meaning out of things to “help” our clients.
Dr. Ulrich: Pleasure.
Dr. Sharp: Thanks for being here.
Dr. Ulrich: Thank you. Bye-bye.
Dr. Sharp: All right, y’all. Thank you so much for tuning into this episode. Always grateful to have you here. I hope that you take away some information that you can implement in your practice and your life. Any resources that we mentioned during the episode will be listed in the show notes, so make sure to check those out.
If you like what you hear on the podcast, I would be so grateful if you left a review on iTunes or Spotify or wherever you listen to your podcasts.
And if you’re a practice owner or an aspiring practice owner, I’d invite you to check out The Testing Psychologist mastermind groups. I have mastermind groups at every stage of practice development: beginner, intermediate, and advanced. We have homework, we have accountability, we have support, we have resources. These groups are amazing. [00:58:00] We do a lot of work and a lot of connecting. If that sounds interesting to you, you can check out the details at thetestingpsychologist.com/consulting. You can sign up for a pre-group phone call, and we will chat and figure out if a group could be a good fit for you. Thanks so much.
The information contained in this podcast and on The Testing Psychologist website is intended for informational and educational purposes only. Nothing in this podcast or on the website is intended to be a substitute for professional, psychological, psychiatric, or medical advice, diagnosis, or treatment.
Please note that no doctor-patient relationship is formed here [00:59:00], and similarly, no supervisory or consultative relationship is formed between the host or guests of this podcast and listeners of this podcast. If you need the qualified advice of any mental health practitioner or medical provider, please seek one in your area. Similarly, if you need supervision on clinical matters, please find a supervisor with expertise that fits your needs.