Think Aloud User Testing – Challenging Established Status Quo

At Loop11, our team has been discussing the potential influence and impact of the think aloud method of user testing on the natural usage behaviour of participants. Does the method distract or change the way in which users naturally interact? Are we introducing bias into the testing process by making respondents verbalise their inner monologue?

There have been previous research studies that do cite an impact on behaviour, however with the growth of rapid online testing and the availability of quantitative metrics, the relative merits and drawbacks of employing this approach have become even more pertinent.

Our team formed its own hypothesis, that the think aloud method would extend task completion times and influence the depth to which participants explore navigation and content. We were also concerned that the approach would lead to a skew towards more vocal (possibly extroverted) attitudinal behavioural types and under-representation of others.

To put the approach to the test, our team designed a comparative testing scenario to pit think aloud testing against testing carried out without the need to verbalise thoughts.

When running the same set of tasks and questions, on the same website, but using two different usability testing tools, one prioritising think aloud, we recorded significant difference in the testing results – though not exactly in ways we had expected!

Surprisingly, for the most part we found no real discernible difference in time on task. However, an interesting result from the three studies was that the greater the level of anonymity, the lower the NPS but the higher the SUS. Does this mean that if a user knows they are being recorded, and their responses will be rated, then they are more likely to award a higher NPS score? Within this limited sample, it would appear so.

How test participants are sourced also impacts results. Providing incentives for performing tasks in a certain manner (very articulate and deliberate) leads to further skewing of results. Rating of participants perpetuates this phenomenon. Participants should be incentivised for completing the study in a natural manner, not what pleases the viewing experience.

Creating products that resonate requires:

  • deep customer empathy (why)
  • target acceptance (what)

If you don’t get it right out of the gate, you’re done! – Aarron Walter, True North Podcast

Anecdote of a runner leading so fast in a marathon he outpaced the markers and ended up off course. He had a 55km run in the end and didn’t win despite extreme performance. You need to be sure you are on the right track.

Think Aloud method

The objective of the Think Aloud method is to get people to verbalise in a monologue as they use the product. What are they thinking, what is their process.

It’s a good process when it works as helps you get into their mind. It gives powerful insights into customer behaviour, motivations and preferences. Jakob Nielsen advocated doing 5 sessions – it would be enough to get usable data.

Still, it’s many years later and Loop11 opted to leave Think Aloud out of their platform, because there’s some research that suggests verbalising your process alters the way you are interacting with the product. It becomes a form of bias.

Influence on natural behaviour:

  • fixation – what are people looking at and focusing on
  • fewer tasks – it takes up some time to think aloud
  • takes longer to complete tasks
  • increased cognitive load – think aloud requires multitasking

Case study: Amazon Prime Video

Set up three tests:

  • standard loop11 usability test, no think aloud (50 people)
  • standard loop11 usability test, with think aloud (50 people)
  • video recording study with a third party tool, with think aloud (10 people)

Each study had identical tasks:

  • find cost of prime video
  • find cost of a series
  • find out if prim will work on my tv

Findings – broadly, pretty low success rates. They found the site wasn’t catering to natural mental models of information discovery, eg. catching people in an onboarding process that didn’t include pricing.

The tests had an inverse relationship between NPS/advocacy and SUS/usability. This seems odd – why is that happening?

Contributing factors

  • cognition and natural usage (does Think Aloud change this? it seemed to)
  • participant sampling (ruled out)
  • incentive (people tend to put more effort in if they’re paid more)
  • participant ratings – this is the one they felt had contributed the most in this case. In test 3, the testers get rated (high ratings lead to getting selected for more tests). This leads to a bias in the tester audience.
  • branding – Amazon is a big brand, so most testers were already users. That can impact results.

Where to from here?

So will Loop11 include Think Aloud in future?

  • be mindful of data and sentiment
  • augment with natural testing behaviour – combine data from sessions that do and don’t include Think Aloud
  • encourage further research – what have other people run into? can we find more tests and get more data? The findings of this test were a surprise.

Whenever you find yourself on the side of the majority, it is time to pause and reflect. – Mark Twain

Think Aloud is a widely-used technique, but do we stop and challenge it?

(electronic synthesiser music) – What I’d like to spend a bit of time on today is sort of continuing this discussion on research, following on from Atkin.

What I really wanna do is unpick a little bit of the way in which bias impacts what we do, and how we interpret research, in its self, and something that I really discovered in really only recently, and was quite a surprise.

Just a little bit of background to follow on from John.

So I’m the co-founder of Loop 11, but most recently I’ve spent time on this podcast. So, been working on the Masters Series where I’ve had the fortunate ability to talk to some of the great design masters of our industry, and some interesting themes came out of that process, where I was able to talk to them in some detail about unpacking, about what makes, and creates great design, and how do we get products to resinate.

So in terms of what makes products resinate, there’s two kind of key aspects from my perspective. So, you’ve got the quality of aspects so the, why, so why do people do what they do, and often if you’ve sat behind a one-way mirror watching some user testing as a designer you’ll say what are these guys doing I can’t understand why they are behaving this way. I’ve designed it so perfectly, and they’re not behaving. So, that’s that aspect of understanding why, and that’s also really about developing customer empathy. So, this came up a lot in the podcast series about how we really engage customer empathy, and that could be done through a couple of means. So one is generative research, it might be ethnography, a whole range of other research techniques, but there’s also less formal techniques that you can incorporate, and use.

So, are you sitting in on the call centre, on the help desk? Are you going and talking to customers on a regular basis? So, it doesn’t always necessarily need to be a formal approach.

The ability to really to unearth, and unpack the why’s is critical in the process.

The second aspect, which I think is equally important, is the what.

So we know why people are doing things, and why the people we are talking to are doing things, but talking about the broader target acceptance of what’s going on.

So, in our role in the research process what we’re trying to do is avoid bias, and to be honest, there’s bias that’s inherent in every research process, and what really is our goal, and our purpose is trying to minimise that as much as possible.

So we get a really accurate reflection of what is reality, in other words, and I’d be interested has anyone come across the speech that Charlie Monger gave to Harvard Uni, in the 90s, about the psychology of misjudgment. Has anyone come across that before? One person.

Highly recommend you listen to it.

It’s just an audio recording, but I listened to that, probably about three or four years ago, and it blew my mind.

Just to see how open I am to interpretation, and bias in everything that we do.

So, I’ll start to unpick some elements that came-up within that talk throughout the session today as well. One of the interviews that we did for the Master Series was with Aarron Walter.

Now he’s the VP of Design Education at InVision, he’s a really insightful guy, and we started to talk about our industry, and the fact that our industry has a need for speed. So we’re really under a lot of pressure to ship, and ship fast, and the challenge with that is that you can ship fast, but you might be heading off on the wrong path. Really from a very early stage, and once you go down a wrong path it’s quite difficult in terms of wasting time, and resources getting back on-track.

So, one of the things Aarron talked about is if you don’t get it right out of the gate, you’re done, and what was interesting about the discussion with Aarron, and I’m not sure if any of you have heard this particular episode, but, he provided analogy, with a running analogy, and after hearing that I heard a story, actually about a relative of mine, It’s my cousin Shane, and Shane is a bit of a machine, and he travels the world running marathons, and doing triathlons, and last year he went, and competed, and actually won the marathon on the Great Wall of China, and he’s also won a marathon circling Uluru as well. So, some incredible locations, and interestingly Shane’s a Perth local, and someone approached him, and said look Shane I’ve got this great idea I’m going to put on a local event.

It’s a marathon, and we’re going to call it the Wild Bull Run, and what we’re gonna do, it’s gonna be a fusion of Pamplona, and an incredible run through wineries, and all sorts of great events, and we’d like you to star in it.

You’re going to be our draw-card for this particular event, and so what happened, is Shane said yeah I’ll do it, I like to support local people, so let’s go.

So anyway, the gun goes off, and Shane takes off, and often is the case with Shane, he’s running up-to five or ten Kms, there’s no one around him, he’s up the front of the course, and he keeps running.

He’s got a CamelBak, which is strapped to his back, which has got his hydration coming through because there wasn’t a lot of drink stops through the winery although he probably could have stopped for a glass of wine, at some places, but anyway, as he was getting to about the 30km mark he got a bit concerned because he’d run out of water, and there was still not drink stations, and no one in sight.

Another 5Kms on the car drives up behind him, And says Shane! Shane! I’m sorry you were running so fast I didn’t have time to put the signs out to direct you which way to go.

So, what had actually happened is Shane had gone terribly off-course, ended up running a 55 kilometre marathon, instead of a 42, and of course didn’t win the marathon.

So, what was interesting about this story is I saw a lot of parallels to our industry, and about trying to be fast in shipping.

Unfortunately for Shane, he didn’t have research, and people as markers to guide him on-track in terms of where he should get to.

So, I thought that was a little bit of a back-up for the role and importance of research in the process. Can I just get a show of hands who’s been involved in user testing? Either facilitating or viewing user testing, over the years? Quite a few of you.

Now I guess this is a technique that’s pretty ubiquitous in our industry these days, but when I started back in the 90s it wasn’t, fortunately it’s changed since then.

I guess the most commonly applied technique is a qualitative technique, and it’s often the usage of a think aloud method so what the think aloud method is, is trying to get the participant to reveal their inner monologue. Their thoughts as they’re going, and interacting with a website, and why that’s important is that we can start to understand their thoughts and processes. Are they struggling? Do they intuitively understand what those labels mean? Are they following a path, that we as designers would expect them to take, as they are following through the interface? And as John said before, back in the day, when Jakob Nielsen was shining in the late 90s early 2000s, look he’s still shining.

He would suggest that look you really only need to do five user tests, and you will unearth about 80% of the problems within your interface, and that’s all great. What we can do as part of that process is, and the qualitative, why, aspect of this research technique really gets us to unpack their behaviour their thoughts, their motivations, and their preferences in the process.

It’s a highly valuable technique.

What we were thinking about at Loop 11, and also, I guess, to step back, we hadn’t included the think aloud technique within our tool because there has been some previous research that really does highlight the fact that technique its self does influence natural behaviour, and within that too, and there’s some great articles on this, if you want to check it out, by Hertzum et all, and Jeff Sauro, and even the Nielsen group themselves.

These particular studies highlighted a range of influences on the natural behaviour, and the first one is what’s called fixation, So if you’re verbalising your thoughts as you’re interacting with interface you’re multitasking for a start.

Some people are better at that than others, I’m not a great multitasker I’d have to admit, but if your verbalising what’s going on what’s been found in these studies is that your fixation, so where you look on screen, tends to alter. So, what happens is you get to the page you’re looking at, and you start to scan around, and see if I’m making the right choices, and decisions. What’s presented in front of me, and it alters your natural fixation within the site itself. There is a cognitive load that comes with think aloud. Your verbalising your thoughts, and that particular aspect means that we can’t always delve through as many tasks as we would, otherwise, like to.

So we need to think about how much we tucker out the participants.

In the consultancy we really try, and keep them for a maximum of sort of 45 minutes on task, and really no longer than that, and even 45 minutes can be a little bit draining depending on a set of tasks that are quite draining in terms of that cognitive load.

It also potentially takes longer to complete tasks, and this is one of the elements that we are probably most concerned about.

So, if we’re recording time on task, and people are verbalising their thinking, and they are taking longer, are we blowing out the amount of time they would naturally spend undertaking that task? And also, as we’ve been talking about a little bit already, that increased mental work-load.

So, just the fact of talking aloud, and also doing it in a way that people can understand. Now, anyone that’s been involved in the think aloud method facilitating user testing, know that not every participants great at it, you get some people in where you have to keep jogging them, remind them to verbalise what they are thinking. What are you thinking now Barry? How did you feel about that? And for some that’s quite a task.

So, there’s a series of elements that we were concerned might effect the natural behaviour. So, what we thought we’d better do is practise what we preach, and put it to the test, and we decided to carry out a study where we would use Amazon Prime Video as a vehicle to carry out the test.

We carefully chose Prime Video because a. Because it is actually a car crash.

Has anyone actually experienced Prime Video? You may have experienced Prime, but Prime Video? okay, one, yeah, two, three, okay.

Was it a car crash for you? I guess at least I’m talking about the on-boarding process. Anyway.

So, we though this would be a great case study to use, and we also chose this because Amazon as a brand is really supposed to be a bastion of great UX, and we’d found an example where perhaps it wasn’t. In terms of the approach we wanted to use we had to think about how do we understand, and how do we setup a methodology that lets us really highlight, and articulate what some of those variances are? If they exist at all? We might get to the end of this exercise, and say look, think aloud’s not a problem.

The previous research seems to suggest it is so, um, we’ll see.

What we did is we ended up stetting up three different studies with three different audience groups.

So, the first one was using Loop 11.

We had a standard test study where people didn’t think aloud they weren’t required to verbalise their inner monologue. The second one we used the same technology. We went out to the same panel of participants, although we used different participants, and we got them, and we instructed them to think aloud.

Even though we weren’t recording what they were saying, a various points during the test we would jog them, and remind them to verbalise their thinking. 50 participants for each of those, and the third we went out to a 3rd party tool, which was essentially a panel with videos, and we, once again this particular panel, their instructed in the think aloud process. So, they are trained on how to verbalise their thoughts. They tend to be very effective at it, and we only got 10 participants from that panel. We would have liked to have got more, but their pretty expensive.

So we said, let’s use this as a first test case, and compare. What we did, was we got them to undertake three key tasks. So, the first is just, find what the cost of Prime Video is? The second, was the cost of purchasing a season of Mr Robot, and the third would be trying to determine whether Prime Video would work in my own TV at home. So some of the results, in terms of the task breakdowns you can see. Loop 11, no think aloud. Loop 11 think aloud,

and the 3rd party tool.

There were some variations, and what you will probably notice is more variations in test three, to test one, and two.

So a higher success rate there.

The other thing you’ll also notice is the meantime, in terms of time on task for task three was significantly higher for test three, but the others probably weren’t that different, and that was the first shock of the exercise. I thought we would see some significant differences in terms of how long people were spending on the task with the think aloud technique. So why, I guess coming back, obviously success rates aren’t great.

If your returning success rates like that you’d be a bit concerned.

So why was that the case, and I’m not going to spend too much time on this, you’ll have to take as a given, I’ll show you a clip in a moment, but one of the key aspects was when people were coming to carry out a basic search, in this case searching for Mr Robot, it would search, all, but not actually include any Prime Video results, unfortunately.

So people were toggling through pages of content that related to very tenuous graphs of robots, and not finding what they were looking for. Also in terms of discovering what it costs, people would often try Prime or getting started, and click on those links, and were taken thorough this merry-go-round of content, none of which provided them with any pricing detail. So, again a really frustrating process in the onboarding stage.

What I want to do now is just come back, and show you a clip.

This was a clip that came out of the test three, so that was the think aloud from the panel, and what I noticed was there were some sessions in there that had some really long lengths of time.

So, this was for one of the tasks, and this person had spent over 20 minutes on a task, and I immediately said to one of my colleagues, clearly someone has gone, and made a coffee or something. I started watching it, and this is what I found.

– [Female Tester] Amazon Prime Video I’ll just go to Amazon Prime page then.

So the menu, About Video Subscriptions.

– [Shefik] So minute five I’m time-lapsing this a bit. – [Female Tester] Oh, I’m getting very frustrated here. I’m concerned, whoo, my goodness, so I’m looking for Prime Video it’s not apparent, it’s not obvious, it’s not clear, it’s not easy to see here.

I don’t want to try Prime Free.

Okay so I’m sorry I can’t find it.

I give-up, I’m going.

– Poor lady, I felt terrible after that 23 minutes of struggle.

So, what was interesting about that too was that she really did spend 23 odd minutes on trying to find that task.

So, that in itself was a big problem.

Often with these testing studies, people, and regardless of whether it’s qualitative with a small number of participants or your looking at large sample sizes, people love to generate metrics from user testing. What are common metrics, that are often generated, are net promoter score, and system usability scale responses.

So you’ve probably heard of the net promoter score. It’s a measure of consumer advocacy for a product. So, how likely are they to advocate, so tell a friend or family member about what the product is, and whether or not they should use it, and the system usability scale is an overall metric that looks at both usability, and user experience. So, with this particular study we wanted to have a look at the feedback from the three tests. So, with test number one we got a positive NPS score of 8. Now that’s the no think aloud.

The think aloud on Loop 11 we got a plus 22, and then we went to the 3rd party tool, we got a plus 50. So, significant variation there.

Now, from an advocacy perspective any score between 30 to 50 is considered a really good score.

So, it ranged from erm to excellent, and certainly significant variation, and a graduation upwards as you’ll notice as we went down.

So, obviously that’s the NPS score.

Looking at the SUS we see the revers happening. So with the first test we see a SUS of 59.

Drops to 50 with test number two, and down to 46 for test number three.

Now, the SUS score, across all the testing we run within the consultancy, averages out at about 68.

So we’re seeing substandard SUS scores for all tests, but we’re seeing again a graduation down through the think aloud from what’s the non think aloud method in test number one. So, why the inverse relationship between the two, for a start, and why are we seeing some of those differences in those figures.

Anyone else looked at the relationship between SUS, and NPS with testing before? Not so many? Um, fascinating, I spend my nights looking at that. So, interesting stuff.

So, what we ended up doing is trying to work out what were the contributing factors to some of those differences and variations? So the first one is essentially cognition and natural usage.

So, if people again, as we talked about earlier, are thinking-aloud, and this is supported by other research that’s been conducted before. If people are thinking aloud there’s a greater cognitive load, and we are influencing their natural usage. You know, we’re altering where they are fixating on the website. There’s a lot more going on, and certainly that could have a big bearing. Another aspect is participant sampling.

So with the Charlie Munger Harvard Speech, he talks, his number on bias is about reward, and punishment. So, if we’re rewarding or punishing somebody. Now, reward in this case might be an incentive that we’re paying.

It might be punishing someone by paying them too little incentive for example or maybe the way in which we’re actually getting those people along, and involving them in the research technique might have an influence.

So, participant sampling could have a bearing. In this case we went to a consistent panel for tests one, and two, and that was they sync panel, and what we ended up doing was we paid those participants around about $4 for 10 minutes.

So, you can make your own judgement .

Is your 10 minutes worth $4 or not, and that’s something only you can answer.

With the 3rd party tool those participants were getting around $10 for 10 minutes, and I’ll talk a little bit more about that in a moment. So yes, participant sampling could have a bit of a bearing on some of the differences. The incentive, as we’ve started to talk about, could also have a bearing.

So, if you’re getting paid $10, and you’re spending 10 minutes, and you’re doing 10 tests a night, you’re getting $100 a night.

For some people that’s a lot of money, and would they be more inclined to finish a study, and spend 23 minutes on a task if they are getting $10 verse getting $4, and again, this is something you can have a look through your own web stats, and say, well how long are people spending onsite? Are people spending 23 minutes or half an hour or are they spending 30 seconds, and what we need to do is keep benchmarking back to a natural experience to make that judgement call.

So yes, in this case incentive may have had a bearing on the results. What we actually though had one of the biggest bearings is this final point, and that’s participant ratings. So, with that 3rd party tool, all panellists are rated on a five point rating scale in terms of their performance.

So, put yourself in the shoes of the client that’s watching these videos. You see somebody articulating beautifully their experience. Telling you, perhaps, what you want to hear, highlighting a couple issues, and problems. You’re more likely to give them a higher rating than someone that’s not all that articulate. Maybe wasn’t able to explain their thinking as clearly, and you’ll give them a lower rating.

So, what ends up happening is that those participants with a higher rating get more work.

They’re put to the top of the tree, and they get more projects again, and again because we want to satisfy our clients.

So, in that sense we’re actually skewing the results, and perhaps skewing them to a certain attitudinal behavioural type.

So, someone that’s more expressive.

Someone that might be able to workout solutions a lot more easily than the average person.

So, again in this case, we think that’s had a significant bearing on results. The last one is branding, and I guess tied up to that last particular point we noticed in the videos that a lot of the participants assumed that we were Amazon carrying out our own testing, and they would say, aw I know you guys like to do this, and do it a certain way.

The other thing we noticed was just about every participant we watched had an Amazon account.

So, when they went to access the webpage we would see their login details although they were greyed out, but they were actually Amazon customers.

So, they’d had previous experience working with that website.

So, that’s also had a little bit of a bearing in the process.

So, I guess where do we go from here.

We know what we’ve seen, this technique has had some influence on natural behaviour. I guess what I’d like to leave you guys with, is to say, I’m not for one second saying that quality of user testing, and the think aloud method should be dropped. I actually think it’s a really valuable technique, but I what I’d like you to be, is really mindful about the data, and sentiment that you’re generating from that analysis.

So, often again that the Nielsen five uses a sufficient aspect. If we’re using the think aloud technique we’re going to be encouraging efficacy with the products just by natural aspect of the way we setup the testing. So, and also, don’t be relying on the data you’re generating from that small sample, using the think aloud method. I’ve worked with a lot of clients that use some of these scores, such as SUS scores, and NPS to determine whether they launch products.

Whether they’ll re-launch, and add new features to products, and if they’re generating a positive score within the lab on a small number of users, and then launching the product, and it fails in market, you know, woe is us for not looking a bit deeper into that process. What I’d also suggest is we try, wherever possible, to augment this with natural behaviours.

So, whether your tying that back to your web analytics, your generating some more quantitative results with larger sample sizes, and not using the think aloud method.

That’s something that I’d also suggest you consider doing. Now time is precious.

So, you’re not going to do this every time, but certainly when you’re looking at generating those metrics it’s really critical that they’re accurate. I guess the other thing too, is that I would like to encourage you guys to do some further research.

I’d love to hear your thoughts, and feedback as you’ve tackled problems such as this because what I’d ideally like to look, is look at this type of study across a whole raft of sites.

Let’s choose sites that, unlike Amazon, don’t have such strong branding, and see if we’re seeing the same type of results. I assume we would, but I think that there’s a culmination of factors involved there, and Munger calls that the Lollapalooza effect. So, that’s a whole raft of different biases that are totalling up to a result, and sometimes it’s hard to break down, and say is that the result of this bias or that or this? So, you know the more information we can generate as an industry the better, and I guess, why is this important? And there was a quote that was provided to me recently from Mark Twain which is, “Whenever you find yourself on the side of the majority, it’s time to pause, and reflect.” and from Munger’s words that’s actually a social proof tendency.

So, what we tend to see is that everyone else is doing this we should do it too, but I think at times we need to challenge those social norms, and say are they delivering exactly what we hoped to deliver because ultimately, we want to deliver those products that resinate. And that’s all from me, thanks guys.

(audience applause) (electronic synthesiser music)