Think Aloud User Testing – Challenging Established Status Quo

At Loop11, our team has been discussing the potential influence and impact of the think aloud method of user testing on the natural usage behaviour of participants. Does the method distract or change the way in which users naturally interact? Are we introducing bias into the testing process by making respondents verbalise their inner monologue?

There have been previous research studies that do cite an impact on behaviour, however with the growth of rapid online testing and the availability of quantitative metrics, the relative merits and drawbacks of employing this approach have become even more pertinent.

Our team formed its own hypothesis, that the think aloud method would extend task completion times and influence the depth to which participants explore navigation and content. We were also concerned that the approach would lead to a skew towards more vocal (possibly extroverted) attitudinal behavioural types and under-representation of others.

To put the approach to the test, our team designed a comparative testing scenario to pit think aloud testing against testing carried out without the need to verbalise thoughts.

When running the same set of tasks and questions, on the same website, but using two different usability testing tools, one prioritising think aloud, we recorded significant difference in the testing results – though not exactly in ways we had expected!

Surprisingly, for the most part we found no real discernible difference in time on task. However, an interesting result from the three studies was that the greater the level of anonymity, the lower the NPS but the higher the SUS. Does this mean that if a user knows they are being recorded, and their responses will be rated, then they are more likely to award a higher NPS score? Within this limited sample, it would appear so.

How test participants are sourced also impacts results. Providing incentives for performing tasks in a certain manner (very articulate and deliberate) leads to further skewing of results. Rating of participants perpetuates this phenomenon. Participants should be incentivised for completing the study in a natural manner, not what pleases the viewing experience.

Creating products that resonate requires:

  • deep customer empathy (why)
  • target acceptance (what)

If you don’t get it right out of the gate, you’re done! – Aarron Walter, True North Podcast

Anecdote of a runner leading so fast in a marathon he outpaced the markers and ended up off course. He had a 55km run in the end and didn’t win despite extreme performance. You need to be sure you are on the right track.

Think Aloud method

The objective of the Think Aloud method is to get people to verbalise in a monologue as they use the product. What are they thinking, what is their process.

It’s a good process when it works as helps you get into their mind. It gives powerful insights into customer behaviour, motivations and preferences. Jakob Nielsen advocated doing 5 sessions – it would be enough to get usable data.

Still, it’s many years later and Loop11 opted to leave Think Aloud out of their platform, because there’s some research that suggests verbalising your process alters the way you are interacting with the product. It becomes a form of bias.

Influence on natural behaviour:

  • fixation – what are people looking at and focusing on
  • fewer tasks – it takes up some time to think aloud
  • takes longer to complete tasks
  • increased cognitive load – think aloud requires multitasking

Case study: Amazon Prime Video

Set up three tests:

  • standard loop11 usability test, no think aloud (50 people)
  • standard loop11 usability test, with think aloud (50 people)
  • video recording study with a third party tool, with think aloud (10 people)

Each study had identical tasks:

  • find cost of prime video
  • find cost of a series
  • find out if prim will work on my tv

Findings – broadly, pretty low success rates. They found the site wasn’t catering to natural mental models of information discovery, eg. catching people in an onboarding process that didn’t include pricing.

The tests had an inverse relationship between NPS/advocacy and SUS/usability. This seems odd – why is that happening?

Contributing factors

  • cognition and natural usage (does Think Aloud change this? it seemed to)
  • participant sampling (ruled out)
  • incentive (people tend to put more effort in if they’re paid more)
  • participant ratings – this is the one they felt had contributed the most in this case. In test 3, the testers get rated (high ratings lead to getting selected for more tests). This leads to a bias in the tester audience.
  • branding – Amazon is a big brand, so most testers were already users. That can impact results.

Where to from here?

So will Loop11 include Think Aloud in future?

  • be mindful of data and sentiment
  • augment with natural testing behaviour – combine data from sessions that do and don’t include Think Aloud
  • encourage further research – what have other people run into? can we find more tests and get more data? The findings of this test were a surprise.

Whenever you find yourself on the side of the majority, it is time to pause and reflect. – Mark Twain

Think Aloud is a widely-used technique, but do we stop and challenge it?