Fair game: the ethics of telco fraud

(bright electronic music) - I'm so excited to be speaking at this conference. So let's get started.
This is fair game: the ethics of telco fraud. Just before I kick off into the meat of this talk, I just wanted to share a new project I just chucked online in the past couple of days. You may have heard of this project I've been working on called Ethics Litmus Tests.
I've actually recently thrown a digital version of it online, where you have a little card flipper and you can pick a card.
And I actually borrowed the code for this project from Ben Buchanan, who put up a version of the card picker for a similar card deck called oblique strategies, which was an inspiration for my project.
So this is a lovely moment of networks colliding. And if you don't know Ben Buchanan, he's the guy who always writes these wonderful blog posts like summarising all of the good stuff coming out of Web Directions called The Big Stonking Post. So yeah, he's rad.
He's closely tied into the Web Directions networks. And thank you again for letting me borrow your code. And if you wanna have a play with the online version of Ethics Litmus Tests, please go for it.
If you want any of the references or any of the things I'm talking to in this talk, please just hop onto my GitHub and grab the slides, the links are all in there, and I'll share this again at the end, in case you need them. So let's get stuck into it.
I'm going to start by highlighting the big topics I'm gonna talk to in this lineup.
So I've got four kind of primary findings I'd like you to take away from this presentation. The first is developing a healthy cynicism about data sources.
The second is mapping our knowledge states. The third is designing for failure first.
And the fourth is upstream prevention versus downstream mitigation.
Now, I'd just like to sort of suggest that if you're worried this is gonna too focused on working in telco or work in machine learning specific modelling, I'll tell you it's really quite generic.
Like these findings and these techniques I offer, they're the foundation pieces to doing the complex modelling correctly, for sure. But they apply to any kind of automation, any kind of decision system.
Honestly, they apply to you.
So at the beginning of 2020, which is either a year ago or a lifetime ago, honestly, who knows at this stage, I began consulting for a telco for my company Debias AI. I'm primarily embedded in their fraud detection team. Their approach to fraud detection is pretty standard. It's a mix of monitoring, internal business rules, and some third-party systems that have propriety fraud machine learning detection algorithms. So that's pretty standard, if you're not familiar with the fraud space, that's pretty much what you would expect to see.
If you haven't heard this term EthicsOps before, think like DevOps, but for ethics.
I introduce a harm identification and harm mitigation lens. And that work looks pragmatically, pragmatically, that work looks like monitoring, notifications, and alerts, as well as product flows and feedback loops. Fraud is actually a really interesting space in which to discuss ethics.
Not only are you the company capable of inflicting harm on your customers, you're at the same time under attack by a subset of customers who actually mean you harm.
You have to balance the real, often urgent concerns of identifying bad behaviour, while also acknowledging the possibility of false positives and considering how you might identify and support any innocent parties who accidentally got caught in the cross-fire.
It's a war of attrition.
If you need a metaphor, it's like email spam. So this leads me into my first finding, which is developing a healthy cynicism about data sources. When we start to feel comfortable or too comfortable about what we think we know, or how we're interpreting our data, that's usually when we discover something is completely different than we expected, and we find that can often times come around and bite us when we least expect it.
So I have this sort of concept that data interpretation pride comes before a fall. And I've got a few examples to offer that try and ground this and make it real.
So one example, in the team I was working in is that we were looking at a particular person who was sending a whole bunch of SMS's, just tonnes and tonnes of texting, and they were sending them to overseas numbers. And normally when we look at volume like that, we sort of in the back of our heads, we think, oh, this is marketing, or this is a spammer.
So we think they're doing something like sending texts like buy my Ray Bans or you know, buy some kind of cheap knock off. That's usually the expectation that we have when we look at that kind of volume.
But actually, we gave this person a call, and we found out that's just not what was going on. They were sending a whole bunch of information messages. They were essentially like, sending activist messages, but instead of an email, which you might normally expect, they were sending them over text.
Usually, yes malicious, not so much.
Other examples of where we thought we knew what was going on, but actually found out after the fact it was quite different would be things like where customers automate or self automate tasks for themselves.
You might not be surprised to learn that we're not the only ones obsessed with automating our way to a better life, right? Lots of people want to find ways to make their lives better, to remind themselves to do things at certain times of day. There's also things like health monitors, where people set up auto calls so that someone who they need to keep an eye on who may be vulnerable has a sort of regular check-in service.
And the dangerous thing with these kinds of automations is that they look very much like the problematic robotic behaviours or the overly consistent, not human like behaviours we might see in our data. So keeping in mind that that kind of stuff is happening alongside of more suspicious or more problematic behaviour is really really core for us to get this work right. So, how do we help ourselves and support ourselves in thinking about this stuff? Well, I argue a big first stepping point is deconstructing your proxy.
If you're not familiar with that term, a proxy is just a sort of placeholder stand in for the thing we actually want to measure. So all of our analytics all of our product work, all rely on proxies. When I look at number of clicks on a button, I assume that's a proxy for I want to be able to buy that thing on that webpage or I want to go to the next view but really it could be rage clicks.
And I just don't know the difference.
Or when I look at something like the time on a page I assume that's a proxy for how much that person is engaged in my content but it really could be that my pop-up has come up and prevented them from going further. And they abandoned the page and just didn't close it off. So we have to be really, really suspicious of these proxies and take some time to think them through so that we can be more pragmatic about what we've actually learned.
So I'd like to pitch a concept framework to help us do that work, to do that deconstruction work. So we start with this idea of a signal.
The signal is the incoming data.
It's the zeros and ones.
Sometimes they come from third-party services, sometimes they come from physical infrastructure, sometimes they come from data sources that we are plugged into otherwise, but this is a thing we're measuring.
And from my signal, I infer some kind of activity but I want us to observe that this first inference is a leap in logic. It's telling me, I think this is probably the sort of thing but actually it could be something else. So sticking with the example of the texting, I could be thinking someone's on your phone writing a text and that's kind of the obvious and intuitive interpretation of I see SMSs being sent from a phone, and then I think that someone was sending a text, but actually we do know that sometimes there's malware on a phone or a device, and that malware could be sending that text without that person physically doing it themselves. So immediately I've introduced some uncertainty into the equation.
From my activity, I then infer a persona.
So again, I have another leap in logic.
In this case, when I'm thinking about someone on their phone texting, again, the persona I normally jump to is that I've got two people and they're talking to each other and having some kind of conversation, but there could be another persona aware I'm writing back to my dentist saying, yes, I'm coming to my appointment. And, you can imagine just absolutely oodles of ways that texting could be done, which is not necessarily like a traditional conversation. I could be testing an API.
I could be painting something, but there's lots of things that I could be doing with a text that isn't necessarily just having a chat in sort of the traditional sense.
And, if I move down to the lower persona, this could be the person who wrote the malware to make the phone send the SMSs is in the background and this person, we may not know much about them, but we're going to assume that they're making money in some way, they're paying a service, or they're sending this person off to some expensive place.
That's allowing them to take a cut off the top. But what this shows us, the signal infers activity, infers persona, shows us that for every leap of logic, there's some uncertainty and the more we do the work of mapping this out and really thinking it through, the more we can be pragmatic, and grounded and understanding what we're learning when we look at these signals. And if you think I'm just being mean about the company I was working with or other companies I've seen recently, I just wanted to share this example because I think it's a good example of ways that smaller, subtle misunderstandings can blow out and have really big impacts.
So this article I've linked here is talking about an NBN piece of news that came out just a couple of days ago, and they were basically saying that they discovered that they had about 300,000 premises that they didn't know existed, that they had to hook up still, and that cost about another 600 million in their budget.
So essentially, they're finding out that they actually have significantly more to do than they thought they had. And the problem, the data distinction is they had a list of addresses, what they did not know is who in that list of addresses was not every single premises on that address.
So that you can understand how that would be an incredibly easy mistake to make, and I have a lot of empathy for the engineer who found this problem out, but I think it's a good sort of Aesop's Fable for the rest of us who really can't afford to make $600 million mistakes.
That if we don't take the time and think deeply about this data that we're ingesting, pulling from other sources or buying off other people, we may find ourselves in a not very happy situation. So yeah, going past deconstructing your signals and deconstructing your proxies, I really, really recommend that you look at your raw data, really look at it not just at dashboards.
For data scientists, please use the tools available to you to interrogate the validity of that raw data. Does it actually meet your expectations? Do the types match? Is there a lot of formatting issues? Is the data quality overall good or not? And it's not just for data scientists, I think everybody has a part to play when it comes to looking at the data, subject matter experts often have quite a lot of intuition about the human stories underneath the data and that can be incredibly invaluable to support the data scientists and the other analysts and knowing what they're looking at and what to make of it. Additionally, data misinterpretation is easy and likely under pressure.
One of the most interesting things about working with fraud is the pressure it puts on data interpretation in periods of attack. It's like putting your data science hygiene through a crucible of intense experience and in unexpected and lumpy ways.
So if you're not prepared to put your data science your labelling, your hygiene through that kind of high pressure experience, you may not be ready or you may not be at a state of maturity yet. So if you're not sure if you're ready to go through that kind of crucible of experience, maybe you should go back to the basics and have a think about what exactly you need to do to build up that confidence and that load bearing nature so that you can be in the crucible that I think of. So for an example, in this space of fraud in this work I've been doing, we had one instance where we thought we had taken some actions against specific services and we could see that the bot that was in production thought it had done the thing.
And we were all very confused trying to understand what had happened.
And it turned out after a couple of days of investigation and unpicking that the service running the data batteries actually had just been out for a period of hours. And there was a plan maintenance outage, and this is a very common thing to happen.
And, the reality is, we were so stressed we didn't even go back through our history of emails and updates from the business to see that this had been the case.
And this is like I say this not because I think it was bad behaviour or bad on the part of the team, but because I think it's so easy to do.
It's so easy to like get yourself in a little bit of a tizzy and not have the time or the energy to like step back and be like, oh, well, if it wasn't this part of my automation, was it this part? And if it wasn't this part, was it the part before? So that kind of difficulty of interpretation when you're under pressure, I think is a very common issue. Another quick example is that we had some features which we know in the team, take a few, take some time, some fixed period of time to calculate.
So the time from the data being captured by activity in the real world to the features being calculated to get alert coming out on Slack was a fixed chunk of time.
And it wasn't like one minute, it was a period of time that we knew we had to account for.
But when we saw that alert coming out at like 2:00 in the morning and we're stressed and we're under an attack and we're tryna work out what's happening, we kept forgetting about that fixed time and thinking that something really suspicious was going on because we thought we we'd take an action on a service when in reality that service had already had that action taken, but it wasn't reflected in the data yet because there was this fixed delay. So those kinds of interpretation issues I think are very common and very human.
And, it honestly, it takes cool heads and a deep breath to be able to move on from it.
So, how do we help ourselves? We designed for our cognitive load and I really wanna emphasise that simple doesn't imply you're stupid. So, just because you have a team filled with really brilliant developers and data scientists and machine learning engineers, doesn't mean you want them to have to be brilliant detectives under time and cost pressure.
So if you can add labels, be explicit, make it impossible to misinterpret.
A litmus test I'd like to offer for this is on a quick glance, is this easy to misunderstand? If you literally look away from your computer, and look back and just look at the thing for a couple of seconds or even a second, and you think, oh, do I get that? Or, oh, I can't quite rock it.
And if it's easy to misunderstand, that's a good signal that you could do a bit more work, you could make it a bit more interpretable. A few more pointers for how we think about making data interpretation scalable across the team and easy for most types of people, a big one is, are we using domain specific language? Are we using jargon? Are we using acronyms? And if we are, maybe could we not? And I'm sure that you've seen people talk about this at length.
So I won't spend too much time justifying why I think it's a bad idea, but if you can be explicit, especially when we have what I think of as load bearing data alerts or data notifications, when it's gonna be important, try your best to make it as explicit as possible and don't assume prior knowledge, and that almost always will pay dividends.
Another obvious one is, add units.
If we're looking at time, is it minutes or hours or days? If we're looking at the counts of a thing, is it one person or 100 services or 1000? It really matters understanding this and again, especially under pressure.
Another pointer for data interpretation is, is this an absolute count of the thing you're measuring or relative? And again, related to units, if I'm looking at a relative percentage and I'm seeing something like 90% accuracy, if I have a sample size of 10, that's cool. It's like one person is having a bad time, but if my sample size is a million, that's terrible. It's a terrible outcome.
So that relative percentage really only is useful information when it's grounded by that absolute count, that total count of what's happening.
Another important one is the data recency.
So this calls back to that fixed delay I was referring to before.
If there's going to be a delay from the time you captured the data to the time you see the thing, make that as explicit as you can.
Say that the data was captured at timestamp X and the alert happens at timestamp Y.
And it may seem overly explicit at the time, but I promise, future you will thank past you for doing it. Okay, so you probably knew this was coming. This is the section where I tell you the best way to improve your data science practise as a whole is to do better at the boring science bits. The bits where we capture knowledge and store it so you can see it in the future.
So, this thing, the swamp of lazy assumptions, I like to think of this as being the status quo for most companies for most domains.
And sometimes it doesn't matter, but sometimes it does. So when I first joined this team and I was trying to understand the space of fraud and the space of different behaviours that were happening, I was really having a hard time getting clear understandable answers out of the team.
And I was finding, there was quite a lot of mismatches of assumptions and there was quite a lot of namespace clashing. And there was just a lot of murkiness.
Like people were sort of like, "Oh, well this is like this, but maybe also like that." And there was a lot of sort of hand waving going on and this always sort of waves a red flag for me. And again, I'm not picking on this team because I think it's so common and so easy to get into this state of affairs.
But I think doing the job of asking the questions really raised for everybody else that I wasn't just the only one who was confused.
And in fact there was a number of confusions and mismatches in mental models amongst the team. So let's talk about implicit versus explicit knowledge. I'd like to pitch that just because a thing is knowable doesn't mean it's known.
And even if it is known, it doesn't mean it's knowledge that the team or the company owns or is in possession of. So there's a distance between what is knowable, what is known and what everybody knows collectively. And this is the thing I want us to get better at understanding and having a sniff test for. So I'd encourage you to become a knowledge excavator. The first step to establishing a shared understanding is to sniff out the same existing implicit knowledge and try to formalise it.
So what does that look like? We're gonna be working on naming often naming things, naming models, naming reference points.
Building a shared vocabulary is really, really important. Even if you're working in a domain where there's a lot of existing language or jargon or vocabulary, having taking the time to actually write out what you mean when you say the thing and often capture all the other words that mean the same thing as well is really really useful for everyone to get on the same page.
This is like the starting point for building these shared mental models I was talking to. And again, I'm just gonna say it again 'cause it's so important to me.
If you do have name-space clashing find another name to call something else so that you don't have this kind of push and pull of meaning where I say one thing and it actually means two or three or four things. So in the space of fraud, here's an example. I've got two types of bad behaviour you might say. One is called international revenue sharing fraud and the other is called toll fraud.
Now, I throw this up here just to be a gotcha. And the reason I do that is because these are actually two names for the exact same kind of behaviour, the exact same kind of fraud, but looking at them you would have no sense that that was the case, right? Like they just don't seem like they mean the same thing at all.
And I think this is really important to remember that when you have this cognitive dissonance you're just adding a bunch of work and for everyone to interpret and understand and follow along with you. Another job I've been doing with the team is establishing the personas, these different sorts of behaviours, the drivers of those behaviours and what we think is going on with the people who are in this space of misuse, abuse and fraud. And this may not be personas in the really deeply researched UX sense because we can't always go up to people and be like, "Well tell us, why are you trying to rip us off?" It's not an easy conversation to have, and we often don't have access to those people. So we're doing what we can with the data we have available, but it can still be incredibly useful to have names for these personas and to be able to talk to them amongst ourselves.
So for example, we have a persona for someone who overuses promotions and they do what we call this deal seeking behaviour.
So they're looking for the best, the cheapest, the fastest way to get the thing.
You might think of it as a digital version of coupon clippers.
And that type of person should not be lumped in, for instance, with a person who's trying to pose as a telco and on-sell the services of the company. So that's a totally different kind of fraud. It's large-scale fraud, and it's really, really costly. And it's important to remember that you have these two very different kinds of people who might be getting your messages or your warnings, and that you wanna talk to the person who's a little cheeky in just as nice a way as you talk to the person who's really problematic and sus. If you don't have the shared understanding, it's hard to move forwards.
We need it for all of our experiments, all of our hypotheses, all of our explorations into the future.
Also, I would argue that the eliciting and discovering of the small discrepancies, the small places where my understanding doesn't quite map to your understanding are actually really insightful moments.
They're moments which reveal a question to investigate or potentially a product opportunity.
And if you take the time to do that work, even if it feels a little bit painful or a little bit verbose, it often reveals really good, insightful, useful seeds for starting new projects.
I'd like to pitch that while we do a good job of capturing data that sort of at the product side of things, seeing what people do with our tools, we tend not to be so good at capturing information about our state of the world, our understanding, and what changes over time.
And I think that doing the work of designing these data schemas is really, really helpful, both for establishing our baseline, as I've said, and for giving us a starting point, if we do wanna go into statistical modelling world or machine learning modelling world.
So what I've got here as an example, on the left column, we have the sort of space for what did I think at the time.
So, what did I think was happening, what was my intuition, what was my inference based on the data available? The middle column is showing me what was the new information, so what came to light, what did I discover? And the final column, the delta column, is what changed. What did I think after I saw this new information? Did it reinforce my existing idea or did it completely change it or was it somewhere in the middle? And this may feel like a new kind of annoying data capture tasks that you don't wanna have to do, but I actually think it will be the sort of thing that has a bit of friction at first, but then as you get in the habit of doing it, it will really flesh out and formalise all of these experiments that we're running all the time and product. One other thing about this space of thinking about language and thinking about naming things is I really believe that changing our language changes our minds.
So I just like to share this little anecdote of one of my colleagues in this team who was presenting some data analysis he'd been doing across the space of fraud.
And he was talking about possibly implementing a new feature.
And he was like, "We'll wait until we're 100% certain, no make that 99% certain." And I had just like this proud mama moment when I heard him say this because I'd been trying really hard to get everybody on the team to onboard this idea that when we're doing science, we're moving towards certainty, but we're never there. And we always hold space in our minds for a little grain of salt of doubt, a little grain of salt of the possibility that something might change.
And I think changing our language, just moving from 100% to 99%, it may seem subtle or unimportant, but I genuinely believe it improves our hygiene overall. So in the space of fraud, we have this added uncertainty on top of what you might think of the normal uncertainty of product interpretation or usability interpretation. And that's because you really can't trust what people say, you're gonna be asking them questions and you may be asking them something that they're gonna answer completely honestly or they may be telling you something that's a bald-faced lie, but they want you to believe it. And you're trying to work out which is which. So dealing with this kind of uncertainty day in day out can make these discussions feel suspect, and you start to see fraud everywhere.
So what's the antidote to this kind of suspicion that's like hardening of our hearts against our customers? I think the best way to avoid falling into this trap is to be explicitly deliberately kind and respectful in all of our communications.
But what if it's a bad actor, you might be asking me? And I think it's a fair question, and most people do you worry like, well, what if I'm talking to someone who's actually trying to do something bad or problematic? But I'd argue, well, so what? Like, what's the actual cost of that? What is the problem? And I think this is the thing we have to take the time to think through, are what are the possible outcomes that I could foresee from the scenario? If I speak in a respectful and kind way to a bad actor, is there any cost or harm to them or to the business? If I speak in a unrespectful or dismissive or unkind way to someone who's actually a good actor, who's confused, is there a cost to the business? Is there a cost to that person? What if there's someone who's differently abled or has a problem being in a phone conversation or being in a chat conversation? There's plenty of people who have a hard time on certain channels.
So we wanna really think through those cases and think, well, you know, to me, the cost of talking nicely to a bad actor is nothing compared to the cost of talking badly to a good actor. We also wanna think about our reputational harms. What are the kinds of bad will we might generate from this kind of outcome? And instead of thinking about this kind of cat and mouse game and the desire to be right, what if instead, we just focus on our goals, which is to prevent bad behaviour, which is costing us money. And at the same time, keep that respect and that kindness at the heart of our communications. So yes, I'd like to pitch that we keep the moralising out of it.
Being right isn't helpful and you don't even know that person or their story. So even if they are exactly what you suspect they are, does it really matter? Another tip I'd offer for trying to set this intention and keep it there is to focus on describing behaviours, not people. So if we can describe service misuse or promotions overuse, that's one thing and it describes the behaviour and it doesn't tell me anything about the person. If I say they're an abuser or a fraudster, immediately I have a much more negative connotation in my mind and I have a bunch of ideas.
And again, my heart is hardened to them and that's really not the space I want to be in. That brings me to my third finding, designing for failure first.
During my time at this company, we've been experimenting with designing communication flows to provide clearer feedback to people whose service came under suspicion, and allow them to request support if they think there was a mistake.
In some cases, they got a warning first and others, it was after the fact.
Now, what I learned from these relatively small batches of cases we experimented with, is that this is very challenging, very difficult work.
One important thing to observe is that if you don't ask, customers will not tell and what I mean by that is if you make an incorrect assumption about someone and take an action against them, it's so much easier to churn than it is to come back and tell you, "Oh, that was wrong and I'm unhappy with you." And in fact, the strong likelihood is you won't get any kind of feedback at all unless they're so, so angry with you that they try and go to a regulator or some kind of body and complain about you and that's obviously everybody's like worst outcome so we don't want that to happen. So how do we bake in feedback loops and what do they look like? Well, I'd suggest they should be intuitive as in they make sense at the time, they need to be contextual so we want them to be specific to the action or the experience this person just had and most importantly, they need to be timely, both for the customer and for you. You wanna make sure that you find things out as quickly as possible and you wanna make sure that you're asking someone about an experience while it's fresh in their mind so don't ask someone about something you did with them on Monday if it's Friday, ask them two hours afterwards, or even just afterwards.
And once you've got these feedback loops running, you still need to do more work I'm sorry to have to tell you.
So we wanna plan time for customer support so if someone does come back and say, yeah, that was wrong fix it.
What does that look like? What is your customer support scripts going to be? You may discover that you have some strange spaces of your automation or your model that aren't performing very well for people so there may need to be more features to develop or model improvements to do.
And you may also need to find other ways to integrate these learnings into your product or into your workflow.
So, don't assume that the work ends at the feedback loop, the work ends at once the feedback loop is integrated into your product cadence.
Another thing I like to say is we need to be explicit about the potential harms and consequences of these outcomes.
And if you can't imagine consequences, I really think you're not thinking hard enough. So let me make this real, there's a paper I found that is from a whole bunch of Facebook engineers from back in 2011.
So roll back to 2011 and imagine what Facebook was like then.
So in the abstract they say, we believe this system has contributed to making Facebook the safest place on the internet for people and their information.
The paper is talking about their attempts to classify and prevent malicious behaviour, bot behaviour and other kinds of attacks on Facebook.
So they're basically trying to work out who's doing something sus and shut them down. The Chaser, "The goal is to protect the graph." And by this they mean the social graph that the overall state of their network.
"Against all attacks, rather than maximise the accuracy of any one specific classifier.
The opportunity cost of refining a model for one attack may be increasing the detection and response on other attacks." Now, in my speaker notes here I literally have the emoji screams forever because this frustrated me so so much.
So like to try and interpret this for you what they're saying is we don't care how good our classifier is and we don't care about the consequences of false positives. Now, if we zoom forward to 2020, we knew that the impact of being deplatformed or otherwise silenced in a platform like Facebook can be really significant.
It can be things like impacting your economic security. If you're a small business owner and you post things through Facebook and suddenly one day you have no audience, that can completely change your income.
Another example of a bad outcome is people who book sex work through digital platforms often need to use these these platforms as a way of vetting a potential customer before they go meet them.
And without this digital intermediary to assess whether their assess, whether they have friends, whether they look like an axe murderer is usually the thing between that person having a successful exchange, a commercial exchange with someone and that person experiencing sexual violence. So someone could literally die because they were deplatformed on Facebook. So to say in a very like academic way we don't care about the accuracy of any specific classifier, it's just so naive, has to be just deeply problematic. So yes, if you ever need to like remind yourself of how a couple years can completely change our view of things, just remember this paper from 2011 and think about those programmers and think about how they just did not understand how people would be using the platform or the impacts of this thing going wrong.
So I've just been giving some examples of how we think about and map out harm, but it's a whole topic.
It's a whole subject in of itself.
And as I don't have time to go there in too much detail here, I'd suggest that if you want to explore some tools, some workshops, some activities that will help support your learning in this space, I've done a whole bunch of useful links up on my GitHub in this space mapping-fair-ml, check it out and please come back to me if you have questions about any of them because I'm exploring them all too, and I'm developing intuitions about which ones I think are useful for what kinds of circumstances.
So failure first design.
If you're thinking what does that even mean, Laura or how am I gonna do it? Here's a couple of suggestions to try and like get your juices flowing.
It could be things like UI interactions, where you have a little link or a button that says something like, get help, or this isn't right. I know that lots of people have said things like, I really want to be able to tell Netflix, forget this movie forever.
Like, I watched it for two minutes and I'm ashamed of myself, please never show it to me again. So that's an example of a failure, which is it now thinks I want to watch that movie, but I want to be able to correct that failure through some kind of UI interaction.
Another example of failure first design is if I'm building out some email columns, or some SMS columns.
So as I was saying, before, with the work I was doing with the front team, sometimes you need to tell people what was happening and why it was happening, and give them the opportunity or the right of reply. So you don't just wanna tell them, this is the state of the world, or this is what we think about you.
We wanna tell them, hey, this is what we think, but if we're wrong, let us know and here's how you do it. Another example, I think, often gets overlooked, is doing some scripts for the conversations that we might need to have with these customers, if we are exploring it more deeply.
So what do you need to know to identify if your classification was correct or not? What does that look like? And what are your sniff tests for whether they're trying to bullshit you or not. And those can be quite hard conversations to design. So I would suggest you think about it and put a bit of time into it.
And as I showed you before, coming up with some data schemas to capture your internal learnings, and to map the state of your understanding over time, I think is a really useful way to support this work. So here's a litmus test for thinking about this failure first design. What if this and this being anything in your product happened to my most vulnerable customer? I think this is useful, because we all know, we all have cases of someone who's vulnerable in one way or many ways.
And we often think to ourselves, oh, I'm glad that thing didn't happen to them. But what if he took the worst thing that could happen in your product and you apply it to your most vulnerable customer? I think thinking that through forces us to grapple with our worst case scenario, that really the worst thing that can happen. And if you think that through and map it out, like actually map out the user journey that will force you to work out how important it is to mitigate or to prevent it from happening or to find out sooner that's happening so you can give them support.
Also, I'd suggest it when you're doing this work, don't expect people to be grateful.
Prepare for pushback.
People don't like it when you make bad assumptions about them.
People don't even like it sometimes when you make good assumptions about them that can feel creepy. Additionally, making the implicit explicit is often uncomfortable.
So just prepare yourself mentally for the fact that people won't necessarily come to you and say, "Thanks so much for doing this." And the outcome may just be that you've helped them, but they're still not very happy with you.
So I think of this work, this designing for failure first as designing the escape hatch and I think this is a useful metaphor because you don't want to have to use it, but if you need it, you sure want it to be there. And you wanna make sure that those hinges aren't rusted shut, right? No one needs an escape hatch that you can't open. So here's another litmus test.
Can any customer go through my escape hatch, recover and remain a happy customer? I think it's a pretty high bar to be honest, you could even simplify this to say, could any customer go through my escape hatch, recover and remain a customer? And that's still a really good bar of quality. This is a quote from a paper by Donella Meadows called "Leverage points: Places to intervene in a system". And if you're interested in this sort of systems thinking theory I've been touching on a little bit in his talk I highly recommend the paper and I really liked this piece.
She says, "Delays in feedback loops are common causes of oscillations.
If you're trying to adjust a system state to your goal, but you only receive delayed information about what the system state is, you will overshoot and undershoot". So what I want us to remember is that these feedback loops we're baking in as part of our failure first design, they're not only to protect and support our customers, they're also for our own understanding of the accuracy of our system.
So the feedback timeframe, the speed to learning what's happening is incredibly important for our understanding of the world, as well as preventing these harms we're talking about. So let's talk now about my last finding, which is upstream prevention versus downstream mitigation. Or as I like to say, is this a product problem or a behaviour detection problem? This is a quote from Eva PenzeyMoog who just started a newsletter which is focused on her work in the use of technology in domestic violence and abuse. It's a really great newsletter and I think it's fascinating and as I've been reading her work, I see a really strong resemblance between the space of fraud and technology and the space of abuse and technology.
These are people with a different agenda to yours, and they're looking for vulnerabilities they can exploit for their own ends.
So Eva says, "These sorts of things will happen. I'm very intentional about not saying abuse might happen. If it can, it will." And I think that's a nice introduction to this concept I have of the product use cascade, or hierarchy of what's possible to do in your product. So let me explain this.
I think the top, the highest bit of the hierarchy is what is actually possible to do in the product? That, to me, trumps everything else.
If I can do it, and I want to do it, I probably will. The second thing underneath that is your intentions, your framing, your design. So that could be things like microcopy that help people know what they should be putting into their form or how they should be using the tool.
It could be your onboarding copy.
Then there's things like the culture of the community. So if you have a product where there's lots of people, and they have an established way of interacting with each other, that culture will also have a normative impact on how people use your tool.
Then there's anything else.
And then at the very bottom, there's terms of use. So I see terms of use as being the least important and the least impactful when it comes to actually changing people's behaviour.
It's really only there because legally we need it, but I really don't want people to think about it as a useful tool for changing how people will behave in your tool.
So to make that strong, I think possible equals permissible. If you don't care to make it impossible for someone to do something in your product, it's tested permission for the behaviour to occur. To extend on that, setting boundaries is design. I think we often miss out on the opportunities to do this well.
So think about your marketing copy, your onboarding, your continued engagement comms.
These are all opportunities to establish the norms of your tool and to give people hints about what should be okay.
So this Tweet is from Corey Doctorow, and I love it because it's so snarky.
He says, "The rise and rise of terms of service is a genuinely astonishing cultural dysfunction. Think of what a bizarre pretence we all engage in that anyone ever has read these sprawling garbage novellas of impenetrable legalese." Isn't that a beautiful quote? I just think it's so fun.
It's beautifully expressed, because he's a writer, but I do truly believe this is the case.
Like we have to get real that this pretence we all engage in that terms and conditions is somehow meaningful or that we engage with it in a useful way, it's just not the case.
We all scroll, click and move on with our lives. And to summarise this idea of what I've been talking about in the space of upstream changing the product versus downstream observing the behaviour and doing something about it, I actually think that in most cases, downstream work is gonna be much more costly and difficult for you than upstream fixing it in the product.
So think about something like promotions abuse. If you have an invited friend or some other kind of promotion that you're running, if you can add some kind of guard rails into your product design that prevent people from using that QR code or that unique URL more than the number of times you want them to use it, that's always going to be better than trying to work out how people have done it post-talk and then remove some credit.
That's always gonna be an awkward conversation and you don't wanna have to do it.
So we're nearly there, but I have a few more things to talk through. So I like to talk about top down strategy versus bottom up strategy and you've probably got this sense by now, but if you were hoping I would be giving you pointers on how to write a value strategy or a principles document, I'm sorry to disappoint.
That's not really my jam.
I think there's no magic bullets in this space. We live in a world of complex entangled systems, and as a result, our problems in the space of fairness and ethics are complex and entangled.
I would add an addendum here that not only are there no magic bullets, but anyone who tells you otherwise is selling you something. In fact, I believe the desire for a magic bullet is part of the problem.
Remember, we're in the Cambrian explosion phase, not just of AI, but also of AI ethics.
It's on us to interrogate the power structures and agendas that lurk behind the glossy marketing copy. So talking of Cambrian explosions, this graphic is from a Harvard research document and they're showing a whole map of the space of artificial intelligence principles documents, and strategy documents that exist in the world at the moment.
So these cover the space of government and private sector and like joint projects and the themes that you can see in those concentric circles cover things like the promotion of human values and security and privacy. And so they did this meta-study where they looked at all of these values documents, and they said, hey, look, a whole bunch of concepts seem to be coming up over and over again, isn't that great? Aren't we getting good at doing AI ethics? But I have a kind of cynical interpretation of this to offer.
I'm not super convinced that the repetition of these themes and these documents is indication of some kind of global agreement about the right way to go forward.
I honestly just think it's a bit of copy and paste. So let me be a little more precise about my critique of what the strategy documents do and what they don't do.
This is a quote from a paper that's exploring the available AI ethics tools, and one of the important things that it found was that developers are becoming frustrated by how little help is offered by highly abstract principles when it comes to their day-to-day job.
Let's take a look at some of these actual principles and see if that makes this a bit more real. So here's one form the Mozilla Trustworthy AI Paper. And it says, "Goal: Consumers choose trustworthy products when they're available and demand them when they aren't." Now I think there's two really problematic assumptions at the heart of this.
One is that customers will know if it's trustworthy or not. I'd argue that in a space where there's obstructions on top of obstructions, and often even the technologist themselves don't know what's happening or how exactly that classification was arrived at, how can we expect consumers to meaningfully interpret or interrogate the technology from the consumer side, and the say "Yes, cool or no, not okay." And the second assumption I don't like at the heart of this is that we are actually in a place to employ our consumer power.
I'd argue that we're not.
We're not actually buying these machine learning models, the people buying them are advertisers.
They're police departments, they're governments, but we aren't actually getting our wallet out and voting. And in that land, we don't have very much power at all. So we have both an economic asymmetry and an information asymmetry.
Making a statement like this feels at least naive and, at the worst, ideologically captured.
I rant about this at length on Twitter.
So if you wanted to read my longer version rant about this, please hop in.
The next example of a goal, or a value, or principle, I took from the IBM Everyday Ethics.
And they say, "AI must be designed to minimise bias and promote inclusive representation." And again, I don't disagree with this in theory, but I just want to know how much or what's my code test for this? How do I know that bias has been minimise? And if I care about inclusive or diverse representation, how much is enough, and will you be reporting on that, and how can I interrogate it? So I think the problem is not so much the statement as the follow through.
What does it actually look like? And just to round it out with another example, here's one from Smart Dubai where they say, "AI systems should be safe and secure and should serve and protect humanity." And again, this is hard to argue with, right? Like, it sounds great, but I have no sense of when this is done or if it's true or not. And it feels like a bit hand-wavy and a bit divorced from reality in my cynical opinion. So yeah, like to sum this up, I do worry that a lot of the ethics work in this space is bike shedding without measurable principles and without real impacts or penalties for not getting this right.
This feels toothless.
And worse, there're resource thieves, they're sucking the oxygen from the room.
And they're allowing us into thinking that we're already doing the work when we haven't really started.
I really like this quote from my partner, his name is Andy and he's lovely.
And we were talking this through and he said, "Yes, like no more strings of empty profundities." And I thought that was a really nice way of encapsulating this problem.
I don't wanna hear things that sound profound but I can't test for, and I can't measure for. So let me pitch you a metaphor.
When I think of us building technology systems, especially the big enterprise level of technology systems, I think of us as a team selling on a big cruise liner. And when you go below to the passenger deck to eat dinner or to watch a movie, it's sort of safe, it feels even boring.
Like it sort of, you can't feel the ship moving. It's so big and it's so stable that you have no sense of what's happening outside your little world.
And you don't really know where you're going and you don't have any sense of like the big context. But if you leave the passenger decks and you go up, up to the outside and you gaze out over the water, and you feel the wind in your hair and you look down, and you see the amount of water that's rushing past you, you see the speed of this big boat, and you have a sense of the context that you're in; that's when you actually can meaningfully grapple with the dangers and the risks that we're introducing in our world.
So I pitched that we want to feel the wind rushing past. I think insulating ourselves from that sensation is itself unhelpful.
So yes, how not to do principles.
I just don't want you to write a document, throw it over the wall and say, "Go use it." If you do want a principles document, you need to champion the approach from inside the teams, get your hands dirty and show by doing.
How do we make this real? If you want this strategy document committed to action, assign a strategy champion or champions that will actually support people in making it real, set measurable targets, provide examples, and most importantly, allow for uncertainty because so much of this work is contextual and a principle which makes sense in one domain may seem absolutely meaningless in another. So, we're gonna have to give ourselves time to work out what makes sense, and perhaps adjust our principle accordingly. Okay, that was a lot.
Just a few more final thoughts, I promise.
So, let's remember these findings we've worked our way through.
We have developing healthy cynicism about our data sources, mapping our knowledge states, designing for failure first, and upstream prevention versus downstream mitigation. So, all of these lead nicely into the final topic I wanna touch on which is automation design itself, how we think about it, and more importantly how we might change the way we think about it. So, I want us to stop thinking about automation design as an all-or-nothing feature.
It's never going to be pick off every single task and automate all the way end to end, or don't do it. Let's think about ways we can break down the monolith. We can test our assumptions, and we can explore what's working in lightweight, easy, iterative steps.
So, for an example, are there any ways I can build a new code prototype? Can I put a human in where a script would be? Can I get them to do the thing the script would do? And does that give me any useful information about how successful or how good my coverage is of that script? And I think a really important note there is, if you put a human in where a script would be, or where a chatbot would be, or some other kind of automation, and they see something they're supposed to do and it feels bad, that's really useful information. That tells me that something's gone wrong and we haven't worked out how to handle all the edge cases that we might be seeing. Another way of thinking of breaking down the monolith is that automations don't need to be on forever. They can be temporary, they could be timeboxed, you could just pick off a small chunk of your user group, or your user base, and send it just to those people. So, think about ways you can scope your experiment down just so that you can learn enough to do the next thing. Don't think about it as an all-or-nothing proposal because that will make it much scarier and much harder to build your confidence in. Another one is that not every automation needs to go end to end from the input data to the action at the end. So, you could think of it as lots of other things. Is it something that adds to a dashboard, is there a notification, or an escalation through your PagerDuty, is there something happening on Slack, or is there an action happening? And if you think there are all of these possibilities, again, you'll see, there's a world of nuance and what you might think you can do before you get to this like fully automated version of the product.
And finally, and I hope this isn't controversial, but please do usability testing on anything that's user facing.
If you're doing something and people will be seeing it, make sure you make sure it's interpretable, make sure you've tested it for tone of voice and make sure that people don't feel confronted or unhappy or otherwise like boxed into a corner by this experience. So to conclude, I love this quote from Richard Feynman because he says, "The first principle is that you must not fool yourself and you are the easiest person to fool." And I think that's so true, whether we're doing product or running science experiments or doing physics, we have to remember that our own overconfidence is as much the enemy as anyone on the outside of ourselves. So, remember that the real game, whether you're in fraud or not, is trying to catch the fraud while not defrauding ourselves.
Our minds are fair game.
Our interpretations are fair game and we have to improve our way of understanding what's happening so that we don't get overconfident and stuck in our own biases. We are driven by the same incentives as the fraudsters. It's just the light and dark version of markets and capitals.
So remember, honestly, we have as much in common with them, as we do with any other user in the world.
Thank you so much for your attention.
I hope you found that useful.
Here's a few links that will guide you through to finding the resources I shared in this deck, and please reach out if you have questions or you want to chat with me about any of these things that I raised.
(smooth music)
How do we connect high-level principles with day-to-day product decision making? How do we move past the AI Ethics hype and start trying, testing and implementing practical approaches?
These questions are at the heart of Laura’s work, and in this talk she shares stories, discoveries, and decisions from her time as an ‘ethics ops’ consultant embedded with a small team in a big telco.
From improving the science bit of data science, to developing the collective sensitivity of the team, to designing recourse for false positives, tune in for pragmatic pointers and actionable take-aways that you can try with your team right away.
Fair game: the ethics of telco fraud
Laura Summers: Multi-disciplinary designer – Debais.ai
Keywords:ethics, ethics ops, risk mitigation, harm mapping, explicit vs implicit knowledge, data interpretation, fraud and security, cognitive design, ethics litmus tests
TL;DR: Laura parses some of the challenges involved in designing and developing ‘Ethics Ops’. Whilst she doesn’t proffer a magic bullet, she shares some key learnings and insights gleaned through her work as an Ethics Ops consultant, specifically learnings from within the fraud space, presenting a range of use cases to demonstrate how thinking through product design via a harm identification and mitigation lens can be helpful not just for our users but also ourselves, our teams, and our companies. By developing a healthy cynicism about data sources, mapping knowledge states, designing for failure first, and paying attention to upstream prevention as opposed to downstream mitigation, Laura introduces some practices and questions we can experiment with to work towards meaningfully grappling with the inherent risks, problems, and complexities surrounding the space of ethics and fairness.
Before we kick off, Laura would like to share a new project she has just posted a digital version of: Ethics Litmus Tests. The online version features a card-flipping tool that you can choose cards with. Laura borrowed the code for this project from Ben Buchanan, who created a card chooser tool for a similar card deck called Oblique Strategies, which was an inspiration for Laura’s project. This is a lovely moment of networks colliding! If you don’t know Ben, he’s the fella behind the Big Stonkin’ Post blog summaries of all the fabulousness coming out of Web Directions. Shout out and thanks to Ben, and if you want to try the online version of Ethics Litmus Tests please do so!
If you want any of the references for anything in this talk, you can find the slides here.
Section 1: Friend or fraud? Laura will cover four primary findings she’s like us to takeaway today:
- Developing a healthy cynicism about data sources.
- Mapping knowledge states.
- Designing for failure first.
- Upstream prevention vs downstream mitigation.
Don’t be worried that this will be too specific to Telco or machine learning. It’s actually quite generic. These findings and techniques she’s offering are certainly the foundation pieces to doing complex modeling correctly, but they apply to any kind of automation; any kind of decision system. Ergo: They apply to you.
At the beginning of 2020 Laura began consulting for Telco through her company Debias ai. She’s primarily embedded in their fraud detection team. Their approach to fraud detection is fairly standard in the fraud space – a mix of monitoring, internal business rules, and some third party systems that have proprietary fraud ML detection algorithms.
If you’re unfamiliar with the term ‘ethics ops’, think: dev ops, but for ethics. Laura introduces a harm identification and harm mitigation lens. Pragmatically, that work looks like monitoring notifications and alerts, as well as product flows and feedback loops.
Fraud is a really interesting space in which to discuss ethics. Not only are you the company capable of inflicting harm on your customers, you’re simultaneously under attack by a subset of customers who mean you harm. You have to balance the real (and often urgent) concerns of identifying bad behaviour whilst also acknowledging the possibility of false positives and considering how you might identify and support any innocent parties who accidentally got caught in the crossfire. It’s a war of attrition. Think: email spam.
Finding #1: Developing a healthy cynicism about data sources. When we start to feel too comfortable about how we’re thinking about or interpreting our data, that’s when we often get bitten by something unique or unexpected when we least expect it.
Laura believes that Data (interpretation) pride comes before a fall. Some illustrative examples: One person sending a high volume of SMS messages, much of it to overseas numbers. Typically, they would assign this type of behaviour to a marketing strategy or spam (Buy these Raybans! etc). Upon investigation, this was not the case. They were actually sending information-based messages containing activism literature. Typically you would expect this to be transmitted via email, but here it was via text. Unusual? Yes. Malicious? No.
Another ex: Customers self-automating across various arenas like daily reminders, or health monitoring auto-calls to check in on vulnerable populations. The danger here is that these types of automations look very much like more problematic bot behaviours or overly consistent ‘not human’ like behaviours that the data flags. A core element of doing this work well is keeping in mind that these innocent instances are happening simultaneously with more suspicious behaviours within the data set. How do we help ourselves to think through these issues?
Deconstructing your proxy: Thinking of proxy as a stand-in/placeholder for the thing you want to measure, all analytics and all product work rely on proxies. Ex: We might assume that the number of clicks on a button is proxy for: I want to buy that thing/take the next step, But it could equally be due to rage clicks. Or if looking at how much time a visitor spends on a page, we might assume that is proxy for how engaging the content is. But it could equally be due to page abandonment due to a pop up confusing them and turning them off. Being suspicious of your proxies and taking time to think them through is a much more pragmatic way to understand what you might be learning.
A concept framework. Start with the idea of a signal – the incoming data, the zeros and ones. Sometimes coming from third-party server, sometimes from infrastructure, sometimes from data sources you’re otherwise plugged into. It is a thing you are measuring. From the signal, you infer some type of activity. But note, the first inference is a leap in logic. It’s telling you: I think this is what this probably could be, but actually it could be something else. Back to the texting ex: The intuitive and obvious interpretation of seeing SMS’s coming from a phone is: Somebody is on their phone sending a text. Actually, it might be malware sending the texts without the person’s knowledge. Acknowledging the second possibility injects uncertainty into the equation.
From the activity, you might further infer a persona – if somebody is texting you assume two people are having a conversation. But it could also be somebody simply confirming an appointment. There are many iterations of texting that may not reflect a traditional conversation – testing an API, pinging etc. The lower persona could be the person who wrote the malware to make the phone send SMS’s in the background. We don’t know much about them, but we are going to assume they are making money in some way. What this framework of signal-infers activity-infers-persona demonstrates is that for every leap of logic, there is some uncertainty. The more we do the work to map this out and think it through, the more pragmatic and grounded our understanding of what we’re learning will be.
A $600m mistake? Lest we think Laura is throwing shade at prior employers, think about this example in terms of how small and subtle misunderstandings can spin out to massive impact: This recent NBN news story where they didn’t know they were missing 300k premises which still needed to be hooked up. To go back and re-integrate these added $600m to the project budget. The data distinction is that they had a list of addresses, but they did not know that that list of addresses was not every single premises on that address. This is an easy mistake to make and Laura feels empathy for the engineer who identified the problem. But it’s also a great Aesop’s Fable for the rest of us who cannot afford to make $600m mistakes!
Look at your raw data. Beyond deconstructing your signals and proxies, look at your raw data, not just at dashboards. Data scientists, please use the tools available to you to interrogate the validity of that raw data. Does it actually meet your expectations? Do the types match? Are there a lot of formatting issues? How is the overall data quality? This is a group responsibility, not just the data scientist’s. Subject matter experts often have strong intuitions about the human stories underneath the data. This can be invaluable to support the data scientists and analysts in knowing what they’re looking at and what to make of it.
Data misinterpretation is easy, and likely, under pressure. One of the most interesting aspects of working in fraud is the pressure put on data interpretation in periods of attack. This is akin to putting your data science hygiene through a crucible of intense experience in unexpected ways. If you’re not prepared to put your data sciences, labeling, and hygiene through that high pressure experience, you may not be at the stage of maturity that is optimal. If this is so, go back and think through what you might need to build up the confidence and lode-bearing nature that can withstand being in the crucible.
An example from Laura’s experience in fraud: One instance where they thought they had taken actions against specific services; they could see that the bot in production thought it had performed its necessary function and were very confused trying to understand what had happened. Upon investigation it turned out that the service running the data batch had simply been down for a few hours in a planned maintenance outage. This is a common occurrence, but the team felt so stressed about the problem that they missed the obvious explanation. She shares this not as a perceived failure of the team, but to highlight how easy it is to get wound up rather than to take the time to step back and logically parse and assess the various parts in the automation process where the issue may have occurred. Difficulty of interpretation when you are under pressure is a common issue. Ex 2: The team had some features that they know have some fixed period of time to calculate. Lag between the data being captured by activity in the real world to the features being calculated to the alert coming out on Slack was a fixed period of time which they knew they had to account for. But when they saw the alert coming out at 2am when they’re stressed and under attack they forgot about the time lag and instead assumed something more suspicious was going on because they had taken action on a service and weren’t seeing the result immediately. These kinds of interpretations are common and human, but with cool heads and deep breaths, moving on is easier.
Design for cognitive load. Simple doesn’t imply you’re stupid. Simply because you have a team of brilliant and skilled data scientists and developers and machine learning engineers doesn’t mean they need to function as detectives under time and cost pressures. If you can, add labels. Be explicit. Make it impossible to misinterpret.
Litmus test. A great litmus test here is: On a quick glance, is this easy to misunderstand? If you literally look away from the screen and look back and don’t see clarity immediately, this is a signal that you could do more on making it interpretable. More pointers for how to make data interpretation scalable and easier across the team:
- Are we using DSL (domain specific language)? Are we using jargon? Are we using acronyms? If so, can we change it? If you can be explicit, particularly in regard to lode-bearing data or notifications, and don’t assume prior knowledge, this will almost always pay dividends.
- Add units: When looking at time, is it minutes, hours, or days? When looking at counts, is it one person, or one hundred services, or one thousand?
- Absolute (total count) or relative (%)? If looking at a relative percentage and seeing 90% accuracy, a sample size of ten means this is a good result but a sample size of a million means this is a terrible result. Relative percentage only yields useful information when grounded in total or absolute count.
- Is the data recency clear? Calling back to the aforementioned fixed delay – if there is going to be a delay between data capture and seeing the thing, make that as explicit as you can: ex: the data was captured at [timestamp:x] and the alert happens at [timestamp:y] This may feel over-explicit at the time, but future you will thank past you for doing this!
Finding # 2: Mapping knowledge states. This is where Laura tells you the best way to improve your data science practices as a whole is to do better at the boring science bits, the bits where we capture knowledge and store it so you can see it in the future.
Beware the swamp of lazy assumptions. The swamp of lazy assumptions is the status quo for most companies, for most domains. There are times when this matters a lot. Ex:Fraud space. When first joining her team, Laura had trouble getting clear, understandable answers about the types of behaviours that were happening. Many mismatches of assumptions, a lot of namespace clashing, and much murkiness. Again this is common, but doing the job of asking the questions highlighted that many others had similar issues and mismatched mental models.
Implicit vs Explicit knowledge. Just because a thing is knowable doesn’t mean it’s known. Even if it is known, it doesn’t mean it’s knowledge that the team or the company owns or possesses. There’s a distance between what is knowable, what is known, and what everybody knows collectively. Become a knowledge excavator: The first step to creating shared understanding is to ‘sniff out’ implicit knowledge and try to formalize it.
Naming matters. Naming things, models, reference points;
- Building a shared vocabulary – This is key. Even if working in a domain where there is a lot of existing language/jargon/vocab, taking the time to write out what you mean when you say the thing, can often capture all the other words that mean the same thing is super useful in getting everyone onto the same page.
- Build shared mental models.
- Avoid name-space clashing – if you do have name-space clashing, find a name to call something else so that you don’t have a push and pull of meaning. Ex: International revenue sharing fraud vs toll fraud. These are two names for the exact same kind of fraud, but looking at the names does not indicate this. Having cognitive dissonance creates more work for everybody in terms of interpretation.
Establish personas. Laura’s team uses personas to understand the different sorts of behaviours in terms of misuse and the drivers of those behaviours. (NB: this is not equivalent of the deeply researched personas typical in UX work, due to barriers inherent to talking to fraudsters, but doing what they can with the data available. Still useful to have names for the personas to help the team communicate internally.) Ex: They have a persona for somebody who over-uses promotions and engage in deal-seeking behaviour (think: digital version of coupon clipper). This person should not be confused with a person posing as a Telco and trying to onsell the services of the company, which is a whole different kind of costly and large-scale fraud. You’re sending messages or warnings to two very different subsets of people with very different intentions, so you need to talk to them accordingly.
Defining your baseline Without shared understanding you can’t define your baseline. Eliciting and discovering small discrepancies where team member’s understandings don’t match yields great insight. This can be in the form of revealing questions to investigate of pointing to product opportunities. This work may feel a little painful or verbose, but if you take the time to do it it can be insightful for sewing new seeds.
Data schemas. While we do well at capturing data which shows on the product side of seeing what people do with our tools, we aren’t so good at capturing information about our state of the world, our understanding and what changes over time. Doing the work of designing these data schemas both for establishing our baseline but also in establishing a starting point if we do want to go into statistical oe ML modelling. Ex: First, what did I think at the time? What was my intuition or inference based on the data? Then: What new info came to light; what did I discover? Then: What changed? What did I think after I saw this new info? Did it reinforce or change my original idea? As you get into the habit of designing data schemas it will really flesh out and formalize these experiments that we’re running all the time in product.
Changing our language changes our minds. Ex: Laura’s colleague presenting his data analysis and the possibility of entering a new feature: We’ll wait until we’re 100% certain…no make that 99% certain
Hashtag proud mama! Laura had succeeded in getting everybody on the team to onboard the idea that yes, we’re doing science and moving toward certainty but we’re never there. We always hold space in our minds for the possibility of doubt and change. Small hygiene changes like moving from 100% to 99% are big steps.
Added uncertainty in the space of fraud Fraud adds another layer of uncertainty to the normal uncertainty you have of product interpretation or useability interpretation. Here, you really can’t trust what people say. People may answer your questions honestly, or they may lie. Dealing with this uncertainty day in day out can make these discussions feel suspect and you start to see fraud everywhere.
The antidote to suspicion. How do we avoid hardening our hearts to our customers? Set an explicit, deliberate intention to be kind and respectful in all of our communications.
bUt wHAt iF iT’s A BAd AcTor? This is a fair question, but so what? What’s the actual cost or problem? What possible outcomes could I foresee from this scenario? If I speak kindly or respectfully to a bad actor, is there any cost or harm to them or the business? If you speak dismissively or unkindly to a good actor who’s confused, is there a cost to the person or business? What if they’re differently abled or have difficulty with certain channels? The cost of speaking nicely to a bad actor is nothing compared with the cost of speaking badly to a good actor. You also need to be mindful of reputation cost. Focus on your goal: to prevent bad behavior that is costing us money. Keep respect and kindness at the heart of your communications.
Keep the moralizing out of it. You don’t know the person or their story. Even if they are exactly who you expect they are, it doesn’t really matter. Describe behaviours, not people. Describe: ‘service misuse’ or ‘promotions overuse’ rather than ‘an abuser’ or ‘a fraudster’ which triggers negative connotations and hardens your heart to them.
Finding # 3: Designing for failure first. At Telco, Laura’s been involved with designing communication flows to provide clearer feedback to people who’s service came under suspicion and allow them to request support if they think there was a mistake. In some cases they received a warning first, in others it was after the fact. These were only small batch cases, and proved very challenging. If you don’t ask, customers won’t tell. If you make an incorrect assumption about a person and take action against them, it’s so much easier to churn that to come back and say ‘that was wrong and I’m unhappy with you.’ You likely won’t get any feedback at all unless they are really angry, which is obvy the worst case outcome.
Feedback loops must be: Intuitive (make sense at the time), Contextual (specific to the action or experience this person just had), and Timely (both for the customer and for you – ask when it’s fresh in their mind). Get your feedback loops up and running and also incorporate…
Plan time for…
- Customer support – if somebody comes back and says ‘fix it’ what will your support script be?
- Product/model improvements – you may have spaces that aren’t performing well. You may need more features or improvements.
- Integrate your learnings – think about other ways the feedback may be integrated into your product cadence.
Be explicit. You need to think about the potential harms and outcomes. If you can’t imagine consequences, you’re not thinking hard enough.
Shot.
We believe this system has contributed to making Facebook the safest place on the internet for people and their information.
See this 2011 paper by a group of facebook engineers which discusses their attempts to classify and prevent malicious behaviour (eg bots and other attacks) on the platform. Trying to figure out who’s behaving suspiciously and shut them down.
Chaser.
The goal is to protect the [social] graph against all attacks other than to maximize the accuracy of any one specific classifier. The opportunity cost of refining a model for one attack may be increasing the detection and response on other attacks.
Insert a string of scream emojis here. TL;DR: What they’re saying is: We don’t care how good our classifier is, nor do we care about the consequences of false positives. Zoom to 2020: The impact of being de-platformed or being otherwise silenced in a platform on facebook can be really significant. Can impact income (ex: small business) or safety (ex: vetting sex workers), to the point of potentially leading to death. So for Facebook to say in an academic way that they don’t care about the accuracy of any specific classifier is beyond naive but deeply problematic and downright dangerous. Keep this paper top of mind when needing to remember how unintended consequences can change over time.
Harm mapping.This topic is critical, and has a breadth beyond the time we have available here. Laura’s recommended starting point if you want some tools, workshops and activities to support your learning in this space is to go to her github page. Please also do come back to her with questions around any of these – she’s exploring also and developing intuitions about which ones are useful for what kinds of circumstances.
Failure-first design. If you’re thinking: What does that even mean? Here’s a couple kickstarter suggestions: It could be things like:
- UI interactions – where you have links to buttons with phrases like: Get help/This isn’t right. Ex: ask Netflix to come up with a button to get rid of a recommended movie forever. They failed by recommending a film I hate, but I want them to correct this failure through some kind of UI interaction
- Email/SMS templates – Ex: Offer the right to reply.
- Support scripts for conversations – What are your bullshit sniff tests? These conversations are tricky, put some thought into it.
- Data schema design – capture your internal learnings and map the state of your understandings over time.
Litmus test: What if this (being anything in your product) happened to my most vulnerable customer? We all have cases of someone who is vulnerable in one or many ways. What if you take the worst thing that could happen to them? Being forced to grapple with our worst case scenario and map out the associated user journey will force you to work out how important it is to mitigate this, identify it earlier and figure out how to offer support, or prevent it outright.
Prepare for pushback. Don’t expect gratitude! People don’t like it when you make bad assumptions about them! (Or sometimes even good ones!) Making the implicit explicit is often uncomfortable. Prepare yourself mentally.
Designing an escape hatch This is a useful metaphor in designing for failure-first. You don’t want to have to use it, but if you need it, you sure want it to be there and you want to ensure the hinges aren’t rusted shut.
Litmus test: Could any customer go through my escape hatch, recover, and remain a happy customer? This is a high bar, you could even elide ‘happy’ and still retain a bar of good quality.
Paper by Donella Meadows called: Leverage points: Places to Intervene in a System. which is great for some of the systems theory content we’ve been discussing here. She writes:
Delays in feedback loops are common causes of oscillations. If you’re trying to adjust a system state to your goal, but you only receive delayed information about what the system state is, you will overshoot and undershoot. Donella Meadows
These feedback loops that we’re baking in as far as failure-first design are not only to protect and support our customers, but also for our own understanding of the accuracy of our system. The feedback timeframe is important.
Finding # 4: Upstream prevention vs downstream mitigation. Or: Is this a product problem or a behaviour detection problem? This is a quote from Eva PenzeyMoog’s newsletter which focuses on her work in the technology of domestic violence and abuse, a space which has a lot of overlap with the space of fraud and technology: ie: people with a different agenda to yours are looking for vulnerabilities they can exploit for their own ends.
These sorts of things will happen. I’m very intentional about not saying abuse ‘might’ happen – if it can, it will. – Eva PenzeyMoog
Product use cascade: or a hierarchy of what’s possible for you to do in your product.
- What’s possible to do in the product – this trumps all else
- Intentions, framing, design. Ex:microcopy helping people know what they should be putting into the forum or how they should be using the tool, or onboarding copy.
- Culture of your community – if you have a product with a lot of people with an established culture of how they interact, that culture will also impact how people use your tool.
- …[anything else]
- Terms of use – this is least important and least impactful when it comes to changing people’s behaviour.
Possible = permissible. If you don’t care to make it impossible for someone to do something in your product, it’s tacit permission for the behaviour to occur.
Setting boundaries is design. Think about your marketing copy, onboarding, continued engagement comms – these are all opportunities to establish the norms of your tool and give people hints about what will be ok. Snarky tweet from Cory Doctorow:
The rise and rise of terms of service is a genuinely astonishing cultural dysfunction. Think of what a bizarre pretense we all engage in, that anyone, ever, has read these sprawling garbage novellas of impenetrable legalese.
We have to get beyond the pretense that terms and conditions is somehow meaningful or useful.
Downstream is usually more costly than upstream. Upstream changes the product. Downstream observes the behaviour and does something about it. Ex: promotions abuse. If you build guardrails into your design that prevent people from fraudulently using that QR code or that URL more than the number of times you want them to use it, that’s always going to be better than trying to work out how people have done it post-hoc and remove some credit.
Section 2: Top-down vs. bottom-up.
Responsible tech. This presentation is not about how to write a values strategy or principles document. There are no magic bullets in this space. We live in a world of complex, entangled systems and as a result, our problems in the space of fairness and ethics are complex. Addendum: Not only are there no magic bullets, but anyone who tells you otherwise is selling something. The desire for a magic bullet is part of the problem. We are in the Cambrian explosion phase, not just of AI, but also AI ethics. It’s on us to interrogate the power structures and agendas behind glossy marketing copy.
Principled artificial intelligence graphic demonstrates the enormity of the space of AI. This was created through a meta-study of values documents that found a great deal of overlapping concepts and assumed this as progress on the course of AI ethics. Laura is unconvinced that the repetition of themes is indicative of a global agreement on how to move forward. More just a function of copy and paste.
Quote from this study exploring AI ethics tools:
As a result, developers are becoming frustrated by how little help is offered by highly abstract principles when it comes to the ‘day job.’
Let’s look at some of these abstract principles to ground this: Here’s one from the Mozilla Trustworthy AI Paper
Goal:Consumers choose trustworthy products when available and demand them when they aren’t.
Two really problematic assumptions here. 1) Customers will know if its trustworthy – abstractions on top of abstractions, to the point where technologists themselves don’t know what classification means, how can we expect consumers to meaningfully interrogate or interpret that technology to determine trustworthiness? 2) Assumption that we are in a place to employ our consumer power. We aren’t. These are models, being purchased by advertisers, police stations, governments. Consumers are not voting customers at this stage = minimal power. Both an economic asymmetry and an information asymmetry.
Quote from Laura’s recent Twitter rant on this:
I’ve been meaning to write about the @mozilla white paper on Trustworthy AI for ages. I started to formulate a response with friends but then…life events happened and I didn’t have the bandwidth. Here’s the paper.
Next example of a goal or value or principle is taken from the IBM Ethics
AI must be designed to minimize bias and promote inclusive representation
Laura doesn’t disagree in theory, but: How much?or: What’s my code test for this? How do I know that bias has been minimized? If I care about inclusive representation, how much is enough? [and: Will you be reporting on that? and :How can I interrogate it?] The problem isn’t the statement but the follow through, or lack thereof.
One more ex from Smart Dubai:
AI systems should be safe and secure, and should serve and protect humanity.
Again, sounds great, but hand-wavey and divorced from reality.
Laura worries that much of the ethics work in this space is ‘bike-shedding’. Without measurable principles, or impacts or penalties for not getting this right, this feels toothless. Worse, they’re resource thieves. Lulling us into thinking we’re already doing the work when we haven’t really started.
Quote from Laura’s partner Andy encapsulates this problem: A string of empty profundities
– Andy Kitchen
Laura doesn’t want to hear profundities that she can’t test and measure for. When thinking of building technology systems, particularly big, enterprise-level systems, think of us as a team sailing on a big cruise-liner. When descending to lower decks it feels safe and even boring. It feels more stable and the sense of motion is minimized. But if you go up to the top decks and stand outside in the wind and look down and comprehend the speed of the boat and the context of where you are, that’s when you can meaningfully grapple with dangers and risks.
Laura posits that we want to feel the wind rushing past. Insulating ourselves from that sensation is itself unhelpful.
How not to do principles: Don’t write a document, throw it over the wall and say ‘go use this.’ If you want a principles document, champion the approach from inside your teams, get your hands dirty and show by doing.
Commit to action: Assign strategy champions(s) that will actually support people in making it real; Set measurable targets; Provide examples, and Allow for uncertainty. So much of this work is contextual – a principle that makes sense in one domain may not in another.
Section 3:Final Thoughts.
- Developing a healthy cynicism about data sources.
- Mapping knowledge states.
- Designing for failure first.
- Upstream prevention vs downstream mitigation.
All of these lead into automation design itself, how we think about it and how we might change the way we think about it. Stop thinking about automation design as an all or nothing feature. How do we break down the monolith, test our assumptions, an explore what’s working in lightweight, iterative, easy steps:
- Concierge prototypes (human-as-bot): if you put a human where a script or bot would be and they see something they’re supposed to do, and it feels bad, that’s useful info.
- Automations can be temporary or time-boxed, they don’t need to be forever.
- Not every automation needs to go end to end from input data to action at the end. Think about where it should be added – Dashboard? Notification? Escalation? Action? There is a world of nuance here.
- Do usability testing (duh). If people are going to see it, make sure it’s interpretable and testable.
The first principle is that you must not fool yourself – and you are the easiest person to fool. – Richard Feynman
Whether we’re doing product, or running science experiments, or doing physics – we have to remember that our own over confidence is as much the enemy as anyone outside of ourselves
Remember that the real game is trying to catch the fraud, yet not defrauding ourselves.Our minds and interpretations are fair game. We need to improve our way of understanding what’s happening to avoid getting overconfident and stuck in our own biases. We are driven by the same incentives as the fraudsters, it’s just the light and dark versions of markets and capitals. We have as much in common with them as we do with any other user.
Thankyou! Check out Laura’s links below: