For those who keep up with tech news, content moderation has been a hot topic recently, with Meta losing their case against content moderators in Kenya who say they were unlawfully terminated and Reddit’s API changes causing subreddit moderators to protest. With all the fighting going on between Big Tech and human moderators, it is tempting to try to automate away the problem. However, this problem is much trickier than it looks. This article walks through the process an engineer might take in order to create a content moderation tool that automatically bans all posts with the word “ass” in it.
When you look at this page, you see words on a screen. However, the computer deals only in numbers, so in order to program a content-moderation tool, we need to think of the words, called “strings” in programming parlance, as numbers. What number each letter should be mapped to is defined by the ASCII standard, in which A = 65, B= 66, and so on. (Fun fact: emojis also have their own ASCII characters). This is done under the hood - when I write a program, I still use letters, but the computer will convert it to these numbers when it is doing its calculations. So, to write a line of code that answers the question “Is the word “ass” in this Facebook post?”, the computer will convert “ass” to ASCII, which is “97, 115, 115”. Then, it will convert all the words in the Facebook post to ASCII, then loop through the converted words to see if they equal “97, 115, 115”. The coding framework for that might look something like this:
For word in post:
if word == "ass":
Block post
Easy enough! But wait, people find out about the tool and decide to be smart, so now whenever they want to say “ass”, they say “ASSociate”. We now update our tool to try to catch that, by blocking every post where one of the words contains “ass” inside it by having it look at all three-letter groupings in all the words. The code for this would look like:
For word in post:
if "ass" in post:
Block post
But wait, there’s someone names Cassandra in your group, and now you can’t post her name, since it contains the forbidden word. Also, some of the members of the group speak Spanish, and now every time they try to talk about securing things (assegurar), their posts are getting blocked.
So, we decide to use the tool to make our lives easier instead of completely automating our job away - instead of blocking every post with the forbidden word in it, we are just going to send it to ourselves. This way, we only have to review suspect posts instead of all the posts in the group. Our code now looks like this:
For word in post:
if "ass" in post:
Tell moderator
This lightens our workload for a bit, but now we start getting complaints from users that people are still cursing, but instead of typing out “ass” they are using Internet-speak euphemisms, like “a”,“aSs”,and“ASS”.Remember,thecomputerisnotlookingatlettersandseeingthattheyvisuallylooksimilartotheword“ass”,itisjustcomparingASCIInumbers,andthe“97115115”of“ass”looksnothinglikethe“973636”ofa or the “97 65 115” of aSs, even though humans can easily decipher the true meaning. This is a common theme in programming - humans are great at parsing out high-level patterns, but computers can only look at individual pieces.
So, we decide to go through all the posts and make a list of all the words being used in place of the forbidden word. We will then check if every string contains any of the words.
For word in post:
For forbidden_word in forbidden_word_list:
If forbidden_word in word:
Tell moderator
This now catches all the undesirable posts, but now in addition to going through a large number of posts looking for the forbidden word, we also have to check all new posts to see if a new iteration of the word has popped up. We’re right back where we started, looking at all the posts, except now we are looking for two things instead of one.
Now seems like a good idea to give up on our brute-force approach, so we switch to machine learning. Our goal is to train a neural network that will take in a Facebook post and tell us the likelihood that one of the words in it is a euphemism for our forbidden words. My first article talks about all the pitfalls that can befall us at this point, but let’s pretend for a moment that we can sidestep all of these and make a system that really, truly works, and can accurately detect everything we want it to. Can we have it automatically block posts for us? Consider these two scenarios:
Post 1:
Husband: Had a great time mowing the lawn at 7am this Sunday! Wife: You're an ass :)
Post 2:
Husband: Had a great time mowing the lawn at 7am this Sunday! Wife: F*** you you ass I am going to fu**ing kill you if you don't stop
Clearly, these two posts are very different in tone. Post 1 does have “ass” in it, but it is not hate speech, it’s a loving couple teasing each other. Post 2 is much more concerning, and should arguably be blocked. The problem with our machine learning model is that we only trained it to find the words, not assess the entire tone and underlying meaning of the post. This is really hard to do - people struggle with that all the time, and most of us have at least a high-school level English/(insert your native language here) education that covers those topics. This gets even more difficult when you consider different cultures, languages, and age groups. In addition, reading any political article will tell you that there is no clear consensus between all people on the tone and intent of any given speech or law. Since we can’t all agree on what should and should not be banned, how could we write a program that could do that for us? We could try, but every time a controversial decision is made, the programmers would need to justify what happened, and as well learned in my previous article, this is almost impossible to do with large machine learning models. So, even if we are somehow able to create a machine learning model that could magically find our forbidden words, a human being would still need to be in the loop to make the final call, and it is vital that this person understands the language, slang, and values of the group they are moderating. Additionally, if our model works well (and most Big Tech models do), these people aren’t seeing the millions of happy wedding posts and friend hangouts and memes being posted on Big Tech websites, they are seeing vile hate speech and traumatic abuse that they can do little to curb in the real world. One can imagine that this is an emotionally draining task. This means that if we want to have social networks free from abuse and misinformation, humans must be in the loop and they must be cared for in order to do their jobs. In addition, we need to recognize that not everything has a clear-cut solution, and it takes careful thought to decide what to do.