Spam fighting is a big part of local SEO these days, so I set out to see how spam fighters could be more efficient with some automation added to the mix.  In this post I'd like to not only share how I fight spam, but how the technology behind that process is implemented.

If you've spent any time working in local SEO, you're bound to have seen GMB listings that range from a bit of questionable keyword stuffing to completely blackhat lead-gen spam.  Some categories, like the legal field, are particularly prone to all kinds of spammy tactics.

Unfortunately, as Gyi Tsakalakis of AttorneySync has mentioned before, the one thing that can beat a solid backlink profile is spam.  Just to take one example, you can see that the listings in first and second place here are stuffed to the brim with keywords.

The Google guidelines clearly state that the listing name should match what clients would see on your website or on signage at your physical location.  

What would we see if we checked out the top result here?

The keywords after the hyphen are clearly not part of the legal business name, so according to the guidelines, really should not be in the listing name.  Google, however, appears to reward this behavior.  

The primary remedy for this issue has always been the "Suggest an edit" feature, where spam fighters can strip the keywords from a business name in their edit.  This is a never ending battle because the listing owner will inevitably restore the keywords after each edit takes effect.

Adding a new tool to the spam fighting arsenal

My goal was to create a more efficient process so that fighting spam is hopefully less time consuming.  To do that, I wanted to narrow the problem down to listings that outrank my client, and who also have a mismatch between their GMB name and what they show on their website.

Before diving into how exactly that process works, let's look at how it's presented.

The map displays all of your keyword competitors, with each location having a details tooltip like above.  When a new location is added, we scan the website and look for a matching name using a variety of techniques.  If no name match is found, the location's "Name On Website" attribute is highlighted in red.

Checking for an exact match

Name matching will never be a perfect process, but we can hopefully achieve a reasonable degree of accuracy through a series of techniques.  

The first technique attempted involves simply fetching the content of the web page and looking for an exact match.  This will sometimes work, and it's fast, so we try that first and stop if a match is successfully found.

import requests
from bs4 import BeautifulSoup

response = requests.get(location.website)
soup = BeautifulSoup(response.content, "html.parser")

print("Exact match:", location.name in soup.body.text)

The only complexity in this first approach comes from the need to get the raw text of the page, absent any tags in the HTML.  This is easy enough to do with the BeautifulSoup library in Python.

Checking for a fuzzy match

The exact match approach simply won't cut it in many cases, however, because a business name will be essentially the same on the website, except for very minor punctuation or spelling differences.  Failing the direct approach, we can turn to fuzzy matching to get a "close enough" answer.

The fuzzywuzzy Python package makes it super easy to compare text and determine similarity ratios.  

from fuzzywuzzy import process, fuzz

result = process.extractOne(
    location.name,
    soup.body.text.split("\n"),
    scorer=fuzz.QRatio,
    score_cutoff=app.config["NAME_MATCH_SCORE_CUTOFF"],
)  

Here we use the extractOne function to find the best possible match amidst all of the text on the web page.  If there isn't a match that exceeds the score cutoff, it's likely that the name isn't on the page in any reasonable form.  The QRatio scorer stands for "quick ratio", although there are several other types of scoring functions available.

This pass will catch a lot of almost equivalent names, but there is a whole category of names that we haven't discussed yet – names embedded in images!

Using computer vision to find name matches

A surprising number of businesses seem to only include their name inside of an image. This obviously poses a problem if we want to know what a user sees when they navigate to the website.  

Fortunately, Python has a number of libraries that make this seemingly daunting problem easy to overcome.  With a combination of OpenCV and Pytesseract, we can do a decent job of extracting these names and checking for a match.  OpenCV is an enormous computer vision library, but we'll need only a tiny subset of all that functionality. Pytesseract is an OCR library that we'll use to do the text extraction.

We need OpenCV to perform some preprocessing on the images that will yield better results when using Pytesseract.

If you're using Ubuntu, follow the steps below to get set up with Tesseract.  The Pytesseract package is a wrapper that makes it easier to use the Tesseract project.

sudo apt-get update
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb http://archive.ubuntu.com/ubuntu $(lsb_release -s -c) universe"
sudo apt-get install -y tesseract-ocr

To install the Python packages, execute the following.

pip install opencv-python
pip install pytesseract

With all of that in place, we're ready to read some images.  Pytesseract performs best when reading black text on a white background, so we'll use OpenCV to transform our input images into that format.  We also want to get rid of any artifacts in the image that might mistakenly be interpreted as text.

Our input will start off looking like the following.

While the output from OpenCV will end up more like this after the preprocessing step.

Transforming an image so that any artifacts beneath a certain threshold are eliminated is known as binarization, and OpenCV offers several such methods.  We'll use the adaptive binarization method, which calculates many thresholds at once across the different regions of an image.

import cv2
from pytesseract import image_to_string

th = cv2.adaptiveThreshold(
    cv2.imread(f.name, cv2.IMREAD_GRAYSCALE),
    255,
    cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    cv2.THRESH_BINARY,
    31,
    2,
)

print(image_to_string(th).split("\n"))


Once the image has been converted to grayscale and binarized, we feed it into the image_to_string function from Pytesseract, and receive the OCR output.  Just like in the previous fuzzy matching attempt, we use the fuzzywuzzy library to add a little leeway to the matching.

In conclusion

There are still many ways to improve upon this approach!  Simply harvesting the images from a site is a challenge on its own, given the many different image formats.  Images are also hosted in a variety of ways, sometimes making it tricky to retrieve them correctly. I've even seen the business name embedded inside of a background image.