Keyword stuffing might sound like something from a decade ago, but the practice is alive and well in local search.  Businesses are stuffing keywords into their names and ranking higher as a result.  

Search for "NYC lawyers" and you'll come across plenty of examples like below.

According to Google, business names should "reflect your business’ real-world name." Despite that guidance, many businesses add keywords that they hope to rank for in their industry.  

For lawyers, that means keywords like "divorce", "DUI", "employment", or "immigration." In some cases, these keywords might be part of the real business name, but they often appear only as part of the search results.  Does keyword stuffing actually help with search ranking?  According to Joy Hawkins, owner of the Local Search Forum, keyword stuffing can bump a listing up 1 to 3 positions.

Given that this is against the Google TOS, is anything happening to businesses that do keyword stuffing?  An interesting case study by Sterling Sky seems to indicate that most businesses get away with this.  Only 20% of those reported to Google received a hard suspension, and that was after the owners persisted in adding keywords back into their names.

Reporting Keyword Stuffing

If you aren't willing to violate the Google TOS, but others are, what can you do about keyword stuffing?

The main avenue for GMB corrections has always been the "suggest an edit" feature.

This is a very manual one by one process, edits can take days to become active, and keyword stuffers will usually revert the changes fairly quickly.  

The GMB support forum was available as a way to escalate issues like keyword stuffing, but Google has since published the business redressal complaint form as the official channel for reporting issues like this.

The new form allows uploads.  I'm not sure whether this means multiple listings can be reported at a time.  The form seems to require, at minimum, a Google Maps URL corresponding to each listing.  

Building a keyword stuffing detector

Discovering and proving keyword stuffing is a time consuming process.  Seeing a keyword in a business title isn't enough – from there you'd need to go to their website to determine if the keyword is legitimately included.

Having a programming background, I asked myself if at least part of this process could be automated.  Can I find keyword stuffing and also gather evidence without spending hours manually going through listings?

Be forewarned, the rest of this article is going to be pretty technical.

Getting started

The jumping off point for our analysis is a local search query, like the "NYC lawyers" query at the beginning of this article.  

Before diving into details, I'd like to show you what the end product is going to look like. This is a small sample from the spreadsheet that we're going to build:

Each row contains the business name as found on Google, a Google Maps URL, screenshot of their listing, and a screenshot of their website (presumably missing the keywords they used for their listing).

This is the process at a high level for building the spreadsheet:

  • Perform the search query and load the "More places" results
  • Analyze each listing and detect any keywords in the title
  • Store the URL for each business listing with possible keyword stuffing
  • Visit the URL and look for the GMB name on the website
  • Take a screenshot of the web page as evidence for later
  • Build a spreadsheet to store our results

Both Node.js and Python are used for this process.  Puppeteer will perform the web crawling, BeautifulSoup will parse HTML in Python, and the FuzzySet library will give us some flexibility on the name matching.

A proxy is also an important part of this process.  I won't recommend a specific proxy, but you should read some reviews to get a feel for what's available and what makes sense for your situation.  You may want to ask your proxy provider what's possible before subscribing, because some proxies block access to certain domains, but don't really advertise that upfront.

To begin with, you'll want the following NPM packages to be installed:

npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth 

The puppeteer-extra-plugin-stealth package suppresses dozens, if not hundreds, of little tell-tale signs that websites can use to detect automated browsing.  It's trivial to detect Puppeteer without this plugin, because the framework injects JavaScript on to the page, and it's easy to test for the presence of this code.

Now that the packages are installed, let's take a look at the skeleton of the crawler code.

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

puppeteer.use(StealthPlugin());

const searchterm = process.argv[2];
const proxyUrl = "YOUR_PROXY_URL_HERE";
const keywords = [
    "affordable",
    "dwi",
    "disability",
    ... // Several more keywords here, cut for brevity
];

async function screenshotDOMElement(page, opts = {}) {
    ...
}

async function crawl(browser) {
    ...
}

puppeteer
    .launch({
        headless: true,
        ignoreHTTPSErrors: true,
        args: [
            `--proxy-server=${proxyUrl}`,
            "--start-fullscreen",
            "--no-sandbox",
            "--disable-setuid-sandbox" 
        ]
    })
    .then(crawl)
    .catch(err => {
        console.error(err);
        process.exit(1);
    });

When writing the script and debugging you will probably want to change the headless argument to false.  The Chromium GUI will not launch when operating in headless mode – this saves resources, but it's also incredibly difficult to debug in that mode.  

The keywords array should contain the keywords that you're interested in flagging.  We'll visit the website of any listing containing one or more of those keywords later.  The screenshot function is responsible for saving a screenshot of a single GMB listing, to be put into our spreadsheet.  The crawl function is where most of the action takes place.

Let's take a look at the screenshot function first.

async function screenshotDOMElement(page, opts = {}) {
    const padding = "padding" in opts ? opts.padding : 0;
    const path = "path" in opts ? opts.path : null;
    const element = opts.element;

    if (!element) throw Error("Please provide an element.");

    const rect = await page.evaluate(_element => {
        if (!_element) return null;
        const { x, y, width, height } = _element.getBoundingClientRect();
        return { left: x, top: y, width, height, id: _element.id };
    }, element);

    return await page.screenshot({
        path: null,
        clip: {
            x: rect.left - padding,
            y: rect.top - padding,
            width: rect.width + padding * 2,
            height: rect.height + padding * 2
        },
        encoding: "base64"
    });
}

This is the original source for this snippet to give credit where it's due.  The function determines the position and dimensions of an element, and passes those coordinates on to the Puppeteer screenshot function to crop the page appropriately.

I'm asking Puppeteer to encode the screenshot data in base64 so I can embed the results into JSON instead of having files saved all over the place.  The script output is just JSON sent directly to the console, which I redirect to a file, for further processing in Python.

Now let's break down the crawl function piece by piece to understand what's happening.

    const page = await browser.newPage();
    await page.setViewport({ width: 1920, height: 1080 });

    await page.goto("https://www.google.com");
    await page.waitFor("input[name='q']");
    await page.type("input[name='q']", searchterm, { delay: 80 });

    await Promise.all([page.waitForNavigation(), page.keyboard.press("\n")]);

    const morePlaces = await page.waitForXPath(
        "//span[contains(text(), 'More places')]"
    );

    await Promise.all([
        morePlaces.click({ delay: 50 }),
        page.waitForNavigation()
    ]);

In this first section, Puppeteer opens a browser window and navigates to Google, then enters our search query and proceeds to the results page.  Because I'm interested in the local pack results, the script clicks on the "More places" link to open Maps.  

I'll go over a few of the important methods I'm using here.  

The waitFor method can accept a number or a CSS selector.  If you give waitFor a string, it will pause the script until the given selector can be matched to something on the page (or until the default timeout causes an exception).  A number will simply cause the script to pause for the given number of milliseconds.

The type method is pretty straightforward, and takes a selector, a string to type, and optional settings like the delay to use between key-presses.

The waitForNavigation function pauses the script until the browser has finished loading the page.  This is an important one, because otherwise the script would keep running as the page loads, and inevitably fail to find anything.  Once the page loads, the script clicks the "More places" link and the Maps section should load.

    let i = 0;
    while (i < 14) {
        const nextspan = await page.waitForXPath(
            "//span[contains(text(), 'Next')]",
            { visible: true }
        );

        const listings = await page.$$(".VkpGBb");
        for (const listing of listings) {
            const title = await page.evaluate(element => {
                return element.querySelector(".dbg0pd").innerText;
            }, listing);

            const content = await page.evaluate(
                element => element.innerHTML,
                listing
            );

            let found = false;
            let screenshot = "";
            for (const keyword of keywords) {
                if (title.toLowerCase().includes(keyword)) {
                    found = true;
                    break;
                }
            }

            await page.evaluate(element => element.scrollIntoView(), listing);

            screenshot = await screenshotDOMElement(page, {
                element: listing,
                path: title + ".png"   
            });

            console.log(
                JSON.stringify({
                    title: title,
                    screenshot: screenshot,
                    content: content   
                })
            );
        }
        
        await Promise.all([
            page.evaluate(element => element.click(), nextspan),
            page.waitForNavigation()   
        ]);

        await page.waitFor(4500);
        i++;
    }
    await browser.close();
}


The code above iterates through the search results, clicking the next button once it has scanned each item on the current page.  For every listing, it takes a cropped screenshot and base64 encodes the data before printing it to the console.

I'd like to point out the evaluate function, as it's incredibly useful and one of the best parts of Puppeteer.  Using evaluate, you can run code directly in the browser, and pass back results as a string or number.  This is useful, for instance, when you need to use JavaScript to click a button or scroll an element into view.

Once I have all of the candidate listings dumped into a JSON file, I can begin visiting each to check and see whether they are using the same name across their GMB listing and website.  If they aren't, and keywords exist in the GMB name, then we can consider that listing as probably being "keyword stuffed."

Below is the Python script that reads the JSON and invokes a separate process to visit each website one by one, while writing the results to a spreadsheet.

import base64
import json
import subprocess
import openpyxl

from urllib.parse import quote_plus
from io import BytesIO

from PIL import Image
from bs4 import BeautifulSoup


def get_image(data, resize=None):
    im = Image.open(BytesIO(base64.b64decode(data)))
    if resize:
        im = im.resize(resize)
    return openpyxl.drawing.image.Image(im)


wb = openpyxl.Workbook()
ws = wb.active

with open("keywords.json", "r") as f:  
    for i, line in enumerate(f, 1):
        line = json.loads(line)
        soup = BeautifulSoup(line["content"], "html.parser")
        a = soup.find("a", "L48Cpd")   

        title = line["title"]

        if a and a.attrs["href"].startswith("http"):
            url = a.attrs["href"]

            ws.cell(row=i, column=2).value = title

            ws.cell(
                row=i, column=3
            ).value = "https://www.google.com/maps/search/?api=1&query={}".format(
                quote_plus(title)
            )

            ws.row_dimensions[i].height = 380
            ws.column_dimensions["D"].width = 80
            
            im = get_image(line["screenshot"])
            im.anchor = ws.cell(row=i, column=4).coordinate
            ws.add_image(im)

            try:
                p = subprocess.run(
                    'node visit.js "{}" "{}"'.format(title, url),
                    shell=True,
                    capture_output=True,
                    timeout=15,
                )
            except subprocess.TimeoutExpired:
                continue

            try:
                data = json.loads(p.stdout.decode("utf-8"))
            except json.decoder.JSONDecodeError:
                continue

            ws.cell(row=i, column=1).value = (
                "STUFFED" if data["topmatch"] < 0.90 else "OK"
            )

            im = get_image(data["imgdata"], resize=(380, 380))
            im.anchor = ws.cell(row=i, column=5).coordinate

            ws.add_image(im)

wb.save("keyword-stuffing.xlsx")

The separate Node script returns a screenshot of the website, which I then resize and insert into the spreadsheet.  The script also returns a "match" percentage from 0 to 1, indicating how close a match for the GMB name could be found on their website.  This allows for a little wiggle room, because demanding an exact match causes a lot of false positives.

Because the Node script is executed as a subprocess, it would be fairly easy to scale this up and run multiple headless browsers, if you had the resources.

The Node script that visits each business website is pretty straightforward.  Calculating the "match" ratio is done using the FuzzySet library.

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const FuzzySet = require("fuzzyset.js/lib/fuzzyset.js");

puppeteer.use(StealthPlugin());

async function crawl(browser) {
    const page = await browser.newPage();
    await page.setViewport({ width: 1920, height: 1080 });

    await page.goto(process.argv[3]);
    await page.waitFor(2000);

    const data = await page.screenshot({
        path: null,
        type: "jpeg",
        encoding: "base64"
    });

    const text = await page.evaluate(() => {
        return document.body.innerText;
    });

    fset = FuzzySet(text.split("\n"));
    const matches = fset.get(process.argv[2]);
    const result = { topmatch: 0, imgdata: "" };

    if (matches && matches.length) {
        result.topmatch = matches[0][0];
    }

    result.imgdata = data;
    console.log(JSON.stringify(result));

    await browser.close();
}

This process seems to work reliably, but there is of course room for improvement.  Some business websites only include their name as part of a logo image, so those are currently flagged as keyword stuffing when that's not necessarily true.  I think this could be solved if you fed each image on the page through Pytesseract and extracted the text using OCR, but that's something for another post.

I hope you enjoyed this walkthrough!  I know I enjoyed learning more about Puppeteer and seeing what it can do.