Tackling toxic profile spam on WordPress.org
3rd October, 2019
I’ve discovered something rotten at the core of the WordPress.org website, which is poisoning the whole web.
The site contains uncountable numbers of spam profiles. Thousands more are created each day. They’re being created by (or rented out to) SEO practitioners, in an attempt to inflate how Google perceives the value, relevance and authority of the (often low quality) websites they link to.
Failing to block and remove these profiles not only harms the wordpress.org website and its users, but it also makes WordPress complicit in the pollution of the broader web.
Don’t worry – I think that I have potential some solutions!
I’ve written up my research, and I’ll reference this post when submitting patches/tickets to the WordPress.org #meta team.
Why is this important?
Spam at this scale is a serious and urgent problem, as the profiles:
- Harm the SEO of the whole wordpress.org ecosystem.
- Provide a poor user experience when users discover/visit/arrive on them.
- Pollute/inflate the site’s analytics and data.
- Bloat the whole ecosystem (where each spam profile results in multiple pages being created; including a ‘support’ profile and its associated sub-pages).
- Increase the burden on, and the costs of site infrastructure.
- Compromise the trustworthiness of the wordpress.org website/brand.
Current (manual) processes to identify and remove spam profiles are inefficient and insufficient; it’s a losing battle. Furthermore, no effective processes exist to prevent their creation in the first place.
This post explores possible solutions, and provides the following recommended actions:
- Change how links output on/in profile pages.
- Improve (and extend) our integration of Google’s Recaptcha software.
To ‘clean up’ the site, the following three challenges must be addressed:
- Disincentivise the creation of spam profiles.
- Improve spam detection/prevention at the point of content creation/update.
- Identify and remove existing spam profiles.
Previous discussions around these challenges have presented either partial or extreme solutions; such as constraining the purpose/usage of the profiles.wordpress.org site, or, hiding profile content from logged-out users.
Aside from not adequately addressing all three of our challenges, many of the options explored risks creating more harm than good, in various ways.
As such, any solution must adhere to the following three constraints :
- It cannot cause or introduce additional friction for real users; introducing extra steps, requirements or constraints to real users risks harming adoption and growth as much as the current spam problem harms our SEO. It’s poor practice to transpose business/technical problems into UX problems.
- It cannot show different content/experiences to logged-in vs logged-out users. Google, Facebook and other systems act like a logged-out user but have special needs. More importantly, the experience of other logged-out users matters as much as the experience of logged-in users. Hiding content from them risks harming SEO, growth and brand strategies as much as the current spam problem.
- It should not enforce what a profile should or shouldn’t be. The nature of the WordPress community requires that profiles be flexible enough to adapt to what the users need them to be. These needs will vary, and it’d be wrong to be prescriptive in defining that. Constraining how profiles can be used (e.g., “profiles aren’t for self-promotion”, or, “profiles aren’t for showcasing portfolios”) risks harming the brand and (future) SEO, growth and brand strategies as much as the current spam problem.
Motivations & objectives
To reduce the incentive to create spam profiles, we must understand the motivations behind their creation.
The structure, content and patterns I see within these profiles leads me to believe that the vast majority of spam profiles are created by automated processes, with a combination of the following objectives.
NOTE: I’ve included links to example profiles throughout; it’s likely that these may have been removed by the point of publishing, so I’ve included screenshots where relevant.
- Sometimes, convincing a user to click on a link (with an intent to monetize, directly or indirectly). E.g: profiles.wordpress.org/lanicheprofitcoursebychrisguth/, profiles.wordpress.org/amazontopicalungating/, profiles.wordpress.org/bestdumpsvendor/
- Rarely, convincing a user of the value of a product/service within the content of the page (i.e., “You should buy my awesome product!”, or “Hire me as a consultant [for irrelevant services]”).
- Rarely, profiles may be created to probe for vulnerabilities (either manually, or at scale); e.g: https://profiles.wordpress.org/mehmet8118/.
NOTE: Manually created profiles are usually harder to identify and to disincentivise, as they take effort to look and behave like more conventional users (or, have genuine engagement). E.g: this profile, which appears to have manually populated various profile fields, or this one, which despite looking like spam, has created / engaged in genuine support threads.
We can also see that there are only a few distinct types of spam profile; there are recurring patterns in the usernames, content, and link targets. It would appear that the majority of our spam profiles are being created by a tiny number of people, processes and/or systems.
If we can better understand the characteristics of these types of profiles, we may be able to more easily remove their motivations or access.
Analytics & behaviour
The Google Analytics implementation on profiles.wordpress.org (and across the wider wordpress.org ecosystem) records information about visits to these pages.
The data reveals an interesting phenomenon – many of the spam profiles show a surprising volume of traffic, like the following representative example (from this profile).
Even a cursory investigation into the properties and behaviours of this traffic reveals that it is not from real visitors.
These are automatically generated requests, running on a schedule from an external service.
We can see evidence of this below, with examples of the consistency of hourly hits, and the surprisingly precise/unusual split between device categories/types.
Further exploration reveals much more evidence that these are ‘fake’ hits from an automated system. There is no doubt that this is not normal user behaviour.
We can also see that these systems are clicking on their links – they’re checking that the link functions correctly, as part of a broader network of links. The following image shows ‘clicks’ to external sites, from their respective spam profiles:
The motivation for these hits is clear; the creators of the spam profiles are checking that their links are still present on the page, as part of an SEO strategy (or service).
Further investigation shows that, in some cases, it appears that some links cease to be checked after a period of time; potentially implying that a payment period has expired.
It’s obvious that the majority of these profiles originate from a paid SEO service / software package, designed to provide links to its users.
Given this, the only thing which matters to the creators is the authority of the wordpress.org domain and the (often temporary, paid for) validation that the link exists on the page.
We can use this insight to focus our disincentivization and cleanup efforts.
Given the characteristics I’ve identified above, conventional approaches to disincentivization may prove difficult.
It should be noted that profile links already utilise the nofollow attribute, which removes (some of) the ‘SEO value’ of those links – but that this has proved inadequate as a lone disincentivization measure.
It may, in fact, be that these links are being intentionally sought after, in order for systems to build a ‘balanced’ link profile for their clients/networks, which contains nofollow’d links from authoritative sites.
Note that this is a relatively common tactic in low-value, high-volume markets (e.g., gambling/crypto affiliate sites and networks), and that wordpress.org is just one target amongst many.
Brief research identified several websites selling such services, and referencing wordpress.org as a good source of ‘authoritative nofollow links’; e.g., this site, which sites this profile as a ‘sample’.
Given that the originators of these profiles are spending money to run the systems and processes behind them, removing the value of the link should remove their motivation to create/maintain them. To achieve this, we should:
- Hide profiles which don’t have any Activity from search engines. Specifically, we should add a meta robots tag with a ‘noindex, follow’ value to the page (and any associated profile pages; e.g., in the https://wordpress.org/support/users/ section), which will instruct search engines not to index them (and also exclude them from wordpress.org search results, which are powered by a custom Google search integration).
- Route all external profile links through a system which hides the link, and which Google can’t follow. Specifically, we should alter all external profile links (including those inside profile content) via a hashed version of the URL (eg., wordpress.org/out/?loc=abc123), and then add a disallow rule to the robots.txt file for patterns matching /out/.
- Remove all non-clickable links (text-only URLs) from profiles, to disincentivise spammers seeking to create ‘unlinked citations’ instead of / in addition to links. These should be replaced with “[LINK REMOVED]”.
NOTE: Hiding ’empty’ profiles may also result in the unfortunate side-effect of ‘hiding’ some valid-but-empty profiles, such as this one and this one. Note that, in all the cases I’ve observed, such profiles lack objective value. Regardless, consideration should be given to adding signposting during the editing process to help people build rich, complete, indexable profiles.
Assuming that the system is actively monitored and sophisticated enough to understand that the links (and often, they pages they’re on) have been ‘nulled’, this should completely remove the motivation to create these profiles – though it may take time for the impact of this to seep through to the bloggers and vendors who promote the existence/value of such links.
On (not) hiding content
Some suggested approaches to disincentivizing profile creation have proposed that we conditionally hide content (and/or links) from users who are logged out of the site. This is currently the behaviour on https://wordpress.org/support/users/[userid]/ profile pages and elsewhere on the forum, and it’s perceived to ‘work’ in these scenarios.
Aside from violating our constraints, there’s no evidence that this approach does indeed ‘work’ (and, the current proliferation of spam profiles might suggest that that’s not the case); and more broadly, we shouldn’t base our approach on replicating arbitrary decisions/methodologies from elsewhere in the dot-org ecosystem.
In fact, following the (hopefully) successful application of the actions in this document, we should consider reversing the current ‘content hiding’ mechanics throughout the rest of the site(s) in order to better adhere to our constraints.
Improving spam detection/prevention
Google’s Recaptcha currently runs on the first step of registration, but not on secondary/subsequent steps where the user confirms their demographic information (the script is present, but not bound to anything). It’s also missing entirely from the login page.
The script is also entirely missing in the ‘edit profile’ page(s); which is where 90% of the problem resides.
Extending and completing the implementation of this is our easiest win for preventing the submission of obviously ‘spammy’ content in profile fields.
The following actions should be taken immediately:
- Enable Recaptcha on secondary registration steps, by binding the script to the submission button (as with the first step).
- Add Recaptcha to profile editing screens and login screens, and bind it to the submission button(s).
- Update the site’s privacy policies accordingly. This will need input from an expert.
NOTE: since publication, it appears that ReCaptcha V3 has been implemented on the registration page, but is still missing from deeper registration states and the entire profile editing area.
Various discussions have considered either adding Akismet to these processes, or defining algorithmic rules for spam identification. Recaptcha is a more sophisticated and ‘hands-off’ solution than both of these – though there’s no reason why we can’t combine all three approaches.
Given the nature of these spam profiles, manually defined rules won’t provide an adequate solution. The vendors selling packages of links on the site, or those who’re manually creating accounts, will simply adapt their behaviour to avoid any rules we put in place.
Maintaining a ‘block list’ of words, phrases and behaviours would require levels of resourcing which exceed even the current policing and removal of spam profiles. It would become an unmanageable nightmare of avoiding false-positives, undergoing manual reviews, and tweaking rules – especially as the perpetrators continue to change their behaviours in response to the blocks.
That said, given our understanding of the nature of the attacks (and the broader WordPress ecosystem), we could consider factoring an element of heuristic or algorithmic checking into a future implementation of Recaptcha V3.
With regards to Akismet, it’s unclear if it’s currently running on profile updates; if this isn’t the case, it should be enabled. However, it’s not well-suited to solving our particular problem(s) (especially for non-English content), and so we should proceed with running Recaptcha alongside it.
Removing existing spam profiles
Having disincentivized and reduced the ease of creating new spam profiles being created, we should turn our eyes to gradually removing existing spam profiles.
Unfortunately, whilst I’ve identified several patterns common to profiles, it’ll be challenging to define a set of logic which comprehensively identifies (and allows us to remove) them. It’ll be necessary to catch as much as possible algorithmically, and then gradually/manually identify and remove remaining profiles over time.
I recommend the use of a scoring process to based (initially) on the following criteria:
|No gravatar image||2|
|No social profiles||2|
|Username contains numbers (other than year patterns)||5|
|Username is more than 25 unseparated characters||5|
|Name is more than three words||5|
|No profile sections||5|
|Content shorter than 200 chars||5|
|Content shorter than 150 chars||10|
|Content longer than 500 words||10|
|Content longer than 1000 words||20|
|Content longer than 2000 words||40|
|Content doesn’t end in standard punctuation||5|
|Content begins with a link||15|
|Content contains only a link||40|
|Content contains currency symbols||5 * $symbol|
|Content contains the following words, phrases or patterns:|
|3 * $words|
|Content contains the following words, phrases or patterns:|
“click here”, “learn more”, (http|www)(.*), download, profit, niche, buy, wife, movies, online
|5 * $words|
|Content contains links (or “[LINK REMOVED]” replacements)||3 * $links|
|Content contains multiple <br /> tags||2 * $tags|
|Content contains <strong>/<b> or <em>/<i> tags||1 * $tags|
|Website link PATH CONTAINS multiple hyphens and/or slashes.||3 * ($el – 1)|
|Website link is > 100 chars in length||10|
|Website link is a known shortener (e.g., bit.ly)||5|
|User has only ever logged in once||15|
|User registered with a known temporary email domain||20|
Based on these scoring criteria, we should take the following action:
|> 60||Trash. This is almost certainly spam.|
|30 <> 60||Pending review. This is likely spam.|
|< 30||Leave for now.|
NOTE: The scoring and threshold numbers may require refinement, and should definitely gain consensus before being utilized. We should also ‘dry run’ this process overall (or random samples of) existing profiles before committing to any deletion actions. This process should also omit accounts which have plugins/themes, in which case, they should be escalated to the relevant teams.
Profiles flagged as ‘trashed’
- Add a user note to the account (e.g., “[WordPress.org] Suspected spam account; trashed”).
- Have their profile content rendered invisible.
- Be unable to post new content in the wp.org ecosystem.
- Update their comments / forum threads (using similar mechanisms to current GDPR request controls; where the user/account is anonymized, but the content remains).
- Be able to be ‘un-trashed’ by admins.
- Auto-delete after a period of 30 days.
Profiles flagged as ‘pending review’
- Add a user note to the account (e.g., “[WordPress.org] Suspected spam account; pending review”).
- Have a ‘spam / not spam’ flag:
- Marking as ‘spam’ should update the status to ‘trashed’.
- Marking as ‘not spam’ should exclude the account from future automated review processes.
This process will need to be developed and maintained. In the short-term, we may be able to short-cut some of the easier or more obvious flags through third-party crawling software (e.g., Screaming Frog).
Following an initial pass, we should reconsider the scoring and thresholds for subsequent ‘rounds’ of review and deletion, until we’re comfortable that it’s feasible to tackle profiles on-at-a-time as they’re discovered or reported.
A honeypot system should be added to all registration/profile forms, to make it harder for automated systems to complete them.
I recommend an approach like this one, where fields are duplicated in order to capture automated submissions. However, the approach should be extended to also:
- Randomise the ‘real’ field IDs and naming characteristics of both the real fields and honeypots (either on each request, or regenerate on a schedule), and tie them back together on the back end via a nonce.
- Randomise the ordering of the real vs honeypot fields in the source code.
This requires some scoping and consideration.
Recaptcha V3 & scoring
In the future, I’d like to consider a roadmap for a more sophisticated integration of Recaptcha V3, and to use the scoring mechanism. This may be used to invoke further challenges, or to provide user feedback.
Note that the mechanism for V3 is different; rather than showing a captcha challenge, the process invisibly scores users based on their browsing behaviour. It’s up to the integrator to determine at what scores/thresholds certain barriers are put in place. E.g., with a high score, a form submission might trigger additional validation mechanisms (potentially including conventional recaptcha interruptions).
Given our understanding of the characteristics of most of the spam profiles on the site, we could our own layer on top of Recaptcha’s spam score, to ‘top up’ the value, based on the spam scoring model outlined above.
On scores exceeding a certain threshold, we should show error messages on profile content update submission (and some hints; e.g., “you’ve used too many links”).
It should be noted that none of the immediate actions I’ve outlined address the problem at the source – that money is changing hands between companies and individuals to take advantage of the wordpress.org website.
The following begins to address the underlying problem through legal action, but requires further consideration in terms of resourcing and ownership.
In a nutshell, the following actions should be prioritised:
- Harden the website’s terms of service against automated/simulated visits, and the creation of profiles with the intent to gain links.
- Research and identify ‘network’ level offenders (such as https://trafficcrow.com/dofollow-backlink-sites-list/), and take legal action against their products/services.
- Research and identify the users of those network-level offenders, and notify them that they’re breaching the website’s terms.
I’ll be raising the following tickets in trac:
- Improving the integration of Google’s Recaptcha software (V2 and V3) into registration and profile updating.
- Developing an MVP scoring system for assessing the ‘spammyness’ of profiles.
- Beginning discussions around longer term actions:
- Exploring legal actions against offending networks/vendors/buyers.
- Honeypot development.
- Building more sophisticated logic on top of Recaptcha V3.
Let me know if I’ve missed anything!