Tackling toxic profile spam on WordPress.org

I’ve discovered some­thing rotten at the core of the Word​Press​.org website, which is pois­on­ing the whole web.

The site contains uncount­able numbers of spam profiles. Thou­sands more are created each day. They’re being created by (or rented out to) SEO prac­ti­tion­ers, in an attempt to inflate how Google perceives the value, relev­ance and author­ity of the (often low qual­ity) websites they link to.

Fail­ing to block and remove these profiles not only harms the word​press​.org website and its users, but it also makes Word­Press compli­cit in the pollu­tion of the broader web.

Don’t worry – I think that I have poten­tial some solu­tions!

I’ve writ­ten up my research, and I’ll refer­ence this post when submit­ting patches/​tickets to the Word​Press​.org #meta team.

Why is this import­ant?

Spam at this scale is a seri­ous and urgent prob­lem, as the profiles: 

  • Harm the SEO of the whole word​press​.org ecosys­tem.
  • Provide a poor user exper­i­ence when users discover/​visit/​arrive on them.
  • Pollute/​inflate the site’s analyt­ics and data.
  • Bloat the whole ecosys­tem (where each spam profile results in multiple pages being created; includ­ing a ‘support’ profile and its asso­ci­ated sub-pages).
  • Increase the burden on, and the costs of site infra­struc­ture.
  • Comprom­ise the trust­wor­thi­ness of the word​press​.org website/​brand.

Current (manual) processes to identify and remove spam profiles are inef­fi­cient and insuf­fi­cient; it’s a losing battle. Further­more, no effect­ive processes exist to prevent their creation in the first place.

TL;DR

This post explores possible solu­tions, and provides the follow­ing recom­men­ded actions:

  1. Change how links output on/​in profile pages.
  2. Improve (and extend) our integ­ra­tion of Google’s Recaptcha soft­ware.
  3. Harden the website terms of use, and clamp down on known offend­ers.

Chal­lenges

To ‘clean up’ the site, the follow­ing three chal­lenges must be addressed:

  1. Disin­centiv­ise the creation of spam profiles.
  2. Improve spam detection/​prevention at the point of content creation/​update.
  3. Identify and remove exist­ing spam profiles.

Constraints

Previ­ous discus­sions around these chal­lenges have presen­ted either partial or extreme solu­tions; such as constrain­ing the purpose/​usage of the profiles​.word​press​.org site, or, hiding profile content from logged-out users. 

Aside from not adequately address­ing all three of our chal­lenges, many of the options explored risks creat­ing more harm than good, in vari­ous ways.

As such, any solu­tion must adhere to the follow­ing three constraints :

  1. It cannot cause or intro­duce addi­tional fric­tion for real users; intro­du­cing extra steps, require­ments or constraints to real users risks harm­ing adop­tion and growth as much as the current spam prob­lem harms our SEO. It’s poor prac­tice to trans­pose business/​technical prob­lems into UX prob­lems.
  2. It cannot show differ­ent content/​experiences to logged-in vs logged-out users. Google, Face­book and other systems act like a logged-out user but have special needs. More import­antly, the exper­i­ence of other logged-out users matters as much as the exper­i­ence of logged-in users. Hiding content from them risks harm­ing SEO, growth and brand strategies as much as the current spam prob­lem.
  3. It should not enforce what a profile should or shouldn’t be. The nature of the Word­Press community requires that profiles be flex­ible enough to adapt to what the users need them to be. These needs will vary, and it’d be wrong to be prescript­ive in defin­ing that. Constrain­ing how profiles can be used (e.g., “profiles aren’t for self-promo­tion”, or, “profiles aren’t for show­cas­ing port­fo­lios”) risks harm­ing the brand and (future) SEO, growth and brand strategies as much as the current spam prob­lem.

Under­stand­ing

Motiv­a­tions & object­ives

To reduce the incent­ive to create spam profiles, we must under­stand the motiv­a­tions behind their creation.

The struc­ture, content and patterns I see within these profiles leads me to believe that the vast major­ity of spam profiles are created by auto­mated processes, with a combin­a­tion of the follow­ing object­ives.

NOTE: I’ve included links to example profiles through­out; it’s likely that these may have been removed by the point of publish­ing, so I’ve included screen­shots where relev­ant.

  • Rarely, convin­cing a user of the value of a product/​service within the content of the page (i.e., “You should buy my awesome product!”, or “Hire me as a consult­ant [for irrel­ev­ant services]”).

NOTE: Manu­ally created profiles are usually harder to identify and to disin­centiv­ise, as they take effort to look and behave like more conven­tional users (or, have genu­ine engage­ment). E.g: this profile, which appears to have manu­ally popu­lated vari­ous profile fields, or this one, which despite look­ing like spam, has created /​engaged in genu­ine support threads.

We can also see that there are only a few distinct types of spam profile; there are recur­ring patterns in the user­names, content, and link targets. It would appear that the major­ity of our spam profiles are being created by a tiny number of people, processes and/​or systems

If we can better under­stand the char­ac­ter­ist­ics of these types of profiles, we may be able to more easily remove their motiv­a­tions or access.

Analyt­ics & beha­viour

The Google Analyt­ics imple­ment­a­tion on profiles​.word​press​.org (and across the wider word​press​.org ecosys­tem) records inform­a­tion about visits to these pages.

The data reveals an inter­est­ing phenomenon – many of the spam profiles show a surpris­ing volume of traffic, like the follow­ing repres­ent­at­ive example (from this profile).

Even a curs­ory invest­ig­a­tion into the prop­er­ties and beha­viours of this traffic reveals that it is not from real visit­ors.

These are auto­mat­ic­ally gener­ated requests, running on a sched­ule from an external service.

We can see evid­ence of this below, with examples of the consist­ency of hourly hits, and the surpris­ingly precise/​unusual split between device categories/​types.

Further explor­a­tion reveals much more evid­ence that these are ‘fake’ hits from an auto­mated system. There is no doubt that this is not normal user beha­viour.

We should also not over­look the fact that these hits trig­ger our analyt­ics system at all – that requires that these ‘checker’ bots use JavaS­cript in order to render the page, in order to be certain that their links are loaded, visible and click­able.

We can also see that these systems are click­ing on their links – they’re check­ing that the link func­tions correctly, as part of a broader network of links. The follow­ing image shows ‘clicks’ to external sites, from their respect­ive spam profiles:

The motiv­a­tion for these hits is clear; the creat­ors of the spam profiles are check­ing that their links are still present on the page, as part of an SEO strategy (or service).

Further invest­ig­a­tion shows that, in some cases, it appears that some links cease to be checked after a period of time; poten­tially imply­ing that a payment period has expired.

It’s obvi­ous that the major­ity of these profiles origin­ate from a paid SEO service /​soft­ware pack­age, designed to provide links to its users.

Given this, the only thing which matters to the creat­ors is the author­ity of the word​press​.org domain and the (often tempor­ary, paid for) valid­a­tion that the link exists on the page.

We can use this insight to focus our disin­centiv­iz­a­tion and cleanup efforts.

Solu­tions

Disin­centiv­iz­ing spam

Given the char­ac­ter­ist­ics I’ve iden­ti­fied above, conven­tional approaches to disin­centiv­iz­a­tion may prove diffi­cult.

It should be noted that profile links already util­ise the nofol­low attrib­ute, which removes (some of) the ‘SEO value’ of those links – but that this has proved inad­equate as a lone disin­centiv­iz­a­tion meas­ure. 

The pres­ence of the nofol­low attrib­ute also hasn’t dissuaded (or been noticed by) the individual/​manual pursuit of links from the word​press​.org, as these vari­ous recent examples show.

It may, in fact, be that these links are being inten­tion­ally sought after, in order for systems to build a ‘balanced’ link profile for their clients/​networks, which contains nofollow’d links from author­it­at­ive sites. 

Note that this is a relat­ively common tactic in low-value, high-volume markets (e.g., gambling/​crypto affil­i­ate sites and networks), and that word​press​.org is just one target amongst many.

Brief research iden­ti­fied several websites selling such services, and refer­en­cing word​press​.org as a good source of ‘author­it­at­ive nofol­low links’; e.g., this site, which sites this profile as a ‘sample’.

Given that the origin­at­ors of these profiles are spend­ing money to run the systems and processes behind them, remov­ing the value of the link should remove their motiv­a­tion to create/​maintain them. To achieve this, we should:

  1. Hide profiles which don’t have any Activ­ity from search engines. Specific­ally, we should add a meta robots tag with a ‘noin­dex, follow’ value to the page (and any asso­ci­ated profile pages; e.g., in the https://​word​press​.org/​s​u​p​p​o​r​t​/​u​s​e​rs/ section), which will instruct search engines not to index them (and also exclude them from word​press​.org search results, which are powered by a custom Google search integ­ra­tion).
  2. Route all external profile links through a system which hides the link, and which Google can’t follow. Specific­ally, we should alter all external profile links (includ­ing those inside profile content) via a hashed version of the URL (eg., word​press​.org/​o​u​t​/​?​loc=abc123), and then add a disal­low rule to the robots.txt file for patterns match­ing /​out/​.
  3. Remove all non-click­able links (text-only URLs) from profiles, to disin­centiv­ise spam­mers seek­ing to create ‘unlinked cita­tions’ instead of /​in addi­tion to links. These should be replaced with “[LINK REMOVED]”.

NOTE: Hiding ’empty’ profiles may also result in the unfor­tu­nate side-effect of ‘hiding’ some valid-but-empty profiles, such as this one and this one. Note that, in all the cases I’ve observed, such profiles lack object­ive value. Regard­less, consid­er­a­tion should be given to adding sign­post­ing during the edit­ing process to help people build rich, complete, index­able profiles.

Assum­ing that the system is actively monitored and soph­ist­ic­ated enough to under­stand that the links (and often, they pages they’re on) have been ‘nulled’, this should completely remove the motiv­a­tion to create these profiles – though it may take time for the impact of this to seep through to the blog­gers and vendors who promote the existence/​value of such links.

On (not) hiding content

Some sugges­ted approaches to disin­centiv­iz­ing profile creation have proposed that we condi­tion­ally hide content (and/​or links) from users who are logged out of the site. This is currently the beha­viour on https://wordpress.org/support/users/[userid]/ profile pages and else­where on the forum, and it’s perceived to ‘work’ in these scen­arios. 

Aside from viol­at­ing our constraints, there’s no evid­ence that this approach does indeed ‘work’ (and, the current prolif­er­a­tion of spam profiles might suggest that that’s not the case); and more broadly, we shouldn’t base our approach on replic­at­ing arbit­rary decisions/​methodologies from else­where in the dot-org ecosys­tem.

In fact, follow­ing the (hope­fully) success­ful applic­a­tion of the actions in this docu­ment, we should consider revers­ing the current ‘content hiding’ mech­an­ics through­out the rest of the site(s) in order to better adhere to our constraints.

Improv­ing spam detection/​prevention

Google’s Recaptcha currently runs on the first step of regis­tra­tion, but not on secondary/​subsequent steps where the user confirms their demo­graphic inform­a­tion (the script is present, but not bound to anything). It’s also miss­ing entirely from the login page.

The script is also entirely miss­ing in the ‘edit profile’ page(s); which is where 90% of the prob­lem resides.

Extend­ing and complet­ing the imple­ment­a­tion of this is our easi­est win for prevent­ing the submis­sion of obvi­ously ‘spammy’ content in profile fields.

The follow­ing actions should be taken imme­di­ately:

  • Enable Recaptcha on second­ary regis­tra­tion steps, by bind­ing the script to the submis­sion button (as with the first step).
  • Add Recaptcha to profile edit­ing screens and login screens, and bind it to the submis­sion button(s).
  • Update the site’s privacy policies accord­ingly. This will need input from an expert.

NOTE: since public­a­tion, it appears that ReCaptcha V3 has been imple­men­ted on the regis­tra­tion page, but is still miss­ing from deeper regis­tra­tion states and the entire profile edit­ing area.

Why Recaptcha?

Vari­ous discus­sions have considered either adding Akismet to these processes, or defin­ing algorithmic rules for spam iden­ti­fic­a­tion. Recaptcha is a more soph­ist­ic­ated and ‘hands-off’ solu­tion than both of these – though there’s no reason why we can’t combine all three approaches.

Given the nature of these spam profiles, manu­ally defined rules won’t provide an adequate solu­tion. The vendors selling pack­ages of links on the site, or those who’re manu­ally creat­ing accounts, will simply adapt their beha­viour to avoid any rules we put in place.

Main­tain­ing a ‘block list’ of words, phrases and beha­viours would require levels of resourcing which exceed even the current poli­cing and removal of spam profiles. It would become an unman­age­able night­mare of avoid­ing false-posit­ives, under­go­ing manual reviews, and tweak­ing rules – espe­cially as the perpet­rat­ors continue to change their beha­viours in response to the blocks.

That said, given our under­stand­ing of the nature of the attacks (and the broader Word­Press ecosys­tem), we could consider factor­ing an element of heur­istic or algorithmic check­ing into a future imple­ment­a­tion of Recaptcha V3.

With regards to Akismet, it’s unclear if it’s currently running on profile updates; if this isn’t the case, it should be enabled. However, it’s not well-suited to solv­ing our partic­u­lar problem(s) (espe­cially for non-English content), and so we should proceed with running Recaptcha along­side it.

Remov­ing exist­ing spam profiles

Having disin­centiv­ized and reduced the ease of creat­ing new spam profiles being created, we should turn our eyes to gradu­ally remov­ing exist­ing spam profiles.

Unfor­tu­nately, whilst I’ve iden­ti­fied several patterns common to profiles, it’ll be chal­len­ging to define a set of logic which compre­hens­ively iden­ti­fies (and allows us to remove) them. It’ll be neces­sary to catch as much as possible algorith­mic­ally, and then gradually/​manually identify and remove remain­ing profiles over time.

I recom­mend the use of a scor­ing process to based (initially) on the follow­ing criteria:

IssuePoints
No Activ­ity5
No gravatar image2
No social profiles2
User­name contains numbers (other than year patterns)5
User­name is more than 25 unsep­ar­ated char­ac­ters5
Name is more than three words5
No profile sections5
Content shorter than 200 chars5
Content shorter than 150 chars10
Content longer than 500 words10
Content longer than 1000 words20
Content longer than 2000 words40
Content doesn’t end in stand­ard punc­tu­ation5
Content begins with a link15
Content or profile contains a link with the string ‘javas­cript’ in the href attrib­ute10 * $links
Content contains only a link40
Content contains currency symbols5 * $symbol
Content contains the follow­ing words, phrases or patterns:
free, enjoy
3 * $words
Content contains the follow­ing words, phrases or patterns:
“click here”, “learn more”, (http|www)(.*), down­load, profit, niche, buy, wife, movies, online
5 * $words
Content contains links (or “[LINK REMOVED]” replace­ments)3 * $links
Content contains multiple <br /​> tags2 * $tags
Content contains <strong>/​<b> or <em>/​<i> tags1 * $tags
Website link PATH CONTAINS multiple hyphens and/​or slashes.3 * ($el – 1)
Website link is > 100 chars in length10
Website link is a known shortener (e.g., bit​.ly)5
User has only ever logged in once15
User registered with a known tempor­ary email domain20

Based on these scor­ing criteria, we should take the follow­ing action:

ScoreAction
> 60Trash. This is almost certainly spam.
30 <> 60Pending review. This is likely spam.
< 30Leave for now.

NOTE: The scor­ing and threshold numbers may require refine­ment, and should defin­itely gain consensus before being util­ized. We should also ‘dry run’ this process over­all (or random samples of) exist­ing profiles before commit­ting to any dele­tion actions. This process should also omit accounts which have plugins/​themes, in which case, they should be escal­ated to the relev­ant teams.

Profiles flagged as ‘trashed’

  • Add a user note to the account (e.g., “[Word​Press​.org] Suspec­ted spam account; trashed”).
  • Have their profile content rendered invis­ible.
  • Be unable to post new content in the wp​.org ecosys­tem.
  • Update their comments /​forum threads (using similar mech­an­isms to current GDPR request controls; where the user/​account is anonym­ized, but the content remains).
  • Be able to be ‘un-trashed’ by admins.
  • Auto-delete after a period of 30 days.

Profiles flagged as ‘pending review’

  • Add a user note to the account (e.g., “[Word​Press​.org] Suspec­ted spam account; pending review”).
  • Have a ‘spam /​not spam’ flag:
    • Mark­ing as ‘spam’ should update the status to ‘trashed’.
    • Mark­ing as ‘not spam’ should exclude the account from future auto­mated review processes.

This process will need to be developed and main­tained. In the short-term, we may be able to short-cut some of the easier or more obvi­ous flags through third-party crawl­ing soft­ware (e.g., Scream­ing Frog).

Follow­ing an initial pass, we should recon­sider the scor­ing and thresholds for subsequent ‘rounds’ of review and dele­tion, until we’re comfort­able that it’s feas­ible to tackle profiles on-at-a-time as they’re discovered or repor­ted.

Future consid­er­a­tions

Honey­pots

A honey­pot system should be added to all registration/​profile forms, to make it harder for auto­mated systems to complete them. 

I recom­mend an approach like this one, where fields are duplic­ated in order to capture auto­mated submis­sions. However, the approach should be exten­ded to also:

  • Random­ise the ‘real’ field IDs and naming char­ac­ter­ist­ics of both the real fields and honey­pots (either on each request, or regen­er­ate on a sched­ule), and tie them back together on the back end via a nonce.
  • Random­ise the order­ing of the real vs honey­pot fields in the source code.

This requires some scop­ing and consid­er­a­tion.

Recaptcha V3 & scor­ing

In the future, I’d like to consider a roadmap for a more soph­ist­ic­ated integ­ra­tion of Recaptcha V3, and to use the scor­ing mech­an­ism. This may be used to invoke further chal­lenges, or to provide user feed­back.

Note that the mech­an­ism for V3 is differ­ent; rather than show­ing a captcha chal­lenge, the process invis­ibly scores users based on their brows­ing beha­viour. It’s up to the integ­rator to determ­ine at what scores/​thresholds certain barri­ers are put in place. E.g., with a high score, a form submis­sion might trig­ger addi­tional valid­a­tion mech­an­isms (poten­tially includ­ing conven­tional recaptcha inter­rup­tions).

Given our under­stand­ing of the char­ac­ter­ist­ics of most of the spam profiles on the site, we could our own layer on top of Recaptcha’s spam score, to ‘top up’ the value, based on the spam scor­ing model outlined above.

On scores exceed­ing a certain threshold, we should show error messages on profile content update submis­sion (and some hints; e.g., “you’ve used too many links”).

Terms of use & legal action

It should be noted that none of the imme­di­ate actions I’ve outlined address the prob­lem at the source – that money is chan­ging hands between compan­ies and indi­vidu­als to take advant­age of the word​press​.org website.

The follow­ing begins to address the under­ly­ing prob­lem through legal action, but requires further consid­er­a­tion in terms of resourcing and owner­ship.

In a nutshell, the follow­ing actions should be prior­it­ised:

Next steps

I’ll be rais­ing the follow­ing tick­ets in trac:

  • Improv­ing the integ­ra­tion of Google’s Recaptcha soft­ware (V2 and V3) into regis­tra­tion and profile updat­ing.
  • Devel­op­ing an MVP scor­ing system for assess­ing the ‘spammy­ness’ of profiles.
  • Begin­ning discus­sions around longer term actions:
    • Harden­ing our website terms of use.
    • Explor­ing legal actions against offend­ing networks/​vendors/​buyers.
    • Honey­pot devel­op­ment.
    • Build­ing more soph­ist­ic­ated logic on top of Recaptcha V3.

Let me know if I’ve missed anything!

Leave a Reply

avatar

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  Subscribe  
Notify of