Silence is golden; security vs SEO?
27th February, 2021
One of WordPress’ most basic security features creates a tricky SEO challenge. Every WordPress site contains dozens of ‘dead’ URLs, which exist to hide sensitive folders, files and information. Each of those URLs comes with an SEO cost, and there’s no easy fix.
To solve this problem, we need to make some ambitious changes to the very core of how WordPress works. The empty files which WordPress creates to prevent directory traversal – which contain nothing but the phrase “Silence is golden” – need a radical overhaul.
Table of contents
What is directory traversal?
With certain server configurations, it’s possible for users to browse and discover the ‘raw’ filesystem of a website; the files behind the content management system. This can be a security risk. These ‘indexes’ might reveal sensitive documents, information, or clues as to how the website might be structured or configured.
This type of browsing is known as “directory traversal”, and taking steps to prevent it is a fairly basic, common security consideration.
The good news is that, on most Apache servers, it’s trivially easy to disable directory traversal. You can just add the following code to a .htaccess
file: Options -Indexes
.
This prevents anybody from seeing what files are in the folder, and any folder ‘below’ it (unless those folders re-enable indexes).
Now, requests to that URL return something like this:
Whilst this isn’t pretty, users and external systems get clear signposting that they’re not allowed to access the resource (via a 403 HTTP header status). Whilst not perfect, for the most part, this is fine for SEO.
WordPress environments vary wildly
Unfortunately, this solution won’t work reliably on all WordPress websites.
WordPress gets installed and runs in lots of different environments. Those environments work in different ways and have unpredictable configurations.
For example, one server running on Apache software might be configured in a radically different way to another Apache server. And, Apache servers behave very differently to NGINX or Litespeed servers.
For some of these differences, It’s hard for WordPress to reliably detect and react to the configuration of its environment. These settings are often ‘below’, ‘behind’, or otherwise inaccessible to WordPress.
With some configurations, assuming a type of support or capability that isn’t present might crash WordPress in a way which it can’t recover from. Assuming anything about server capabilities is risky.
So if we can’t rely on disabling indexes via our .htaccess
file, we need an alternative approach.
Preventing directory traversal with an empty index file
The good news is that there’s an alternative, more universal way to ‘disable’ indexes.
The secret is simply to ensure that there’s a file in the directory (usually a index.php
file, or similar). With such a file in place, requests to the URL return the file, instead of the folder.
So one of the things that WordPress does is to add an empty index.php
file into a number of its ‘core’ folders.
You can see that the wp-content
folder, for example, contains a index.php
file that, upon investigation, doesn’t do anything.
Other than the phrase ‘Silence is golden’, these files are empty. They exist solely to prevent directory traversal.
With these in place, requesting the URLs they represent returns a blank page. This happens when you request the folder (/wp-content/
or the file /wp-content/index.php
).
The problems with an empty file
Unfortunately, this approach causes three serious side effects when it comes to SEO and user experience. I’ll explore these, and then consider what we might do to solve them; either in plugins or in WordPress’ core software.
1. They deliver a a poor user experience when visited
On other types of ‘non-existent’ URLs, WordPress serves you a ‘user-friendly’ 404 error. On these ‘Silence is golden’ files, you just get an empty white page.
That’s a poor user experience, which might cause frustration or confusion. The end-user shouldn’t be exposed to the fact that this particular URL represents a ‘real’ file. As far as they’re concerned, they’ve requested a URL that isn’t a “page” on the website, so they should get a 404 error.
How many people are hitting these URLs, though? Well, that’s hard to answer. In most cases, any tracking or analytics systems are integrated into a site’s template logic – none of which is loaded on these types of pages. We’re blind to when and where we’re failing our audiences.
2. They don’t completely solve the underlying security risk
A malicious user can still try to map the filesystem structure of my site by looking for (or brute-forcing) these types of ’empty file’ responses.
Even though they can’t see the contents of or move between directories (i.e., conventional directory traversal), they can still find them, which might be a risk.
Worst of all, on server setups that would have natively prevented directory indexes, we’ve introduced a new traversal problem where one wouldn’t have existed.
3. They (incorrectly) return a 200 HTTP status header
When these pages are loaded, they tell search engines and other consumers (like Facebook and other social media platforms) that the response from the requested page is ‘OK’ via a 200 HTTP status code. But this isn’t ‘OK’; there’s no content there to show.
Returning an unrepresentative or incorrect HTTP header is bad practice. It can cause SEO and social media issues, as external systems misunderstand and misrepresent the nature of these ‘pages’.
In some cases, it might mean that more traffic ends up on these types of URLs. In others, it might mean that a large or complex site experiences problems with crawl budget. SEO considerations aside, that could have a real impact on the carbon footprint of those sites, and at scale, of WordPress as a whole.
Instead of a 200 status, these URLs should return a 404 HTTP status; a clear statement that there’s no content found, and, no different in principle from requesting any other ‘invalid’ URL on a site.
Perhaps you could make an argument that these requests should technically return a 200 (or even a 403) status, as the file exists and is returned correctly; it’s just empty/forbidden. But the presence of a ‘real’ file isn’t relevant to us. Our websites are heavily ‘abstracted’ away from the underlying filesystem, to the point that there isn’t a direct relationship between the URLs we use to access our content, and the actual files that ‘power’ these. By this same logic, we should return a 404 Not Found header for all URLs on a WordPress site which don’t represent a ‘real’ file; which is almost all of them. That’s obviously not a good idea.
You could also make an argument for setting an X-Robots-Tag
HTTP header with a value of noindex
on these files, or to block them via a series of robots.txt
rules – but both of these approaches have unpleasant side-effects, and, don’t solve our other underlying problems.
Is this really a problem?
We should clear that none of these issues are ‘serious’, either from an SEO or security perspective. Site owners and managers don’t need to worry about these URLs; there are almost certainly more valuable things that they can spend that energy on.
It may even be that Google and other external systems are already smart enough to ‘ignore’ that these pages respond in a non-standard, incorrect way.
But neither of these factors mean that we shouldn’t aspire to fix this incorrect behaviour. Improving your SEO is an incremental challenge, and every small change and improvement can still be important.
We should also consider that these empty pages exist in large numbers across every WordPress website. Conservatively, that means that WordPress is responsible for hundreds of millions of minor SEO errors across the web. For a platform that markets itself as “SEO friendly”, that’s an uncomfortable thought.
This isn’t a new problem
This problem has been around for almost as long as WordPress has. In fact, this ticket on WordPress’ issue tracking system was opened nine years ago. Since then, some steps have been made to add index files to more key folders (primarily to avoid SEO and crawl issues), but coverage is still far from comprehensive. The underlying problem remains.
What’s the solution?
Requests to URLs that represent filesystem folders – like /wp-content/
– should return a ‘nice’ 404 error. That’s good for users, good for search engines, and good for security.
Unfortunately, we still need to rely on an index.php
file to do that (as we can’t rely on indexes being disabled), but, the file needs to contain more than just ‘Silence is golden’.
To solve the three problems we explored above, we need our index.php
files to do three things:
- Return a 404 HTTP status header, so that search engines and other external systems know that there’s no valid ‘page’ at the URL.
- Load the site’s 404 template, so that users have a good experience (and so that connected systems like tracking/analytics run).
- Be in every folder in a WordPress filesystem.
Let’s explore each of these challenges, and propose some solutions.
Returning a 404 HTTP status
Returning the correct HTTP status is an easy fix – it can be achieved via a minor edit to the existing index.php
file code. E.g., by adding header("HTTP/1.0 404 Not Found");
.
This code can simply be added to the creation routines and processes for these files.
As a singular, simple change, this is worth pursuing even if we can’t fix the bigger/underlying problems.
Loading the site’s 404 template
When you request a URL on a WordPress site that is not a ‘real’ file, that request is routed to WordPress’ main index.php
file. That then loads WordPress, which determines how the request should be handled. It’s at that point that we can load templates, and interact with stored values in the database.
When you load a ‘real’ file, then none of that happens. WordPress doesn’t load at all, which means that we can’t use its templating engine, or load any saved values.
To solve this, our index files need to load WordPress.
If we can achieve this, then we don’t even need to set a 404 HTTP status; WordPress will do it for us.
But that’s easier said than done because our index files don’t necessarily know where the files we need to load ‘live’. WordPress can be installed in multiple ways and places (e.g., in a subfolder, or on a subdomain), and we have no reliable way to locate the parts we need.
That means that need our index files to be ‘smarter’
- When the site is first installed, all of our
index.php
files should be created and/or updated with a PHP function which attempts to load WordPress from a specific location. - That location value should be an absolute reference to WordPress; not a relative path based on the file location. E.g., it should ‘build up’ from
$_SERVER["DOCUMENT_ROOT"]
, rather than constructing a relative path like../../../../index.php
. - If/when attempting to load WordPress from the defined location fails (ie., the file isn’t found), the file should fall back to simply setting a 404 HTTP status.
Whilst this may seem complicated, it’s not radically different from the way in which WordPress makes alterations to a site’s .htaccess
file when site settings are changed. And building up from the document root means that we only need to determine and construct the location value once.
The big difference in this situation, however, is the number of index files, and that they’re located all over the website’s file system. But because we’re running this in the installation process, any overhead on editing these files shouldn’t be as ‘painful’ as it might be if we needed to change large numbers of files on a live site.
What about change?
When a site structure is changed, these processes would have to be re-run. Otherwise, all of the references to WordPress’ location in our files will fail.
That wouldn’t be the end of the world (they’d still set a 404 HTTP header), but, it does mean that our solution is more fragile than we’d like.
To enable files to be updated when site settings are updated, we could also log where all of our index files are when they’re created (either at the point of installation or, in the future). We could do this in a new database table, (or in WordPress’ options table).
Then, when an update to the file structure or WordPress location occurs, a process could be triggered to update all of the files in the system.
Alternatively, we could accept that these kinds of alterations are rare, and rely on users to run a simple find-and-replace process on the filesystem to ‘fix’ all of the incorrect location values (potentially with some kind of supported method for doing this).
Being in every folder
WordPress comes with, and in some scenarios, creates large numbers of folders over time. Even knowing which folders exist, and where they are, is a considerable challenge.
For our solution to be reliable, we need to ensure that there’s a ‘smart’ index.php
file in every folder in the filesystem. That’s every part of WordPress’ core systems, every part of every theme, and every folder in every plugin.
We also need to consider every folder which WordPress (or plugins) creates; not just those which exist at the point of install. That includes a number of ‘infinite spaces’, such as date-organized media folders.
Given the number of folders, systems, people and moving parts involved, this feels like a potentially insurmountable challenge.
There is some hope, though. Folders don’t magically create themselves. They only come into existence as the result of a process. Processes can be altered.
Conveniently, WordPress already has a handy function for creating folders; it’s called wp_mkdir_p
, which is documented here. This could be adapted, extended or ‘wrapped’ in ways that means that, when it’s used, it also populates created folders with our ‘smart’ index files.
We could feasibly push for changes in WordPress core and in the coding standards for plugins, to require that the creation of folders uses this approach, rather than the ‘raw’ PHP mkdir
equivalent.
In summary
For a relatively minor bug, there’s still a lot of thinking and work that needs doing to achieve a graceful, worthwhile fix. But, as I’ve covered, that doesn’t mean that this isn’t worth solving.
As WordPress powers more and more of the web, we need to be responsible for setting standards around quality. We can’t allow it to be ‘okay’ that our content management system creates hundreds of millions of ‘dead ends’ on the web, just because it’s an unfortunate side-effect of the way in which the system works.
These kinds of issues matter, because, working towards a better web matters. Less altruistically, they matter, because they hinder the ability of search engines to effectively crawl and understand our websites.
So. Solving our ’empty page’ problem involves three key steps:
- Updating existing processes to return a 404 HTTP status.
- Updating creation processes to store a reference to (and subsequently load) WordPress.
- Updating processes to ensure that the process of creating folders implements #1 and #2.
Each of these steps adds incremental value and closes the quality gap. Even if getting to #3 is challenging (and undoubtedly a long-term, stretch goal), we should start working on improving the current behaviour.
I hope that this article provides a starting point for discussion, and sparks follow up with proposals for the fixes above based on community feedback and further exploration.