Website Security: Sensitive docs, robot blocks, and file-system locks

There’s a potential security flaw buried deep in millions of websites. If you’re affected, it could expose your private content, customer details, and sensitive documents to the world. And there’s nothing that you can do about it… Or is there?

Website managers, owners and users upload files and assets all of the time. Most of the time, these are images, videos, or documents which are displayed on (or linked to from) their pages, posts and themes.

Sometimes, the files which they upload contain or represent sensitive content – things like invoices, customer details, passwords, and business information.

Depending on how your website, CMS and server is configured, these files might be discoverable by search engines like Google. And when they’ve been found by search engines, they can easily be found by people.

Right now, I can search Google and find private content, documents, information and passwords from thousands of websites. Even metadata on assets, like EXIF information on your images, can contain private, personal or sensitive information. If I were malicious, I could actively seek out private or damaging materials, through nothing more than some targeted Google searches.

Important!

Before I dive in, I should be clear about the nature of this issue: it’s not a security flaw with, or limited to, WordPress. Rather, it’s primarily a business and asset management issue and a server configuration issue. There’s little that WordPress (or any other CMS) can do to easily ‘fix’ this issue (with an out-of-the-box approach, for most users). This post will explore why that’s the case, and what you can do about it.

That said, with its broad adoption, WordPress is where this issue presents itself most obviously. It’s also where we have the tools and opportunities to try and tackle some of the symptoms – or at least to educate people about the risks.

I’m going to explore the problem in-depth, and outline some of our options for addressing it.

Understanding the problem

Your files are public by default

As a user, I can easily upload files or folders to my server (either directly, or through the WordPress Media Library). On most server configurations, these files are, by default, publicly accessible.

In most use cases, that’s fine, and it’s the behaviour I want. The images I use on my website, for example, are meant to be accessed and shared.

But if I’m uploading sensitive assets, business documents, or customer details, it’s a different story.

WordPress’ versatility is one of its great strengths – but that means that, sometimes, it’s used in ways that create unexpected problems.

So here’s the scary bit. By default, if these URLs are publically accessible, there’s nothing stopping Google, Facebook, or competitors from finding, accessing, and sharing them. If search engines can access these resources – unless they’re explicitly asked not to1 – they’re likely to crawl them and make them discoverable through their search results.

And once these files have been discovered, it’s very easy for anybody to search Google with a specific intent to find them. It’s a simple matter to search for specific file types (such as PDFs) on a specific domain and to browse through what Google’s found. And when you’ve found a file or two, you can often spot and manipulate patterns in URL structures, and explore beyond what Google’s found.

I should point out that, attempting to take advantage of ‘bugs’ like this is often, in many territories, considered illegal. Actively manipulating URLs and systems to ‘probe’ for hidden files and information has, on several occasions, lead to jail time for the individuals in question.

Even if you’re making an effort to hide your sensitive files, the software, themes, plugins and server configurations you’re relying on might be quietly exposing them.

In particular, if you’re using lots of third-party plugins and theme code, it’s challenging to understand and monitor the location and access permissions for all of your files, documents and correspondence2. Often, you might not even know if these systems are creating or leaking sensitive information.

And if that’s leaking personally identifiable or sensitive information about individuals, that might mean that you’re breaking the law – especially if you’re handling information about EU visitors, and contravening elements of the GDPR.

How request handling for ‘non-pages’ works

To understand the nature of the problem, we need to talk about how WordPress handles URLs.

WordPress provides developers with a great deal of control over how it handles requests. When users type in the URL of a post, page, or category, developers can manage and influence how that process works.

Specifically, developers can use theme code or plugins to modify how a request is processed. We can do anything from changing or setting HTTP headers, to showing errors, to redirecting visits, or even altering what content is displayed and how it’s formatted.

But requests to the URLs of ‘non-page’ resources, like images, JPG files, PDFs, DOC files, audio and video completely bypass WordPress (and most other content management systems)!

If you type in the URL for an image that resides in a folder on my website and view that image directly in the browser, WordPress doesn’t load. Chances are, WordPress doesn’t even know that a request occurred when you request a ‘static asset’ like these. You’re requesting and getting the file directly from the server (usually, Apache, or NGINX), and the content management system is never needed or loaded.

Here’s an example image file that I uploaded to my website, through the WordPress Media Library. It has a public URL, and anybody can access it directly. Even though I uploaded the file through WordPress’ interface, it’s not involved in serving that file when it’s requested. WordPress doesn’t manage the file in any active sense; it just put it in a folder on my server.

Speaking of which, there are many cases where requests to folders in WordPress (and other platforms) will also bypass the CMS. Without specific configuration, their URLs provide direct access for users to browse your files. If I have, can find, or guess the URL, I can view and download your assets.

On most Apache systems, simply browsing to a filesystem folder lets you freely explore my theoretically sensitive business documents. Often, you can even click on the ‘Parent Directory’ link, or edit the URL to traverse ‘upwards’. Now you’re freely exploring the server’s public folder structure and rummaging through (theoretically) private files.

The challenge is that it’s really hard to change how requests to these types of URLs are handled from within WordPress when WordPress isn’t being loaded.

There are some technical solutions (such as routing all URL requests through WordPress), but these often come with scalability/performance issues, and the implementation tends to vary a lot based on each individual site and setup.

What can we do?

Ideally, we’d make alterations to how WordPress works, in order to protect everybody’s private files. But whilst there are some quick wins that you can implement on your individual site, there are significant challenges around creating a large-scale, “out of the box” solution for everybody.

By making a small tweak to your server configuration, you can prevent users from being able to browse through folders. Depending on your setup, that can be as simple as adding ‘Options ‑Indexes’ to your .htaccess file.

Frustratingly, however, many hosting solutions don’t enable this setting by default. That creates a barrier of awareness, education, and time/cost.

Unfortunately, this also doesn’t do anything to prevent the discovery of your folder structure or files – but it does at least prevent active and easy traversing of your folder structures to identify sensitive assets.

And just to make things more complex, there’s no simple equivalent setting for NGINX setups3 (or for many non-standard WordPress environments), which we can easily modify or configure through plugin and theme code. These settings often need configuring at the server level, way outside of the scope of the CMS, and on a case-by-case basis.

That’s why WordPress relies on placing empty index.php files inside all of its core folder structures – if you can’t rely on disabling indexes on folders, you can at least prevent them from being displayed by ensuring that a folder contains an index file.

However, WordPress doesn’t (and, can’t, without a great feat of engineering) automatically create these empty index files in newly created folders, or folders it doesn’t know about. That means that any custom folders you create (or which are created by other processes) expose you to the risk of directory traversal attacks4.

We’ve also got no easy way of knowing what, or where, all of your custom folders are – especially when plugins may create many folders in different locations, which frequently don’t contain empty index files to prevent exploration.

Because we can’t account for all of the folders on the server, this is far from a comprehensive or effective technique.

To really address the problem, we need to look at some bigger challenges.

Challenge #1: Changing how WordPress works

It’s hard to change how ‘non-page’ requests are handled by WordPress, without making significant changes to how the core processes of file handling and rewriting works.

By default, WordPress deliberately doesn’t ‘listen’ for requests to files and folders which exist in the filesystem – on most setups, this is the default behaviour, as configured in the .htaccess file. It’s explicitly told to ignore requests for static files and assets, and only to process requests which look like they’re for posts or pages.

What if we changed this, and made WordPress intercept all requests?

If we can get WordPress to intercept the request, we can make modifications to HTTP headers and indexation directives. We can try to prevent search engines from crawling and/or indexing sensitive resources.

But in this scenario, your website, and all of your assets are going to be much slower. WordPress is a large, complex piece of software. If it has to load on every request before it serves you the file you requested, you’re likely to be adding significant load time to every request5. You’ll be creating additional strain on your server’s hardware, too.

Such a fundamental shift in how WordPress works is also likely to make a mess. All of today’s existing themes, plugins and configurations weren’t written to anticipate this functionality, and are likely to require modification in order to avoid breakage.

Even if we could overcome some of those technical challenges, WordPress still can’t easily listen to requests for files outside of where it’s been installed – and many websites store sensitive files in folders outside of places where WordPress can listen for them.

Lastly, to be a broad and general solution, this approach would need to work consistently across many websites with varying setups, with minimal manual setup and configuration – something we can’t rely on out in the wild, where server software, settings and file permissions vary wildly.

Challenge #2: Using plugins to hide private files

Whilst it’d be safer to generally avoid storing private files in publicly accessible websites at all, that’s not always practical.

The good news is that conscientious site managers can already take action to hide their private files, by using plugins to move and/or hide them.

Typically, these plugins work by moving private files ‘upwards’ in the server, to a folder ‘above’ the publicly accessible website, so that they can only be accessed by logged in or otherwise validated requests.

But whilst these solve the problem on a case-by-case, site-by-site basis, they aren’t a great general solution. They require that you know about them, that you understand the requirement for them on your site, that you can go through the (sometimes complex) configuration processes, and that you actively micromanage all of your private assets.

And in many cases, site owners who don’t know better want to be able to share their private files via a URL (e.g., sending a client a link to an invoice PDF). However, storing them outside of the file system either prevents this (so people keep them public) or still results in sensitive assets having accessible URLs, which allow them to be discovered by search engines.

In an ideal world, WordPress’ core system would have features to move and hide private files automatically. However, placing and managing files outside of WordPress’ folder structure is a tricky business when websites, server setups and configurations vary wildly. An automated solution of this nature simply isn’t feasible.

That leaves millions of websites and businesses vulnerable.

Challenge #3: Flagging private files to search engines

There’s nothing stopping themes or plugins (or even a WordPress core feature) from adding a ‘private’, or ‘hide from search engines’ checkbox in the media manager, where users upload or manage their files.

But, remember, we can’t rely on WordPress to can’t intercept requests to these kinds of files, so we can’t do anything directly with or from that setting. What we can do, however, is change the filename or attributes of those assets when they’re created or edited.

If we, for example, appended ‘__private’ to filenames (so that invoice-123.pdf becomes invoice-123__private.pdf), then we could easily create rules in the robots.txt file to prevent search engines from crawling them or set HTTP headers via .htaccess (or similar tools) to set x‑robots-headers – which would instruct search engines not to index those files.

And of course, this doesn’t stop people searching for and/or finding these files through brute force – it’d be a trivial matter to probe for files that matched this pattern, and harvest the files.

Alternatively, we could move private files – say, into a /private/ folder. Then we can configure access rules and logic to prevent people sniffing around. We could get WordPress to follow either of these approaches, with minimal effort.

On the surface, these look like potentially elegant solutions. However, further consideration reveals that neither is viable.

Using robots.txt won’t solve our problem – whilst it prevents compliant crawlers from accessing URLs, it doesn’t stop search engines from indexing them or including them in their results. It also won’t do anything to stop malicious users or crawlers from seeking out sensitive content.

Similarly, setting HTTP headers requires changes to .htaccess or equivalent files, which, as we’ve covered, is often challenging based on varying server setups and environments. These are also likely to be ignored by malicious users or crawlers.

More significantly, changing filenames is bad practice. Best practice states that, in all cases and scenarios on the web, URLs should never change. Changing URLs risks dead ends and ‘link rot’, 404 errors, and all sorts of mess6.

And whilst it’s relatively safe to set an asset as private at the point of upload, if you ever need to change that setting in the future, you break all links to the asset.

You also break any media embedded in your content, unless you have a solution that intelligently updates all of the links and references (although, we’re hard-pressed to come up with a use-case for embedding a private asset in a public page).

There are possible solutions to this, however. If the solution for managing private assets is intelligent enough to redirect users to the public/private version of a file depending on its settings, then you could avoid those dead ends and broken URLs.

But that approach comes with a performance overhead (WordPress needs to listen and check for the existence of resources and public/private versions of files on every 404’ing request) and risks leaving a trail that leads directly to your private files.

Lastly, if you’re managing any of this functionality through a plugin, disabling or removing that plugin risks your asset management and media URLs breaking, too.

Challenge n?

None of these options is suitable. They get us some of the way to our goals, but we inevitably find ourselves stuck in a position where we can’t rely on the server environment behaving the way we expect (at least, not consistently enough, or at a scale which we can use to rely on a stable solution), or we end up creating new problems as we solve for managing exposure.

The only person who can solve this is you

The further we dig, the more apparent it becomes that this is a business challenge, as much as a technical challenge. Individual organisations will need to define and implement their own solutions, based on their own setups and needs. Stakeholders will need to invest in understanding how their files, privacy, and legal obligations are managed. 

That means that, when it comes to protecting your files, you’re in the driving seat. And, you may be legally liable if you’re leaking sensitive information – ignorance isn’t a defence.

You’ll need to think about whether you store any sensitive assets on your site(s) and make plans to review what you’re doing to prevent them from being discovered, indexed, and accessed.

There are a few things you can do to protect yourself, your business and your customers, but it all boils down to making sure that you keep your private files private.

I’d recommend that everybody does the following, on a regular basis:

  • Ensure that your folder structure doesn’t allow index browsing (either through setting server options, or including index files)
  • Remove any sensitive documents stored in your media library, unless you’re absolutely sure that they have to be there.
  • Actively protect any remaining private files through a plugin or server solution which ensures that they’re not publicly accessible.
  • Avoid using date-based upload folders, and avoid storing media/assets in newly or manually created folders (e.g., avoid logging into your server via FTP, and placing sensitive assets in folders you’ve manually created).
  • If you’re using plugins for managing customer information, invoices, or other sensitive data, pay attention to any publicly accessible URLs which they create. Remember, if you can get to the URLs without being logged into the site, so can search engines and malicious users.
  • Make sure that your site is registered through Google Search Console, and pay attention to information about which of your pages Google is indexing.
  • Watch out for plugins like contact forms, membership systems and forums, which might store sensitive data without you realising. Periodically search Google for ‘site:mysite.com7, and keep an eye out for anything which shouldn’t be there.
  • Check your server logs; search engines and malicious users may be finding and accessing sensitive data and not leaving an obvious trace – by monitoring your server logs, you can see exactly which URLs are being requested. If you’re not familiar with log analysis, this article by Dominic Woodman from Distilled is a great starting place.
  • If you think that a particular plugin, theme or system might be leaking sensitive data, let the site owner and/or plugin developers know. Please don’t share or post URLs which lead to or contain sensitive business/personal data in the comments below, or anywhere else. If you can’t get hold of somebody to report an issue, you can report issues through platforms like Open Bug Bounty or Hacker One.

1. A robots.txt disallow rule prevents crawling, but not necessarily indexation – and malicious bots might ignore the file entirely. That means that we have to rely on X‑ROBOTS HTTP headers, which we can’t send via WordPress when the file is accessed directly.

2. I’ve already found several well-known contact form plugins which store uploads and contact records in easily discoverable and traversable folders – and these are just the tip of the iceberg. I’m in the process of getting in touch with the authors.

3. Whilst it’s possible to add ‘location’ rules to NGINX which would allow you to override the normal behaviour, this can be tricky to configure, and map out every folder (or regex pattern) where you want to override the settings – that’s a lot of maintenance, and not a particularly scalable solution for most sites.

4. This includes date-based upload folders, which are WordPress’ default settings for uploaded media. In most cases, we’d recommend disabling this option in your site settings.

5. It’s also worth considering that a page which requests multiple assets (e.g., a web page with lots of images) will now be loading WordPress multiple times, adding huge performance overheads – and applying any kind of conditional logic to prevent this would be hugely complex.

6. Even if we don’t change the filename, and we just add a clever rewrite to set HTTP headers on assets requested with /private/ in the URL, that doesn’t prevent users directly accessing or sharing the original URL, without the /private/ string in the URL. And preventing direct access in that scenario functionally the same as having changed the URL.

7. You can refine your searches to look for specific patterns by adding extra components. E.g., “site:mysite.com filetype:pdf”, or “site:mysite.com intitle:invoice” are good starting points for discovering sensitive information which you might accidentally be exposing. Ryan Siddle of Merj.com has some excellent research into the kinds of things you might want to search for to identify holes on your site. Remember, using this kind of technique on other people’s sites may be considered illegal, especially if you’re actively probing for private information.

Leave a Reply

Your email address will not be published. Required fields are marked *