Website Security: Sensitive docs, robot blocks, and file-system locks

There’s a poten­tial secur­ity flaw buried deep in millions of websites. If you’re affected, it could expose your private content, customer details, and sens­it­ive docu­ments to the world. And there’s noth­ing that you can do about it… Or is there?

Website managers, owners and users upload files and assets all of the time. Most of the time, these are images, videos, or docu­ments which are displayed on (or linked to from) their pages, posts and themes.

Some­times, the files which they upload contain or repres­ent sens­it­ive content – things like invoices, customer details, pass­words, and busi­ness inform­a­tion.

Depend­ing on how your website, CMS and server is configured, these files might be discov­er­able by search engines like Google. And when they’ve been found by search engines, they can easily be found by people.

Right now, I can search Google and find private content, docu­ments, inform­a­tion and pass­words from thou­sands of websites. Even metadata on assets, like EXIF inform­a­tion on your images, can contain private, personal or sens­it­ive inform­a­tion. If I were mali­cious, I could actively seek out private or damaging mater­i­als, through noth­ing more than some targeted Google searches.


Import­ant!

Before I dive in, I should be clear about the nature of this issue: it’s not a secur­ity flaw with, or limited to, Word­Press. Rather, it’s primar­ily a busi­ness and asset manage­ment issue and a server config­ur­a­tion issue. There’s little that Word­Press (or any other CMS) can do to easily ‘fix’ this issue (with an out-of-the-box approach, for most users). This post will explore why that’s the case, and what you can do about it.

That said, with its broad adop­tion, Word­Press is where this issue presents itself most obvi­ously. It’s also where we have the tools and oppor­tun­it­ies to try and tackle some of the symp­toms – or at least to educate people about the risks.

I’m going to explore the prob­lem in depth, and outline some of our options for address­ing it.

Under­stand­ing the prob­lem

Your files are public by default

As a user, I can easily upload files or folders on my web server (either directly, or through the Word­Press Media Library). On most server config­ur­a­tions, these files are, by default, publicly access­ible.

In most use-cases, that’s fine, and it’s the beha­viour I want. The images I use on my website, for example, are meant to be accessed and shared.

But if I’m upload­ing sens­it­ive assets, busi­ness docu­ments, or customer details, it’s a differ­ent story.

Word­Press’ versat­il­ity is one of its great strengths – but that means that, some­times, it’s used in ways which create unex­pec­ted prob­lems.

So here’s the scary bit. By default, if these URLs are public­ally access­ible, there’s noth­ing stop­ping Google, Face­book, or compet­it­ors find­ing, access­ing, and shar­ing them. If search engines can access these resources – unless they’re expli­citly asked not to1 – they’re likely to crawl them and make them discov­er­able through their search results.

And once these files have been discovered, it’s very easy for anybody to search Google with a specific intent to find them. It’s a simple matter to search for specific file types (such as PDFs) on a specific domain and to browse through what Google’s found. And when you’ve found a file or two, you can often spot and manip­u­late patterns in URL struc­tures, and explore beyond what Google’s found.

I should point out that, attempt­ing to take advant­age of ‘bugs’ like this is often, in many territ­or­ies, considered illegal. Actively manip­u­lat­ing URLs and systems to ‘probe’ for hidden files and inform­a­tion has, on several occa­sions, lead to jail time for the indi­vidu­als in ques­tion.

Even if you’re making an effort to hide your sens­it­ive files, the soft­ware, themes, plugins and server config­ur­a­tions you’re rely­ing on might be quietly expos­ing them.

In partic­u­lar, if you’re using lots of third-party plugins and theme code, it’s chal­len­ging to under­stand and monitor the loca­tion and access permis­sions for all of your files, docu­ments and corres­pond­ence2. Often, you might not even know if these systems are creat­ing or leak­ing sens­it­ive inform­a­tion.

And if that’s leak­ing person­ally iden­ti­fi­able or sens­it­ive inform­a­tion about indi­vidu­als, that might mean that you’re break­ing the law – espe­cially if you’re hand­ling inform­a­tion about EU visit­ors, and contra­ven­ing elements of the GDPR.

How request hand­ling for ‘non-pages’ works

To under­stand the nature of the prob­lem, we need to talk about how Word­Press handles URLs.

Word­Press provides developers with a great deal of control over how it handles requests. When users type in the URL of a post, page, or category, developers can manage and influ­ence how that process works.

Specific­ally, developers can use theme code or plugins to modify how a request is processed. We can do anything from chan­ging or setting HTTP head­ers, to show­ing errors, to redir­ect­ing visits, or even alter­ing what content is displayed and how it’s format­ted.

But requests to the URLs of ‘non-page’ resources, like images, JPG files, PDFs, DOC files, audio and video completely bypass Word­Press (and most other content manage­ment systems)!

If you type in the URL for an image which resides in a folder on my website and view that image directly in the browser, Word­Press doesn’t load. Chances are, Word­Press doesn’t even know that a request occurred when you request a ‘static asset’ like these. You’re request­ing and getting the file directly from the server (usually, Apache, or NGINX), and the content manage­ment system is never needed or loaded.

Here’s an example image file which I uploaded to my website, through the Word­Press Media Library. It has a public URL, and anybody can access it directly. Even though I uploaded the file through Word­Press’ inter­face, it’s not involved in serving that file when it’s reques­ted. Word­Press doesn’t manage the file in any active sense; it just put it in a folder on my server.

Speak­ing of which, there are many cases where requests to folders in Word­Press (and other plat­forms) will also bypass the CMS. Without specific config­ur­a­tion, their URLs provide direct access for users to browse your files. If I have, can find, or guess the URL, I can view and down­load your assets.

On most Apache systems, simply brows­ing to a filesys­tem folders lets you freely explore my theor­et­ic­ally sens­it­ive busi­ness docu­ments. Often, you can even click on the ‘Parent Direct­ory’ link, or edit the URL to traverse ‘upwards’. Now you’re freely explor­ing the server’s public folder struc­ture, and rumma­ging through (theor­et­ic­ally) private files.

The chal­lenge is that it’s really hard to change how requests to these types of URLs are handled from within Word­Press when Word­Press isn’t being loaded.

There are some tech­nical solu­tions (such as rout­ing all URL requests through Word­Press), but these often come with scalability/​performance issues, and the imple­ment­a­tion tends to vary a lot based on each indi­vidual site and setup.

What can we do?

Ideally, it’d be breat like to make alter­a­tions to how Word­Press works, in order to protect everybody’s private files. But whilst there are some quick wins which you can imple­ment today, there are signi­fic­ant chal­lenges around creat­ing a large-scale, “out of the box” solu­tion for every­body.

By making a small tweak to your server config­ur­a­tion, you can prevent users from being able to browse through folders (as we explored in my example above). Depend­ing on your setup, that can be as simple as adding ‘Options -Indexes’ to your .htac­cess file.

Frus­trat­ingly, however, many host­ing solu­tions don’t enable this setting by default. That creates a barrier of aware­ness, educa­tion, and time/​cost.

Unfor­tu­nately, this also doesn’t do anything to prevent the discov­ery of your folder struc­ture or files – but it does at least prevent active and easy travers­ing of your folder struc­tures to identify sens­it­ive assets.

And just to make things more complex, there’s no simple equi­val­ent setting for NGINX setups3 (or for many non-stand­ard Word­Press envir­on­ments), which we can easily modify or config­ure through plugin and theme code. These settings often need config­ur­ing at server-level, way outside of the scope of the CMS, and on a case-by-case basis.

That’s why Word­Press relies on placing empty index.php files inside all of its core folder struc­tures – if you can’t rely on disabling indexes on folders, you can at least prevent them from being displayed by ensur­ing that a folder contains an index file.

However, Word­Press doesn’t (and, can’t, without a great feat of engin­eer­ing) auto­mat­ic­ally create these empty index files in newly created folders, or folders it doesn’t know about. That means that, any custom folders you create (or which are created by other processes) expose you to risk of direct­ory traversal attacks4.

We’ve also got no easy way of know­ing what, or where, all of your custom folders are – espe­cially when plugins may create many folders in differ­ent loca­tions, which frequently don’t contain empty index files to prevent explor­a­tion.

Because we can’t account for all of the folders on the server, this is far from a compre­hens­ive or effect­ive tech­nique.

To really address the prob­lem, we need to look at some bigger chal­lenges.

Chal­lenge #1: Chan­ging how Word­Press works

It’s hard to change how ‘non-page’ requests are handled by Word­Press, without making signi­fic­ant changes to how the core processes of file hand­ling and rewrit­ing works.

By default, Word­Press delib­er­ately doesn’t ‘listen’ for requests to files and folders which exist in the filesys­tem – on most setups this is the default beha­viour, as configured in the .htac­cess file. It’s expli­citly told to ignore requests for static files and assets, and only to process requests which look like they’re for posts or pages.

What if we changed this, and made Word­Press inter­cept all requests?

If we can get Word­Press to inter­cept the request, we can make modi­fic­a­tions to HTTP head­ers and index­a­tion direct­ives. We can try to prevent search engines from crawl­ing and/​or index­ing sens­it­ive resources.

But in this scen­ario, your website, and all of your assets are going to be much slower. Word­Press is a large, complex piece of soft­ware. If it has to load on every request before it serves you the file you reques­ted, you’re likely to be adding signi­fic­ant load time to every request5. You’ll be creat­ing addi­tional strain on your server’s hard­ware, too.

Such a funda­mental shift in how Word­Press works is also likely to make a mess. All of today’s exist­ing themes, plugins and config­ur­a­tions weren’t writ­ten to anti­cip­ate this func­tion­al­ity, and are likely to require modi­fic­a­tion in order to avoid break­age.

Even if we could over­come some of those tech­nical chal­lenges, Word­Press still can’t easily listen to requests for files outside of where it’s been installed – and many websites store sens­it­ive files in folders outside of places where Word­Press can listen for them.

Lastly, to be a broad and general solu­tion, this approach would need to work consist­ently across many websites with vary­ing setups, with minimal manual setup and config­ur­a­tion – some­thing we can’t rely on out in the wild, where server soft­ware, settings and file permis­sions vary wildly.

Chal­lenge #2: Using plugins to hide private files

Whilst it’d be safer to gener­ally avoid stor­ing private files in publicly access­ible websites at all, that’s not always prac­tical.

The good news is that conscien­tious site managers can already take action to hide their private files, by using plugins to move and/​or hide them.

Typic­ally, these plugins work by moving private files ‘upwards’ in the server, to a folder ‘above’ the publicly access­ible website, so that they can only be accessed by logged in, or other­wise valid­ated requests.

But whilst these solve the prob­lem on a case-by-case, site-by-site basis, they aren’t a great general solu­tion. They require that you know about them, that you under­stand the require­ment for them on your site, that you can go through the (some­times complex) config­ur­a­tion processes, and that you actively micro­man­age all of your private assets.

And in many cases, site owners who don’t know better want to be able to share their private files via a URL (e.g., send­ing a client a link to an invoice PDF). However, stor­ing them outside of the file system either prevents this (so people keep them public), or still results in sens­it­ive assets having access­ible URLs, which allow them to be discovered by search engines.

In an ideal world, Word­Press’ core system would have features to move and hide private files auto­mat­ic­ally. However, placing and managing files outside of Word­Press’ folder struc­ture is a tricky busi­ness when websites, server setups and config­ur­a­tions vary wildly. An auto­mated solu­tion of this nature simply isn’t feas­ible.

That leaves millions of websites and busi­nesses vulner­able.

Chal­lenge #3: Flag­ging private files to search engines

There’s noth­ing stop­ping themes or plugins (or even a Word­Press core feature) adding a ‘private’, or ‘hide from search engines’ check­box in the media manager, where users upload or manage their files.

But, remem­ber, we can’t rely on Word­Press to can’t inter­cept requests to these kinds of files, so we can’t do anything directly with or from that setting. What we can do, however, is change the file­name or attrib­utes of those assets when they’re created or edited.

If we, for example, appen­ded ‘_​_​private’ to file­names (so that invoice-123.pdf becomes invoice-123__private.pdf), then we could easily create rules in the robots.txt file to prevent search engines from crawl­ing them, or set HTTP head­ers via .htac­cess (or similar tools) to set x-robots-head­ers – which would instruct search engines not to index those files.

And of course, this doesn’t stop people search­ing for and/​or find­ing these files through brute force – it’d be a trivial matter to probe for files which matched this pattern, and harvest the files.

Altern­at­ively, we could move private files – say, into a /​private/​ folder. Then we can config­ure access rules and logic to prevent people sniff­ing around. We could get Word­Press to follow either of these approaches, with minimal effort.

On the surface, these look like poten­tially eleg­ant solu­tions. However, further consid­er­a­tion reveals that neither are viable.

Using robots.txt won’t solve our prob­lem – whilst it prevents compli­ant crawl­ers from access­ing URLs, it doesn’t stop search engines from index­ing them or includ­ing them in their results. It also won’t do anything to stop mali­cious users or crawl­ers seek­ing out sens­it­ive content.

Simil­arly, setting HTTP head­ers requires changes to .htac­cess or equi­val­ent files, which, as we’ve covered, is often chal­len­ging based on vary­ing server setups and envir­on­ments. These are also likely to be ignored by mali­cious users or crawl­ers.

More signi­fic­antly, chan­ging file­names is bad prac­tice. Best prac­tice states that, in all cases and scen­arios on the web, URLs should never change. Chan­ging URLs risks dead ends and ‘link rot’, 404 errors, and all sorts of mess6.

And whilst it’s relat­ively safe to set an asset as private at the point of upload, if you ever need to change that setting in the future, you break all links to the asset.

You also break any media embed­ded in your content, unless you have a solu­tion which intel­li­gently updates all of the links and refer­ences (although, we’re hard-pressed to come up with a use-case for embed­ding a private asset in a public page).

There are possible solu­tions to this, however. If the solu­tion for managing private assets is intel­li­gent enough to redir­ect users to the public/​private version of a file depend­ing on its settings, then you could avoid those dead ends and broken URLs.

But that approach comes with a perform­ance over­head (Word­Press needs to listen and check for the exist­ence of resources and public/​private versions of files on every 404’ing request), and risks leav­ing a trail which leads directly to your private files.

Lastly, if you’re managing any of this func­tion­al­ity through a plugin, disabling or remov­ing that plugin risks your asset manage­ment and media URLs break­ing, too.

Chal­lenge n?

None of these options are great solu­tions. They get us some of the way, but we inev­it­ably find ourselves stuck in a posi­tion where we can’t rely on the server envir­on­ment behav­ing the way we expect (at least, not consist­ently enough, or at a scale which we can use to rely on a stable solu­tion), or we end up creat­ing new prob­lems as we solve for managing expos­ure.

The only person who can solve this is you

The further we dig, the more appar­ent it becomes that this is a busi­ness chal­lenge, as much as a tech­nical chal­lenge. Indi­vidual organ­isa­tions will need to define and imple­ment their own solu­tions, based on their own setups and needs. Stake­hold­ers will need to invest in under­stand­ing how their files, privacy, and legal oblig­a­tions are managed.

That means that, when it comes to protect­ing your files, you’re in the driv­ing seat. And, you may be legally liable if you’re leak­ing sens­it­ive inform­a­tion – ignor­ance isn’t a defence.

You’ll need to think about whether you store any sens­it­ive assets on your site(s), and make plans to review what you’re doing to prevent them from being discovered, indexed, and accessed.

There are a few things you can do to protect your­self, your busi­ness and your custom­ers, but it all boils down to making sure that you keep your private files private.

I’d recom­mend that every­body does the follow­ing, on a regu­lar basis:

  • Ensure that your folder struc­ture doesn’t allow index brows­ing (either through setting server options, or includ­ing index files)
  • Remove any sens­it­ive docu­ments stored in your media library, unless you’re abso­lutely sure that they have to be there.
  • Actively protect any remain­ing private files through a plugin or server solu­tion which ensures that they’re not publicly access­ible.
  • Avoid using date-based upload folders, and avoid stor­ing media/​assets in newly or manu­ally created folders (e.g., avoid logging into your server via FTP, and placing sens­it­ive assets in folders you’ve manu­ally created).
  • If you’re using plugins for managing customer inform­a­tion, invoices, or other sens­it­ive data, pay atten­tion to any publicly access­ible URLs which they create. Remem­ber, if you can get to the URLs without being logged into the site, so can search engines and mali­cious users.
  • Make sure that your site is registered through Google Search Console, and pay atten­tion to inform­a­tion about which of your pages Google is index­ing.
  • Watch out for plugins like contact forms, member­ship systems and forums, which might store sens­it­ive data without you real­ising. Peri­od­ic­ally search Google for ‘site:mysite.com7, and keep an eye out for anything which shouldn’t be there.
  • Check your server logs; search engines and mali­cious users may be find­ing and access­ing sens­it­ive data and not leav­ing an obvi­ous trace – by monit­or­ing your server logs, you can see exactly which URLs are being reques­ted. If you’re not famil­iar with log analysis, this article by Dominic Wood­man from Distilled is a great start­ing place.
  • If you think that a partic­u­lar plugin, theme or system might be leak­ing sens­it­ive data, let the site owner and/​or plugin developers know. Please don’t share or post URLs which lead to or contain sens­it­ive business/​personal data in the comments below, or anywhere else. If you can’t get hold of some­body to report an issue, you can report issues through plat­forms like Open Bug Bounty or Hacker One.

1. A robots.txt disal­low rule prevents crawl­ing, but not neces­sar­ily index­a­tion – and mali­cious bots might ignore the file entirely. That means that we have to rely on X-ROBOTS HTTP head­ers, which we can’t send via Word­Press when the file is accessed directly.

2. I’ve already found several well-known contact form plugins which store uploads and contact records in easily discov­er­able and travers­able folders – and these are just the tip of the iceberg. I’m in the process of getting in touch with the authors.

3. Whilst it’s possible to add ‘loca­tion’ rules to NGINX which would allow you to over­ride the normal beha­viour, this can be tricky to config­ure, and map out every folder (or regex pattern) where you want to over­ride the settings – that’s a lot of main­ten­ance, and not a partic­u­larly scal­able solu­tion for most sites.

4. This includes date-based upload folders, which are Word­Press’ default setting for uploaded media. In most cases, we’d recom­mend disabling this option in your site settings.

5. It’s also worth consid­er­ing that a page which requests multiple assets (e.g., a web page with lots of images) will now be load­ing Word­Press multiple times, adding huge perform­ance over­heads – and apply­ing any kind of condi­tional logic to prevent this would be hugely complex.

6. Even if we don’t change the file­name, and we just add a clever rewrite to set HTTP head­ers on assets reques­ted with /​private/​in the URL, that doesn’t prevent users directly access­ing or shar­ing the original URL, without the /​private/​string in the URL. And prevent­ing direct access in that scen­ario func­tion­ally the same as having changed the URL.

7. You can refine your searches to look for specific patterns by adding extra compon­ents. E.g., “site:mysite.com filetype:pdf”, or “site:mysite.com intitle:invoice” are good start­ing points for discov­er­ing sens­it­ive inform­a­tion which you might acci­dent­ally be expos­ing. Ryan Siddle of Merj​.com has some excel­lent research into the kinds of things you might want to search for to identify holes on your site. Remem­ber, using this kind of tech­nique on other people’s sites may be considered illegal, espe­cially if you’re actively prob­ing for private inform­a­tion.

Leave a Reply

avatar

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  Subscribe  
Notify of