Overview of the problem
Unfortunately, this approach is critically flawed – the more effectively your PDFs are promoted, exposed, shared and consumed, the less you know about it. Consider any of the following scenarios:
- Your PDF ranks well in search engines
- It’s linked to from a third party website, or socially shared
- It gets bookmarked, and subsequently visited directly
- A campaign promotes the PDF directly (e.g., as a destination from an email campaign)
I’m going to outline a general approach which will help to fill in some of the gaps in your data. It assumes the use of PHP, and Google Analytics – however the approach should be portable (depending on the capabilities of your solution of choice). It can also be extended to apply to other file types which aren’t typically tracked (e.g., images[!], flash, docs). I’d love to hear about creative uses.
Intercept your PDF requests, and fire server-side tracking
Did you know that, as well as just making things ‘pretty’, URL rewriting via htaccess allows you to do some clever stuff with intercepting and altering requests? The core of the solution is as simple and elegant as adding the following code to your htaccess file:
RewriteRule ^(.+).pdf$ /analytics-pdf.php?file=$1.pdf [L,NC,QSA]
This simple rule intercepts any request ending in .pdf, and, invisibly to the user (and to search engines), actually fires analytics-pdf.php instead.
The next part is a little more complex; we need to set up analytics-pdf.php to grab the filename of the PDF, tell Google Analytics to fire a pageview, and then redirect the user to the PDF.
The following PHP is a quick hack of the example solution on the PHP-GA site, adapted to grab the PDF filename from the $_GET array and pass it to a pageview as a virtual path, and then display the requested file. Save it as ‘analytics-pdf.php’ in your root folder.
// Include the PHP GA script(s) require_once('php-ga/autoload.php'); use UnitedPrototype\GoogleAnalytics; // Change these values to your UA code and domain DEFINE('UA_CODE','UA-XXXXXXXX-Y'); DEFINE('HOSTNAME','example.com'); // Grab the PDF and sanitize the filename $filePath = $_GET['file']; $filePath = filter_var($filePath, FILTER_SANITIZE_URL); $file = end(explode("/", $filePath)); if($filePath) : $tracker = new GoogleAnalytics\Tracker(UA_CODE,HOSTNAME); $visitor = new GoogleAnalytics\Visitor(); $visitor->setIpAddress($_SERVER['REMOTE_ADDR']); $visitor->setUserAgent($_SERVER['HTTP_USER_AGENT']); $visitor->setScreenResolution('1024x768'); $session = new GoogleAnalytics\Session(); $page = new GoogleAnalytics\Page('/'.$filePath); $page->setTitle($file); $tracker->trackPageview($page, $session, $visitor); $filename = $file; header('Content-type: application/pdf'); header('Content-Disposition: inline; filename="' . $file . '"'); header('Content-Transfer-Encoding: binary'); header('Content-Length: ' . filesize($filePath)); header('Accept-Ranges: bytes'); @readfile($filePath); endif;
Believe it or not, that’s all there is to it. You’re now intercepting all PDF requests, and triggering a pageview before serving up the PDF. Magic!
I’d caution reading on, however, as there are some implications and unknowns around this approach.
A note on sessions and Universal Analytics
PHP-GA is creating a distinct session when it fires, which won’t tie together with existing visit behaviour. If I arrive on a PDF and browse into the site, there’ll be attribution issues and some inflation of visit counts. Similarly, on-site visitors who click through to a PDF will also create disconnected data.
In the mid- to long-term, this will be easily manageable with the release of Universal Analytics, which will replace the PHP-GA component of the solution, and allow us to use server-side Analytics to carry across a distinct session ID between requests.
In the meantime, I’d appreciate anybody’s thoughts on how we might overcome this – it’s potentially feasible to do some clever things with including conditional logic in the tags, and/or artificially constructing/maintaining channel data by, e.g., manipulating UTM tags. I started considering solutions, but was keen to get something out of the door, at least, and to revisit and refine as usage demands.
Early thoughts include:
- Only firing if we detect that the visitor originated from an external location (e.g., not an internal link), as it’s assumed that existing event tracking will account for internal links
- Modifying the code to construct events, rather than pageviews, so as to create a consolidated view count when added to the existing internal link click counts
- Considering using passive events so as to avoid visit count inflation, but at the expense of understanding that a pageview-like action has occurred (which may artificially skew interpretation in the wrong/other direction than the present problem presents).
- Attempting to collect, and then carry channel data through to the PDF – e.g., appending UTM parameters to the destination URL, which can then be carried on into subsequent pageviews on the site
Further considerations and challenges
Please bear the following in mind when using:
- This isn’t rigorously tested. I’m proving the framework of a conceptual solution here, which you’ll need to adapt to your own requirements. Don’t expect it to work 100% correctly out of the box.
- I’ve not paid much attention to sanitisation, filtering, and validation. It’s possible, likely even, that there are security holes in the recording and usage of the PDF filename/filepath which could do with closing.
- I’ve not played much with PHP-GA, as the solution is only intended to bridge the gap between now and the arrival of Universal Analytics. There may be more things which it could do, or better ways to do what it’s doing
- It’s hard to completely predict how this will effect your data, given the caveats around sessions.
- Does this open up opportunities for commercially accountable re-targeting campaigns which use white papers, etc. as destinations? What are the implications?
- How useful is this for consolidating data on disperate/seperate instances of a single resource? E.g., a video in multiple formats, or in multiple locations?
Though this solution won me “Best Tip” at MeasureCamp, I really have to give proper credit to Monica and the team at twentysix, whose discussion around the problem got me thinking about possible solutions.
17/02/2013 20:46 – From a conversation with @danbarker, we’ve explored the idea of, rather than sending the user to the ‘vanilla’ PDF, presenting the PDF in a ‘wrap’ on request (e.g., an iframe, embed, or similar).
The PDF would be framed in a viewing portal, but enhanced with share/print options, internal links, etc. See Dan’s hastily mocked up screenshot. This would have the added advantage of allowing for the native embedding of normal GA code (which, given that it’d just fire and record the URL, would give you a completely integrated session/data-set) but comes with some disadvantages. Some native functionality is broken, such as right clicking a link to a PDF to save it, and interferes with any other native behaviour which expects a link to a PDF to act like a PDF, rather than a web page. It’s worth pointing out that there are obviously solutions out there which do things like this, but our expectations are that the analytics angle probably doesn’t get the love that it deserved.
I think that this has some definite potential as, e.g., a WordPress plugin as a fork of the solution, but may not be appropriate for everybody.