Advanced PDF tracking

Many websites, especially in B2B or long purchase cycle scenarios, rely heavily on PDFs as part of their conversion process, reassurance/validation content, and SEO strategy (generally containing white papers, case studies, etc). Typically, the effectiveness of these resources is measured in two ways – from an SEO/rankings perspective, and from measuring how frequently links to the PDFs are clicked (by, e.g., firing some on-click JavaScript).

Unfortunately, this approach is critically flawed – the more effectively your PDFs are promoted, exposed, shared and consumed, the less you know about it. Consider any of the following scenarios:

  • Your PDF ranks well in search engines
  • It’s linked to from a third party website, or socially shared
  • It gets bookmarked, and subsequently visited directly
  • A campaign promotes the PDF directly (e.g., as a destination from an email campaign)

In each of these scenarios, if you’re using a client-side, JavaScript tracking solution like Google Analytics, your tracking solution is completely bypassed, as your PDFs don’t (and generally can’t, without security headaches) contain your tracking code. The better your SEO campaign promotes your white papers, the more you’re going to under-report their success, and risk making poor decisions or having your work misrepresented… Eek!

I’m going to outline a general approach that will help to fill in some of the gaps in your data. It assumes the use of PHP, and Google Analytics – however, the approach should be portable (depending on the capabilities of your solution of choice). It can also be extended to apply to other file types which aren’t typically tracked (e.g., images[!], flash, docs). I’d love to hear about creative uses.

Intercept your PDF requests, and fire server-side tracking

Did you know that, as well as just making things ‘pretty’, URL rewriting via htaccess allows you to do some clever stuff with intercepting and altering requests? The core of the solution is as simple and elegant as adding the following code to your htaccess file:

RewriteRule ^(.+).pdf$  /analytics-pdf.php?file=$1.pdf [L,NC,QSA]

This simple rule intercepts any request ending in .pdf, and, invisibly to the user (and to search engines), actually fires analytics-pdf.php instead.

The next part is a little more complex; we need to set up analytics-pdf.php to grab the filename of the PDF, tell Google Analytics to fire a pageview, and then redirect the user to the PDF.

PHP-GA

You’ll need to grab and upload a copy of PHP-GA. This is a PHP framework that makes it easy to construct calls to GA in much the same way as the normal JavaScript approach, but using server-side logic. We’ll use this framework to construct the request to Google Analytics.

The following PHP is a quick hack of the example solution on the PHP-GA site, adapted to grab the PDF filename from the $_GET array and pass it to a pageview as a virtual path, and then display the requested file. Save it as ‘analytics-pdf.php’ in your root folder.

 // Include the PHP GA script(s)
 require_once('php-ga/autoload.php');
 use UnitedPrototype\GoogleAnalytics;

// Change these values to your UA code and domain
DEFINE('UA_CODE','UA-XXXXXXXX-Y');
DEFINE('HOSTNAME','example.com');

// Grab the PDF and sanitize the filename
$filePath = $_GET['file'];
$filePath = filter_var($filePath, FILTER_SANITIZE_URL);
$file = end(explode("/", $filePath));

if($filePath) :

 $tracker = new GoogleAnalytics\Tracker(UA_CODE,HOSTNAME);
 $visitor = new GoogleAnalytics\Visitor();
 $visitor->setIpAddress($_SERVER['REMOTE_ADDR']);
 $visitor->setUserAgent($_SERVER['HTTP_USER_AGENT']);
 $visitor->setScreenResolution('1024x768');
 $session = new GoogleAnalytics\Session();
 $page = new GoogleAnalytics\Page('/'.$filePath);
 $page->setTitle($file);
 $tracker->trackPageview($page, $session, $visitor);

 $filename = $file;
 header('Content-type: application/pdf');
 header('Content-Disposition: inline; filename="' . $file . '"');
 header('Content-Transfer-Encoding: binary');
 header('Content-Length: ' . filesize($filePath));
 header('Accept-Ranges: bytes');
 @readfile($filePath);

endif;

You’re now intercepting all PDF requests, and triggering a pageview before serving up the PDF. Magic!

I’d caution reading on, however, as there are some implications and unknowns around this approach.

A note on sessions and Universal Analytics

PHP-GA is creating a distinct session when it fires, which won’t tie together with existing visit behaviour. If I arrive on a PDF and browse into the site, there’ll be attribution issues and some inflation of visit counts. Similarly, on-site visitors who click through to a PDF will also create disconnected data.

In the mid- to long-term, this will be easily manageable with the release of Universal Analytics, which will replace the PHP-GA component of the solution, and allow us to use server-side Analytics to carry across a distinct session ID between requests.

In the meantime, I’d appreciate anybody’s thoughts on how we might overcome this – it’s potentially feasible to do some clever things with including conditional logic in the tags, and/or artificially constructing/maintaining channel data by, e.g., manipulating UTM tags. I started considering solutions but was keen to get something out of the door, at least, and to revisit and refine as usage demands.

Early thoughts include:

  • Only firing if we detect that the visitor originated from an external location (e.g., not an internal link), as it’s assumed that existing event tracking will account for internal links
  • Modifying the code to construct events, rather than pageviews, so as to create a consolidated view count when added to the existing internal link click counts
  • Considering using passive events so as to avoid visit count inflation, but at the expense of understanding that a pageview-like action has occurred (which may artificially skew interpretation in the wrong/other direction than the present problem presents).
  • Attempting to collect, and then carry channel data through to the PDF – e.g., appending UTM parameters to the destination URL, which can then be carried on into subsequent pageviews on the site

Further considerations and challenges

Please bear the following in mind when using:

  • This isn’t rigorously tested. I’m proving the framework of a conceptual solution here, which you’ll need to adapt to your own requirements. Don’t expect it to work 100% correctly out of the box.
  • I’ve not paid much attention to sanitisation, filtering, and validation. It’s possible, likely even, that there are security holes in the recording and usage of the PDF filename/filepath which could do with closing.
  • I’ve not played much with PHP-GA, as the solution is only intended to bridge the gap between now and the arrival of Universal Analytics. There may be more things which it could do, or better ways to do what it’s doing
  • It’s hard to completely predict how this will affect your data, given the caveats around sessions.
  • Does this open up opportunities for commercially accountable re-targeting campaigns which use white papers, etc. as destinations? What are the implications?
  • How useful is this for consolidating data on disperate/seperate instances of a single resource? E.g., a video in multiple formats, or in multiple locations?

Updates

17/02/2013 20:46 – From a conversation with @danbarker, we’ve explored the idea of, rather than sending the user to the ‘vanilla’ PDF, presenting the PDF in a ‘wrap’ on request (e.g., an iframe, embed, or similar).

The PDF would be framed in a viewing portal, but enhanced with share/print options, internal links, etc. See Dan’s hastily mocked up screenshot. This would have the added advantage of allowing for the native embedding of normal GA code (which, given that it’d just fire and record the URL, would give you a completely integrated session/data-set) but comes with some disadvantages. Some native functionality is broken, such as right-clicking a link to a PDF to save it, and interferes with any other native behaviour which expects a link to a PDF to act like a PDF, rather than a web page. It’s worth pointing out that there are obviously solutions out there that do things like this, but our expectations are that the analytics angle probably doesn’t get the love that it deserved.

I think that this has some definite potential as, e.g., a WordPress plugin as a fork of the solution, but may not be appropriate for everybody.

4 responses to “Advanced PDF tracking”

  1. jared says:

    Thank you for the script and concept. I have implemented it on this website to track our pdf downloads..

    http://goodandevilbook.com/languages

  2. Ryan says:

    I work for a company (Docalytics – http://docalytics.com) that takes this a step further. We convert the PDF into an HTML5 viewer, and provide detailed analytics from there. It allows you to get more detailed information, such as pages read, etc.

  3. paolo says:

    Hi Jono,
    I’ve adopted your solution.
    I ask something about what you write on the paragraph “A note on sessions and Universal Analytics”.
    I ask for confirmation that the inflation acts only on the count of visits, not on the user nor on page views.

    Thank you

Leave a Reply

Your email address will not be published. Required fields are marked *