Advanced PDF tracking

Many websites, espe­cially in B2B or long purchase cycle scen­arios, rely heav­ily on PDFs as part of their conver­sion process, reassurance/​validation content, and SEO strategy (gener­ally contain­ing white papers, case stud­ies, etc). Typic­ally, the effect­ive­ness of these resources is meas­ured in two ways – from an SEO/​rankings perspect­ive, and from meas­ur­ing how frequently links to the PDFs are clicked (by, e.g., firing some on-click JavaS­cript).

Unfor­tu­nately, this approach is crit­ic­ally flawed – the more effect­ively your PDFs are promoted, exposed, shared and consumed, the less you know about it. Consider any of the follow­ing scen­arios:

  • Your PDF ranks well in search engines
  • It’s linked to from a third party website, or socially shared
  • It gets book­marked, and subsequently visited directly
  • A campaign promotes the PDF directly (e.g., as a destin­a­tion from an email campaign)

In each of these scen­arios, if you’re using a client-side, JavaS­cript track­ing solu­tion like Google Analyt­ics, your track­ing solu­tion is completely bypassed, as your PDFs don’t (and gener­ally can’t, without secur­ity head­aches) contain your track­ing code. The better your SEO campaign promotes your white papers, the more you’re going to under-report their success, and risk making poor decisions or having your work misrep­res­en­ted… Eek!

I’m going to outline a general approach which will help to fill in some of the gaps in your data. It assumes the use of PHP, and Google Analyt­ics – however the approach should be port­able (depend­ing on the capab­il­it­ies of your solu­tion of choice). It can also be exten­ded to apply to other file types which aren’t typic­ally tracked (e.g., images[!], flash, docs). I’d love to hear about creat­ive uses.

Inter­cept your PDF requests, and fire server-side track­ing

Did you know that, as well as just making things ‘pretty’, URL rewrit­ing via htac­cess allows you to do some clever stuff with inter­cept­ing and alter­ing requests? The core of the solu­tion is as simple and eleg­ant as adding the follow­ing code to your htac­cess file:

RewriteRule ^(.+).pdf$  /analytics-pdf.php?file=$1.pdf [L,NC,QSA]

This simple rule inter­cepts any request ending in .pdf, and, invis­ibly to the user (and to search engines), actu­ally fires analytics-pdf.php instead.

The next part is a little more complex; we need to set up analytics-pdf.php to grab the file­name of the PDF, tell Google Analyt­ics to fire a pageview, and then redir­ect the user to the PDF.

PHP-GA

You’ll need grab and upload a copy of PHP-GA. This is a PHP frame­work which makes it easy to construct calls to GA in much the same way as the normal JavaS­cript approach, but using server-side logic. We’ll use this frame­work to construct the request to Google Analyt­ics.

The follow­ing PHP is a quick hack of the example solu­tion on the PHP-GA site, adap­ted to grab the PDF file­name from the $_​GET array and pass it to a pageview as a virtual path, and then display the reques­ted file. Save it as ‘analytics-pdf.php’ in your root folder.

 // Include the PHP GA script(s)
 require_once('php-ga/autoload.php');
 use UnitedPrototype\GoogleAnalytics;

// Change these values to your UA code and domain
DEFINE('UA_CODE','UA-XXXXXXXX-Y');
DEFINE('HOSTNAME','example.com');

// Grab the PDF and sanitize the filename
$filePath = $_GET['file'];
$filePath = filter_var($filePath, FILTER_SANITIZE_URL);
$file = end(explode("/", $filePath));

if($filePath) :

 $tracker = new GoogleAnalytics\Tracker(UA_CODE,HOSTNAME);
 $visitor = new GoogleAnalytics\Visitor();
 $visitor->setIpAddress($_SERVER['REMOTE_ADDR']);
 $visitor->setUserAgent($_SERVER['HTTP_USER_AGENT']);
 $visitor->setScreenResolution('1024x768');
 $session = new GoogleAnalytics\Session();
 $page = new GoogleAnalytics\Page('/'.$filePath);
 $page->setTitle($file);
 $tracker->trackPageview($page, $session, $visitor);

 $filename = $file;
 header('Content-type: application/pdf');
 header('Content-Disposition: inline; filename="' . $file . '"');
 header('Content-Transfer-Encoding: binary');
 header('Content-Length: ' . filesize($filePath));
 header('Accept-Ranges: bytes');
 @readfile($filePath);

endif;

Believe it or not, that’s all there is to it. You’re now inter­cept­ing all PDF requests, and trig­ger­ing a pageview before serving up the PDF. Magic!

I’d caution read­ing on, however, as there are some implic­a­tions and unknowns around this approach.

A note on sessions and Univer­sal Analyt­ics

PHP-GA is creat­ing a distinct session when it fires, which won’t tie together with exist­ing visit beha­viour. If I arrive on a PDF and browse into the site, there’ll be attri­bu­tion issues and some infla­tion of visit counts. Simil­arly, on-site visit­ors who click through to a PDF will also create discon­nec­ted data.

In the mid- to long-term, this will be easily manage­able with the release of Univer­sal Analyt­ics, which will replace the PHP-GA compon­ent of the solu­tion, and allow us to use server-side Analyt­ics to carry across a distinct session ID between requests.

In the mean­time, I’d appre­ci­ate anybody’s thoughts on how we might over­come this – it’s poten­tially feas­ible to do some clever things with includ­ing condi­tional logic in the tags, and/​or arti­fi­cially constructing/​maintaining chan­nel data by, e.g., manip­u­lat­ing UTM tags. I star­ted consid­er­ing solu­tions, but was keen to get some­thing out of the door, at least, and to revisit and refine as usage demands.

Early thoughts include:

  • Only firing if we detect that the visitor origin­ated from an external loca­tion (e.g., not an internal link), as it’s assumed that exist­ing event track­ing will account for internal links
  • Modi­fy­ing the code to construct events, rather than pageviews, so as to create a consol­id­ated view count when added to the exist­ing internal link click counts
  • Consid­er­ing using pass­ive events so as to avoid visit count infla­tion, but at the expense of under­stand­ing that a pageview-like action has occurred (which may arti­fi­cially skew inter­pret­a­tion in the wrong/​other direc­tion than the present prob­lem presents).
  • Attempt­ing to collect, and then carry chan­nel data through to the PDF – e.g., append­ing UTM para­met­ers to the destin­a­tion URL, which can then be carried on into subsequent pageviews on the site

Further consid­er­a­tions and chal­lenges

Please bear the follow­ing in mind when using:

  • This isn’t rigor­ously tested. I’m prov­ing the frame­work of a concep­tual solu­tion here, which you’ll need to adapt to your own require­ments. Don’t expect it to work 100% correctly out of the box.
  • I’ve not paid much atten­tion to sanit­isa­tion, filter­ing, and valid­a­tion. It’s possible, likely even, that there are secur­ity holes in the record­ing and usage of the PDF filename/​filepath which could do with clos­ing.
  • I’ve not played much with PHP-GA, as the solu­tion is only inten­ded to bridge the gap between now and the arrival of Univer­sal Analyt­ics. There may be more things which it could do, or better ways to do what it’s doing
  • It’s hard to completely predict how this will effect your data, given the caveats around sessions.
  • Does this open up oppor­tun­it­ies for commer­cially account­able re-target­ing campaigns which use white papers, etc. as destin­a­tions? What are the implic­a­tions?
  • How useful is this for consol­id­at­ing data on disperate/​seperate instances of a single resource? E.g., a video in multiple formats, or in multiple loca­tions?

Updates

17/​02/​2013 20:46 – From a conver­sa­tion with @danbarker, we’ve explored the idea of, rather than send­ing the user to the ‘vanilla’ PDF, present­ing the PDF in a ‘wrap’ on request (e.g., an iframe, embed, or similar).

The PDF would be framed in a view­ing portal, but enhanced with share/​print options, internal links, etc. See Dan’s hast­ily mocked up screen­shot. This would have the added advant­age of allow­ing for the native embed­ding of normal GA code (which, given that it’d just fire and record the URL, would give you a completely integ­rated session/data-set) but comes with some disad­vant­ages. Some native func­tion­al­ity is broken, such as right click­ing a link to a PDF to save it, and inter­feres with any other native beha­viour which expects a link to a PDF to act like a PDF, rather than a web page. It’s worth point­ing out that there are obvi­ously solu­tions out there which do things like this, but our expect­a­tions are that the analyt­ics angle prob­ably doesn’t get the love that it deserved.

I think that this has some defin­ite poten­tial as, e.g., a Word­Press plugin as a fork of the solu­tion, but may not be appro­pri­ate for every­body.

newest oldest most voted
Notify of
jared
Guest

Thank you for the script and concept. I have imple­men­ted it on this website to track our pdf down­loads..

http://​goodandev​il​book​.com/​l​a​n​g​u​a​ges

Ryan
Guest

I work for a company (Doca­lyt­ics – http://​doca​lyt​ics​.com) that takes this a step further. We convert the PDF into an HTML5 viewer, and provide detailed analyt­ics from there. It allows you to get more detailed inform­a­tion, such as pages read, etc.

paolo
Guest
paolo

Hi Jono,
I’ve adop­ted your solu­tion.
I ask some­thing about what you write on the para­graph “A note on sessions and Univer­sal Analyt­ics”.
I ask for confirm­a­tion that the infla­tion acts only on the count of visits, not on the user nor on page views.

Thank you