Advanced PDF tracking

Many websites, espe­cially in B2B or long purchase cycle scen­arios, rely heav­ily on PDFs as part of their conver­sion process, reassurance/​validation content, and SEO strategy (gener­ally contain­ing white papers, case stud­ies, etc). Typic­ally, the effect­ive­ness of these resources is meas­ured in two ways – from an SEO/​rankings perspect­ive, and from meas­ur­ing how frequently links to the PDFs are clicked (by, e.g., firing some on-click JavaS­cript).

Unfor­tu­nately, this approach is crit­ic­ally flawed – the more effect­ively your PDFs are promoted, exposed, shared and consumed, the less you know about it. Consider any of the follow­ing scen­arios:

  • Your PDF ranks well in search engines
  • It’s linked to from a third party website, or socially shared
  • It gets book­marked, and subsequently visited directly
  • A campaign promotes the PDF directly (e.g., as a destin­a­tion from an email campaign)

In each of these scen­arios, if you’re using a client-side, JavaS­cript track­ing solu­tion like Google Analyt­ics, your track­ing solu­tion is completely bypassed, as your PDFs don’t (and gener­ally can’t, without secur­ity head­aches) contain your track­ing code. The better your SEO campaign promotes your white papers, the more you’re going to under-report their success, and risk making poor decisions or having your work misrep­res­en­ted… Eek!

I’m going to outline a general approach which will help to fill in some of the gaps in your data. It assumes the use of PHP, and Google Analyt­ics – however the approach should be port­able (depend­ing on the capab­il­it­ies of your solu­tion of choice). It can also be exten­ded to apply to other file types which aren’t typic­ally tracked (e.g., images[!], flash, docs). I’d love to hear about creat­ive uses.

Inter­cept your PDF requests, and fire server-side track­ing

Did you know that, as well as just making things ‘pretty’, URL rewrit­ing via htac­cess allows you to do some clever stuff with inter­cept­ing and alter­ing requests? The core of the solu­tion is as simple and eleg­ant as adding the follow­ing code to your htac­cess file:

RewriteRule ^(.+).pdf$  /analytics-pdf.php?file=$1.pdf [L,NC,QSA]

This simple rule inter­cepts any request ending in .pdf, and, invis­ibly to the user (and to search engines), actu­ally fires analytics-pdf.php instead.

The next part is a little more complex; we need to set up analytics-pdf.php to grab the file­name of the PDF, tell Google Analyt­ics to fire a pageview, and then redir­ect the user to the PDF.

PHP-GA

You’ll need grab and upload a copy of PHP-GA. This is a PHP frame­work which makes it easy to construct calls to GA in much the same way as the normal JavaS­cript approach, but using server-side logic. We’ll use this frame­work to construct the request to Google Analyt­ics.

The follow­ing PHP is a quick hack of the example solu­tion on the PHP-GA site, adap­ted to grab the PDF file­name from the $_​GET array and pass it to a pageview as a virtual path, and then display the reques­ted file. Save it as ‘analytics-pdf.php’ in your root folder.

 // Include the PHP GA script(s)
 require_once('php-ga/autoload.php');
 use UnitedPrototype\GoogleAnalytics;

// Change these values to your UA code and domain
DEFINE('UA_CODE','UA-XXXXXXXX-Y');
DEFINE('HOSTNAME','example.com');

// Grab the PDF and sanitize the filename
$filePath = $_GET['file'];
$filePath = filter_var($filePath, FILTER_SANITIZE_URL);
$file = end(explode("/", $filePath));

if($filePath) :

 $tracker = new GoogleAnalytics\Tracker(UA_CODE,HOSTNAME);
 $visitor = new GoogleAnalytics\Visitor();
 $visitor->setIpAddress($_SERVER['REMOTE_ADDR']);
 $visitor->setUserAgent($_SERVER['HTTP_USER_AGENT']);
 $visitor->setScreenResolution('1024x768');
 $session = new GoogleAnalytics\Session();
 $page = new GoogleAnalytics\Page('/'.$filePath);
 $page->setTitle($file);
 $tracker->trackPageview($page, $session, $visitor);

 $filename = $file;
 header('Content-type: application/pdf');
 header('Content-Disposition: inline; filename="' . $file . '"');
 header('Content-Transfer-Encoding: binary');
 header('Content-Length: ' . filesize($filePath));
 header('Accept-Ranges: bytes');
 @readfile($filePath);

endif;

Believe it or not, that’s all there is to it. You’re now inter­cept­ing all PDF requests, and trig­ger­ing a pageview before serving up the PDF. Magic!

I’d caution read­ing on, however, as there are some implic­a­tions and unknowns around this approach.

A note on sessions and Univer­sal Analyt­ics

PHP-GA is creat­ing a distinct session when it fires, which won’t tie together with exist­ing visit beha­viour. If I arrive on a PDF and browse into the site, there’ll be attri­bu­tion issues and some infla­tion of visit counts. Simil­arly, on-site visit­ors who click through to a PDF will also create discon­nec­ted data.

In the mid- to long-term, this will be easily manage­able with the release of Univer­sal Analyt­ics, which will replace the PHP-GA compon­ent of the solu­tion, and allow us to use server-side Analyt­ics to carry across a distinct session ID between requests.

In the mean­time, I’d appre­ci­ate anybody’s thoughts on how we might over­come this – it’s poten­tially feas­ible to do some clever things with includ­ing condi­tional logic in the tags, and/​or arti­fi­cially constructing/​maintaining chan­nel data by, e.g., manip­u­lat­ing UTM tags. I star­ted consid­er­ing solu­tions, but was keen to get some­thing out of the door, at least, and to revisit and refine as usage demands.

Early thoughts include:

  • Only firing if we detect that the visitor origin­ated from an external loca­tion (e.g., not an internal link), as it’s assumed that exist­ing event track­ing will account for internal links
  • Modi­fy­ing the code to construct events, rather than pageviews, so as to create a consol­id­ated view count when added to the exist­ing internal link click counts
  • Consid­er­ing using pass­ive events so as to avoid visit count infla­tion, but at the expense of under­stand­ing that a pageview-like action has occurred (which may arti­fi­cially skew inter­pret­a­tion in the wrong/​other direc­tion than the present prob­lem presents).
  • Attempt­ing to collect, and then carry chan­nel data through to the PDF – e.g., append­ing UTM para­met­ers to the destin­a­tion URL, which can then be carried on into subsequent pageviews on the site

Further consid­er­a­tions and chal­lenges

Please bear the follow­ing in mind when using:

  • This isn’t rigor­ously tested. I’m prov­ing the frame­work of a concep­tual solu­tion here, which you’ll need to adapt to your own require­ments. Don’t expect it to work 100% correctly out of the box.
  • I’ve not paid much atten­tion to sanit­isa­tion, filter­ing, and valid­a­tion. It’s possible, likely even, that there are secur­ity holes in the record­ing and usage of the PDF filename/​filepath which could do with clos­ing.
  • I’ve not played much with PHP-GA, as the solu­tion is only inten­ded to bridge the gap between now and the arrival of Univer­sal Analyt­ics. There may be more things which it could do, or better ways to do what it’s doing
  • It’s hard to completely predict how this will effect your data, given the caveats around sessions.
  • Does this open up oppor­tun­it­ies for commer­cially account­able re-target­ing campaigns which use white papers, etc. as destin­a­tions? What are the implic­a­tions?
  • How useful is this for consol­id­at­ing data on disperate/​seperate instances of a single resource? E.g., a video in multiple formats, or in multiple loca­tions?

Updates

17/​02/​2013 20:46 – From a conver­sa­tion with @danbarker, we’ve explored the idea of, rather than send­ing the user to the ‘vanilla’ PDF, present­ing the PDF in a ‘wrap’ on request (e.g., an iframe, embed, or similar).

The PDF would be framed in a view­ing portal, but enhanced with share/​print options, internal links, etc. See Dan’s hast­ily mocked up screen­shot. This would have the added advant­age of allow­ing for the native embed­ding of normal GA code (which, given that it’d just fire and record the URL, would give you a completely integ­rated session/data-set) but comes with some disad­vant­ages. Some native func­tion­al­ity is broken, such as right click­ing a link to a PDF to save it, and inter­feres with any other native beha­viour which expects a link to a PDF to act like a PDF, rather than a web page. It’s worth point­ing out that there are obvi­ously solu­tions out there which do things like this, but our expect­a­tions are that the analyt­ics angle prob­ably does­n’t get the love that it deserved.

I think that this has some defin­ite poten­tial as, e.g., a Word­Press plugin as a fork of the solu­tion, but may not be appro­pri­ate for every­body.

4
Leave a Reply

avatar
3 Comment threads
1 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
4 Comment authors
Jono AldersonpaoloRyanjared Recent comment authors

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  Subscribe  
newest oldest most voted
Notify of
jared
Guest

Thank you for the script and concept. I have imple­men­ted it on this website to track our pdf down­loads..

http://​goodandev​il​book​.com/​l​a​n​g​u​a​ges

Ryan
Guest

I work for a company (Doca­lyt­ics – http://​doca​lyt​ics​.com) that takes this a step further. We convert the PDF into an HTML5 viewer, and provide detailed analyt­ics from there. It allows you to get more detailed inform­a­tion, such as pages read, etc.

paolo
Guest
paolo

Hi Jono,
I’ve adop­ted your solu­tion.
I ask some­thing about what you write on the para­graph “A note on sessions and Univer­sal Analyt­ics”.
I ask for confirm­a­tion that the infla­tion acts only on the count of visits, not on the user nor on page views.

Thank you