Google have to improve their structured data reporting
30th December, 2019
I’ve talked previously about how the adoption of structured data – specifically schema.org markup – is strategically important to Google. That’s why they’re investing so much in supporting new formats, and why they’re continuing to add new reports to their Google Search Console product.
But the way in which those reports summarise and visualise the use of structured data on your website is problematic. Their reports are increasingly confusing, misleading, and are going to become worse over time as they extend their reporting to cover new features.
Their reports are confusing
Many webmasters will have issues like the one outlined in the image below. Here, the “Sitelinks searchbox” report in Google Search Console (which identifies the presence of structured data describing an internal site search feature) says that the yoast.com website features over two thousand internal search elements.
Yoast.com obviously doesn’t have thousands of internal search tools, but we do describe the internal search feature of our site in the structured data of every page on our site. That’s because each
WebPage element we’re describing on each page is part of a
WebSite, and that
WebSite has a
SearchAction property. We’re not saying “This page has an internal search feature”, we’re saying “This
WebPage is part of a
WebSite, which has a search feature”.
That looks something like the image below, which is taken from the source code of one of our articles about meta descriptions.
Despite looking like a ‘good’ report, these green bar charts are completely meaningless. In fact, this is just one example of many, where the reports simply don’t make sense. The
SearchAction markup is a property of the
WebSite, not the
WebPage, so reporting on it based on the number of pages it’s found on simply doesn’t make sense. Having more pages with this markup is neither ‘good’ nor ‘bad’, and arguably isn’t relevant or useful information.
This confusion in reporting between “things on this page”, and “things we found references to on this page” is consistent throughout Google’s reporting, and renders most of those reports largely useless. At best, they’re simply a count of pages on a website. At worst, they’re confusing to users who don’t understand what they’re looking at.
As Google adds support for more types of structured data, the problem is going to get worse. But addressing this isn’t as simple as just improving the charts.
There’s a deeper problem
The poor reporting formats and visualisation are the tip of the iceberg for a bigger problem. Fixing the Sitelinks searchbox graph would be a trivial issue, but it’s unlikely to happen due to three deeper challenges with Google’s internal operations and communications. Specifically:
- Google’s ‘mental model’ for structured data is entirely page-centric*. That means that their reporting is also page-centric; they’re all based on visualising the structured data discovered on a specific URL. For properties which don’t belong to the
WebPagewhich a URL represents, page-centric reporting won’t make sense.
- Their reports are element-centric. They’re based on visualising the (numbers of) URLs which feature a specific type of structured data within a site. Rich descriptions of pages are about the connections between elements (e.g., the
WebPage. This is more useful to understand than, say, how many
Articleelements exist across a site.
- Their teams are feature-centric. Different groups of people within Google work on different areas structured data, directly related to the features they want to develop/expose in their search results. This leads to conflicting opinions, implementations, and fragmentation.
*Specifically, they have no way to connect or understand the relationships between different entities (represented in structured data) when they’re spread across different pages. They rely on the rest of Google’s systems to determine relationships between pages/things. This aligns with their narrative around the deprecation of rel
prev tags, the insistence that there’s no such thing as Domain Authority, and similar.
The problem with a page-centric model
Because Google won’t “join the dots” between pages on multiple entities, the only way in which we can describe such relationships (e.g., describing the properties of a
WebSite within which a given
WebPage exists, or, the properties of the
Organization who published that page) is to repeat all of those properties, in context, on every page on the site.
To describe an
Article, for example, you must also describe the
WebPage that it’s on, the
Organization which published it, and the
Author who wrote it. These entities might naturally live on different URLs, and have deep relationships and contexts of their own. Critically, describing the properties of those entities in situ is not the same as declaring that they’re ‘on’ the page upon which they’re described.
That’s why, in Yoast SEO, we construct a ‘graph’ of all the objects on (or referenced on) a given page and describe all of their properties. You can see a great example of what that looks like here (from this article). You can read more about the ‘graph approach’ which we used here.
This isn’t a problem of the graph approach defining things which “aren’t on the page”, it’s a problem of Google providing the wrong tools for understanding the presence and role of structured data on a site. They’re trying to visualise a graph in page-centric, feature-centric reporting. Their internal structures, teams and politics are forcing a square peg into a round hole.
Remember, we have to describe all of the properties of each related element, because Google won’t build those associations across pages/URLs.
…And their advice is consistently bad
To make things worse, Google’s own documentation provides bad and conflicting advice on the topic.
This page, for example, says that you should only add internal search markup on the site’s homepage. But the internal search markup is a property of the
WebSite, not the homepage’s
WebPage (never mind that there’s no schema.org definition for a ‘homepage’). And if we have to describe that
WebSite on every page – because it’s the natural parent of each
WebPage, and has other important properties (like describing the
Organization who publish the
WebSite, and therefore the
WebPage) – then it doesn’t make sense to alter the properties of the
WebSite entity based on which
WebPage is being described. That’s a terrible approach.
Similarly, advice from Google’s John Mu suggests that
Organization markup should only be placed on the homepage (or perhaps, depending on the site, the contact or about page), and “not on every page”. Without a graph-based approach, that advice would make sense – every page shouldn’t represent/contain the
Organization‘s markup. But in a connected graph, we need to understand that, for example, the
Author of an
Article on a
WebPage works for the
WebSite which the
Article is on. Excluding the
Organization from the graph makes the relationships between those entities fall apart.
They’re internally inconsistent, too
On top of this, their approaches are inconsistent between features. The way in which
Logo markup (usually used in the context of describing an
Organization) is reported on is completely different to the Sitelinks SearchBox scenario.
Specifically, they’ll only report on
Logo markup if it’s on a ‘leading edge’ in the graph markup (i.e., if it’s attached to a ‘top-level’ node). This is never the case in our approach, as the
Organization is always at least a few ‘layers’ deep when we’re primarily describing a
Product or similar. That means that despite our Organization being referenced on every page (and having a
Logo), our Logo reporting in Google Search Console is empty.
This difference in approach is a deliberate, and seemingly arbitrary design decision by the team responsible for this report. In fact, each feature that Google supports is detected, extracted, and reported on in different ways, according to the whims of the teams involved. Their problems with silos and internal inconsistency extend well beyond their reporting, into the heart of how they extract process different types of schema.org entities – but that’s a whole other topic.
There’s consistency in their bad advice, at least. It all assumes that they’re talking to users who’re simply copying/pasting bits of isolated structured data into their webpages. It caters to users who don’t have connected graphs, and that they’re just trying to add snippets of code to specific pages (to chase specific features/rewards from Google’s structured data features).
It also assumes that it’s fine to rely on Google to extract all of those isolated ‘bits’ of markup, and for them to construct their own context about how they’re related.
Their approach is bad for the open web
I’m deeply uncomfortable with both of those assumptions because I think it could be harmful to the open web. Google isn’t the only consumer who theoretically stands to benefit from richer, deeper structured data on the web, if it’s implemented properly. With their current feature-centric approach, they’re only encouraging and rewarding ‘snippets’ of structured markup, which critically doesn’t describe the relationships between things. That worries me because:
- In graph data, it’s the relationships between things which matter. The value of structured data is in describing those relationships (the “linked” bit of JSON-LD is the whole point). If users are just copying/pasting fragmented bits of markup onto their pages in order to chase rewards from Google, they do so at the expense of the future wider web, in a way which Google is endorsing/rewarding. Even if Google (with all of their processing and understanding of the web) can reliably infer the relationships between things on pages across a site, other consumers may not be able to.
- It’s close to impossible to construct a nuanced graph by copying and pasting code. To produce (and maintain) markup like the example on this page, each node needs to know about the properties of each other node, and make complex decisions about how they’re related. I’ve described many of the challenges in achieving that here. Catering for copy/paste approaches, both in their guidelines and their reporting, limits the adoption of structured data solely to a few bits-and-pieces which Google want.
- Google’s preferences frequently diverge from schema.org‘s definitions, in ways which are designed to simplify and enable this ‘copy/paste’ approach. That conflicting information causes confusion. Forums, blogs, and even reputable sites are full of conflicting, and often incorrect advice. Google’s approach to structured data documentation and reporting is mystifying and confusing the topic, rather than clarifying it.
All of this is bad for the open web. Tying our structured data implementations specifically to whichever features Google is currently rewarding (never mind that they frequently deprecate them), rather than richly describing our pages and our content (in a way which Google can still consume!) stifles innovation, and relegates schema.org to little more than a reference sheet for copy-and-paste Google hacks.
In an ideal world, every webpage would fully describe itself and all of its relationships, and other platforms/tools/companies would be able to consume and utilise that information. In today’s world, none of that is possible, because all of Google’s tools and reporting force webmasters to look ‘inwards’, not outwards.
What’s the answer?
The structured data reports in Google Search Console urgently need to be overhauled. They need to do a better job of visualising the relationships between entities, and the distinctions between properties on the page and properties of the page. That’s a complex challenge, but one which they should rise to if they want to encourage broader and deeper adoption.
To do that, they’ll need to reassess their internal team structures and operating processes. They’ll need consistent internal positioning on how their processes for data extraction, evaluation and reporting work across different schema ‘features’.
They’ll also need to change the narrative in their support documentation, to shift the focus away from copyable snippets, and towards constructing graphs of connected data.
In many cases, that’ll mean shifting the processes away from facilitating blunt copying-and-pasting by individual website managers, to making stronger demands of technology vendors and underlying platforms to include rich structured data in their outputs. Individual webmasters and content editors shouldn’t – can’t reasonably – be responsible for maintaining complex structured data about their pages (never mind the markup for all of the other pieces on their website). And the issues with Google’s reporting suite only serves to widen the gap between their understanding of what ‘good’ looks like.
Whilst Yoast has taken some great strides in standardising an approach for this kind of connected data, Google should be doing more to solve the problems of adoption and utilisation further upstream. We shouldn’t be leading this charge, Google should be. But at the moment, there’s no sign that they understand the scope of the opportunities at hand for getting this right, or the risk we face if they (continue to) get it wrong.