Google have to improve their structured data reporting

I’ve talked previously about how the adoption of structured data – specifically schema.org markup – is strategically important to Google. That’s why they’re investing so much in supporting new formats, and why they’re continuing to add new reports to their Google Search Console product.

But the way in which those reports summarise and visualise the use of structured data on your website is problematic. Their reports are increasingly confusing, misleading, and are going to become worse over time as they extend their reporting to cover new features.

Their reports are confusing

Many webmasters will have issues like the one outlined in the image below. Here, the “Sitelinks searchbox” report in Google Search Console (which identifies the presence of structured data describing an internal site search feature) says that the yoast.com website features over two thousand internal search elements.

Yoast.com obviously doesn’t have thousands of internal search tools, but we do describe the internal search feature of our site in the structured data of every page on our site. That’s because each WebPage element we’re describing on each page is part of a WebSite, and that WebSite has a SearchAction property. We’re not saying “This page has an internal search feature”, we’re saying “This WebPage is part of a WebSite, which has a search feature”.

That looks something like the image below, which is taken from the source code of one of our articles about meta descriptions.

Despite looking like a ‘good’ report, these green bar charts are completely meaningless. In fact, this is just one example of many, where the reports simply don’t make sense. The SearchAction markup is a property of the WebSite, not the WebPage, so reporting on it based on the number of pages it’s found on simply doesn’t make sense. Having more pages with this markup is neither ‘good’ nor ‘bad’, and arguably isn’t relevant or useful information.

This confusion in reporting between “things on this page”, and “things we found references to on this page” is consistent throughout Google’s reporting, and renders most of those reports largely useless. At best, they’re simply a count of pages on a website. At worst, they’re confusing to users who don’t understand what they’re looking at.

As Google adds support for more types of structured data, the problem is going to get worse. But addressing this isn’t as simple as just improving the charts.

There’s a deeper problem

The poor reporting formats and visualisation are the tip of the iceberg for a bigger problem. Fixing the Sitelinks searchbox graph would be a trivial issue, but it’s unlikely to happen due to three deeper challenges with Google’s internal operations and communications. Specifically:

  1. Google’s ‘mental model’ for structured data is entirely page-centric*. That means that their reporting is also page-centric; they’re all based on visualising the structured data discovered on a specific URL. For properties which don’t belong to the WebPage which a URL represents, page-centric reporting won’t make sense.
  2. Their reports are element-centric. They’re based on visualising the (numbers of) URLs which feature a specific type of structured data within a site. But the connections and relationships between those elements (e.g., the Author of an Article on a WebPage) is just as important – and sometimes moreso – than simply identifying their presence.
  3. Their teams are feature-centric. Different groups of people within Google work on different areas structured data, directly related to the features they want to develop/expose in their search results. This leads to conflicting opinions, implementations, and fragmentation.

*Specifically, they have no way to connect or understand the relationships between different entities (represented in structured data) when they’re spread across different pages. They rely on the rest of Google’s systems to determine relationships between pages/things. This aligns with their narrative around the deprecation of rel next/prev tags, the insistence that there’s no such thing as Domain Authority, and similar.

The problem with a page-centric model

Because Google won’t “join the dots” between pages on multiple entities, the only way in which we can describe such relationships (e.g., describing the properties of a WebSite within which a given WebPage exists, or, the properties of the Organization who published that page) is to repeat all of those properties, in context, on every page on the site.

To describe an Article, for example, you must also describe the WebPage that it’s on, the Organization which published it, and the Author who wrote it. These entities might naturally live on different URLs, and have deep relationships and contexts of their own. Critically, describing the properties of those entities in situ is not the same as declaring that they’re ‘on’ the page upon which they’re described.

That’s why, in Yoast SEO, we construct a ‘graph’ of all the objects on (or referenced on) a given page and describe all of their properties. You can see a great example of what that looks like here (from this article). You can read more about the ‘graph approach’ which we used here.

This isn’t a problem of the graph approach defining things which “aren’t on the page”, it’s a problem of Google providing the wrong tools for understanding the presence and role of structured data on a site. They’re trying to visualise a graph in page-centric, feature-centric reporting. Their internal structures, teams and politics are forcing a square peg into a round hole.

Remember, we have to describe all of the properties of each related element, because Google won’t build those associations across pages/URLs.

…And their advice is consistently bad

To make things worse, Google’s own documentation provides bad and conflicting advice on the topic.

This page, for example, says that you should only add internal search markup on the site’s homepage. But the internal search markup is a property of the WebSite, not the homepage’s WebPage (never mind that there’s no schema.org definition for a ‘homepage’). And if we have to describe that WebSite on every page – because it’s the natural ‘parent’ of each WebPage, and has other important properties (like describing the Organization which publishes the WebSite, and therefore the WebPage) – then it doesn’t make sense to alter the properties of the WebSite entity based on which WebPage is being described. That’s a terrible approach; the website doesn’t not have a search facility, just because I’m viewing an inner page.

Similarly, advice from Google’s John Mu suggests that Organization markup should only be placed on the homepage (or perhaps, depending on the site, the contact or about page), and “not on every page”. Without a graph-based approach, that advice would make sense – every page shouldn’t represent/contain the Organization’s markup. But in a connected graph, we need to understand that, for example, the Author of an Article on a WebPage works for the Organization who publish the WebSite which the Article is on. Excluding the Organization from the graph makes the relationships between those entities fall apart.

They’re internally inconsistent, too

On top of this, their approaches are inconsistent between features. The way in which Logo markup (usually used in the context of describing an Organization) is reported is completely different to the Sitelinks SearchBox scenario. 

Specifically, they’ll only report on Logo markup if it’s on a ‘leading edge’ in the graph markup (i.e., if it’s attached to a ‘top-level’ node). This is never the case in our approach, as the Organization is always at least a few ‘layers’ deep when we’re primarily describing a WebPageArticle, Product or similar. That means that despite our Organization being referenced on every page (and having a Logo), our Logo reporting in Google Search Console is empty.

This difference in approach is a deliberate, and seemingly arbitrary design decision by the team responsible for this report. In fact, each feature that Google supports is detected, extracted, and reported on in different ways, according to the whims of the teams involved. There are many similar examples to this. Their problems with silos and internal inconsistency extend well beyond their reporting into the heart of how they extract, process different types of schema.org entities – but that’s a whole other topic.

There’s consistency in their bad advice, at least. It all assumes that they’re talking to users who’re simply copying/pasting bits of isolated structured data into their webpages. It caters to users who don’t have connected graphs, and that they’re just trying to add snippets of code to specific pages (to chase specific features/rewards from Google’s structured data features).

It also assumes that it’s fine to rely on Google to extract all of those isolated ‘bits’ of markup, and for them to construct their own context about how they’re related.

Their approach is bad for the open web

I’m deeply uncomfortable with both of those assumptions because I think it could be harmful to the open web. Google isn’t the only consumer who theoretically stands to benefit from richer, deeper structured data on the web, if it’s implemented properly. With their current feature-centric approach, they’re only encouraging and rewarding ‘snippets’ of structured markup, which critically doesn’t describe the relationships between things. That worries me because:

  1. In graph data, it’s the relationships between things which matter. The value of structured data is in describing those relationships (the “linked” bit of JSON-LD is the whole point). If users are just copying/pasting fragmented bits of markup onto their pages in order to chase rewards from Google, they do so at the expense of the future wider web, in a way which Google is endorsing/rewarding. Even if Google (with all of their processing and understanding of the web) can reliably infer the relationships between things on pages across a site, other consumers may not be able to magically do so.
  2. It’s close to impossible to construct a nuanced graph by copying and pasting code. To produce (and maintain) markup like the example on this page, each node needs to know about the properties of each other node, and make complex decisions about how they’re related. I’ve described many of the challenges in achieving that here. Catering for copy/paste approaches, both in their guidelines and their reporting, limits the adoption of structured data solely to a few bits-and-pieces which Google want.
  3. Google’s preferences frequently diverge from schema.org’s definitions, in ways which are designed to simplify and enable this ‘copy/paste’ approach. That conflicting information causes confusion. Forums, blogs, and even reputable sites are full of conflicting, and often incorrect advice. Google’s approach to structured data documentation and reporting is mystifying and confusing the topic, rather than clarifying it.

All of this is bad for the open web. Tying our structured data implementations specifically to whichever features Google is currently rewarding (never mind that they frequently deprecate them), rather than richly describing our pages and our content (in a way which Google can still consume!) stifles innovation, and relegates schema.org to little more than a reference sheet for copy-and-paste Google hacks.

In an ideal world, every webpage would fully describe itself and all of its relationships, and other platforms/tools/companies would be able to consume and utilise that information. In today’s world, none of that is possible, because all of Google’s tools and reporting force webmasters to look ‘inwards’, not outwards.

What’s the answer?

The structured data reports in Google Search Console urgently need to be overhauled. They need to do a better job of visualising the relationships between entities, and the distinctions between properties on the page and properties of the page. That’s a complex challenge, but one which they should rise to if they want to encourage broader and deeper adoption.

To do that, they’ll need to reassess their internal team structures and operating processes. They’ll need consistent internal positioning on how their processes for data extraction, evaluation and reporting work across different schema ‘features’.

They’ll also need to change the narrative in their support documentation, to shift the focus away from copyable snippets, and towards constructing graphs of connected data. 

In many cases, that’ll mean shifting the processes away from facilitating blunt copying-and-pasting by individual website managers, to making stronger demands of technology vendors and underlying platforms to include rich structured data in their outputs. Individual webmasters and content editors shouldn’t – can’t reasonably – be responsible for maintaining complex structured data about their pages (never mind the markup for all of the other pieces on their website). And the issues with Google’s reporting suite only serves to widen the gap between their understanding of what ‘good’ looks like.

Whilst Yoast has taken some great strides in standardising an approach for this kind of connected data, Google should be doing more to solve the problems of adoption and utilisation further upstream. We shouldn’t be leading this charge, Google should be. But at the moment, there’s no sign that they understand the scope of the opportunities at hand for getting this right, or the risk we face if they (continue to) get it wrong.

6 responses to “Google have to improve their structured data reporting”

  1. Jono, I feel the exact same way, mate. And as an owner of one of the big eCommerce platforms out there, I too want to set the standard in the entity generation and management through Schema. So let’s just keep our fingers crossed that Google picks this up and runs with it, as opposed to what they’ve been doing so far.

  2. Mack says:

    Hi Jono – I’ve been digging into schema and structure data for the last week and a bit trying to learn as much as possible about the subject. It’s been difficult getting a comprehensive understanding of it and how to use it across different use cases but little by little I’m piecing it together. This article might be a bit ahead of me in terms of my knowledge but I appreciate what you had to say and will round back to it at some point (it also introduced me to the YOAST documentation on the subject which was a lot of help). This might not be the best place to ask this question but it honestly feels like there isn’t a conclusive answer to structured data so you tend to get a lot of different answers (or no answers) depending on where you ask. Anyway here’s my question: I’m setting up schema for what I consider a local business (it’s an online service marketplace that sells its services locally nationwide). Now this business is a local service business and doesn’t get visitation from the public in any way. Would I use the @type localBusiness to define this business or would I simply use “organization”, “website”, and “webpage” to try and describe everything and link everything together in some fashion? Hope to hear back from you and appreciate it again. Thanks.

    • Hey Mack, thanks for your comment! I’m glad that the Yoast schema documentation was helpful.

      This is a really great question. If the business doesn’t have a public-facing property (i.e., a store), then I’d not use LocalBusiness markup. The schema.org definitions are relatively unopinionated in this regard, but Google expects a local business to have opening hours for publically visitable addresses, etc.

      This is definitely a point of contention, though – there are lots of scenarios like this which Google does a poor job of supporting. E.g., there’s no way for businesses to support multiple/split opening hours across the course of a day (e.g., a siesta or period of closure in the afternoon, as is common in some European countries). Hopefully, as more people ask these sorts of questions, they’ll improve their support!

  3. Excellent article! I’ve been learning more and more about structured data and SEO for the past several months, and Google’s documentation has been driving me bonkers. Conflicting information, lack of information, arbitrary requirements, apparent disregard for the entire point of knowledge graph and structured data in the first place… good grief! 

    It sucks that webmasters are faced with such a choice: 

    • Embed structured data the right way, clear, connected
    • Embed structured data the “Google” way, a nightmare (but performs better in Google search)

    Do you have any thoughts on how other search engines compare in this regard?

    • Jono Alderson says:

      I share your pain!
      I know that Yandex have been doing some interesting stuff (with ‘islands’) for a long time, but I don’t think that’s matured in line with schema.org or modern standards (i.e., JSON+LD). Bing have some pretty rudimentary support for the basics, but nothing nearly as comprehensive as Google.
      The challenge we have is that, whilst Google remains the largest / only significant consumer of schema.org markup, they can essentially make their own rules – and they’re influential enough that their interpretation and preferences and up becoming synonymous with the standards themselves, and eventually subsuming them.
      I should be clear that not all of their approaches are bad, but, there are definitely places where they’re inconsistent, conflicting, and/or create the wrong kinds of incentives and behaviours!

Leave a Reply

Your email address will not be published. Required fields are marked *