What is Schema Markup and How Structured Data Can Benefit Your Business
You may or may not have noticed but there has been a revolution in search technology during the past decade that has deep implications for any business with a web presence. That revolution centres around semantics – the meaning behind words, or Schema Markup and Structured Data as SEO knows it.
In this article we’re going to look at some of the background of search and how semantics have become important; the tools at our disposal to help index a site’s content semantically; a brief introduction to the practical side of this with supporting links; and an exploration of some of the fundamental concepts underpinning this – entities and trust. Finally, we’ll have a look at what is on offer as far as search results go and what the end user might see.
- A brief history of search and search technology
- The problem with Keywords without a context and the importance of disambiguation
- Machine readable semantic languages
- Schema.org – providing a universal semantic language for search engines
- Example of Schema Markup in JSON-LD
- Nodes, Entities, and the knowledge graph
- Reflections in SERP’s – Rich Cards, Rich Snippets, and other human accessibility layers.
The nature of online search technology has changed quite considerably since its birth in 1990. Once they had evolved a bit, from 1995 onwards, search engines worked by providing a list of possible answers to search queries based on the words they contained, and it was then left to the searcher to select what looked like the best matches and would take them where they wanted to go. This ambiguity and transparency in the search engine algorithms meant that there were many tricks one could use to gain high positions in the search results without necessarily containing the information that the search query was trying to find. Or in other words – search results were often not very useful to the searcher and there was a lot of “spam”, so to speak.
Early search engines such as Lycos, Yahoo, WebCrawler, Excite, AltaVista and Infoseek all made contributions to how end users could navigate and find things on the WWW but they struggled to stay ahead of those seeking to game the systems. Natural language queries, methods of indexing, the crawling of meta-data, back-links – these were a few of the innovations that gave search engines far greater scope, but it also opened them up to being “gamed” and subjected to “Black Hat” techniques, which perhaps ironically gave rise to the vast majority of the SEO industry 15 years ago.
Google have made a lot of effort to try and put this sort of thing firmly in the past. Restrictions were put in place to stop spam with the advent of “nofollow” links in 2005, and the Panda updated in 2011 started putting the squeeze on things like content farms and scraper sites, back then, all popular underhand ways to try and get those precious top page rankings. It was at this time that the search engines that had become dominant – Google, Yahoo! And MSN – collaborated to come up with Schema.org, a key development for realising a semantic web and which we’ll have a look at in more detail later. Schema was designed to provide a fundamental lexicon of terms that would be useful for describing websites and the activities that go on around them.
The Google Hummingbird algorithm in 2013 was designed to make semantics central and it was at this point that Google started to seriously reward websites incorporating semantic markup and Schema. Context and meaning started to become a reality in internet search. The aim: to make search accurate and relevant information for the end user. Since 2013 AI and semantics have become the heart of search engine technology and user experience has become absolutely central. As recently as 2017, a Google search algorithm update unofficially dubbed “Fred” punishes sites with low-quality backlinks, and sites that prioritize monetization over user experience. These days websites not only have to do what they say on the tin and mean what they say, but speed, ease of use and clarity are also now crucial factors in getting listed in search results.
Even those who only have a passing interest in internet technologies are probably aware that the words associated with a web page are important. As with any index, keywords are used to match the most relevant results when trying to find something in a mass of information. However, searches based purely on keywords – strings of text – don’t take account of ambiguity, synonyms etc. and so searches on Google prior to Hummingbird were not particularly accurate and might have required quite a bit of digging around to find exactly what you were looking for. Inevitably, competition for certain keywords became very high, especially when the “real estate” of the Search Engine Results Pages has been shown to be effectively just the first 3 or 4 positions, and whole industry was born to take advantage of searches based entirely on keywords.
The problem was both ambiguity of meaning and also ambiguity of identity. To illustrate the issue here’s an example from the Wikipedia page on “Word-sense Disambiguation”:
“To give a hint of how all this works, consider three examples of the distinct senses that exist for the (written) word “bass“:
- a type of fish
- tones of low frequency
- a type of instrument
and the sentences:
- I went fishing for some sea bass.
- The bass line of the song is too weak.
To a human, it is obvious that the first sentence is using the word “bass (fish)”, as in the first sense above, and that in the second sentence the word “bass (instrument)” is being used as in the latter sense. Developing algorithms to replicate this human ability can often be a difficult task, as is further exemplified by the implicit equivocation between “bass (sound)” and “bass (instrument)”.
Whilst machines can manipulate pure data at incredible speeds, the measure of context is a recent development. Semantics – the differences and/or similarities in perceived meanings of words, sentences, and/or phrases, relative to context – is now a factor in the machine algorithms running search engines. And so there has been a lot of effort made by Google to make it clear and easy for those submitting websites and pages that they need to make their content understandable to these machine algorithms and specify the meaning of their web pages with a high degree of accuracy.
The vocabulary used by web pages needs to be genuine. It used to be enough simply to include certain words in the text of pages in order to rank for them, but no more! Google’s artificial intelligence algorithm (RankBrain) detects patterns of search queries, their context and consequent user behaviour. This has resulted in the focus shifting very strongly towards simply providing a good user experience and useful information, because the real advantage is now in content with a clear context solidified around trust. Ambiguity is no longer useful, and as a result, confusing and un-engaging content is irrelevant. Google is now giving major boosts to quality content. Why? The end user, the querant. From their behaviour Google can now tell if the search results gave a meaningful answer to what they were looking for.
In the early days of the search engines a large percentage of the vetting of site submissions used to be done “by hand” in order to give search results some accuracy. However, the scale of growth in the WWW and the number of pages and sites to be tracked simply became too much for that technique (though it is still used to calibrate search accuracy, apparently). A machine algorithm became essential and, by default, a language for it. And thus, we have…
In 2011 the major search engines collaborated to lay down a vocabulary and language so that content made available on the internet could be read by machines. The result was the Schema ontological framework – a vocabulary to be used in conjunction with a certain type of markup that makes the content of a web page explicit in the metadata. Since then JSON-LD has steadily been gaining ground as the preferred markup method, over RDFa or Microdata. A quick look at the entire Schema hierarchy should give you some idea of its scope.
The vocabulary is made up of URI’s – stable nodes in the web that are used as atomic or basic statements in the language.
“A Uniform Resource Identifier (URI) is a string of characters designed for unambiguous identification of resources and extensibility via the URI scheme.” – Wikipedia https://en.wikipedia.org/wiki/URI
So, for example, to identify unambiguously the nature of your business you might use the URI http://schema.org/HealthAndBeautyBusiness or the more general http://schema.org/LocalBusiness. Whilst Schema.org sets a fundamental set of terms, these can be added to by bringing in other ontologies. For example, auto.schema.org adds specific terms for cars, motorcycles, etc. Similarly, Wikipedia can be used as a stable reference possessing unique addresses for concepts and things. Because Wikipedia is such a well-established and trusted entity, its pages can also be used as part of the ontological language.
Schema.org is supposed to provide a universal semantic language for those wanting to markup their identity and content with machine readable data and since it is based on RDF it can then be linked to many other ontologies that have their own URI’s for particular terms. The data that it holds can then be interpreted and linked in to other semantic statements.
“Uniform Resource Identifiers (URI) are a single global identification system used on the World Wide Web, similar to telephone numbers in a public switched telephone network. URIs are a key technology to support Linked Data by offering a generic mechanism to identify entities (‘Things’) or concepts in the world.” – Australian Government Linked Data Working Group
So, a semantic statement would be formed in a “triple”, made up of stable URI’s. For example, we could state in semantic terms:
That is: This Local Business provides the Service Digital Marketing.
Using the Schema Ontology and JSON-LD markup this would look something like this:
“name”: “Digital Marketing”,
This is a very brief and rudimentary example of implementing Schema with JSON-LD. There is so much detail in the Schema vocabulary that it quickly becomes overwhelming and, be warned, some of the documentation and examples given on the site are incomplete and/or confusing. It seems that the best way of using it is still being hashed out. However, the main pages on the site are definitely worth digesting.
Another thing to bear in mind when trying to tackle Schema to markup your website is that there are only so many things that have been implemented from it in the search engines. So, to begin with it’s probably worth sticking to the examples given in Google’s Search Gallery. For further examples and information on the writing JSON-LD markup check out some of the following sites:
“Things, not strings”
Illustration 1: The fundamental requirements for Semantic Search, courtesy of David Amerland
Entities in the semantic web are trusted points around which other data revolves. An entities address – how to reference and link to it – is its URI. http://www.schema.org is the base URI used for all the vocabulary of Schema. Companies and businesses could establish their entity by specifying a URI with a website and including semantic statements to that effect.
By embracing semantics, the nature of search on the internet has changed from being a system that could be easily fooled to a larger conversation where authority, trust, reputation and influence are integral to the stability of the node and where it sits in the overall landscape. Thus, websites representing businesses and companies need to be consistent, useful, reliable, and easy to access.
Illustration 2: The Semantic Web Tower, first proposed by Tim Berners-Lee, with annotation.
An entity could have many, many semantic statements attached to its node or URI. This describes its identity in certain terms that are machine readable. Relationships with the Entity can be stated, determined, and found. The language used to describe a business becomes central. Straplines, key concepts, themes, and identity could and should all point to URI’s to strengthen their place in the WWW and help form a clear picture of the presence of the business online. The sum of these forms the Digital Signature and this is, to some extent, what is visualised in the Knowledge Graph on Google. Associated images, facts, data and so on can be drawn from all over the Web to populate this.
Back in 2006 Tim Berners-Lee described linked data as follows:
“The Semantic Web isn’t just about putting data on the web. It is about making links, so that a person or machine can explore the web of data. With linked data, when you have some of it, you can find other, related, data.”
And that is largely what Google’s Knowledge Graph is intended to consolidate so that answers to user questions and queries can be displayed in one place directly on Google.
Here is an excellent example of some of these concepts being applied by Connecting Data: London as a Graph – make sure to check out the Schema in Google’s Structured Data Testing Tool. More recently Google Search Console (formerly Webmaster tools) has introduced a new reporting section for unparsable structured data. Should you implement any Schema on your website, any errors or opportunities for improvement will be flagged to you in this report.
Semantic statements are held by the quality and authority of their references (i.e. the URI’s). The entity described by these statements takes its place in a wider conversation of context and meaning. For some idea of the extent of what this means have a look at http://lod-cloud.net/clouds/lod-cloud.svg – the LOD cloud is a visualisation of the extent of the Semantic Web and it’s many data silos, nodes and vocabularies.
Illustration 3: Can you spot your business entity? Find a place in the conversation. (Source: bordalierinstitute.com)
Comparison of Google’s SERP’s between 2011 and 2018: Rich Cards, Rich Snippets and Other Human Accessibility Layers
These two screenshots should give you some idea of the changes that have been taking place in Google’s Search Result Pages since the introduction of the Knowledge Graph. The main thing to note is that the “10 blue links” have been slowly breaking up and various “Cards” and “Rich Snippets” now sit amongst the search results. In terms of SERP’s, this is the payoff for implementing Schema markup on your site. In terms of the Semantic Web Tower illustrated earlier this is the top level – the human access – and it is a sign of things to come.
Illustration 4: Google SERP’s c.2011
Illustration 5: Google SERP’s c.2018
It should be understood that semantic markup doesn’t necessarily help with improving ranking (though structured data markup is becoming increasingly essential) but what it will do is improve visibility and be able to provide specific answers to specific questions. When a search query elicits a particular entity it’s appearance in search results will have a lot more detail or, in other words, will be content rich with associated images, information, data, links all appearing, hooked in to that entities web presence. Here’s a good article highlighting the benefits of engaging in Structured Data markup.
CTR is important as well as dwell time on a page. Good quality content and excellent user experience is essential, and this is what will get your content and data appearing in those Rich Cards and Snippets. When users click on your listing and spend time on the page, or share it and link to it, all that will be read by Google and solidify that page’s internet presence.
So, it’s really important for businesses themselves to have a clear picture of who they are, what they do and why. Marketing experts are well acquainted with this sort of idea as it has very strong ties to branding. The difference here is that every term can be linked to other entities on the WWW in order to disambiguate and make the message strong and clear.
Furthermore, the massive increase in voice searches and other non-PC queries makes allowing the essential data that is associated with an entity to be accessed by search engines even more important. Once Google has a verified and solid graph it can use that to provide direct answers to questions.
Whilst semantic markup is not yet obligatory it is fast becoming essential as the internet pitches towards “Things” and their presence as “Entities” with a place in the WWW. Remember that all this is aimed at helping the end user by increasing accuracy, reliability, trust, and authority. It is essentially a return to the initial premise of creating content that is valuable and easy to find for the user. Yet it is also a fundamental shift in the basis of connectivity and search on the internet and the effects are only beginning to be felt as the data, its context and meaning are sorted and analysed by the AI behind the big search engines. The transformation of the medium into one of entities connected by their meaning rather than arbitrary links brings the internet closer to the ideal as conceived by Tim Berners-Lee. Ultimately, looking at your digital presence with this in mind will help in your own appreciation of what you do, what you represent and where you are as a business. And your visibility will be a signifier of how profound that appreciation and your ability to take part in the online conversation really are.