Microformats: is that thing still around?

microformats logo

Let's see how we got here in the first place. No, not conception; I'm talking about the communication between humans and machines. Here it goes:

When I write, I love books!, I mean something like: I enjoy reading, buying and talking about books and the material within.

When I write, I heart books!, I mean pretty much the same thing.

Observe that the syntax - how I said it - has changed, yet, the semantics - the meaning behind what I said - stayed the same.

Machines see the syntax of some data the same way as humans, yet, they do not have a semantic understanding of this data. Information on data is not built because there is no standard way to make sense of data on the machine level. Semantics is lost, and they [machines] don't really know what just happened.

What we need to solve is the communication problem. We need to get the machines to understand what is really in our documents to help us better with what we need.

So then, why can't we use HTML or XML to solve this on Web documents and go on with it? Well, yes, hang on. HTML falls short of capturing this extra information simply because we have too many rich languages and its dialects to handle something at this scale. XML doesn't cut it either, because it hasn't worked well in practise (i.e., no standard DTD where we all agree) as we have hoped. What's a girl to do? Read on!

What about common patterns? This is where it gets interesting because we understand that we have similar data on the Web. We just have to find a way to get to it efficiently. We even have existing and widely adopted standards for these common patterns. For instance, we have ISO 8601 to represent date and time. We have vCard (electronic business card format), RFC 2445 (calendar data exchange), Atom (like RSS, a syndication format), WGS84 (reference frame for the earth) and so on.

It turns out that we use a lot of this information in our (mostly HTML) documents, yet, since most documents are marked up in their own way, we have no efficient way to grab this data. We could of course write a unique script to grab the contact information from site A and then another script for site B but we soon see that it is a futile exercise given the vast amount of Web pages out there.

Let me quickly mention that there are several - not necessarily competing - approaches to solving this general problem and I would like to think of them in different layers where each will succeed on its own. This article leans more toward microformats.

Here is the turning point: We could leave the documents as they are - doing their own thing - but pass along simple bits of information to give it some representation as to what it really is. Here is one way to markup a person:

<span class="person">

And here is one way to mark it up in a way that it uses the existing vCard standard.

    <span class="fn n">
        <span class="given-name">Sarven</span>
        <span class="family-name">Capadisli</span>

This would be an HTML representation of vCard. We call it hCard.

The additional syntax potentially allows the machines (when I say machines, I mean scripts like JavaScript or PHP for instance) to pickup on this contact information, provided that we tell it what (vCard) to look for. This is the key! This is how microformats goes at it: to locate and mark these common formats based on the existing standards in our documents. We can now collect this information much easier then before. And it goes on.

Think of the possibilities where we can mark all contact information, events, locations, reviews, listings, resumes, tags, human relationships (and many more) all over the Web! We not only achieve a standard way to represent this data but we can also write clever tools and services to work with it. For instance, we can have a script that can map all the locations of the jobs an individual had directly from the HTML document. We can export calendar events into our desktop or Web calendars. Or even into our handheld devices. We can even subscribe to article entries on a Web page (like Atom/RSS). Who still wants to copy/paste?

We simply need to rethink about our data.

We no longer have to say Here is a photograph of me flying a kite last Saturday and stop. We can certainly dissect this information further into smaller components where:

  • the event is located somewhere that we can map -- adr/geo
  • the occurrence of the event has a date/timestamp -- hCalendar
  • entities (i.e, humans, organizations, locations/objects) in the image can be identified -- hCard
  • description of the event can be provided -- hAtom
  • contextual tags can be used -- rel-tag

all of course correspond to a microformat: (partial) HTML representation of a standard.

It turns out that we can have a single instance of a document that can have multi-purposes (reuse). e.g. the name of a company can be marked up with hCard and still be represented as the name of the event (like a job) in which a person attended. Nice!

So, where is all this heading?

If we are at Web 2.0 (take it easy, it's just a term) and if we eventually want to end up at Web 3.0 (capitalized Semantic Web where all the extra-cool stuff happens), then microformats is Web 2.5ish (lowercase semantic web). It is to bridge the gap by allowing us to pass on some of the most widely used information forward.

We can transform a document that contains microformats into a RDF form using GRDDL (pronounced "griddle"). This is super-cool because microformats can help us bootstrap a lot of data we have already on the Web to the point where we can query (like SQL) this information using SPARQL (pronounced "sparkle") on RDF. Yes, I know, that is super-super-cool. We just went from Web of documents to Web of databases.

We have Web services (e.g. Technorati Microformats Search) and browser tools like Operator (and many more) that can pick up on these formats and do things with it. The next generation of browsers will certainly incorporate some of these functionalities natively to provide a better user-experience.

As far as the user experience goes, the visible data (think of tags) is a lot more useful for humans then invisible meta data (think of meta-keywords) which was mainly directed towards the machines. If the information is relevant for the user, then it should certainly be visible. This is one of the design approaches of microformats.

Concerning searching, possibilities and evolution of the way we dig information gets a wicked face lift. Search engines can now deliver far more accurate results than ever before. We no longer need to factor many, dynamic, complex algorithms across the Web just so the site owners don't reverse engineer them easily. Yes, I'm looking at your general direction Google, Yahoo, MSN, Ask et al. The current way of finding relevant information is a total hack-job since the algorithms essentially brute analyze the documents and cross-reference them with one another, only to spit out essentially some keyword matches and whether enough useful documents link to it. Give or take a few common SEO guidelines. Since we are constantly evolving the way machines find useful results for us, using microformats would be the next logical step. As far as SEO goes, machines need to catchup to the way humans look at the documents. This means honest, relevant information to the user. No border-line tricks!

We will instead search like (totally made up) "Montreal pizza under_5_dollars open_until_11pm". Since, we can identify locations, items, money (currencies etc.), events in documents using microformats, search engines now simply need to parse what is indexed and output useful information back to the user in the SERPs. At this point, I'm not sure if we need to get into the implications of advertising here. It may be annoying to see relevant, accurate ads all the time, but at least it is potentially a lot more useful then what we have now. A good upgrade if you ask me.

Keep in mind that microformats is not trying to solve everything but rather only simple, common things with what we already have. It is not trying to change or create a new approach (or a language) to what already works today.

Microformats are simple to implement and it is certainly not rocket science.

We need to write more tools and services, and come up with far more clever ways to mash-up documents then what we have today. The possibility is there. Our ideas and the dedication to get them done will pave our way to improved communication not only for humans but also for machines (yes, even the cute, little ones).

I believe using microformats also helps the development team to have a common understanding of the content. On different layers, the communication in the team is kept accurate and consistent because any given information at its core contains a standard way to use it across all documents. Which essentially speeds up processes and potentially delivers sound work. Not to mention a happy end-user!

I will wrap up with this: new formats are on the way and they are developed by an open community. The approach is to first investigate what is out there already; if there is no existing standard, then we identify the common patterns that is used on the Web. Fairly scientific, yes?

You can get involved and contribute your share for a common good:

Note: This article is loosely based on the microformats-01 presentation.