Sunday, May 13, 2007

User powered content and SemWeb

As it usually happens with a emergent long term concepts, people tend to go overboard and attempt to boil the ocean and lose touch with reality. Semantic web in its current direction appears to be laden with fantastic descriptions but offers little incentives for business to agree on one standard.

Obviously, all new technologies take time to mature and get adopted by businesses but then one successful adoption comes with hundred other candidates that died their natural death in the lab. Let's see what issues might prevent the concept of web applications using other web application and create a fantastic meaningful web.

Inherent problem with user powered content

How a user contributed content is described in terms of its metadata has a very significant role to play in SemWeb. Metadata is what structured search engines would use to locate relevant information for SemWeb. Unfortunately, there are inherent problems [Metacrap].

People are lazy
Let's face it, most of us are lazy. Effective SemWeb is only possible if a complicated set of metadata in terms of taxonomy or conceptual relationship to one another is described correctly. Creating and maintaining these schemas or adapting a pre-existing one is no trivial task. Solution is not to open this to people because there is no incentive for a user to take the pain of going beyond a couple of clicks.

People Lie
It's no wonder that your inbox is inundated with spam that reads the subject line as "Here is the information you requested". It is equally no wonder that your web search for a "enterprise application integration" might result in hundreds of press releases from companies which have nothing to do with EAI.

Lying exists in real world to compete for business as much as in the online world. How do we ensure that metadata is reliable? This is one thing where democracy does not really work.

People don't agree on terminology
You can't force people to describe something your way. I might like to call a porn, an erotic thriller and I can't stop you to call a nude beach, a naturalist shorefront. Your wish!
However, this creates a big problem for SemWeb, because computers cannot figure this out reliably.


Sunday, May 6, 2007

Beyond the structured web

If you can imagine your city library without computerized catalogs, without assignment of any specific aisle for specific category of books and just a computer that has all the text in all the books indexed. Assuming you found something, you could put it back anywhere you wanted.

Web is not a whole lot different today. How do you find a book in this library by "Nora Roberts" in fiction? If you were looking for information about marriage, will it pull books with "monogamy", "polygamy", and "civil union"? This is where "google" becomes totally ineffective in its search engine (besides its other problems)

At first step, SemWeb is an attempt to enable web applications to categorize and describe the information they are holding in a parlance that computers can chew on. At second step it goes further, inside the information. Something even latest library cataloging system could not do. For example, you could not look for cook books in a library that has a recipe with chicken and eggplant as ingredients, could you? With the information structured on the web, that should be pretty trivial.

All of this is still about all the available information arranged in a syntax, but how do you make sense out of it. Isn't "semantic" part of the semantic web supposed to imply that? Drawing meaning out of information? When does syntax become semantics?

Tim berners Lee vision of semantic web:

I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize.

Intelligence, unfortunately, has varying connotation. If you refer to my earlier post, the intelligence of finding an available Golf course, ensuring their ratings is above a certain specific level, getting busy/free data from invitees' shared calendar are just operations pieced together by a human who authored such an agent. The real intelligence must possess ability to reason deductively and power of inference and match such reasonings with an objective.

Meaning out of information comes from relationship of one data piece with another, resulting in a pattern that "intelligence" can piece together. Something we, human, are good at. Computers suck at it. Beyond the structure of information, which could be utilized for building useful applications such as the golf tee time scheduler or my personal shopping agent, this set of information still remains rather meaningless.

This is where the first word of the phrase "semantic web" gets confusing. There are two stages to SemWeb; one that is concerned with providing structure to world wide web and make it appear more like world wide database to computers. The second step to semantic web is to develop technologies to comb and find patterns in the rather chaotic stream of structured information. The second stage overlaps areas of artificial intelligence and much at research labs.

Regardless of where "semantic" in semantic web lands, one thing is very clear. We need "structured" web to build the next generation of application, something that is taking ground in new startup companies. freebase, launched by the parallel computing pioneer Danny Hills is attempting to collaboratively categorize and connect existing web in structured format. Cylive, co-founded by me, is a social publishing platform that allows an author to categorize, describe meta-data and store every piece of content in multiple attributes enabling structured publishing, something that can be easily chewed up by other semantic web agents.

In my subsequent posts, I will share my thoughts on challenges from implementation as well as business point of view of SemWeb.

Tuesday, May 1, 2007

Data interchange

Since time immemorial, computer representation of data for information interchange has gone through millions of version of formats and standards. Almost every company, group, and consortium came up with its own format of data interchange, touting it obviously as the best of the lot. Now that semantic web is talk of the town (at least in the web 2.0 town), thankfully we are well positioned now with ubiquitous XML, resurrected RDF, and esoteric OWL.

A gentle primer on semantic web

Much of web today evolved over Mosaic and Netscape browser. It was an obvious choice for web based information providers to build everything that catered to these browsers. The result was astounding. The explosive growth is all in front of us.

Yet, there was one problem. Computers in this ocean screamed "Water water everywhere, not a drop to drink". Web got littered with information everywhere. Computers became faster, bandwidth availability grew exponentially, storage became cheaper. Yet, all the tall promises of early web such as the refrigerator that will order groceries when the bread was done, remained promises.

Sooner or later, it was imperative that web got structured in a way that one computer could discover, explore capabilities and create relationship between all the possible information on the web. Semantic web is the direction, web has to evolve into.

How does this direction affect applications of tomorrow? Let's talk about an example. Suppose I wanted to create this web application that will allow me to schedule tee time with my friends, at a top rated golf course within drivable distance. This will require the application to check calendar free/busy data from all different calendaring applications my friends use, query a sports directory for a list of golf courses within 50 miles from my home, consult multiple sites to filter this list to have overall user rating of 4 or more. Finally, it has to check each of the golf course site to check availability information for next 7 days.

It is not only a nightmare to build an application like this today, it is practically impossible. Why? because all the assumption about availability of information such as calendar and user ratings etc. are not available in a way my application could interrogate. To accomplish such a task today, one of the friends have to spend hours (if not days) trying to arrange that using the information available on the web or start a flurry of emails trying to gather it.

What if you wanted to buy a used camera today. Even today, you will spend countless hours surfing trying to find a deal. What if you had a software agent, you could just instruct the model and make of the camera you were interested in and it could find couple on eBay, one on craigslist and 3 different ecommerce dealers who were ready to ship it to you at the lowest price available anywhere. The agent would know your zipcode, estimate shipping and tax and perform the comparison all by itself. Is there an incentive for eBay to publish its information in structured fashion? Absolutely. Everybody wants a bigger audience.

Semantic web promises to allow creation of such application much easier, perhaps designed by user themselves.