The semantic web and open linked data open-up the vis­ion of sci­entific data pub­lished in machine-readable form. But their adop­tion faces some chal­lenges, many self-inflicted.

Last week I taught a short course on pro­gram­ming context-aware sys­tems as a Swiss doc­toral winter school. The idea was to fol­low the devel­op­ment pro­cess from the ideas of con­text, through mod­el­ling and rep­res­ent­a­tion, to reas­on­ing and maintenance.

Con­text has a some prop­er­ties that make it chal­len­ging for soft­ware devel­op­ment. The data avail­able tends to be het­ero­gen­eous, dynamic and richly linked. All these prop­er­ties can impede the use of nor­mal approaches like object-oriented design, which tend to favour sys­tems that can be spe­cified in a static object model up-front. Rather than use an approach that’s highly struc­tured from its incep­tion, am altern­at­ive approach is to use an open, unstruc­tured rep­res­ent­a­tion and then add struc­ture later using aux­il­i­ary data.This leads nat­ur­ally into the areas of linked open data and the semantic web.

The semantic web is a term coined by Tim Berners-Lee as a nat­ural follow-on from his ori­ginal web design. Web pages are great for brows­ing but don’t typ­ic­ally lend them­selves to auto­mated pro­cessing. You may be able to extract my phone num­ber from my con­tact details page, for example, but that’s because you under­stand typo­graphy and abbre­vi­ation: there’s noth­ing that expli­citly marks the num­ber out from the other char­ac­ters on the page. Just as the web makes inform­a­tion read­ily access­ible to people, the semantic web aims to make inform­a­tion equally access­ible to machines. It does this by allow­ing web pages to be marked-up using a format that’s more semantic­ally rich than the usual HTML. This uses two addi­tional tech­no­lo­gies: the Resource Descrip­tion Frame­work (RDF) to assert facts about objects, SPARQL to access to model using quer­ies, and the Web Onto­logy Lan­guage (OWL) to describe the struc­ture of a par­tic­u­lar domain of dis­course. Using the example above, RDF would mark-up the phone num­ber, email address etc expli­citly, using ter­min­o­logy described in OWL to let a com­puter under­stand the rela­tion­ships between, for example, a name, an email address, an employ­ing insti­tu­tion and so on. Effect­ively the page, as well as con­vey­ing con­tent for human con­sump­tion, can carry con­tent marked-up semantic­ally for machines to use autonom­ously. And of course you can also cre­ate pages that are purely for machine-to-machine inter­ac­tion, essen­tially treat­ing the web as a stor­age and trans­fer mech­an­ism with RDF and OWL as semantic­ally enriched trans­port formats.

So far so good. But RDF, SPARQL and OWL are far from uni­ver­sally accep­ted “in the trade”, for a num­ber of quite good reasons.

The first is verb­os­ity. RDF uses XML as an encod­ing, which is quite a verb­ose, tex­tual format. Second is com­plex­ity: RDF makes extens­ive use of XML namespaces, which add struc­ture and pre­vent mis­in­ter­pret­a­tion but make pages harder to cre­ate and parse. Third is the exchange over­head, whereby data has to be con­ver­ted from in-memory form that pro­grams work with into RDF for exchange and then back again at the other end, each step adding more com­plex­ity and risks of error. Fourth is the unfa­mili­ar­ity of many of the con­cepts, such as the dynamic non-orthogonal clas­si­fic­a­tion used in OWL rather than the static class hier­arch­ies of com­mon object-oriented approaches. Fifth is the dis­con­nect between pro­gram data and model, with SPARQL sit­ting off to one side like SQL. Finally there is the need for all these tech­no­lo­gies en masse (in addi­tion to under­stand­ing HTTP, XML and XML Schemata) to per­form even quite simple tasks, lead­ing to a steep learn­ing curve and a high degree of com­mit­ment in a pro­ject ahead of any obvi­ous returns.

So the decision to use the semantic web isn’t without pain, and one needs to place suf­fi­cient value on its advant­ages — open, standards-based rep­res­ent­a­tion, easy exchange and integ­ra­tion — to make it worth­while. It’s undoubtedly attract­ive to be able to define a struc­ture for know­ledge that exactly matches a chosen sub-domain, to describe the rich­ness of this struc­ture, and to have it com­pose more or less cleanly with other such descrip­tions of com­ple­ment­ary sub-domains defined inde­pend­ently — and to be able to exchange all this know­ledge with any­one on the web. But this flex­ib­il­ity comes with a cost and (often) no obvi­ous imme­di­ate, high-value benefits.

Hav­ing taught this stuff, I think the essen­tial prob­lem is one of tool­ing and integ­ra­tion, not core beha­viour. The semantic web does include some really valu­able con­cepts, but their real­isa­tion is cur­rently poor and this poses a haz­ard to their adoption.

In many ways the use of XML is a red her­ring: no sane per­son holds data to be used pro­gram­mat­ic­ally as XML. It is — and was always inten­ded to be — an exchange format, not a data struc­ture. So the focus needs to be on the data model under­ly­ing RDF (subject-predicate-object triples with sub­jects and pre­dic­ates rep­res­en­ted using URIs) rather than on the use of XML.

While there are stand­ard lib­rar­ies and tools for use with the semantic web — in Java these include Jena for rep­res­ent­ing mod­els, Pel­let and other reason­ers provid­ing onto­lo­gical reas­on­ing, and Protégé for onto­logy devel­op­ment — their level of abstrac­tion and integ­ra­tion with the rest of the lan­guage remain quite shal­low. It is hard to ensure the valid­ity of an RDF graph against an onto­logy, for example, and even harder to val­id­ate updates. The type sys­tems also don’t match, either stat­ic­ally or dynam­ic­ally: OWL per­forms clas­si­fic­a­tion based on attrib­utes rather than by defin­ing hard classes, and the clas­si­fic­a­tion may change unpre­dict­ably as attrib­utes are changed. (This isn’t just a prob­lem for statically-typed pro­gram­ming lan­guages, incid­ent­ally: hav­ing the objects you’re work­ing with re-classified can inval­id­ate the oper­a­tions you’re per­form­ing at a semantic level, regard­less of whether the type sys­tem com­plains.) The sep­ar­a­tion of query­ing and reas­on­ing from rep­res­ent­a­tion is awk­ward, rather like the use of SQL embed­ded into pro­grams: the query doesn’t fit nat­ur­ally into the host lan­guage, which typ­ic­ally has no syn­tactic sup­port for con­struct­ing queries.

Per­haps the solu­tion is to step back and ask: what prob­lem does the semantic web solve? In essence it addresses the open and scal­able mark-up of data across the web accord­ing to semantic­ally mean­ing­ful schemata. But pro­gram­ming lan­guages don’t do this: they’re about nailing-down data struc­tures, per­form­ing local oper­a­tions effi­ciently, and let­ting developers share code and func­tion­al­ity. So there’s a mis-match between the goals of the two sys­tem com­pon­ents, and their strengths don’t com­ple­ment each other in the way one might like.

This sug­gests that we re-visit the integ­ra­tion of RDF, OWL and SPARQL into pro­gram­ming lan­guages; or, altern­at­ively, that we look at for what fea­tures would provide the best com­pu­ta­tional cap­ab­il­it­ies along­side these tech­no­lo­gies. A few char­ac­ter­ist­ics spring to mind:

  • Use clas­si­fic­a­tion through­out, a more dynamic type struc­ture than clas­sical type systems
  • Access RDF data “nat­ive”, using bind­ing along­side querying
  • Adopt the XML Schemata types “nat­ive” as well
  • Make code poly­morphic in the man­ner of OWL onto­lo­gies, so that code can be exchanged and re-used. This implies basing typ­ing on reas­on­ing rather than being purely structural
  • Hid­ing namespaces, URIs and the other ele­ments of RDF and OWL behind more famil­iar (and less intrus­ive) syn­tax (while keep­ing the semantics)
  • Allow pro­gram­matic data struc­tures, suited to local use in a pro­gram, to be layered onto the data graph without for­cing the graph itself into con­vo­luted structures
  • Think­ing about the global, non-local data struc­tur­ing issues
  • Make access to web data intrinsic, not some­thing that’s done out­side the nor­mal flow of con­trol and syntax

The chal­lenges here are quite pro­found, not least from rel­at­ively ped­es­trian mat­ters like con­cur­rency con­trol, but at least we would then be able to lever­age the invest­ment in data mark-up and exchange to obtain some of the bene­fits the semantic web clearly offers.