Saturday, January 28, 2012

2012: The Year of the Semantic Web

In 1996, Tim Berners-Lee, director of the World Wide Web Consortium (W3C), defined modern Semantic Web technology with this vision:

If the interaction between person and hypertext could be so intuitive that the machine-readable information space gave an accurate representation of the state of people's thoughts, interactions, and work patterns, then machine analysis could become a very powerful management tool, seeing patterns in our work and facilitating our working together through the typical problems which beset the management of large organizations.

Fifteen years later, the Semantic Web is used in a variety of fields from art museums informatics to breast cancer research. Although the global implementation of the Semantic Web vision may be years from becoming a reality, many sophisticated IT departments are increasingly adopting semantic standards and migrating to semantic technology-based products to achieve the same benefits in their enterprise that the Semantic Web delivers to the Web. This technological trend will continue to penetrate industries as diverse as finance, medical devices, telecommunications, life sciences, and the intelligence community. In fact, I believe that 2012 will be the year of the Semantic Web.

Here are three use cases from 2011 that illustrate the growing impact of semantic technology in commerce and culture today, and why society is migrating toward a data-driven world.

1. Telecommunications -- The Siri Use Case

Even if you don't know anyone whose Christmas list included an iPhone 4S, it's still true that Apple sold 35 million iPhones in the first quarter, ending December. And it's estimated that Apple will sell 125 million more in 2012. That's a lot of people talking to themselves, I mean, their voice assistant Siri. She's the virtual servant or concierge who can help you arrange a place to eat or stay, activities to do, provide you with directions (hopefully better than those dictated by my GPS). All you do is speak, click, or type, and your little helper collects information from a slew of websites, assisting you in your decision-making process. It can even secure a restaurant reservation for you, or an airplane ticket. All of this is why Siri and a few other features use TWICE as much data as the last iPhone model. The iPhone 4S even uses more data than the iPad.

Co-founder, CTO, and VP Design of Siri, Tom Gruber, is a pioneer in the world of the Semantic Web. A forerunner in using the Web to collect and share information, he is credited with defining "ontology" in a technical sense for computer science -- the first one to call "ontologies" a technology for enabling knowledge sharing. Gruber established the DARPA Knowledge Sharing Library and was among the founding innovative thinkers who laid the groundwork for what we now call the Semantic Web.

2. Enterprises -- The Best Buy Use Case

In December of 2009, Jay Myers, the Lead Development Engineer at Best Buy, published a strategic formula for business data and semantics. It consisted of three circles, the first two added together to produce the third: Externally facing linked open data + Internal linked data = Insights. He explains:

The external data sphere represents human and machine readable data that you'd want everyone to access. One of the primary vehicles gaining popularity on the web is RDFa, a way of utilizing richly annotated HTML to deliver data to machines while retaining the rich visual web human users have become accustomed to... The great thing about "front-end" semantic markup techniques is with a little additional knowledge and tools, it allows countless numbers of HTML devs to create a very rich web of data by simply adding data annotations to their HTML, essentially making the entire web an open and queryable database or API for us to extract knowledge from.
Was the strategy successful? According to an interview of Jay by Doc Sheldon of SearchNewsCentral.com ("RDFa: The Inside Story From Best Buy"), I would say yes. The Best Buy Lead Development Engineer said this:
Within just a couple of months, we began to see an increase in our organic search results. Before long, it had increased by 30 percent over historical rates. We also saw an increase in our click-through rate. Yahoo did a study a while back and found that people that had rich snippets on the results pages were seeing around a 15 percent increase in CTR, which has proven to be the case for us. And of course, it makes our web site "smarter" and more open to machines, which ultimately benefits customers.

3. Museum Informatics -- The Annapolis Historic Foundation Use Case

Finally, consider the recent collaboration between a museum collection in Annapolis, Marylandand my technology firm, Orbis Technologies, Inc. The bulk of our business concentrates on delivering semantic applications to the Department of Defense and commercial clients with near-Internet data challenges. However, we were also able to use our technological capabilities to enhance the world of art exhibits. In a display that ran for seven months, we worked with the Annapolis Historic Foundation to showcase the work of a variety of craftsmen in Annapolis between the years 1700 and 1810, with special focus on portrait artists, silversmiths, and cabinetmakers.

Orbis essentially created an interactive knowledge application for the exhibit that facilitated cross-referencing of information on an artist or image. For example, by clicking on the name of silversmith William Faris, or cabinetmaker John Shaw, a person is able to access all other kinds of information related to the craftsman. As in millions of other use cases, semantic technology was used to create connections between different kinds of data available on points of interest -- in this instance, artisans and objects.

What Semantic Technology Can Do

These broad project applications of semantic technology share common components, of course. Successful implementations often have well understood process workflows that support the generation of a defined product. The unique, domain/industry vocabularies are often required for structured, semi-structured, and unstructured data. These project characteristics, combined with the correct products, can create successful semantic technology-driven projects that demonstrate the value and subsequent return on investment.
In other words, in the best of situations -- where optimal project characteristics are in place -- semantic technology can address common infrastructure problems associated with massive database integration efforts and data overload (i.e., too much data and not enough actionable information or real knowledge). The core semantic technology standards (e.g. RDF) provide machine-readable formats for explicitly describing relationships in a format that models human cognition, thereby creating information that facilitates the human decision process.

The Semantic Web allows us to invest our brain power on responsibilities and tasks that require alert human cognition -- and gives the tedious line checking and data grabbing to a machine who doesn't talk back, get grumpy or demand coffee.

That's why 2012 will be the year of the Semantic Web.

This article was originally posted at   http://www.huffingtonpost.com/steve-hamby/semantic-web-technology_b_1228883.html

Give Me a Sign: What Do Things Mean on the Semantic Web?

Coca-Cola, Toucans and Charles Sanders Peirce

The crowning achievement of the semantc Web is the simple use of URIs to identify data. Further, if the URI identifier can resolve to a representation of that data, it now becomes an integral part of the HTTP access protocol of the Web while providing a unique identifier for the data. These innovations provide the basis for distributed data at global scale, all accessible via Web devices such as browsers and smartphones that are now a ubiquitous part of our daily lives.

Yet, despite these profound and simple innovations, the semantic Web’s designers and early practitioners and advocates have been mired in a muddled, metaphysical argument of at least a decade over what these URIs mean, what they reference, and what their actual true identity is. These muddles about naming and identity, it might be argued, are due to computer scientists and programmers trying to grapple with issues more properly the domain of philosophers and linguists. But that would be unfair. For philosophers and linguists themselves have for centuries also grappled with these same conundrums [1].

As I argue in this piece, part of the muddle results from attempting to do too much with URIs while another part results from not doing enough. I am also not trying to directly enter the fray of current standards deliberations. (Despite a decade of controversy, I optimistically believe that the messy process of argument and consensus building will work itself out [2].) What I am trying to do in this piece, however, is to look to one of America’s pre-eminent philosophers and logicians, Charles Sanders Peirce (pronounced “purse”), to inform how these controversies of naming, identity and meaning may be dissected and resolved.

‘Identity Crisis’, httpRange-14, and Issue 57

The Web began as a way to hyperlink between documents, generally Web pages expressed in the HTML markup language. These initial links were called URLs (uniform resource locators), and each pointed to various kinds of electronic resources (documents) that could be accessed and retrieved on the Web. These resources could be documents written in HTML or other encodings (PDFs, other electronic formats), images, streaming media like audio or videos, and the like [3].
All was well and good until the idea of the semantic Web, which postulated that information about the real world — concepts, people and things — could also be referenced and made available for reasoning and discussion on the Web. With this idea, the scope of the Web was massively expanded from electronic resources that could be downloaded and accessed via the Web to now include virtually any topic of human discourse. The rub, of course, was that ideas such as abstract concepts or people or things could not be “dereferenced” nor downloaded from the Web.

One of the first things that needed to change was to define a broader concept of a URI “identifier” above the more limited concept of a URL “locator”, since many of these new things that could be referenced on the Web went beyond electronic resources that could be accessed and viewed [3]. But, since what the referent of the URI now actually might be became uncertain — was it a concept or a Web page that could be viewed or something else? — a number of commentators began to note this uncertainty as the “identity crisis” of the Web [4]. The topic took on much fervor and metaphysical argument, such that by 2003, Sandro Hawke, a staffer of the standards-setting W3C (World Wide Web Consortium), was able to say, “This is an old issue, and people are tired of it” [5].

Yet, for many of the reasons described more fully below, the issue refused to go away. The Technical Architecture Group (TAG) of the W3C took up the issue, under a rubric that came to be known as httpRange-14 [6]. The issue was first raised in March 2002 by Tim Berners-Lee, accepted for TAG deliberations in February 2003, with then a resolution offered in June 2005 [7]. (Refer to the original resolution and other information [6] to understand the nuances of this resolution, since particular commentary on that approach is not the focus of this article.) Suffice it to say here, however, that this resolution posited an entirely new distinction of Web content into “information resources” and “non-information resources”, and also recommended the use of the HTTP 303 redirect code for when agents requesting a URI should be directed to concepts versus viewable documents.

This “resolution” has been anything but. Not only can no one clearly distinguish these de novo classes of “information resources” [19], but the whole approach felt arbitrary and kludgy.

Meanwhile, the confusions caused by the “identity crisis” and httpRange-14 continued to perpetuate themselves. In 2006, a major workshop on “Identity, Reference and the Web” (IRW 2006) was held in conjunction with the Web’s major WWW2006 conference in Edinburgh, Scotland, on May 23, 2006 [8]. The various presentations and its summary (by Harry Halpin) are very useful to understand these issues. What was starting to jell at this time was the understanding that the basis of identity and meaning on the Web posed new questions, and ones that philosophers, logicians and linguists needed to be consulted to help inform.

The fiat of the TAG’s 2005 resolution has failed to take hold. Over the ensuing years, various eruptions have occurred on mailing lists and within the TAG itself (now expressed as Issue 57) to revisit these questions and bring the steps moving forward into some coherent new understanding. Though linked data has been premised on best-practice implementation of these resolutions [9], and has been a qualified success, many (myself included) would claim that the extra steps and inefficiencies required from the TAG’s httpRange-14 guidance have been hindrances, not facilitators, of the uptake of linked data (or the semantic Web).

Today, despite the efforts of some to claim the issue closed, it is not. Issue 57 and the periodic bursts from notable semantic Web advocates such as Ian Davis [10], Pat Hayes and Harry Halpin [11], Ed Summers [12], Xiaoshu Wang [13], David Booth [14] and TAG members themselves, such as Larry Masinter [15] and Jonathan Rees [16], point to continued irresolution and discontent within the advocate community. Issue 57 currently remains open. Meanwhile, I think, all of us interested in such matters can express concern that linked data, the semantic Web and interoperable structured data have seen less uptake than any of us had hoped or wanted over the past decade. As I have stated elsewhere, unclear semantics and muddled guidelines help to undercut potential use.

As each of the eruptions over these identity issues has occurred, the competing camps have often been characterized as “talking past one another”; that is, not communicating in such a way as to help resolve to consensus. While it is hardly my position to do so, I try to encapsulate below the various positions and prejudices as I see them in this decades-long debate. I also try to share my own learning that may help inform some common ground. Forgive me if I overly simplify these vexing issues by returning to what I see as some first principles . . . .

What’s in a Name?

Original Coca-Cola bottle
One legacy of the initial document Web is the perception that Web addresses have meaning. We have all heard of the multi-million dollar purchasing of domains [17] and the adjudication that may occur when domains are hijacked from their known brands or trademark owners. This legacy has tended to imbue URIs with a perceived value. It is not by accident, I believe, that many within the semantic Web and linked data communities still refer to “minting” URIs. Some believe that ownership and control over URIs may be equivalent to grabbing up valuable real estate. It is also the case that many believe the “name” given to a URI acts to name the referent to which it refers.

This perception is partially true, partially false, but moreover incomplete in all cases. We can illustrate these points with the global icon, “Coca-Cola”.

As for the naming aspects, let’s dissect what we mean when we use the label “Coca-Cola” (in a URI or otherwise). Perhaps the first thing that comes to mind is “Coca-Cola,” the beverage (which has a description on Wikipedia, among other references). Because of its ubiquity, we may also recognize the image of the Coca-Cola bottle to the left as a symbol for this same beverage. (Though, in the hilarious movie, The Gods, They Must be Crazy, Kalahari Bushmen, who had no prior experience of Coca-Cola, took the bottle to be magical with evil powers [18].) Yet even as reference to the beverage, the naming aspects are a bit cloudy since we could also use the fully qualified synonyms of “Coke”, “Coca-cola” (small C), “Classic Coke” and the hundreds of language variants worldwide.

On the other hand, the label “Coca-Cola” could just as easily conjure The Coca-Cola Company itself. Indeed, the company web site is the location pointed to by the URI of http://www.thecoca-colacompany.com/. But, even that URI, which points to the home Web page of the company, does not do justice to conveying an understanding or description of the company. For that, additional URIs may need to be invoked, such as the description at Wikipedia, the company’s own company description page, plus perhaps the company’s similar heritage page.

Of course, even these links and references only begin to scratch the surface of what the company Coca-Cola actually is: headquarters, manufacturing facilities, 140,000 employees, shareholders, management, legal entities, patents and Coke recipe, and the like. Whether in human languages or URIs, in any attempt to signify something via symbols or words (themselves another form of symbol), we risk ambiguity and incompleteness.

URI shorteners also undercut the idea that a URI necessarily “names” something. Using the service bitly, we can shorten the link to the Wikipedia description of the Coke beverage to http://bit.ly/xnbA6 and we can shorten the link to The Coca-Cola Company Web site to http://bit.ly/9ojUpL. I think we can fairly say that neither of these shortened links “name” their referents. The most we can say about a URI is that it points to something. With the vagaries of meaning in human languages, we might also say that URIs refer to something, denote something or identify (but not in the sense of completely define) something.

From this discussion, we can assert with respect to the use of URIs as “names” that:
  1. In all cases, URIs are pointers to a particular referent
  2. In some cases, URIs do act to “name” some things
  3. Yet, even when used as “names,” there can be ambiguity as to what exactly the referent is that is denoted by the name
  4. Resolving what such “names” mean is a matter of context and reference to further information or links, and
  5. Because URIs may act as “names”, it is appropriate to consider social conventions and contracts (e.g., trademarks, brands, legal status) in adjudicating who can own the URI.

In summary, I think we can say that URIs may act as names, but not in all or most cases, and when used as such are often ambiguous. Absolutely associating URIs as names is way too heavy a burden, and incorrect in most cases.

What is a Resource?

The “name” discussion above masks that in some cases we are talking about a readable Web document or image (such as the Wikipedia description of the Coke beverage or its image) versus the “actual” thing in the real world (the Coke beverage itself or even the company). This distinction is what led to the so-called “identity crisis”, for which Ian Davis has used a toucan as his illustrative thing [10].Keel-billed Toucan
As I note in the conclusion, I like Davis’ approach to the identity conundrum insofar as Web architecture and linked data guidance are concerned. But here my purpose is more subtle: I want to tease apart still further the apparent distinction between an electronic description of something on the Web and the “actual” something. Like Davis, let’s use the toucan.

In our strawman case, we too use a description of the toucan (on Wikipedia) to represent our “information resource” (the accessible, downloadable electronic document). We contrast to that a URI that we mean to convey the actual physical bird (a “non-information resource” in the jumbled jargon of httpRange-14), which we will designate via the URI of http://example.com/toucan.

Despite the tortured (and newly conjured) distinction between “information resource” and “non-information resource”, the first blush reaction is that, sure, there is a difference between an electronic representation that can be accessed and viewed on the Web and its true, “actual” thing. Of course people can not actually be rendered and downloaded on the Web, but their bios and descriptions and portrait images may. While in the abstract such distinctions appear true and obvious, in the specifics that get presented to experts, there is surprising disagreement as to what is actually an “information resource” v. a “non-information resource” [19]. Moreover, as we inspect the real toucan further, even that distinction is quite ambiguous.

When we inspect what might be a definitive description of “toucan” on Wikipedia, we see that the term more broadly represents the family of Ramphastidae, which contains five genera and forty different species. The picture we are showing to the right is but of one of those forty species, that of the keel-billed toucan (Ramphastos sulfuratus). Viewing the images of the full list of toucan species shows just how divergent these various “physical birds” are from one another. Across all species, average sizes vary by more than a factor of three with great variation in bill sizes, coloration and range. Further, if I assert that the picture to the right is actually that of my pet keel-billed toucan, Pretty Bird, then we can also understand that this representation is for a specific individual bird, and not the physical keel-billed toucan species as a whole.

The point of this diversion is not a lecture on toucans, but an affirmation that distinctions between “resources” occur at multiple levels and dimensions. Just as there is no self-evident criteria as to what constitutes an “information resource”, there is also not a self-evident and fully defining set of criteria as to what is the physical “toucan” bird. The meaning of what we call a “toucan” bird is not embodied in its label or even its name, but in the context and accompanying referential information that place the given referent into a context that can be communicated and understood. A URI points to (“refers to”) something that causes us to conjure up an understanding of that thing, be it a general description of a toucan, a picture of a toucan, an understanding of a species of toucan, or a specific toucan bird. Our understanding or interpretation results from the context and surrounding information accompanying the reference.

In other words, a “resource” may be anything, which is just the way the W3C has defined it. There is not a single dimension which, magically, like “information” and “non-information,” can cleanly and definitely place a referent into some state of absolute understanding. To assert that such magic distinctions exist is a flaw of Cartesian logic, which can only be reconciled by looking to more defensible bases in logic [20].

Peirce and the Logic of Signs

The logic behind these distinctions and nuances leads us to Charles Sanders PeirceCharles Sanders Peirce (1839 – 1914). Peirce (pronounced “purse”) was an American logician, philosopher and polymath of the first rank. Along with Frege, he is acknowledged as the father of predicate calculus and the notation system that formed the basis of first-order logic. His symbology and approach arguably provide the logical basis for description logics and other aspects underlying the semantic Web building blocks of the RDF data model and, eventually, the OWL language. Peirce is the acknowledged founder of pragmatism, the philosophy of linking practice and theory in a process akin to the scientific method. He was also the first formulator of existential graphs, an essential basis to the whole field now known as model theory. Though often overlooked in the 20th century, Peirce has lately been enjoying a renaissance with his voluminous writings still being deciphered and published.

The core of Peirce’s world view is based in semiotics, the study and logic of signs. In his seminal writing on this, “What is in a Sign?” [21], he wrote that “every intellectual operation involves a triad of symbols” and “all reasoning is an interpretation of signs of some kind”. Peirce had a predilection for expressing his ideas in “threes” throughout his writings.

Semiotics is often split into three branches: 1) syntactics – relations among signs in formal structures; 2) semantics – relations between signs and the things to which they refer; and 3) pragmatics – relations between signs and the effects they have on the people or agents who use them.

Peirce’s logic of signs in fact is a taxonomy of sign relations, in which signs get reified and expanded via still further signs, ultimately leading to communication, understanding and an approximation of “canonical” truth. Peirce saw the scientific method as itself an example of this process.

A given sign is a representation amongst the triad of the sign itself (which Peirce called a representamen, the actual signifying item that stands in a well-defined kind of relation to the two other things), its object and its interpretant. The object is the actual thing itself. The interpretant is how the agent or the perceiver of the sign understands and interprets the sign. Depending on the context and use, a sign (or representamen) may be either an icon (a likeness), an indicator or index (a pointer or physical linkage to the object) or a symbol (understood convention that represents the object, such as a word or other meaningful signifier).

An interpretant in its barest form is a sign’s meaning, implication, or ramification. For a sign to be effective, it must represent an object in such a way that it is understood and used again. This makes the assignment and use of signs a community process of understanding and acceptance [20], as well as a truth-verifying exercise of testing and confirming accepted associations.

John Sowa has done much to help make some of Peirce’s obscure language and terminology more accessible to lay readers [22]. He has expressed Peirce’s basic triad of sign relations as follows, based around the Yojo animist cat figure used by the character Queequeg in Herman Melville’s Moby-Dick:

The Triangle of Meaning

In this figure, object and symbol are the same as the Peirce triad; concept is the interpretant in this case. The use of the word ‘Yojo’ conjures the concept of cat.

This basic triad representation has been used in many contexts, with various replacements or terms at the nodes. Its basic form is known as the Meaning Triangle, as was popularized by Ogden and Richards in 1923 [23].

The key aspect of signs for Peirce, though, is the ongoing process of interpretation and reference to further signs, a process he called semiosis. A sign of an object leads to interpretants, which, as signs, then lead to further interpretants. In the Sowa example below, we show how meaning triangles can be linked to one another, in this case by abstracting that the triangles themselves are concepts of representation; we can abstract the ideas of both concept and symbol:

Representing an Object by a Concept

We can apply this same cascade of interpretation to the idea of the sign (or representamen), which in this case shows that a name can be related to a word symbol, which in itself is a combination of characters in a string called ‘Yojo’:

Representing Signs of Signs of Signs
According to Sowa [22]:

“What is revolutionary about Peirce’s logic is the explicit recognition of multiple universes of discourse, contexts for enclosing statements about them, and metalanguage for talking about the contexts, how they relate to one another, and how they relate to the world and all its events, states, and inhabitants.
“The advantage of Peircean semiotics is that it firmly situates language and logic within the broader study of signs of all types. The highly disciplined patterns of mathematics and logic, important as they may be for science, lie on a continuum with the looser patterns of everyday speech and with the perceptual and motor patterns, which are organized on geometrical principles that are very different from the syntactic patterns of language or logic.”
Catherine Legg [20] notes that the semiotic process is really one of community involvement and consensus. Each understanding of a sign and each subsequent interpretation helps come to a consensus of what a sign means. It is a way of building a shared understanding that aids communication and effective interpretation. In Peirce’s own writings, the process of interpretation can lead to validation and an eventual “canonical” or normative interpretation. The scientific method itself is an extreme form of the semiotic process, leading ultimately to what might be called accepted “truths”.

Peircean Semiotics of URIs

So, how do Peircean semiotics help inform us about the role and use of URIs? Does this logic help provide guidance on the “identity crisis”?

The Peircean taxonomy of signs has three levels with three possible sign roles at each level, leading to a possible 27 combinations of sign representations. However, because not all sign roles are applicable at all levels, Peirce actually postulated only ten distinct sign representations.

Common to all roles, the URI “sign” is best seen as an index: the URI is a pointer to a representation of some form, be it electronic or otherwise. This representation bears a relation to the actual thing that this referent represents, as is true for all triadic sign relationships. However, in some contexts, again in keeping with additional signs interpreting signs in other roles, the URI “sign” may also play the role of a symbolic “name” or even as a signal that the resource can be downloaded or accessed in electronic form. In other words, by virtue of the conventions that we choose to assign to our signs, we can supply additional information that augments our understanding of what the URI is, what it means, and how it is accessed.

Of course, in these regards, a URI is no different than any other sign in the Peircean world view: it must reside in a triadic relationship to its actual object and an interpretation of that object, with further understanding only coming about by the addition of further signs and interpretations.

In shortened form, this means that a URI, acting alone, can at most play the role of a pointer between an object and its referent. A URI alone, without further signs (information), can not inform us well about names or even what type of resource may be at hand. For these interpretations to be reliable, more information must be layered on, either by accepted convention of the current signs or the addition of still further signs and their interpretations. Since the attempts to deal with the nature of a URI resource by fiat as stipulated by httpRange-14 neither meet the standards of consensus nor empirical validity, the attempt can not by definition become “canonical”. This does not mean that httpRange-14 and its recommended practices can not help in providing more information and aiding interpretation for what the nature of a resource may be. But it does mean that httpRange-14 acting alone is insufficient to resolve ambiguity.

Moreover, what we see in the general nature of Peirce’s logic of signs is the usefulness of adding more “triads” of representation as the process to increase understanding through further interpretation. Kind of sounds like adding on more RDF triples, does it not?

Global is Neither Indiscriminate Nor Unambiguous

Names, references, identity and meaning are not absolutes. They are not philosophically, and they are not in human language. To expect machine communications to hold to different standards and laws than human communications is naive. To effect machine communications our challenge is not to devise new rules, but to observe and apply the best rules and practices that human communications instruct.

There has been an unstated hope at the heart of the semantic Web enterprise that simply expressing statements in the right way (syntax) and in the right form (RDF) is sufficient to facilitate machine communications. But this hope, too, is naive and silly. Just as we do not accept all human utterances as truth, neither will we accept all machine transmissions as reliable. Some of the information will be posted in error; some will be wrong or ill-fitting to our world view; some will be malicious or intended to deceive. Spam and occasionally lousy search results on the Web tell us that Web documents are subject to these sources of unsuitability, why is not the same true of data?

Thus, global data access via the semantic Web is not — and can never be — indiscriminate nor unambiguous. We need to understand and come to trust sources and provenance; we need interpretation and context to decide appropriateness and validity; and we need testing and validation to ensure messages as received are indeed correct. Humans need to do these things in their normal courses of interaction and communication; our machine systems will need to do the same.

These confirmations and decisions as to whether the information we receive is actionable or not will come about via still more information. Some of this information may come about via shared conventions. But most will come about because we choose to provide more context and interpretation for the core messages we hope to communicate.

A Go-Forward Approach

Nearly five years ago Hayes and Halpin put forth a proposal to add ex:refersTo and ex:describedBy to the standard RDF vocabulary as a way for authors to provide context and explanation for what constituted a specific RDF resource [11]. In various ways, many of the other individuals cited in this article have come to similar conclusions. The simple redirect suggestions of both Ian Davis [10] and Ed Summers [12] appear particularly helpful.

Over time, we will likely need further representations about resources regarding such things as source, provenance, context and other interpretations that would help remove ambiguities as to how the information provided by that resource should be consumed or used. These additional interpretations can mechanically be provided via referenced ontologies or embedded RDFa (or similar). These additional interpretations can also be aided by judicious, limited additions of new predicates to basic language specifications for RDF (such as the Hayes and Halpin suggestions).

In the end, of course, any frameworks that achieve consensus and become widely adopted will be simple to use, easy to understand, and straightforward to deploy. The beauty of best practices in predicates and annotations is that failures to provide are easy to test. Parties that wish to have their data consumed have incentive to provide sufficient information so as to enable interpretation.

There is absolutely no reason that these additions can not co-exist with the current httpRange-14 approach. By adding a few other options and making clear the optional use of httpRange-14, we would be very Peirce-like in our go-forward approach: We are being both pragmatic while we add more means to improve our interpretations for what a Web resource is and is meant to be.


[1] Throughout intellectual history, a number of prominent philosophers and logicians have attempted to describe naming, identity and reference of objects and entities. Here are a few that you may likely encounter in various discussions of these topics in reference to the semantic Web; many are noted philosophers of language:
  • Aristotle (384 BC – 322 BC) – founder of formal logic; formulator and proponent of categorization; believed in the innate “universals” of various things in the natural world
  • Rudolf Carnap (1891 – 1970) - proposed a logical syntax that provided a system of concepts, a language, to enable logical analysis via exactly formula; a basis for natural language processing;rejected the idea and use of metaphysics
  • RenĂ© Descartes (1596 – 1650) – posited a boundary between mind and the world; the meaning of a sign is the intension of its producer, and is private and incorrigible
  • Friedrich Ludwig Gottlob Frege (1848 – 1925) – one of the formulators of first-order logic, though syntax not adopted; advocated shared senses, which can be objective and sharable
  • Kurt Gödel (1906 – 1978) – his two incompleteness theorems are some of the most important logic contributions of all time; they establish inherent limitations of all but the most trivial axiomatic systems capable of doing arithmetic, as well as for computer programs
  • David Hume (1711 – 1776) – embraced natural empiricism, but kept the Descartes concept of an “idea”
  • Immanuel Kant (1724 – 1804) – one of the major philosophers in history, argued that experience is purely subjective without first being processed by pure reason; a major influence on Peirce
  • Saul Kripke (1940 – ) – proposed the causal theory of reference and what proper names mean via a “baptism” by the namer
  • Gottfried Wilhelm Leibniz (1646 – 1716) – the classic definition of identity is Leibniz’s Law, which states that if two objects have all of their properties in common, they are identical and so only one object
  • Richard Montague (1930 – 1971) – wrote much on logic and set theory; student of Tarski; pioneered a logical approach to natural language semantics; associated with model theory, model-theoretic semantics
  • Charles Sanders Peirce (1839 – 1914) – see main text
  • Willard Van Orman Quine (1908 – 2000) – noted analytical philosopher, advocated the “radical indeterminancy of translation” (can never really know)
  • Bertrand Russell (1872 – 1970) – proposed the direct theory of reference and what it means to “ground in references”; adopted many Peirce arguments without attribution
  • Ferdinand de Saussure (1857 – 1913) – also proposed an alternative view to Peirce of semiotics, one grounded in sociology and linguistics
  • John Rogers Searle (1932 – ) – argues that consciousness is a real physical process in the brain and is subjective; has argued against strong AI (artificial intelligence)
  • Alfred Tarski (1901 – 1983) – analytic philosopher focused on definitions of models and truth; great admirer of Peirce; associated with model theory, model-theoretic semantics
  • Ludwig Josef Johann Wittgenstein (1889 – 1951) – he disavowed his earlier work, arguing that philosophy needed to be grounded in ordinary language, recognzing that the meaning of words is dependent on context, usage, and grammar.
Also, Umberto Eco has been a noted proponent and popularizer of semiotics.
[2] As any practitioner ultimately notes, standards development is a messy, lengthy and trying process. Not all individuals can handle the messiness and polemics involved. Personally, I prefer to try to write cogent articles on specific issues of interest, and then leave it to others to slug it out in the back rooms of standards making. Where the process works well, standards get created that are accepted and adopted. Where the process does not work well, the standards are not embraced as exhibited by real-world use.
[3] Tim Berners-Lee, 2007. What Do HTTP URIs Identify?
This article does not discuss the other sub-category of URIs, URNs (for names). URNs may refer to any standard naming scheme (such as ISBNs for books) and has no direct bearing on any network access protocol, as do URLs and URIs when they are referenceable. Further, URNs are little used in practice.
[4] Kendall Clark was one of the first to question “resource” and other identity ambiguities, noting the tautology between URI and resource as “anything that has identity.” See Kendall Clark, 2002. “Identity Crisis,” in XML.com, Sept 11 2002; see http://www.xml.com/pub/a/2002/09/11/deviant.html. From the topic map community, one notable contribution was from Steve Pepper and Sylvia Schwab, 2003. “Curing the Web’s Identity Crisis,” found at : http://www.ontopia.net/topicmaps/materials/identitycrisis.html.
[5] Sandro Hawke, 2003. Disambiguating RDF Identifiers. W3C, January 2003. See http://www.w3.org/2002/12/rdf-identifiers/.
[6] The issue was framed as what is the proper “range” for HTTP referrals and was also the 14th major TAG issue recorded, hence the name. See further the httpRange-14 Webography .
[7] See W3C, “httpRange-14: What is the range of the HTTP dereference function?”; see http://www.w3.org/2001/tag/issues.html#httpRange-14.
[9] Leo Sauermann and Richard Cyganiak, eds., 2008. Cool URIs for the Semantic Web, W3C Interest Group Note, December 3, 2008. See http://www.w3.org/TR/cooluris/.
[10] Ian Davis, 2010. Is 303 Really Necessary? Blog post, November 2010, accessed 20 January 2012. (See http://blog.iandavis.com/2010/11/04/is-303-really-necessary/.) A considerable thread resulted from this post; see http://markmail.org/thread/mkoc5kxll6bbjbxk.
[11] See first Harry Halpin, 2006. “Identity, Reference and Meaning on the Web,” presented at WWW 2006, May 23, 2006. See http://www.ibiblio.org/hhalpin/irw2006/hhalpin.pdf. This was then followed up with greater elaboration by Patrick J. Hayes and Harry Halpin, 2007. “In Defense of Amibiguity,” http://www.ibiblio.org/hhalpin/homepage/publications/indefenseofambiguity.html.
[12] Ed Summers, 2010. Linking Things and Common Sense, blog post of July 7, 2010. See http://inkdroid.org/journal/2010/07/07/linking-things-and-common-sense/.
[13] Xiaoshu Wang, 2007. URI Identity and Web Architecture Revisited, Word document posted on posterous.com, November 2007. (Former Web documents have been removed.)
[14] David Booth, 2006. “URIs and the Myth of Resource Identity,” see http://dbooth.org/2006/identity/.
[15] See Larry Masinter, 2012. “The ‘tdb’ and ‘duri’ URI Schemes, Based on Dated URIs,” 10th version, IETF Network Working Group Internet-Draft,January 12, 2012. See http://tools.ietf.org/html/draft-masinter-dated-uri-10.
[16] Jonathan Rees has been the scribe and author for many of the background documents related to Issue 57. A recent mailing list entry provides pointers to four relevant documents in this entire discussion. See Jonathan A Rees, 2012. Guide to ISSUE-57 (httpRange-14) document suiteJanuary, 21, 2012.
[17] At least twenty domain names, led by insure.com, have sold for more the $2 million each; see this Wikipedia listing.
[18] In the wonderful movie, The Gods, They Must be Crazy, Bushmen in the Kalahari Desert one day find an unbroken glass Coke bottle that had been thrown out of an airplane. Initially, this strange artifact seems to be another boon from the gods, and the Bushmen find many uses for it. But unlike anything that they have had before, there is only one bottle to go around. This creates jealousy, envy, anger, hatred, even violence. The protagonist, Xi, decides that the bottle is an evil thing and must be thrown off of the edge of the world. The hilarity of the movie comes from that premise and Xi’s encounters with the modern world as he pursues his quest with the magic bottle.
[19] Wang [13]rhetorically asked which of the following things would be categorized as an “information resource”:
  1. A book
  2. A clock
  3. The clock on the wall of my bedroom
  4. A gene
  5. The sequence of a gene
  6. A software
  7. A service
  8. A namespace
  9. An ontology
  10. A language
  11. A number
  12. A concept, such as Dublin Core’s creator.
See the 2007 thread on this issue, mostly by Sean Palmer and Noah Mendelsohn, the latter aknowledging that various experts may only agree on 85% of the items.
[20] See further Catherine Legg, 2010. “Pragmaticsm on the Semantic Web,” in Bergman, M., Paavola, S., Pietarinen, A.-V., & Rydenfelt, H. eds., Ideas in Action: Proceedings of the Applying Peirce Conference, pp. 173–188. Nordic Studies in Pragmatism 1. Helsinki: Nordic Pragmatism Network. See http://www.nordprag.org/nsp/1/Legg.pdf.
[21] Charles Sanders Peirce, 1894. “What is in a Sign?”, see http://www.iupui.edu/~peirce/ep/ep2/ep2book/ch02/ep2ch2.htm.
[22] The figures in particular are from John F. Sowa, 2000. “Ontology, Metadata, and Semiotics,” presented at ICCS 2000 in Darmstadt, Germany, on August 14, 2000; published in B. Ganter & G. W. Mineau, eds., Conceptual Structures: Logical, Linguistic, and Computational Issues, Lecture Notes in AI #1867, Springer-Verlag, Berlin, 2000, pp. 55-81. May be found at http://www.jfsowa.com/ontology/ontometa.htm. Also see John F. Sowa, 2006. “Peirce’s Contributions to the 21st Century,” presented at International Conference on Conceptual Structures, Aalborg, Denmark, July 17, 2006. See http://www.jfsowa.com/pubs/csp21st.pdf.
[23] C.K. Ogden and I. A. Richards, 1923. The Meaning of Meaning, Harcourt, Brace, and World, New York, 8th edition 1946.

This article was originally posted at  http://www.mkbergman.com/994/give-me-a-sign-what-do-things-mean-on-the-semantic-web/

MIT Research: The Advantage Of Ambiguity In Language

Most think that language evolved as a way for people to exchange information, however, linguists and other communication students have long reasoned over why language evolved. Famous linguists, amongst them MIT's Noam Chomsky, have debated that language is actually badly designed for communication and state that it is only a byproduct of a system that may have evolved for other reasons, maybe for structuring our own private thoughts.

As proof for their theory, these linguists highlight the fact that language is ambiguous. They claim that in a scheme, which is optimized for passing information between a speaker and a listener each word would only have one meaning to avoid any risk of confusion or misunderstanding. In a study published in the journal Cognition a team of MIT cognitive scientists has now upturned the linguists hypothesis with a new theory, which argues that ambiguity makes language in fact more efficient as it permits the reuse of short, efficient sounds that listeners can easily distinguish depending on the context. 



Senior author of the study Ted Gibson, an MIT professor of cognitive science says:

"Various people have said that ambiguity is a problem for communication. But once we understand that context disambiguates, then ambiguity is not a problem - it's something you can take advantage of, because you can reuse easy [words] in different contexts over and over again."

The word "Mean" for instance is a rather ironic example of ambiguity, as it can obviously stand for indicating and signifying something, yet it can also refer to an intention or purpose, for instance as in "I meant to go to the store". It could be another word for something or someone offensive or nasty, as well as referring to the 'mathematical average', and just by adding an 's' at the end of the word makes the definition even more versatile, for example, "a means to an end" refers to an instrument or method, or financial management, as in "to live within one's means".

Given all these different definitions, literally no one who masters the English language gets confused when hearing the word "mean." The reason is that the different senses of the word occur in very different contexts, which enables listeners to interpret its meaning almost automatically.

The researchers believe that the simplest words for language processing systems most probably exist because of this disambiguating power of context, which may restrain the ambiguity of languages to reuse words.

Based on previous studies and on observations they suggest that words with fewer syllables, high frequency and the simplest pronunciations should have the most meanings. 


To examine their theory, the researchers conducted corpus studies in Dutch, English and German. A corpus study is the study of language based on "real life" language examples that are stored in corpora (or corpuses), i.e. computerized databases created for linguistic research.

Their theory that shorter words that occurred more frequently and conformed to the language's typical sound patterns tend to be ambiguous was confirmed, when they compared certain properties of words to their numbers of meanings. They observed that the trends were statistically important in all three languages.

In order to comprehend why ambiguity makes a language more instead of less efficient, one has to examine the competing desires of a speaker and listener. Whereas a speaker wants to put across as much as possible to a listener with as few words as possible, the listener aims to gain a complete and specific understanding of what the speaker is trying to convey. However, as the researchers have already pointed out, it is "cognitively cheaper" if the listener concludes certain things from the context of the conversation, rather than the speaker having to spend more time on longer and more elaborate descriptions.

The result is a system that leans toward ambiguity by reusing the "easiest" words. Piantadosi states that once the context is taken into account, it becomes clear that "ambiguity is actually something you would want in the communication system."



Implications for computer science



According to the researchers, the statistical nature of their paper demonstrates a trend in the field of linguistics that is starting to depend more heavily on information theory and quantitative methods. 



Gibson states that, "The influence of computer science in linguistics right now is very high," and adds that natural language processing (NLP) is a major objective of those who operate at the intersection of the two fields. 



Piantadosi highlights that ambiguity in natural language presents enormous challenges for NLP developers, saying:

Ambiguity is only good for us [as humans] because we have these really sophisticated cognitive mechanisms for disambiguating. It's really difficult to work out the details of what those are, or even some sort of approximation that you could get a computer to use."

However, as Gibson pointed out, this problem has long been known by computer scientists and even though the new study offers a better theoretical and evolutionary explanation as to why ambiguity exists, the fact that, "Basically, if you have any human language in your input or output, you are stuck with needing context to disambiguate," still exists he says. 


Written by Petra Rattue
Copyright: Medical News Today
Not to be reproduced without permission of Medical News Today

This article was originally posted at http://www.medicalnewstoday.com/articles/240700.php



Wednesday, January 25, 2012

Collaborative Intelligence & the EHR

A provider's essential critical knowledge is often so obscured that the EHR becomes more of an obstacle than a useful source of clinical information.

The meaningful use-compliant electronic health record (EHR) has quickly become very adept at capturing and sharing standardized, structured clinical content that can be communicated, stored, and to some extent consumed by other systems. Unfortunately, this strength is also the EHR's greatest limitation. Amid the structured templates and required fields of the EHR, the essential critical knowledge a provider needs to know is often so obscured that the EHR becomes more of an obstacle or annoyance than a truly useful source of clinical information.

No Place for Clinician's Thought-Process?
The critical clinical insights that providers most need from an EHR are simply not available to allow for informed decision-making. The required fields may all be populated, but the patient's story remains frustratingly incomplete.

The reason for this is simple: by its very nature, the EHR paradigm of capturing clinical information by way of mouse-and-keyboard input into structured forms limits the expressiveness of content. Because there is no place for non-standard information or for the clinician's thought process in reaching certain diagnoses in the templates, we not only miss out on the details of a patient's clinical history, but also on the critical information that reflects the way doctors think.

Documentation of the rationale for conclusions, relevant temporal and sequential facts, causal information, etc. is either lost or obscured beyond efficient retrieval. Some EHRs have incorporated options to allow providers to capture unstructured narrative information, but the resulting text usually has limited utility since it remains unstructured data buried inside various notes fields.

This dilemma is significant. It will take more than incremental feature improvements to realize the promise of the EHR: to support everything from disease management to clinical decision support to major operational efficiencies. To deliver on the expectations for eHealth, we need the EHR not only to capture and effectively use structured data, but also to capture the full patient story and support clinical collaboration based on that story.

What is needed is collaborative intelligence, a solution that enables and supplements the kind of complete and focused clinical picture physicians convey via face-to-face collaboration. Providing such intelligence requires an understanding of clinical workflows, and an ecosystem of people, process and technology to provide the clinical insights that permit clinicians to zoom in on the most critical information quickly and effectively.

All of the pieces required for such collaborative intelligence are in place today: Recognition and understanding of spoken content, semantic web coding and analysis to drive actions and learning algorithms that continuously improve the performance of automated systems based on human feedback. Four key technologies provide the backbone:

Speech Understanding: Speech is the most natural way for humans to convey complex information, and it is the preferred mode of clinical documentation for most physicians today. Speech-based documentation is fast and interferes with the provider-patient interaction least. Converting speech into structured clinical notes using computers reduces costs and time lag associated with human transcription.

The availability of next-generation speech understanding technology now provides significantly higher accuracies across medical disciplines and documentation types than what has previously been available through speech recognition systems. Integration with various clinical systems further optimizes the efficiency of the technology.

Natural Language Understanding (NLU)Sophisticated technology to "read" and understand unstructured clinical narrative is a critical ingredient for collaborative intelligence. We can now produce meaningful structured information from narrative content, merging the benefits of dictation and structured documentation.

Irrespective of whether clinical narrative is captured through dictation or directly in textual form, the synergistic combination of speech and natural language processing (NLP) technologies now yields highly accurate, context-aware clinical content that is codified to standardized medical ontologies such as SNOMED-CT. This in turn drives actionable information and together with structured EHR data enables clinical decision support and improves the quality of care.

Semantic Clinical Reasoning: Once meaningfully structured narrative information is available, it must be made accessible in workflow-friendly, flexible modes. Newly available tools allow physicians to gain access and insights into clinical data that were impossible to get a few years ago. Also, these tools make physicians more productive because they are capable of abstracting and summarizing the relevant clinical information for each provider. They can reason across millions of documents or drill down on the relevant information about one patient in a given context.

Information mined from narrative content can be combined with structured data from EHRs to obtain holistic insights into the patient's story. From retrospective analyses to real-time feedback for physicians at the time of documentation that enables more timely clinical documentation improvement (CDI) to the ability to share clinical insights among caregivers in a collaborative system, the fruits of this reasoning are game-changing.

Machine LearningTo realize the full scope of its benefits, a collaborative intelligence system must be both highly scalable and responsive to the incessant changes in medical knowledge. The only way to achieve these objectives is through "machine learning" - intelligent systems that improve their predictions as they process more information.

Many NLP systems lack a robust capability to do this or rely on hand-crafted rules for knowledge updates, an inherently non-scalable approach. Learning from human feedback is crucial as it provides a constant opportunity to adapt to the changing environment as well as to improve the results and insights gained from collaborative intelligence.

Taken together and combined in the right manner, these technologies and workflows offer the best path to fulfill the goals of eHealth. The EHR remains an essential tool for advancing the quality and efficiency of care, but all stakeholders in healthcare have to remember that it is far from a panacea. To reach the goals of complete, accurate and seamlessly interoperable clinical information, we need to take into account that the most complete, accurate and interoperable way of communicating clinical information is via the spoken word. It also happens to be the most efficient way of capturing such information.

Juergen Fritsch is the chief scientist of MedQuist. He was previously chief scientist and co-founder of M*Modal, and before that, he was one of the founders of Interactive Services (ISI), where he served as principal research scientist

This article was originally posted at http://health-information.advanceweb.com/Features/Articles/Collaborative-Intelligence-the-EHR.aspx

Tuesday, January 17, 2012

Future Enterprise - The Future of The Internet

David Hunter Tow – Director of the Future Planet Research Centre, forecasts that within the next decade the Internet and Web may be at risk of splitting into a number of separate entities- fragmenting under technological, national, business and social pressures.

In its place may emerge a network of networks – continuously morphing- linking and fragmenting, with no central dominant domain backbone; instead a disconnected, random structure of networks with information channeled through uncoordinated switching stations and content hubs, controlled by a range of geopolitical, social and enterprise interests.

For authoritarian states such as China, North Korea, Iran and Syria as well as criminal cartels, this will facilitate the expansion of their operations, allowing them to circumvent exposure of illegal activities in much the same way as the current Darknet network.

Darknet- the alternate network of virtual channels that currently operates beneath the backbone of the Internet has long been a place for clandestine operations, by both criminal and state networks. It is also used as a tool by cyber authorities to provide evidence of DDoS, port scanning, worms and other malware; also allowing dissidents from repressive regimes to remain in touch with the outside world, providing protection to whistle blowers and hosting pirated movie and music sites- out of reach of traditional search engines.

Autocratic governments are also maintaining increasingly tight censorship over politically sensitive sites via controlled points of entry to their cyber fiefdoms, even to the extent of distorting current and historical events. Both China and Iran now have plans to establish their own Internet infrastructure to further strengthen the control and censorship of their populations and no doubt other authoritarian states will follow. But this power won’t be limited to dictator-run states. The increasing threat of Internet censorship via the proposed SOPA- Stop Online piracy Action legislation in the US, confirms the threat facing online freedom even within democratic nations and has already motivated opposition by major Web companies concerned about the arbitrary blocking of any site considered deemed to be infringing copyright laws.

At the same time white hat hacker groups plan to launch their own communication satellites linked to a grid of tracking stations in order to avoid such Government surveillance and interference, as discussed at the recent Chaos Communication Congress in Berlin

But Apple, Facebook, Google, Amazon as well as Cable and Internet TV companies have already begun to fragment the web to support their own Walled Garden strategies of quarantining and manipulating membership data, applications, entertainment, search results and identities. Facebook membership data cannot be transferred to other social sites. Adobe’s Flash software as well as a number of developer applications were banned by Apple, which means the iPhone browser cannot display a large portion of the Internet. Likewise Amazon’s Kindle will only display books on sale or for rent by the company. Google Plus fails to adequately attribute search results to original sources and multiple Ids are banned.
Such social sites have become closed silos, similar in many respects to those of authoritarian sites such as China.

The more this type of restricted, proprietary architecture gains traction on the Web the more it will become fragmented and the easier it will be for criminal groups to exploit, placing the open and egalitarian charter of the future Internet at risk.

But there are compelling reasons why such closed silo strategies, introduced by Governments or Web companies are likely to eventually collapse.

As outlined in previous blogs, physics ordains that information flows cannot be constrained and will eventually spread by pathways of least resistance, driven by consumer demand, competitive pressure and technological advances. In addition, biological ecosystems with limited genetic variation are the most vulnerable to extinction. Companies within the cyber ecosphere are equally vulnerable- more susceptible to competition and rapid changes in their technological and social environments if open access to innovative ideas and information flows is restricted.

The emergence of the Semantic Web is also a catalyst for greater openness, facilitating the interpretation, linking and application of knowledge stored in millions of discrete databases across the Web. This is a vital advance in fostering greater transparency, flexibility and autonomy within the Cybersphere.

But the battle for web control and Internet supremacy is only just beginning, not only between the US and China but also involving all other nations in the newly emerging multi-polar world. The US still maintains the controlling votes in ICAAN - the Domain management company, despite many attempts to democratise its management.

But now the US will be forced to flex up and stop playing the role of alpha male in an increasingly equal and diverse information world.
By its obsession with maintaining technology dominance of critical assets such as the Web, particularly in a time of global warming, with an urgent need to effectively manage global resources for all populations, the US is ironically accelerating the rise of alternate Internets and Webs.

China is charging ahead with alternate communication networks, as in most areas of new technology. After all its search engine - Baidu, already has 500 million users - almost as many as Google worldwide. Baidu works hand in glove with the Central Communist Party and is the ultimate arbiter of reality for its users, committed to working within the Government's paranoid censorship parameters constrained by a massive firewall of 50,000 Internet police. But with 200 million bloggers producing trillions of words a day as well as subscribers to RenRen and Seina Weibo- the equivalent of Facebook and Twitter, it’s becoming an increasingly tough call- even for a totalitarian government.

So now the momentum is building for a multi-Internet infrastructure as governments of all colours attempt to impose their will and dominate the evolution of the pre-eminent artefact of our civilisation, which may hold the key to the planet’s survival.
In the short term China cannot replicate the mega optic fibre cable, satellite and server networks of the present Internet, but it can deploy a mesh of alternate channels linking its own network assets to other friendly systems, for example in Africa, South America, Iran and Russia; at the same time constructing a topology complete with their own domain servers. In addition, it will develop its own knowledge hubs while leveraging the existing core public assets such as the priceless science, engineering, social and economic databases of the current Web.

The new US Net Neutrality rules recently introduced to prevent balkanisation are already under heavy fire, with broadband providers prevented from engaging in anti-competitive behaviour by blocking content or slowing access to sites and applications, as Comcast attempted to do in 2007 with the BitTorrent "peer-to-peer" protocol.
But as the pressure to bypass the new rules to allow a multi-speed Internet has increased, so too have the tensions been building between the major Social Web, Broadband and Cloud providers- Google, Apple, Facebook, Cisco, Verizon, Amazon, VMware etc. Cloud vendors have been erecting a new set of proprietary firewalls, with VMware the exception, adopting an open architecture to encourage developers to leverage and extend its technology.

The more such closed architecture with differing operational and security standards gain traction however, the higher the risk that the CloudSphere will eventually become fragmented, less productive and more vulnerable to hacking.

Meanwhile, despite its financial problems, the EU plans to spend billions on boosting broadband speeds to increase productivity and competitiveness. The European Commission will spend 9 billion euros to rollout super-fast broadband infrastructure and services across the European Union to help create a single market for digital public services by 2020 for half its population including- e-health, intelligent energy and cyber security applications, assisting utility companies, construction cooperatives, public authorities and rural users.

New Internet Architecture options are also on the horizon, with a number of innovations in train, forecast to improve the Web’s flexibility while avoiding fragmentation. But these could be put in jeopardy by the US’s intransigence over ceding control.

For example the National Science Foundation has established the Future Internet Architecture program- Nebula, to better secure Internet- Id verification, data safety, mobile access and cloud computing. Google is also setting up a new Web architecture to improve search effectiveness.

At a recent Internet Conference run by the European Paradiso Group, a number of advanced options were discussed including- Internet routing algorithms with quantum options to provide more efficient and secure routing paths; flexible spectrum allocation; a smart Internet environment enabled by networked sensors; a content and context aware Web combined with self-organising and self-adaptive capabilities to provide more autonomy and optimisation.

In addition, the proposed Named Data Networking (NDN) architecture shifts the communication emphasis from today's focus on resource addresses, servers, and hosts, to one oriented to content and context. By identifying data objects instead of just locations, NDN transforms data into the primary Internet focus. While the current Internet secures the channel or path between two communication points, adding data encryption as an extra, NDN will implicitly secure content security and trust.

These and other advances will result in the emergence of Internet Mark 3.0, following its early incarnation as a simple packet data transfer system and then transforming into a pervasive information search powerhouse over the last decade

But Internet 3.0 will only emerge if fragmentation of its infrastructure and the ensuing chaos is avoided

Internet Mark 3.0 will offer- complex multidimensional and ultra-efficient processing and the dissemination of realtime, multi-services and decision-making based on content and context– not just physical objects.
Such capability will drive societal transformation at hyper speed catalysing - urbanisation, mobility, vastly improved health and education services and all forms of virtual reality, as well as the beginning of a truly symbiotic web-human partnership in complex decision-making.

The Future of the Web has been discussed in a number of previous blogs by the author.

In summary-

By 2015 Web 2.0- The Social Web- will have developed into a complex multimedia interweaving of ideas, knowledge and social commentary, connecting over three billion people on the planet.

By 2025, Web 3.0- The Semantic Web- will have made many important contributions to new knowledge through network science, logical inference artificial intelligence. It will be powered by a seamless, computational mesh, enveloping and connecting human and artificial life and will encompass all facets of our social and business lives- always on and available to manage every need.

By 2035, Web 4.0- the Intelligent Web- will be ubiquitous- able to interact with the repository of all available knowledge of human civilisation- past and present, digitally coded and archived for automatic retrieval and analysis. Human intelligence will have co-joined with advanced forms of artificial intelligence, creating a higher or meta-level of knowledge processing. This will be essential for supporting the complex decision-making and problem solving capacity required for civilisation's future survival and progress.

Also by 2035 the last of the enterprise walled gardens will break down and leak like stone walls surrounding an ancient town. Techniques and technologies across the spectrum of knowledge will continue to spread, expand and link in new ways as they always have, bypassing temporary impediments, because that is the physical reality of information and knowledge.

The future Internet will inevitably follow these laws- becoming more open and flexible, using common protocols as enterprises and consumers demand greater flexibility. As an increasing number of data providers begin to implement Tim Bernier Lee’s Linked Data principles, it will transform into the creation of an open global Infosphere containing billions of links and coordinated by the World Wide Web Consortium.

This will offer a blueprint for connecting information from different sources into a single global data repository, with the Global Commons and Public Domain models playing an increasingly important democratic role.

Most importantly the Web will be equally available to and controlled by all nations, under the auspices of a specially constituted UN body, devolving forever away from US control.

But this can only happen if the underlying structural integrity of the Internet and Web is preserved. If managed as a global cooperative project it will result in enormous benefits for the whole of humanity. But if the Future Internet splits and fragments along geopolitical and competitive lines, as its current evolution suggests, then much of its potential benefit for our civilisation and planet will dissipate.

The future of this pre-eminent human-engineered living organism of the 21st century is now in the balance..

This article was originally posted at http://it.toolbox.com/blogs/future-enterprise-it/future-enterprise-the-future-of-the-internet-50029

A Brief History of Classification

Girginakku at Glencairn. These clay tablets were used for many purposes, including cataloging.
The earliest known means of classifying an object and keeping it in order are girginakku. These are ancient Mesopotamian clay tablets that were attached to scrolls and tablets and used to identify the contents. Examples of approximately 5300 years in age can be found in the British Museum.

Girginakku at Glencairn. These clay tablets were used for many purposes, including cataloging.

The famous Library of Alexandria in Egypt housed one of the earliest forms of library catalog in the third century BCE. The library reportedly housed more than 120,000 scrolls which were stored in bins categorized by subject. Each of these bins was labeled, and the labels were indexed in Pinakes. The taxonomy of subjects was devised by Callimachus, the second recorded librarian at Alexandria. He created a system with 11 main categories: six genres and 5 kinds of prose (6 categories for non-fiction, 5 for fiction.) These were rhetoric, law, history, medicine, mathematics, natural science, epic, tragedy, comedy, lyric poetry and miscellaneous. The influences of this system are still seen today in such systems as the Dewey Decimal Classification system.

Beginning in the 8th century CE, the Islamic library at Baghdad, The House Of Wisdom, began collecting books in earnest. The knowledge of papermaking had been acquired from Chinese prisoners and books proliferated. This is akin to the explosion of digital information we see today. These books were organized into genres, categories and sub-categories to make them easier to manage until the library was destroyed by a Mongol invastion in the mid 13th century.

The Leiden University Library, The Netherlands, created the first printed institutional library catalog shorty after it opened in the late 16th Century. The book was titled Nomenclator, and was a list of all authors whose books - in manuscript or print - were available in the library. The Library continued on the leading edge until the 20th century: it was among the first to use cards for its catalog and in 1969 began work on an automated system which was bought by OCLC in 2000. OCLC maintains WorldCat, the Worldwide Catalog, a machine system for libraries large and small, private and public, worldwide.

In 1735 Carolus Linnaeus published his Systema Naturæ, more commonly known as the Linnaean or Animal Kingdom taxonomy. Most of us are familiar with this system from grade school biology - there are three kingdoms (animals, plants, minerals) which are divided into classes, orders, genus and species. This is purely hierarchical in nature, and while it is capable of greater things, is used as an information placement tool mostly by non-biologists - akin to navigation taxonomies today. When you speak to people about taxonomy, this is often what they think of, and it is very useful to have some examples of similarity and differentiation at the ready to explain how your own taxonomy relates.

Three hundred years later Melvil Dewey created the Dewey Decimal system, which organizes artifacts by subject into 10 main categories. This system took hold quickly in the public and school libraries in the United States. The Library of Congress created their first dictionary catalog a couple of decades later in 1898, the Library of Congress Subject Headings. This is the basis for cataloging and classifying all of the works that are in or are sent to the Library of Record in the USA. These catalog entries are the basis for a fee-based service which generates income for the LoC. It charges other libraries for copies of their catalog cards so that the subscribing library doesn’t have to do the cataloging work themselves.

In the middle of the 20th Century an Indian mathematician and scholar by the name of Ranganathan created Colon Classification, a system still in use in Indian Libraries today. He posited that everything could be organized under 5 key facets, combined appropriately for the resource: Personality, Matter, Energy, Space, and Time. Each of these facets has a controlled value entered which is obtained from a taxonomy or thesaurus. The delimiters between the facets is a colon, and they are always entered in the PMEST order. This type of faceted taxonomy is a more practical solution for cataloging items in a digital world. Rather than having to have a list of 10k items, one can have 4 lists of 10 items, which is much easier to manage. This is NOT a rule - it is an example. Each application has its own business requirements.

Taxonomies in the enterprise reach back further than one thinks, but became known to researchers in 1858 when the NY Times began its index to the newspaper. It became such a valuable tool that publishers began indexing books and periodicals and publishing such - H.W.Wilson is a great publisher of indexes. The Reader’s Guide to Periodical Literature is one that most school students are introduced to. Database providers and large academic/scholarly/professional publishers added this capability early on as well. Proquest/Gale/Cengage, Dialog, Factiva, Reuters, IEEE, ACM all have indexes. Large government organizations also have indexes organized by subject taxonomies or thesauri: NASA, DTIC, NIH, BLS, CIA, NAICS, SEC.

Taxonomies for the enterprise and the web as we know them today began as experiments in search improvements in the 1990s. Yahoo’s first release and Open Directory were clearly a librarian-like effort to organize the then small web. Those categorization structures were re-created within the realm of Natural Language Processing - math with letters. Pattern matching is the basis for much of what occurs in these systems for rules based categorization. In simplest terms, a rule which tags a piece of content with a term from the taxonomy is an if-then statement.

Efforts are underway to transform semantic systems into more than just known item or NLP derived labeling to systems capable of contextual understanding. Ontologies are the means by which much of this effort will be accomplished in the short term. An ontology is more advanced than a taxonomy as it an contain self-defined relationships beyond that of parent-child. It can also be used to infer data and reason over information. The World Wide Web Consortium is one of the key leaders in efforts for standards in this space, as a semantic space is what Tim Berners-Lee had in mind for the web from the beginning.

This article was originally posted at http://triviumrlg.com/content/brief-history-classification

SemTechBiz Conference – Call for Presentations Open until January 16

SemTechBiz is returning to San Francisco on June 3-7, and once again we plan to make it the biggest and most comprehensive educational conference on the business of semantic technologies. And that’s exactly where we are asking you to contribute please – by sharing the practical experience you have gained in your own semantic projects.
We’re looking for case studies big and small – whether you’re building the semantic infrastructure of the future, like the DoD Enterprise Web, or you’ve done semantic annotation on a local business web site, like Plush Beauty Bar. They’re all relevant, because the curiosity of the audience is so rich and diverse.
The Call for Presentations ends on January 16, 2012, so get your abstract together ASAP. All the information you need, and the links to submit your presentation proposal, are HERE.
Conference registration is also open. Register by February 17 and save with substantial early bird discounts.
If you have any questions, feel free to email me at Tony@SemanticWeb.com
Thanks,
Tony Shaw