Semantic Web for the Working Ontologist. Dean Allemang
Чтение книги онлайн.
Читать онлайн книгу Semantic Web for the Working Ontologist - Dean Allemang страница 8
Social data
A special case of the desire to share data is social networking. Billions of people share data about their lives on a number of social web sites, including their personal lives as well as their professional lives. It is worth their while to share this data, as it provides ways for them to find new friends, keep in touch with old friends, find business connections, and many other advantages.
Social and professional networking is done in a non-distributed way. Someone who wants to share their professional or personal information signs up for a web service (common ones today include Facebook, LinkedIn, Instagram, and WeChat; others have come and gone, and more will probably appear as time goes on), creates an account that they have control of, and they provide data, in the form of daily updates, photos, tags of places they’ve been and people they’ve been with, projects they have started or completed, jobs they have done, and so on. This data is published for their friends and colleagues, and indeed in some cases for perfect strangers, to search and view.
In these cases, the service they signed up for owns the data, and can use it for various purposes. Most people have experienced the eerie effect of having mentioned something in a social network, only to find a related advertisement appear on their page the following day.
Advertising is a lucrative but mostly harmless use of this data. In 2018, it was discovered that data from Facebook for millions of users had been used to influence a number of high-profile elections around the world, including the US presidential election of 2016 and the so-called “Brexit” referendum in the UK [Meredith 2018]. Many users were surprised that this could happen; they shared their data in a centralized repository over which they had no control.
This example shows the need for a balance of control—yes, I want to share my data in the examples of Section 1.2, and I want to share it with certain people but not with others (as is the case in this section). How can we manage both of these desires? This is a problem of distributed data; I need to keep data to myself if I want to control it, but it has to connect to data around the world to satisfy the reasons why I publish it in the first place.
Learning from data
Data Science has become one of the most productive ways to make business predictions, and is used across many industries, to make predictions for marketing, demand, evaluation of risk, and many other settings in which it is productive to be able to predict how some person will behave or how well some product will perform.
Banking provides some simple examples. A bank is in the business of making loans, sometimes mortgages for homeowners, or automobile loans, small-business loans, and so on. As part of the loan application process, the bank learns a good deal about the borrower. Most banks have been making loans for many decades, and have plenty of data about the eventual disposition of these loans (for example, Were they defaulted? Did they pay off early? Were they refinanced?). By gathering large amounts of this data, machine learning techniques can predict the eventual disposition of a loan based on information gathered at the outset. This, in turn, allows the bank to be more selective in the loans it makes, allowing it to be more effective in its market.
This basic approach has been applied to marketing (identifying more likely sales leads), product development (identifying which features will sell best), customer retention (identifying problems before they become too severe to deal with), medicine (identifying diseases based on patterns in images and blood tests), route planning (finding best routes for airplanes), sports (deciding which players to use at what time), and many other high-profile applications.
In all of these cases, success relied on the availability of meaningful data. In the case of marketing, sales, and manufacturing applications, the data comes from a single source, that is, the sales behavior of the customers of a single company. In the case of sports, the statistical data for the sport has been normalized by sports fans for generations. The data is already aligned into a single representation. This is an important step that allows machine learning algorithms to generalize the data.
The only example in this list where the data is distributed is medicine, where diagnoses come from hospitals and clinics from around the world. This is not an accident; in the case of medicine, disease and treatment codes have been in place for decades to align data from multiple sources.
How can we extend the successful example of machine learning in medicine, to take our machine learning successes from the enterprise level to the industrial level in other industries? We need a way to link together data that is distributed throughout an industry.
1.3 Distributed Data
In the restaurant example, we had data (opening hours, daily special, holiday closings) published so that they can be read by the human eye, but our automated assistant couldn’t read them. One solution would be to develop sophisticated algorithms that can read web pages and figure out the opening hours based on what it sees there. But the restaurant owner knows the hours, and wants prospective patrons to find them, and for them to be accurate. Why should a restaurant owner rely on some third party to facilitate communication to their customers?
A scientific paper that reports on an experimental finding has a very specific audience: other researchers who need to know about that compound and how it reacts in certain circumstances. It behooves both the author and the reader to match these up. Once again, the author does not want to rely on someone else to communicate their value.
This story repeats at every level; a bank has more control over its own instruments if it can communicate their terms in a clear and unambiguous way (to partners, clients, or regulators). The IAU’s charter is to keep the astronomical community informed about developments in observations and classifications. Dentists want their patients to be able to find their clinics.
The unifying theme in all of these examples is a move from a presentation of information for a specific audience, requiring interpretation from a human being, to an exchange of data between machines. Instead of relying on human intuition just in the interpretation of the data, we meet half-way: have data providers make it easier to consume the data. We take advantage of the desire to share data, to make it easier to consume.
Instead of thinking of each data source as a single point that is communicating one thing to one person, it is a multi-use part of an interconnected network of data. Human users and machine applications that want to make use of this data collaborate with data providers, taking advantage of the fact that it is profitable to share your data.
A distributed web of data
The Semantic Web takes this idea one step further, applying it to the Web as a whole. The Web architecture we are familiar with supports a distributed network of hypertext pages that can refer to one another with global links called Uniform Resource Locators (URLs). The Web architecture generalizes this notion to a Uniform Resource Identifier (URI), allowing it to be used in contexts beyond the hypertext Web.
The main idea of the Semantic Web is to support a distributed Web at the level of the data rather than at the level of the presentation. Instead of just having one web page point to another, one data item can point to another, using the same global reference mechanism that the Web uses—URIs. When Mongotel publishes information about its hotels and their locations, or when Copious publishes its opening hour, they don’t just publish a human-readable presentation of this information but instead a distributable, machine-readable description of the data.
The