Declarative Logic Programming. Michael Kifer
Чтение книги онлайн.
Читать онлайн книгу Declarative Logic Programming - Michael Kifer страница 27
Another large area of contribution for Datalog was as a vehicle for understanding the complexity and expressiveness of database querying. Complexity refers to the amount of time or space to evaluate a query over a database, as a function of input size. The relevant input can be the data (for data complexity), the query (for query or expression complexity), or both (for combined complexity). The complexity is generally characterized by broad classes, such as logarithmic space (LOGSPACE) or polynomial time (PTIME). A classical result is that domain relational calculus (essentially non-recursive Datalog with negation) has LOGSPACE data complexity and PSPACE query complexity. (Domain relational calculus is PSPACE-complete, meaning—informally—that domain relational calculus evaluation is as hard as any problem in PSPACE.) Abiteboul et al. [1995] provide an introduction to such complexity concepts and results. We note that Datalog itself is PTIME-complete for data complexity and EXPTIME-complete for query and combined complexity [Immerman 1982, Vardi 1982].
Datalog has helped the study of database query complexity (and decidability), particularly in terms of the complexity “consequences” of different language features and semantics. Compared to a syntactically complicated language such as SQL, language features can be more cleanly removed from or added to Datalog, such as recursion, negation, second-order variables, inequality literals, ordered domains, fixpoint operators, iteration, disjunction (in rule heads) and complex objects. It also readily admits alternative semantics, such as Well-Founded vs. Stable-Model semantics for negation. A survey by Dantsin et al. [2001] covers complexity of these variants and more. There is also work investigating restrictions of Datlog, such as linear recursion (at most one IDB goal per rule body) [Ullman and Van Gelder 1988] and limitation to a single rule [Gottlob and Papadimitriou 1999]. We also note that Datalog has also proved a useful tool in studying the complexity of non-database problems, such as constraint satisfaction [Feder and Vardi 1999].
Work on expressiveness considers questions such as which queries cannot be expressed in a particular language, how two query languages are related in terms of the queries they can express, and how query languages relate to complexity classes. For example, it is not possible to express a transitive-closure query in non-recursive Datalog [Aho and Ullman 1979], linear Datalog is strictly less expressive than regular Datalog [Afrati and Cosmadakis 1989], Datalog with equality and inequality predicates is equivalent to fixpoints over positive existential queries [Chandra and Harel 1985], and Datalog with ordered structures (that is, with a built-in ordering predicate on the domain of values) exactly captures polynomial time [Papadimitriou 1985]. Such expressiveness results are not purely of theoretical interest—they can show that a language with better “programmability” can be substituted for another without diminishing the queries that can be expressed. For example, both Monadic Datalog (all head predicates have a single variable) and a natural subset of Elog (a language for web wrappers) are as expressive as Monadic Second-Order logic (logic with variables ranging over sets) on trees [Gottlob and Koch 2004]. Thus, for tasks such as information extraction from the Web, the former languages provide an easier-to-use alternative to the latter. The survey by Dantsin et al. [2001] also covers a wide range of expressiveness results.
1.7.2 The Resurgence
Work on Datalog always kept going at some level, even after the initial flurry of activity subsided. Beginning around 1998 there was renewed interest in Datalog, driven by its use to solve real problems rather than by intrinsic properties or theoretical developments (although there have been new extensions and evaluation techniques motivated by these applications). Some of this interest was due to the new (at the time) field of the Semantic Web and its focus on inference [Berners-Lee et al. 2001, Boley and Kifer 2013a, Decker et al. 1998, Guha et al. 1998]. In other cases, Datalog was being applied in settings where database technology was not commonly used, such as program analysis and networking. Thus, its adoption was more a matter of introducing a declarative approach into a new domain or a domain where solutions were typically expressed imperatively, rather than of replacing another database language.
Semantic Web. The idea of the Semantic Web emerged in late 1990s as the next step in the evolution of the World Wide Web. Instead of HTML pages, the Semantic Web was to rely on machine-readable logical statements (most of which would be just database facts written in an open standard format called RDF [Lassila and Swick 1999]), making search results more relevant, giving rise to better recommendation systems, making Semantic Web services possible (for example, automatic arrangement of trips), and even enabling automated contract negotiation.
The two main research directions in this area are reasoning and machine learning; we focus here on the former. In reasoning, two approaches dominated: OWL [Grau et al. 2008], which is based on a particular strain of knowledge representation, called Description Logic [Baader et al. 2003], and rule-based reasoning, which is based on logic programming and includes Datalog. Both approaches have led to W3C recommendations (W3C calls its standards “recommendations”)—the already mentioned OWL and the Rule Interchange Format (RIF) [Boley and Kifer 2013a, Boley and Kifer 2013b], which is based on logic programming and, more specifically, on F-logic and HiLog (see Section 1.4.6). The expressive powers of OWL and of rules are quite different, which is why both approaches exist, each having its adherents. This bifurcation of efforts on the reasoning side has also led to a number of efforts to reconcile OWL with rules. First, it resulted in the development of an OWL subset, called OWL RL,22 which lies in the intersection of OWL and Datalog. Second, several approaches attempted to combine OWL and logic programming at a much more general level so as to take advantage of both paradigms [Eiter et al. 2004a, Eiter et al. 2004b, Motik et al. 2006].
Overall, building the Semantic Web turned out to be harder than originally envisioned, but the effort is proceeding apace. It resulted in Google’s Knowledge Graph, DBpedia, Wikidata, Semantic MediaWiki, and other resources. In addition, it led to a number of commercial Datalog-based systems for RDF stores such as Ontotext’s GraphDB,23 Franz’s AllegroGraph,24 and Apache’s Jena,25 to name a few. Many leading logic programming systems (for example, SWI Prolog, Ciao Prolog, and XSB) also have extensive web-related capabilities.
Datalog for Program Analysis. Program analysis is an example of an application of Datalog to an area that had largely used imperative approaches before. Figuring out program properties, such as call chains or pointer aliasing, often involves graph traversal or mutual recursion, which are directly expressed in Datalog. Lam et al. [2005] relate their experience with implementing program analyses to detect security holes in web applications. They turned to Datalog after coding their analysis algorithms in Java proved hard to get correct and efficient. They also found their Datalog implementation easier to maintain and extend. Their evaluation approach to Datalog was to translate it to relational algebra and thence to Boolean functions. Relations with program-structure and execution-context information are represented as characteristic functions (mappings of tuples to true
or false
). Each function is encoded into a compact structure, called a binary decision diagram, upon which the Boolean functions can be efficiently evaluated. The CodeQuest system [Hajiyev et al. 2006] used Datalog to query program information, aimed in particular at enforcing style rules and programming conventions that the compiler does not enforce, such as requiring all subclasses of an abstract class to reimplement a particular method. The developers report that Datalog hits a “sweet spot” between expressiveness and efficiency. While not working with enormous datasets, the space requirements can exceed main memory, with code bases containing more than one million lines of code and greater than 10,000 classes. Their approach to evaluation is to send Datalog through an optimizing compiler that emits SQL, using iteration in stored procedures or SQL’s Common Table Expressions to handle recursion. The Semmle product takes a similar approach for