Declarative Logic Programming. Michael KiferЧитать онлайн книгу.
derivations, for example, by remembering previously derived facts, and avoiding re-deriving them (in top-down approaches) or excluding them when trying to derive new facts (in bottom-up approaches).
Recursion. A key advantage of Datalog over other relational languages, like SQL, was the natural expression of recursion in queries. The top-down, leftto-right evaluation style of Prolog can behave differently depending on how rules are written. Some rules can cause infinite regress in trying to prove a goal. For example, if the first rule for related
were
Prolog evaluation would keep solving the first goal with a recursive application of the same rule, and never actually encounter rules or facts for the advised
predicate. Even when rule sets are better constructed, there can still be an unbounded number of search paths even when there are only finitely many answers. In fact, if the SLD tree is infinite and all answers are desired, no complete search strategy will terminate. Here the bottom-up approaches have a decided advantage, as termination is guaranteed if the set of answers is finite (though bottom up can re-derive facts).
Large fact bases. Prolog implementations largely assumed that rules and facts all fit in memory. However, this assumption was questionable for applications of Datalog to large datasets. If virtual memory is required for the fact base, then accessing facts turns into disk accesses. Different binding patterns for goals lead to different orders of access, hence not all disk accesses will be clustered, leading to random reads on disk. If there is alternative storage for facts (in the file system or a database), then there can be costs associated with moving facts into internal memory structures—such as the parsing associated with the Prolog assert
command. For efficient access to large fact bases on disk, data needs to be read in chunks, and manipulated in its format on disk.
A great deal of the early work on Datalog (and deductive databases more generally) centered on alternative evaluation mechanisms. These approaches generally started either from basic bottom-up (BU) or top-down (TD) evaluation. The bottomup methods tried to avoid re-deriving established facts, and providing more direction as to which facts to derive. Top-down methods tried to access facts in groups (rather than one at a time), rearrange the order of sub-goals, generate database queries to solve goal lists, and memoize derived facts to break recursive cycles. Theoretical proposals for alternative evaluation methods proliferated, with further and further refinements. However, side-by-side comparisons of actual implementations of these approaches were not common (although there was some comparative analysis of methods [Bancilhon and Ramakrishnan 1986]). The potential for scaling and performance of logic languages was promoted, such as via parallel evaluation in the Japanese Fifth-Generation Project [Moto-oka and Stone 1984] or using database techniques such as indexing and query optimization.
1.5.1 Bottom-Up Methods
In Section 1.1, we described the bottom-up evaluation of a Datalog program P at the level of single instantiations of rules to generate new facts. This derivation process is commonly organized into “rounds,” where in one round we consider all satisfied instances of all rules. A round can be viewed as a transformation TP, called the immediate consequence operator, that takes a set of established facts F and produces a set G of newly established facts [Van Emden and Kowalski 1976]. If we let F0 represent the initial EDB facts, then bottom-up evaluation can be seen as producing a sequence of fact-sets F0, F1, F2, … where
For a Datalog program, we eventually reach a fixed point Fj where Fj = Fj−1 (that is, a point at which no new facts can be established using rules in program P). Call this fixpoint F*. At this point, a query over the virtual database embodied by P can be answered. This iterative application of TP is called the naïve approach to bottom-up evaluation. There are two big problems with the naïve approach: repeated derivation and irrelevant results, which give rise to more sophisticated evaluation strategies.
Repeated Derivation. Note that Fi ⊇ F by construction. For a standard Datalog program P, TP is monotone in the sense that adding more facts to the input cannot reduce the number of facts in the output. Thus TP(Fi) ⊇ TP(Fi−1). So each round in the naïve approach re-establishes all the facts from the previous round, which is wasted work. The semi-naïve approach addresses this problem by only using rule instances in a round that have not been tried in previous rounds. It does so by determining which are the new facts generated in a round, then making sure any rule instance used in the next round contains at least one of the new facts. Thus, semi-naïve determines the newly established facts at round i as
and restricts TP to only generate facts if a rule instance uses at least one fact from Di. Rather than examining rule instances on a one-by-one basis, semi-naïve constructs an alternative rule set Q using a version rN
and rO
of each predicate r
, where rN
contains the “new” r
-facts from Di and rO
the “old” facts from Fi−1. So, for example, the rule
in program P would be replaced by three rules in program Q:
where rN
and rO
are computed by the system at each iteration anew. Note that the semi-naïve approach does not completely avoid repeated derivation of facts, since distinct rule instances can establish the same facts independently. For example, the rule instances
both establish r(a, b)
.
A systematic method based on incrementalization [Liu and Stoller 2009] has been developed to ensure that each combination of facts that makes two subgoals of a rule true simultaneously is considered at most once, and furthermore to provide precise time and space complexity guarantees. The method compiles a set of Datalog rules into a stand-alone imperative program. The generated program does a minimum addition of one fact at a time, incrementally maintains the results of expensive computations after each addition, and uses a combination of indexed and linked data structures to ensure that each firing of a rule takes worst-case constant time in the size of the data.
Irrelevant Derivation. The variants of bottom-up we have described so far generate the entire virtual database before turning to determining which results match the goal. Some facts that are non-goal results are nevertheless needed in order to establish facts that are goal-facts. But there can be facts in the virtual database that neither match the goal nor are required to derive the goal facts. For example, suppose the related
predicate had only the first two rules:
and the goal is related(P, ks)
. Then any derived fact related(a, b)
where b ≠ ks
will not be in the result, nor will it be useful in deriving goal facts. Aho and Ullman [1979] point out that in certain cases optimizations used for relational databases are applicable. For example, we can use “selection push-down” to move the restriction that P2
= ks
into the rule bodies to get
Note that this strategy works because the bound value stays in the same position. However, if our goal were related(ks, P)
, the binding in the goal does not directly restrict the binding in the rule body. Notice, however, that the