Database 2, année 2023

Charles Paperman and Mikaël Monet

Foreword

Prerequisites to this course: The syntax and semantics of first-order logic, the basis of complexity theory (a little bit of algorithms, the notion of reduction between computational problems, the definition of Turing machines, the class PTIME, the class NP, being able to prove by yourself a simple NP-completeness result, having already seen and understood an undecidability proof, the class PSPACE).

Supplementary material: This course is inspired from the book “Database Theory” (work in progress, available here), and from Dan Suciu’s lecture notes (here). Do not hesitate to consult these.

$\newcommand{\calQ}{\mathcal{Q}}$ $\newcommand{\Consts}{\mathsf{Consts}}$ $\newcommand{\vars}{\mathrm{vars}}$

Part I: Relational databases, queries and first-order logic, complexity of model checking and evaluation

Relational databases: definitions

The unnamed perspective

This is the notation that we will use most often in this course.

Let us fix a countably infinite set $\mathsf{Consts}$, called the domain, of constants (also called data values).

A tuple $\bar{t}$ is an element of $\mathsf{Consts}^k$ for a certain $k$, that is called the arity of $\bar{t}$.

If $\mathsf{Consts}$ contains (among other things) the values $a$, $b$, and $\text{toto}$, then $(a,\text{toto},a)$ is a tuple of arity 3.

A relation is a finite set of tuples of same arity (that is then called the arity of the relation)

$\{(a,a,a), (a,b,a), (a,\text{toto},c)\}$, or $\{(a), (42)\}$, or ${()}, or $\emptyset$ (the empty set) are relations. $\{(a),(a,b)\}$ is not a relation.

A (relational) schema $\Sigma$ is a set of relational symbols $R_1,...,R_k$, each with its arity (a natural number), written $\mathrm{ar}(R_i)$.

A relational database (DB for short) over $\Sigma$ and $\mathsf{Consts}$ associates to each relational symbol $R$ of $\Sigma$ a relation, that we will write $R^D$ (or simply $R$ when there is no risk of confusion).

$\Sigma = \{R,S,T\}$, of arity, respectively, 2,3 and 1. Then a DB could consist of the following relations: $R^D = \{(a,a), (a,42)\}$, $S^D = \{(b,b,a), (\text{toto},b,c), (c,c,c)\}$, and $T^D = \{(a), (\text{toto})\}$. To see clearer, we often simply draw $D$ as a set of tables in the usual way.

We call a fact of $D$ something of the form $R(a_1,...,a_{\mathrm{ar}(R)})$. Hence we can equivalently see a DB as a set of facts.

The previous example but where we see $D$ as a set of facts, i.e., $D = \{R(a,a), R(a,42), S(b,b,a), S(\text{toto},b,c), S(c,c,c), T(a), T(\text{toto})\}$.

We write $|D|$ the number of facts of a database $D$, and call this its size.

The active domain of a database $D$, written $\mathrm{ADom}(D)$, is the set of constants that occur in the database.

For the database of the previous example, $\mathrm{ADom}(D) = \{a,b,c,\text{toto},42\}$. It might be the case that $\mathrm{ADom}(D)$ is a strict subset of $\mathsf{Consts}$.

The named perspective

This is the one you are used to when you manipulate relational databases management systems (RDBMSs, such as MySQL, PostgreSQL, etc.), where columns in tables have attribute names. This contrasts with the unnamed perspective we just saw, where we do not care about the names of attributes and only the position inside tuples is important.

In the named perspective, we could have for instance a relation Employee(int id, varchar(50) name, float salary). In the unnamed, we simply see this as a relation of arity 3. The domain $\mathsf{Consts}$ can contain integers, strings, floats, whatever we want: for the type of problems we will study in this course, this will be of no importance.

Why do we use the unnamed perspective? Because this is closer to the syntax of first-order logic, and also because it is less notationally heavy and hence less annoying when we want to write queries or proofs (see examples below).

Differences between this course and “real” databases

In real databases data values have a type (int, string, etc). As already mentioned, here we do not care and put everything in $\mathsf{Consts}$.

The other main difference is that in this course we consider what is called the set semantics (a relation is a set of tuples, so we cannot have repeated tupes), whereas typical RDBMSs allow for duplicates, which is refered to as the bag semantics. This subtle distinction can make the complexity of some problems differ. The set semantics is less painful to work with, and it is already interesting enough to make it worth studying.

Queries: definitions

A query $q$ (on $\Sigma$) is a function that associates to each DB $D$ a relation, called the result of q on D and written $q(D)$. Note that when the arity of that relation is zero, only two results are possible: the emptyset $\emptyset$, and $\{()\}$, the set containing only the empty tuple. We traditionally associate the first to FALSE and the second to TRUE, and we then call the query Boolean. A database can then either satisfy the query, written $D \models q$, or not satisfy it, written $D \not\models q$.

Queries are usually written in specific query languages. In this course we will see first-order queries and restrictions thereof such as conjunctive queries (CQs) or unions of conjunctive queries (UCQs). More powerfull query languages exist, such as second-order logical queries, or Datalog queries, that will probably not be covered in this course. As always, more expressivity often comes with a cost, in the sense that typical problems have higher complexity.

First-order logic

You should already know the syntax and semantics of first-order logic (cf. prerequisites, see wikipedia if you need a refresher (the difference being that we do not use “function symbols” in this class, and all “structures” correspond to our databases and are thus finite)). Just a few notational reminders here so that we are on the same page.

Let $V$ be a countably infinite set of variables, disjoint from $\mathsf{Consts}$.

An atom is either of the form $t_1 = t_2$ (called an equality atom), or of the form $R(t_1,...,t_{\mathrm{ar}(R)})$ where each $t_i$ is either a constant or a variable.

Formulas can be built from atoms by combining them using logical connectives ($\land$, $\lor$, $\lnot$, $\implies$) and existential ($\exists$) and universal ($\forall$) quantifiers.

A variable can be free or bound in a formula.

A sentence is a formula with no free variables.

(Taken from Dan Suciu’s lecture notes). In English, what do these queries say?

$\exists x \exists y \exists y (x \neq y) \land (x\neq z) \land (y \neq z)$
$\exists x \exists y \forall z (z = x) \lor (z = y)$
If the schema contains a single binary relation $E$ (i.e., of arity two), one can see a DB on this schema as a directed graph. What does the following formula say about such graphs? $\forall x \exists y E(x,y) \lor E(y,x)$
$\forall x \forall y \exists z E(x,z) \land E(z,y)$
$\exists x y z E(x,y) \land E(y,z) \land E(z,x)$

A first-order query $q$ (or simply FO query) is the given of a first-order formula and of a tuple that contains all its free variables. The result $q(D)$ over a DB is defined as we expect (if you are in doubt, see “First-Order Queries” around page 29 of the book).

$q(x,x,y) = \exists z E(x,z) \land E(z,y)$. The formula is $\phi(x,y) = \exists z E(x,z) \land E(z,y)$, with $x$ and $y$ the two free variables, and the tuple is $(x,x,y)$. Example of a database and $q(D)$ in class.

Proposition: In terms of expressivity, first-order queries and relational algebra (Wikipedia reminder) and “vanilla SQL” are equivalent. (This should have been covered in the M1 database course.)

Conjunctive queries (CQs)

A conjunctive query (CQ) is an FO query in which we only use conjunction of atoms and existential quantifiers. In all generality, they are thus of the form $q(\bar{x}) = \exists \bar{y} \bigwedge_{i=1}^n R_i(\bar{z_i})$ (we write $\bar{x}$ for a tuple, of the form $\bar{x} = (x_1,...,x_n)$, where the $x_i$s are not necessarily distinct) where the $R_i$s are not necessarily distinct, $\bar{x}$ and $\bar{y}$ are tuples of variables that do not intersect, each $\bar{z_i}$ is a tuple of variables and/or constants, and the union of the variables occuring in the $z_i$s is equal to $\bar{x} \cup \bar{y}$.

$q(x,y,y,z) = \exists t,u [ A(z,y,t) \land A(t,u,u) \land B(u,x) ]$

(we write $\exists t,u$ instead of $\exists t \exists u$).

If the database contains facts $A(a,b,a)$, $A(a,c,c)$ and $B(c,a)$ for example, then the tuple $(a,b,b,a)$ will be in the result $q(D)$.

CQs correspond to the SELECT FROM WHERE fragment of SQL, where we only use equality in the WHERE clause. So they are quite an important fragment of FO queries.

Find the name of the customers that have bought chocolate, with the following schema (named perspective): $\Sigma =$ Customer(int id, varchar(50) name)
Orders(int clientId, int productID)
Products(int id, varchar(50) name)

In SQL, in the named perspective:

  SELECT Customers.name FROM Customers, Orders, Products WHERE Customer.id =
  Order.clientId AND Order.productId = Products.id AND Products.name = 'Chocolate'

With a CQ, in the unnamed perspective: $q(x) = \exists y \exists z \mathrm{Customer}(y,x) \land \mathrm{Order}(y,z) \land \mathrm{Products}(z,\text{'chodolate'})$

This is much easier to read: we see directly the “structure” of the query, we don’t care about the attribute names, and we do not need to chase down the equalities as they are directly visible.

== To move further, maybe for a TD? ### Unions of conjunctive queries (UCQs)

UCQ = a disjunction of CQs that all have the same free variables

$q(x,y) = (\exists z R(x,z) \land S(z,y)) \lor (\exists z1 \exists z2 S(z1, x) \land S(z2,y))$

Model checking and query evaluation, notions of combined and data complexity

Query evaluation problem: given a query $q$ and a DB $D$, compute $q(D)$.

Model checking: just like query evaluation, but for Boolean queries (the result is simply TRUE or FALSE).

If our query language is Turing-complete (e.g., Python), this task is going to be undecidable. But what about FO queries for instance?

When faced with a new problem, we usually:

First, try to understand in which complexity class it lies (is it PTIME, NP-complete, PSPACE-complete, EXPSPACE-complete, or is it maybe undecidable?)
If we are lucky and it is in PTIME, then we might want to obtain “efficient” algorithms (linear-time, or quadratic, or O(n n), etc.)

In this first part of the course we will mostly study classical problems and try to determine their complexity class. If they are in PTIME, we will not try to optimize the degree of the polynomial.

As usual, to talk about complexity, we have to decide how the input is represented (e.g., on a Turing machine input tape). For DBs and queries, we will consider any reasonable encoding (see, e.g., Appendix C of the book).

Then, there are two measures of complexity that are commonly used in database theory: the combined complexity and the data complexity. Intuitively, in the combined complexity the query and the DB are considered to be part of the input, whereas for the data complexity the query is fixed and only the DB is part of the input. Data complexity is motivated by the fact that in real life queries are much smaller than databases. Let’s see some examples of what we mean exactly by combined and data complexity.

ModelChecking(FO) is the following decision problem:

INPUT: A Boolean FO query q (notice that this is just a sentence) and a DB $D$.
OUTPUT: Accept if $D$ satisfies $q$, reject otherwise

(We define QueryEval(FO) similarly for arbitrary FO queries, where the output is $q(D)$.)

Here, we measure the combined complexity, that is, the complexity as a function $f(|\phi|,|D|)$. We will soon see what is the complexity of this specific problem, but for now let’s see an example of a problem where we measure the data complexity.

In data compexity, we typically have one computational problem per query. For instance:

ModelChecking($\exists x y z R(x,y) \land R(y,z)$) is the following decision problem:

INPUT: A database $D$
OUTPUT: Accept if $D$ satisfies that query, reject otherwise

Here, we measure the data complexity. It might be the case that ModelChecking($q_1$) and ModelChecking($q_2$) have a different complexity if $q_1$ and $q_2$ are different.

Theorem 1: For every FO Boolean query $q$, the problem ModelChecking($q$) is in PTIME. Informally, we can say that “the data complexity of FO is PTIME”

(informal proof). We first prove this theorem for the specific query (borrowed from Dan Suciu’s lecture notes): $q = \exists x [A(x) \land \forall y ( B(y) \implies C(x,y) ) ]$. We propose the following algorithm for solving ModelChecking($q$) in time $O(|D|^2)$, in pseudocode:

def model_checking(q, D):
    compute_adom(D)  # this can be done in linear time
    foundX = False
    for x in ADom(D):
      if not A(x):   # shorthand for "if A(x) \in A^D"
        continue     # recall that this breaks one iteration of the loop
      allY = True
      for y in ADom(D):
        if B(y) and not C(x,y):
          allY = False
      if allY:
        foundX = True
    return foundX

This algorithm is clearly correct. What is its complexity, as a function of $|D|$? Let $n$ be $|\mathrm{ADom}(D)|$. Clearly, we have $n \leq 2 |D|$, because the relations have arity at most two, so there can be at most $2|D|$ distinct values in $\mathrm{ADom}(D)$. Let us be conservative and consider that checking whether $A(c)$ is in $A^D$ for a specific value c is in linear time in $|D|$ (we can read the whole database in linear time, and if we encounter the fact $A(c)$ we return yes). Then the algorithm has complexity $O(n^3)$, which is $O(|D|^3)$ by what precedes. This is indeed PTIME.

-> This can be generalized to any FO Boolean query, with complexity roughly $O(|D|^k)$, with $k$ the number of variables in the query. Indeed, we can always rewrite an FO formula as a bunch of quantifiers, plus at the end a propositional formula (with no quantifiers) (see This wikipedia page). Then we can have an algorithm with the same structure as the one above, where we will have roughly $k$ nested for-loops, plus a bit of code to test the instantiated propositional formla.

Exercice: For any FO query (not necessarily Boolean), what is the complexity of QueryEval($q$)?

We now introduce an important notion for Boolean CQs that we will use a lot: homomorphisms.

Set $q$ be a Boolean CQ and $D$ a DB. We write $\mathrm{vars}(q)$ the set of variables of $q$. A homomorphism from $q$ to $D$ is a function

$h: \mathrm{vars}(x) \cup \mathsf{Consts} \to \mathsf{Consts}$ such that:

Constants are preserved: for every $c \in \mathsf{Consts}$, we have $h(c) = c$
Atoms of $q$ are mapped to facts of $D$: for every atom $R(\bar{x})$ of $q$, $R(h(\bar{x})) \in R^D$

(here we write $h(\bar{x})$ as a shorthand for the tuple $(h(x_1),...,h(x_n))$.)

$q = \exists x y z R(x,x,y) \land R(y,x,z) \land S(z,y)$, D = … TODO in class

Important observation: $D\models q$ iff there exists a homomorphism from $q$ to $D$.

Theorem 2: the problem ModelChecking(CQs) is NP-complete. Informally, we say that “CQs have NP-complete combined complexity”.

Before doing the proof, notice how this contrasts with Theorem 1 above: the data complexity of any FO query is PTIME, but when we look at the combined complexity, even restricted to the fragment of FO with only conjunctions and existential quantifiers (CQs), it becomes NP-complete!

This also shows that it is important to be precise when one defines a problem, in particular to specify exactly what is part of the input and what is not.

This problem is in NP: why? To show hardness, let’s reduce from 3SAT: Let $\phi$ be a 3CNF. We build (in PTIME in $|\phi|$) a CQ $q_\phi$ and a DB $D$ such that $\phi$ is satisfiable if and only if $D$ satisfies $q_\phi$.

The schema contains 4 relational symbols of arity 3, say $A0$, $A1$, $A2$, and $A3$. The query $q_\phi$ contains one atom per clause of $\phi$, in the following way:

if $C$ is a clause with no negated variables, so of the form $x_i \lor x_j \lor x_k$, then $q_\phi$ contains the atom $A0(x_i,x_j,x_k)$
if $C$ is a clause with exactly one negated variable, so of the form $x_i \lor x_j \lor \lnot x_k$, then $q_\phi$ contains the atom $A1(x_i,x_j,x_k)$
… exactly two negated variables, … $x_i \lor \lnot x_j \lor \lnot x_k$, then … $A2(x_i,x_j,x_k)$
… all 3 variables negated, then … $A4(x_i,x_j,x_k)$

($q_\phi$ is the conjunction of all those atoms, with all variables existentially quantified)

The database consists of the following 4 relations $A0^D$, $A1^D$… over the domain $\{0,1\}$. In $A0$ there are all possible facts except $A0(0,0,0)$, in A1 there are all possible facts except…, etc. (whiteboard drawing)

We can construct $q_\phi$ and $D$ in PTIME? -> yes

$\phi$ is satisfiable iff $D\models q_\phi$? -> yes (check in class, using homomorphisms)

Question (at home): what is the complexity of QueryEval(CQs)?

Theorem 3: The problem ModelChecking(FO) is PSPACE-complete.

Same proof as for Theorem 2, but we reduce from the PSPACE-hard problem TQBF, and we add to $q_\phi$ the appropriate quantifiers.

This shows that if we restrict problem (from FO to CQs), its complexity can decrease (here from PSPACE to NP)!

We also present an alternative proof of Theorem 2, simpler, by reduction from the 3 colorability problem (of non-directed graphs).

Alternative proof of Theorem 2. We reduce from 3 coloring, which we recall is the following problem:

3 colorability:

INPUT: A non directed graph $G = (V,E)$

OUTPUT: Accept if we can color the nodes of the graph with 3 colors such that no two adjacent nodes have the same color.

This problem is NP-complete. The reduction is then as follows.

The schema consists of a single binary relation $E$.

The database is $\{E(1,2), E(1,3), E(2,1), E(2,3), E(3,1), E(3,2)\}$, i.e., all facts $E(a,b)$ for $a,b \in \{0,1\}$ with $a\neq b$.

The CQ $q$ contains, for every edge $\{u,v\}$ of $G$, the two atoms $E(u,v)$ and $E(v,u)$. So the variables of $q$ are the nodes of $G$, and they are all existentially quantified.

This construction can be done in PTIME. We only need to check that $G$ is 3-colorable iff $D \models q$: in class, using homomorphisms

Lab

Fill the TODO of the following python code
Fill the TODO of the following python code, with first corrections
Fill the TODO of the following python code, with second corrections

Compiled the: ven. 26 avril 2024 14:56:41 CEST