How DB runs your query in back end

9 min readJul 12, 2024

If we run a query like

select student from Student db where student age >15

How it access the database files, B trees and indexes so on. Given query can be processed in multiple ways and those ways will have different costs such as time taken for run the query, access data, process data etc.

Basic steps in query processing

Query plan : Input of SQL query
Parser and Translator : Translate SQL to Relational algebra expression
Relational — algebra expression : There can be one or many relational algebra expressions
Optimizer : How relational algebra expression made efficient and to optimise it use statistics about data such as past history informations on what are attributes more used, tuples actually exist and so on
Execution plan : Based on optimizer result we can have execustion plan with less cost. It contains what are the actions we need to do in terms of accessing the different indexes and different B+ tree nodes
Evaluation engine : This gets the data from B+ trees and do the joining . merging and sorting operations in the execution plan.
Out put : Results of the SQL query

Relational Algebra Vs Equivalant Expressions

Select Instructor.name 
From Instructor
Where salary < 75000

Query 1 : Find salary index and get salary < 75000

Query 2 : Find Salary index and discard instructor with salary > 75000

So in here to find out the best execution plan we need to use Optimization.

Query optimization

Out of all equivalent evaluation plans we try to find the one that gives some minimum cost. Cost is estimated based on certain statistical information from the database catalog, number of tuples in the relation, size of tuples, attributes on which frequent conditions are tested.

Measures of Query Cost

Cost is generally measured as total elapsed time for answering query.

Disk access
CPU usage
Network Communication

Typically disk access is predominant cost, and is also relatively easy to estimate. As we look into storage structure we may have typiocally magnetic disk which where the head has to move to the correct cylinder to find the block where the record can be located. This process is called seek.

Disk access measures :

Number of seeks * average seek cost
Number of blocks read * average block read cost
Number of blocks written * average block write cost

Usually cost to wrtie a block is greater than cost to read a block. Normally data is read back after being written to ensure that the write was successful.

Warning : In here we ignore CPU costs for simplicity. But real systems do take CPU cost into account. In here this does not include cost of writing output to disk.

Hints : There are several algorithms can reduce disk IO by using extra buffer space. For example if one block has read in and we are just using one record of that in the next operation we have to use some records whihc is already existing in that block and that block is maintained in that buffer then we do not need to go back to disk and read the block once again.

So more of the buffer space that we can provide naturally the performance would become better but certainly the memory required for keeping the buffer would be higher and it also often difficult to decide as to whether particular block in the buffer or not.

Amount of real memory available to buffer depend on other concurrent queries and OS processes, known only during execution. Better too use worst case estimation.

Selection Algorithm

Linear search

Scan each file block and test all records to see whether they satisfy the selection condition.

If selection is on key attribute, search can stop on finding record.

Linear search can be applied regardless of

Selection condition
Ordering of record in the file
Availability of indices

Binary Search

Binary search does not generally make sense since data is not stored consecutively except when there is a index available, and binary search requires more seeks than index search.

Index Scan (Search algorithms that use index)

If we have some index on B+ tree that we are going to use to do selection.

So if we are using Primary index and looking for equality on key. For retrieve a single record that satisfy the corresponding equality condition.

Retrieve Single record (Primary index, equality on key)

Height means number of blocks need to read so each one take transfer time + seek time, because they are not consecutively located. Height of B+ tree and additional one is used to get the data record read.

Retrieve Multiple records (Primary index, equality on non key)

Retrieve records on Secondary index (Secondary index, equality on non-key)

Retrieve single record if search key is candidate key

Retrieve multiple records if search key is not candidate key

Retrieve records on Comparison

Retrieve Records (Primary index, comparison)

In here it uses B+ tree to find the first element and after that it just transfer the records

Also we can just start sequentially based on primary index from the beginning

Retrieve Records (Secondary index, comparison)

In either case, retrieve records that are pointed to

requires an I/O for each record
Linear file scan may be cheaper

Select With Conjunctions

σθ1^ σθ2 ^ ……………. σθn(Relation)

Retrieve records (conjunctive selection using one index)

Select a combination of θi and algorithms we discussed so far.
Test other conditions on tuple after fetching it into memory buffer.

Retrieve records (conjunctive selection using composite index)

Use appropriate composite (multiple-key) index if available.

Retrieve records (conjunctive selection by intersection of identifiers)

Requires indices with record pointers
Use corresponding index for each condition and take intersection of all obtained sets of record pointers
Then fetch records from file
If some conditions do not have appropriate indices, apply test in memory.

Select With Disjunctions

σθ1∨ σθ2 ∨……………. σθn(Relation)

Retrieve records (disjunctive selection by union of identifiers)

Applicable if all conditions have available indices, otherwise we need to use linear scan
Use corresponding index for each condition, and take union of all the obtained sets of record pointers
After that we can fetch records from file.

Negation

Use linear scan on file
If very few records satisfy negation, also an index is applicable to θ. We need to find satisfy record using the index and fetch from file.

Sorting

If we build an index on the relation, and then we can use the index to read the relation in sorted order. This is the behaviour of B+ tree in the in order traversal. This may lead to one disk block access for each tupple at time. If all the records can fit into the memory then we can use some memory algorithm like quicksort, but often that will not be the case, because relations are much bigger. So what will have to do is will have to take recourse to external sort and merge strategy. This method is old strategy but very effective.

External sort merge

What we do is certainly we cannot read that whole realation in terms of memory into memory. So what we do we take different parts and say we are taking in groups of 3 for example.

merge passe1

We sort them in memory. We take record groups we can fit into memory and sort them. Once you sort them you can make these are sorted sub lists of the original set. After that you can merge them according to the merge strategy. After that you can write this back.

merge pass 2

Now we have bigger sorted lists. These are called runs. After that we merge them depending on the actual size of the file and the size of the memory directly fits in. We might have to do multiple runs until we get sorted output.

External sort merge estimations (M-memory size in page, i-sort runs )

Repeatedly do

Read M blocks of relation into memory
Sort the in-memory blocks
Write sorted data to run i; i++

2. Merge the runs (N<M)

Use N blocks of memroy to buffer input runs, and 1 block to buffer output. Read the first block of each run into its buffer page.
repeat : Select first record among all buffer pages. Write the recoird to the output buffer. If the output buffer is full write it to disk. Delete the record from its input buffer page, if the buffer page become empty then read the next block of the run into the buffer.
Until all input buffer pages are empty.

Join Operations

Join operations Algorithms

Nested loop join
Block nested loop join
Indexed nested loop join
Merge join
Hash Join

Nested -Loop Join

In here if we have 2 relations called r and s and we have condition θ and we are doing θ join.

The basic cartesian product is all records of r will have to be matched to all records of s. This can be done using nested for loops so for each tuple in r we try out each tuple n s take the tr, ts pair and if they satisfy the condition theta then the go to the output otherwise leave that or discard that.

Cost estimation

nr is the number of records in the relation
br is the number of blocks

For every record we have to access all the blocks. For every tuple of relation r we have to actually access all the blocks of relation s.

Optimization 1 :

If smaller relation can entirely fit into memory. We do not need to do the repeated read for both reations.

example:

If smaller relation(student) fits entirely in memory, the cost estimation will be 500 block transfers

Block Nested Loop Join

Cost :

Indexed nested loop join

Indexed lookups can replace file scan if join is equi-join or natural join and
an index is available on the inner relation’s join attribute. we can construct an index just to compute a join.
For each tuple tr in the outer relation r, use the index to look up tuples in s that satisfy the join condition with tuple tr.
- Worst case: buffer has space for only one page of r, and, for each tuple in r, we perform an index lookup on s.
- Cost of the join: br (tT+ts) + nr* c
• Where c is the cost of traversing index and fetching all matching s tuples for one tuple or r
• c can be estimated as cost of a single selection on s using the join condition.
If indices are available on join attributes of both r and s, use the relation with fewer tuples as the outer relation.

Example

Duplicate Elimination

Projection

Perform projection on each tuple
Followed by duplicate elimination

Aggregation (Group by)

This can be implemented in a manner similar to duplicating elimination.