Distributed Database Concepts (chapter 4)

Chandima Jayamina

6 min readJun 3, 2024

Query Processing and Optimization in Distributed Databases

Query Processing in distributed database :

In distributed system, relational algebra is not enough to express execution strategies.
The query processor must also select best sites to process data and possible ways data should be transformed

Query decomposition :

Input is calculus query(sql) maps to algebra operations(select, join, project, rename). In here it checks global directory to retrieve info about global conceptual schema. Both input and output queries refer to global relations, without knowledge of the distribution of data. In this phase it do Normalization, Analysis, Elimination of redundancy, rewriting.

Normalization: Where clause is format for conjunctive normal form (A∨B)∧(B∨C) or Disjunctive normal form (A∧B)∨(B∧C).
Analysis: Identify and reject type incorrect(check attributes & relation names defined in the global schema, check operations on attributes do not conflict with the types of the attributes) or semantically incorrect queries (if query components do not contribute in any way to the generation of the result, by using connection graph and join graph we can identify).
Elimination of redundancy: simplify the query by eliminate redundancies
Rewriting : Several algebraic queries can be derived from the same calculus query, that some algebraic queries are better than others.

Data localization :

Inputs is algebraic query on global conceptual schema. The main role of the second layer is to localize the query’s data using data distribution information in the fragment schema. In DDB, relations are fragmented & stored in disjoint subsets, called fragments, each being stored at different site.

This layer determines which fragments are involved in the query & transforms the distributed query into a query on fragments.

Translating of global query :

In DDBMS single global relation is some times fragmented. Fragments are deployed on various distinct sites, moreover to ensure processing locally sometimes a relation or a fragment gets replicated own.

Global query for its successful execution must get decomposed into fragment queries. Select * from student -> { Select * from student1, Select * from student2,….}

Global query optimization :

Inputs is algebraic query on fragments. Main objective is to determine an execution strategy close to the optimal solution. Permutation on queries leads to similar queries and can be order based on cost function.

This depends on fragment allocation and fragment statistics.The output of this phase is optimized algebraic query with communication operators including fragments

Local optimization :

last layer performed by all participating sites that have the fragments which one involved in query. Each sub query executing at one site, called local query on that site. This is optimized using local schema and then executed

At this time the algorithms to perform the relational operators may be chosen. The output is then returned to the site where the query is generated.

Distributed Query Processing Using Semijoin

Semi join based algorithm for query optimization :

Semi join matches the rows of two relations and then show the matching rows of the relation whose name is mentioned to the left side of ⋉ semi join operator.

Join ordering is important aspect in centralized DBMS, and it is even more important in DDBMS since join fragments that are stores at different sites may increase the communication time.

Approaches to achieve this :

Optimize the ordering of joins directly

Ingres and Distributed Ingres
System R and System R*

Replace join by combination of semi joins in order to minimize the communication costs

Hill climbing and SDD-1

The semi join approach is better if the semi join acts as a sufficient reducer(few tuples of R participate in the join

Data Transfer Costs of Distributed Query Processing

Distributed Cost Model :

The cost of the distributed execution strategy can be expressed with respect to either the total time or the response time.
Total time(cost) is sum of all time components
Response time is the elapsed time from the initiation to the completion of the query

Total time is equal to time of cpu instructions(a) and time of disk I/O(b) and communication time depicted by fixed time of initiating and receiving message(c) and time it takes to transmit a data unit from one site to another(d, #byte is the sum of all messages size),

communicating time can be reflect as -> Tmsg(Fixed time of initiating and receiving message) + Ttr(time it takes to transmit data from one site to another) * #bytes(total number of bytes)

Transactions in Distributed database management system

As in centralized database ACID properties for any transaction is mandatory in Distributed database.

Transaction based on location of access data

Local transactions involve read, write, update of data in only one local database
Global transaction involved read, write or update of data in many such local databases

Distributed Database Transaction system

Transaction manager : In DDB we have one for every site, whose main job is to verify ACID properties of those transactions that execute at that site.

Responsible for maintaining a log for recovery purpose,
Responsible for participating in an appropriate concurrency-control schema to coordinate the concurrent execution of the transactions executing at that site.

Transaction Coordinator : available for every site in DDB to manage and coordinate various transactions(both local and global) initiated at that site.

Responsible for execution of every transaction at that site.
Responsible for breaking the transaction into number of sub transactions and distributing these sub-transactions to the appropriate sites for execution

Transaction Models

General model :Read and write actions without any specific ordering.
Two-step transaction:If the transactions are restricted so that all the read actions are performed before any write action
Restricted model : If the transaction is restricted so that a data item has to be read before it can be updated
Restricted two-step transaction : If a transaction is both two-step and restricted
Action model : Consists of the restricted class with the further restriction that each read, write pair be executed automatically.

Distributed transaction

A client transaction becomes distributed if it invokes operations in several different servers.

Ways that distributed transaction can be structured

Flat transactions :

Client makes request to more than one server
Flat transaction completes each of its requests before going to the next one
Sequential access to objects

Nested transactions

The top level transaction can open subtransactions and each sub transaction can open further subtransactions
Sub transactions at the same level can run concurrently
Arrange in levels
Objects in different servers can be invoked in parallel

Thats the end of this chapter, In the next chapter lets discuss about the Types of Distributed database Systems and Distributed Database Architectures 😊

Distributed Database Concepts (chapter 5)

Types of Distributed database Systems and Distributed Database Architectures

chandimajayamina.medium.com

Chapter 1 : Overview of Distributed database concepts

Distributed Database Concepts (Overview)

This chapter is going to come as series of blogs 😊