Distributed Database Concepts (chapter 2)

7 min readJun 3, 2024

Data Fragmentation, Replication and Allocation Techniques for Distributed Database Design

Data Fragmentation

The process of dividing the Database into smaller multiple parts is called fragmentation. These fragments may be stored at different locations/sites.

This process is usually done by the key of the key value record in two ways

Grouping all keys lexically : all keys that start with A or between A and C can be found on a specific server
Grouping records by key : all records that have the same key and placing them on the same server.

Fragmentation and Correctness Rule in DDBS

Data fragmentation process should be carried out in such a way that the reconstruction of original database from the fragments is possible.

Fragmentation methods:

Horizontal Data fragmentation

Primary Horizontal data fragmentation
Derived Horizontal data fragmentation

2. Vertical fragmentation

Grouping (Bottom Up approach)
Splitting (Top Down approach)

3. Hybrid/Mixed Fragmentation

Lets discuss above fragmentation methods in more details.

Horizontal Data fragmentation :

Horizontal fragmentation divides a relation(table) horizontally into the group of rows to create subset of tables.

We can reconstruct the fragments by union. We should not lose data or we should not get redundant data,

Primary Horizontal data fragmentation :

In Fragmentation we create fragments based on attribute of the given table. Whole raw of the original db is in fragments.

In recunstruct of the fragments to original relation we have correctness rules and allocation rules.

Correctness rules :

Completeness : There is no loss of data due to fragmentation
Reconstruction : Ability to reconstruct the original table from fragments, check whether functional dependencies are preserved. (R = R1 U R2 ……..U Rn)
Disjointness : No record will become a part of two or more different fragments (R1 ∩ R2 ∩…………∩ Rn = Null)

Allocation :

Replication : Availability of same copy of a database in multiple locations/sites is referred as full replication
Non replication : Only one copy of the main copy is available in another site

Derived Horizontal data fragmentation :

Derived horizontal fragmentation a relation R1(Owner relation) may get fragmented by attribute of some other relation R2(Member relation)

Step 1 : Look at the tables we have

Step 2 : Create fragments on owner relation

Step 3 : Now let’s perform one semi join from above Project(member table) ⋉ HF1

Advantages :

allows parallel processing on fragments of a relation
allows a relation to be split so that tuples are located where they are most frequently accessed.

Vertical Data fragmentation :

Vertical fragmentation is more complex than Horizontal fragmentation. In order to maintain re-constructiveness each fragment should contain the primary key field of the table. In here we keep the primary key on each fragment for reconstruct

Approaches in Vertical Fragmentation :

Grouping (Bottom — Up approach)

Starts by assigning each attribute to one fragments and at each step join some of the fragments until some criteria is satisfied.
Attributes are progressively aggregated to constitute fragments

Splitting (Top — Down approach)

Starts with a relation and decides on beneficial partitioning based on the access behaviour of the application to the attributes.
Global relations are progressively split into fragments

Reconstruct of vertical fragmentation is performed by using Full outer join operation on fragments

Advantages :

Allows tuples to be split so that each part of the tuple is stored where it is most frequently accessed
Tuple-id attribute allows efficient joining of vertical fragments allows parallel processing on relation

Hybrid/Mixed Data Fragmentation :

This is combination of horizontal fragmentation and vertical fragmentation. Reconstruction of original table is often an expensive task. Original relation is obtained by combination of join and union operations. In here we first generate Horizontal fragments and we split them to vertical fragments

Advantages

Fragments may be successively fragmented to an arbitrary depth

Data Replication

This means storing a copy or replica of a relation or fragment of relation redundantly in two or more sites. This improves availability of data, Speed up query evaluation with fast response, better performance, reduce network traffic.

Data Replication methods :

Synchronised replication : Replica will be modified immediately after some changes are made in the relational table. So nio difference between original data and replica.
Asynchronous replication : Replica will be modified after commit is fired on the database

Challenges on Data Replication

Increase storage requirement : Maintaining multiple copies
Increased cost and complexity of data updation : When data updates update need to reflect in all copies. This requires complex synchronization techniques and protocols.

Replication Schemes

Full Replication

Stores multiple copies of each database fragment at multiple sites. In this case all database fragments are replicated. In here Concurrency control is difficult to achieve and update operation is slower.

2. Partial Replication

Stores multiple copies some database fragments at multiple sites. Still it has advantages and disadvantages as full replication

3. No replication : Stores each database fragment exactly at one location

Distributed Catalog Management

If a relation is fragmented and replicated, we must be able to uniquely identify each replica of each fragment. We need to track how data is distributed across sites.

Site catalog : Describe all objects (Fragments, replicas) at a site and keeps track of replicas of relational created at the site. (<local_name, birth_site>)

Catalog Management Schemas :

Centralized management schema

In this approach global catalog is stored in a single site and all other sites access catalog information from this central site. In here availability and reliability is very low. The major drawback of this approach is that it decreases site autonomy because other sites are dependent on centralized site.

Fully replicated management schema

In this approach catalog information are replicated at each site of the distributed system. The availability and reliability are very high. But this approach have information update overhead.

Fragmented and Distributed management schema

This approach is adopted in distributed system R* to overcome the drawbacks of the above two approach. In here there is a local catalog at each site that contains the data stored at the site. Catalog of the birth site for a relation is responsible for keeping track of where replicas of the relation are stored. when fragment or replica is moved to a different location, the local catalog at the corresponding site must be updated.

Data Allocation

Allocation of data is one of the key design issues of distributed database. The main objective of data allocation in distributed database is to place data fragments at different sites in such a way, so that total data transfer cost can be minimised while executing a set of queries.

Centralized : Consists of a single DB and DBMS stored at one site with users distributed across the network
Partitioned : Database is partitioned into disjoint fragments, each fragment assigned to one site.
Replicated : Replicated data allocation, copies of one or more database fragments are stored at different sites.

Factors need to consider

Performance and data availability (If the data available in local site query can perform much faster, the transaction cost is reduced.)
Size, number of rows and number of relations that an entity maintaining with other entities
Types of transactions to be applied to the database and attributes accessed by each of those transactions.

Information Requirements

Database Information : selectivity of fragments, Size of fragment
Application information : number of read accesses of a query to a fragment, number of update accesses of a query to a fragment, matrix indicating which queries updates which fragments, matrix for retrievals, originating site of each query.
Site Information : cost of storing data at a site, cost of processing at a site
Network information : communication cost/frame(units of data in the link layer eg-wifi,bluetooth) between two sites

How to find the cost 🤨

Processing Component (Access cost)