Distributed Database Concepts (chapter 2)
Data Fragmentation, Replication and Allocation Techniques for Distributed Database Design
Data Fragmentation
The process of dividing the Database into smaller multiple parts is called fragmentation. These fragments may be stored at different locations/sites.
This process is usually done by the key of the key value record in two ways
- Grouping all keys lexically : all keys that start with A or between A and C can be found on a specific server
- Grouping records by key : all records that have the same key and placing them on the same server.
Fragmentation and Correctness Rule in DDBS
Data fragmentation process should be carried out in such a way that the reconstruction of original database from the fragments is possible.
Fragmentation methods:
- Horizontal Data fragmentation
- Primary Horizontal data fragmentation
- Derived Horizontal data fragmentation
2. Vertical fragmentation
- Grouping (Bottom Up approach)
- Splitting (Top Down approach)
3. Hybrid/Mixed Fragmentation
Lets discuss above fragmentation methods in more details.
Horizontal Data fragmentation :
Horizontal fragmentation divides a relation(table) horizontally into the group of rows to create subset of tables.
We can reconstruct the fragments by union. We should not lose data or we should not get redundant data,
Primary Horizontal data fragmentation :
In Fragmentation we create fragments based on attribute of the given table. Whole raw of the original db is in fragments.
In recunstruct of the fragments to original relation we have correctness rules and allocation rules.
Correctness rules :
- Completeness : There is no loss of data due to fragmentation
- Reconstruction : Ability to reconstruct the original table from fragments, check whether functional dependencies are preserved. (R = R1 U R2 ……..U Rn)
- Disjointness : No record will become a part of two or more different fragments (R1 ∩ R2 ∩…………∩ Rn = Null)
Allocation :
- Replication : Availability of same copy of a database in multiple locations/sites is referred as full replication
- Non replication : Only one copy of the main copy is available in another site
Derived Horizontal data fragmentation :
Derived horizontal fragmentation a relation R1(Owner relation) may get fragmented by attribute of some other relation R2(Member relation)
Step 1 : Look at the tables we have
Step 2 : Create fragments on owner relation
Step 3 : Now let’s perform one semi join from above Project(member table) ⋉ HF1
Advantages :
- allows parallel processing on fragments of a relation
- allows a relation to be split so that tuples are located where they are most frequently accessed.
Vertical Data fragmentation :
Vertical fragmentation is more complex than Horizontal fragmentation. In order to maintain re-constructiveness each fragment should contain the primary key field of the table. In here we keep the primary key on each fragment for reconstruct
Approaches in Vertical Fragmentation :
Grouping (Bottom — Up approach)
- Starts by assigning each attribute to one fragments and at each step join some of the fragments until some criteria is satisfied.
- Attributes are progressively aggregated to constitute fragments
Splitting (Top — Down approach)
- Starts with a relation and decides on beneficial partitioning based on the access behaviour of the application to the attributes.
- Global relations are progressively split into fragments
Reconstruct of vertical fragmentation is performed by using Full outer join operation on fragments
Advantages :
- Allows tuples to be split so that each part of the tuple is stored where it is most frequently accessed
- Tuple-id attribute allows efficient joining of vertical fragments allows parallel processing on relation
Hybrid/Mixed Data Fragmentation :
This is combination of horizontal fragmentation and vertical fragmentation. Reconstruction of original table is often an expensive task. Original relation is obtained by combination of join and union operations. In here we first generate Horizontal fragments and we split them to vertical fragments
Advantages
- Fragments may be successively fragmented to an arbitrary depth
Data Replication
This means storing a copy or replica of a relation or fragment of relation redundantly in two or more sites. This improves availability of data, Speed up query evaluation with fast response, better performance, reduce network traffic.
Data Replication methods :
- Synchronised replication : Replica will be modified immediately after some changes are made in the relational table. So nio difference between original data and replica.
- Asynchronous replication : Replica will be modified after commit is fired on the database
Challenges on Data Replication
- Increase storage requirement : Maintaining multiple copies
- Increased cost and complexity of data updation : When data updates update need to reflect in all copies. This requires complex synchronization techniques and protocols.
Replication Schemes
- Full Replication
Stores multiple copies of each database fragment at multiple sites. In this case all database fragments are replicated. In here Concurrency control is difficult to achieve and update operation is slower.
2. Partial Replication
Stores multiple copies some database fragments at multiple sites. Still it has advantages and disadvantages as full replication
3. No replication : Stores each database fragment exactly at one location
Distributed Catalog Management
If a relation is fragmented and replicated, we must be able to uniquely identify each replica of each fragment. We need to track how data is distributed across sites.
Site catalog : Describe all objects (Fragments, replicas) at a site and keeps track of replicas of relational created at the site. (<local_name, birth_site>)
Catalog Management Schemas :
- Centralized management schema
In this approach global catalog is stored in a single site and all other sites access catalog information from this central site. In here availability and reliability is very low. The major drawback of this approach is that it decreases site autonomy because other sites are dependent on centralized site.
- Fully replicated management schema
In this approach catalog information are replicated at each site of the distributed system. The availability and reliability are very high. But this approach have information update overhead.
- Fragmented and Distributed management schema
This approach is adopted in distributed system R* to overcome the drawbacks of the above two approach. In here there is a local catalog at each site that contains the data stored at the site. Catalog of the birth site for a relation is responsible for keeping track of where replicas of the relation are stored. when fragment or replica is moved to a different location, the local catalog at the corresponding site must be updated.
Data Allocation
Allocation of data is one of the key design issues of distributed database. The main objective of data allocation in distributed database is to place data fragments at different sites in such a way, so that total data transfer cost can be minimised while executing a set of queries.
- Centralized : Consists of a single DB and DBMS stored at one site with users distributed across the network
- Partitioned : Database is partitioned into disjoint fragments, each fragment assigned to one site.
- Replicated : Replicated data allocation, copies of one or more database fragments are stored at different sites.
Factors need to consider
- Performance and data availability (If the data available in local site query can perform much faster, the transaction cost is reduced.)
- Size, number of rows and number of relations that an entity maintaining with other entities
- Types of transactions to be applied to the database and attributes accessed by each of those transactions.
Information Requirements
- Database Information : selectivity of fragments, Size of fragment
- Application information : number of read accesses of a query to a fragment, number of update accesses of a query to a fragment, matrix indicating which queries updates which fragments, matrix for retrievals, originating site of each query.
- Site Information : cost of storing data at a site, cost of processing at a site
- Network information : communication cost/frame(units of data in the link layer eg-wifi,bluetooth) between two sites
How to find the cost 🤨
Processing Component (Access cost)
Processing update cost
Processing retrieval cost
Thats the end of this chapter, In the next chapter lets discuss about the concurrency control and recovery in Distributed Database 😊
Chapter 1 : Overview of Distributed database concepts