2.1 Clustering

This chapter discusses the issues related to how the data is physically stored on the disk and some of the access mechanisms commonly used for retrieving this data.

The Internal Level is the level which deals with the physical storage of data. While designing this layer, the main objective is to optimize performance by minimizing the number of disk accesses during the various database operations.




The figure shows the process of database access in general. The DBMS views the database as a collection of records. The File Manager of the underlying Operating System views it as a set of pages and the Disk Manager views it as a collection of physical locations on the disk.

When the DBMS makes a request for a specific record to the File Manager, the latter maps the record to a page containing it and requests the Disk Manager for the specific page. The Disk Manager determines the physical location on the disk and retrieves the required page.
2.1 Clustering
In the above process, if the page containing the requested record is already in the memory, retrieval from the disk is not necessary. In such a situation, time taken for the whole operation will be less. Thus, if records which are frequently used together are placed physically together, more records will be in the same page. Hence the number of pages to be retrieved will be less and this reduces the number of disk accesses which in turn gives a better performance.
This method of storing logically related records, physically together is called clustering.
Eg: Consider CUSTOMER table as shown below.
Cust ID
Cust Name
Cust City
...
10001
Raj
Delhi
...
10002
...
...
...
10003
...
...
...
10004
...
...
...
...
...
...
...
...
...
...
...
If queries retrieving Customers with consecutive Cust_IDs frequently occur in the application, clustering based on Cust_ID will help improving the performance of these queries. This can be explained as follows.

Assume that the Customer record size is 128 bytes and the typical size of a page retrieved by the File Manager is 1 Kb (1024 bytes).

If there is no clustering, it can be assumed that the Customer records are stored at random physical locations. In the worst-case scenario, each record may be placed in a different page. Hence a query to retrieve 100 records with consecutive Cust_Ids (say, 10001 to 10100), will require 100 pages to be accessed which in turn translates to 100 disk accesses.

But, if the records are clustered, a page can contain 8 records. Hence the number of pages to be accessed for retrieving the 100 consecutive records will be ceil(100/8) = 13. i.e., only 13 disk accesses will be required to obtain the query results. Thus, in the given example, clustering improves the speed by a factor of 7.7
Q: For what record size will clustering be of no benefit to improve performance ?
A: When the record size and page size are such that a page can contain only one record.
Q: Can a table have clustering on multiple fields simultaneously ?
A: No
•Intra-file Clustering – Clustered records belong to the same file (table) as in the above example.
•Inter-file Clustering – Clustered records belong to different files (tables). This type of clustering may be required to enhance the speed of queries retrieving related records from more than one tables. Here interleaving of records is used.\

2. The Internal Level
Powered by Blogger.