2.3 Hashingd

Hashing is yet another method used for making retrievals faster. This method provides direct access to record on the basis of the value of a specific field called the hash_field. Here, when a new record is inserted, it is physically stored at an address which is computed by applying a mathematical function (hash function) to the value of the hash field. Thus for every new record,

hash_address = f (hash_field), where f is the hash function.

Later, when a record is to be retrieved, the same hash function is used to compute the address where the record is stored. Retrievals are faster since a direct access is provided and there is no search involved in the process.

An example of a typical hash function is given by a numeric hash field, say an id, modulus a very large prime number.

Q: Can there be more than one hash fields on a file ?

A: No

As hashing relates the field value to the address of the record, multiple hash fields will map a record to multiple addresses at the same time. Hence there can be only one hash field per file.

Collisions : Consider the example of the CUSTOMER table given earlier while discussing clustering. Let CUST_ID be the hash field and the hash function be defined as ((CUST_ID mod 10000)*64 + 1025). The records with CUST_ID 10001, 10002, 10003 etc. will be stored at addresses 1089, 1153, 1217 etc. respectively.

It is possible that two records hash to the same address leading to a collision. In the above example, records with CUST_ID values 20001, 20002, 20003 etc. will also map on to the addresses 1089, 1153, 1217 etc. respectively. And same is the case with CUST_ID values 30001, 30002, 30003 etc.

The methods to resolve a collision are by using :

1. Linear Search:

While inserting a new record, if it is found that the location at the hash address is already occupied by a previously inserted record, search for the next free location available in the disk and store the new record at this location. A pointer from the first record at the original hash address to the new record will also be stored. During retrieval, the hash address is computed to locate the record. When it is seen that the record is not available at the hash address, the pointer from the record at that address is followed to locate the required record.

In this method, the over head incurred is the time taken for the linear search to locate the next free location while inserting a record.

2. Collision Chain:

Here, the hash address location contains the head of a list of pointers linking together all records which hash to that address.

In this method, an overflow area needs to be used if the number of records mapping on to the same hash address exceeds the number of locations linked to it.

Powered by Blogger.