当前位置:文档之家› 浅谈merge join 与hash join的区别

浅谈merge join 与hash join的区别

Merge Joins
Sort merge joins can be used to join rows from two independent sources. Hash joins generally perform. better than sort merge joins. On the other hand, sort merge joins can perform. better than hash joins if both of the following conditions exist:

The row sources are sorted already.
A sort operation does not have to be done.
However, if a sort merge join involves choosing a slower access method (an index scan as opposed to a full table scan), then the benefit of using a sort merge might be lost.

Sort merge joins are useful when the join condition between two tables is an inequality condition (but not a nonequality) like <, <=, >, or >=. Sort merge joins perform. better than nested loop joins for large data sets. You cannot use hash joins unless there is an equality condition.

In a merge join, there is no concept of a driving table. The join consists of two steps:

Sort join operation: Both the inputs are sorted on the join key.
Merge join operation: The sorted lists are merged together.
If the input is already sorted by the join column, then a sort join operation is not performed for that row source.


Theoptimizercan choose a sort merge join over a hash join for joining large amounts of data if any of the following conditions are true:

The join condition between two tables is not an equi-join.
OPTIMIZER_MODE is set to RULE.
HASH_JOIN_ENABLED is false.
Because of sorts already required by other operations, the optimizer finds it is cheaper to use a sort merge than a hash join.
The optimizer thinks that the cost of a hash join is higher, based on the settings of HASH_AREA_SIZE and SORT_AREA_SIZE.


To advise the optimizer to use a sort merge join, apply the USE_MERGE hint. You might also need to give hints to force an access path.

There are situations where it is better to override the optimize with the USE_MERGE hint. For example, the optimizer can choose a full scan on a table and avoid a sort operation in aquery. However, there is an increased cost because a large table is accessed through an index and single block reads, as opposed to faster access through a full table scan.



Hash Joins
Hash joins are used for joining large data sets. The optimizer uses the smaller of two tables or data sources to build a hash table on the join key in memory. It then scans the larger table, probing the hash table to find the joined rows.

This method is best used when the smaller table fits in available memory. The cost is then limited to a single read pass over the data for the two tables.

However, if the hash table grows too big to fit into the memory, then the optimizer breaks it up into different partitions. As the partitions exceed allocated memory, parts are written to temporary segments on disk. Larger temporary extent sizes lead to improved I/O when writing the partitions to disk; the recommended temporary extent is about 1 MB. Temporary extent size is specified by INITIAL and NEXT for permanent ta

blespaces and by UNIFORM. SIZE for temporary tablespaces.

After the hash table is complete, the following processes occur:

The second, larger table is scanned.
It is broken up into partitions like the smaller table.
The partitions are written to disk.
When the hash table build is complete, it is possible that an entire hash table partition is resident in memory. Then, you do not need to build the corresponding partition for the second (larger) table. When that table is scanned, rows that hash to the resident hash table partition can be joined and returned immediately.

Each hash table partition is then read into memory, and the following processes occur:

The corresponding partition for the second table is scanned.
The hash table is probed to return the joined rows.
This process is repeated for the rest of the partitions. The cost can increase to two read passes over the data and one write pass over the data.

If the hash table does not fit in the memory, it is possible that parts of it may need to be swapped in and out, depending on the rowsretrievedfrom the second table. Performance for this scenario can be extremely poor.



The optimizer uses a hash join to join two tables if they are joined using an equijoin and if either of the following conditions are true:

A large amount of data needs to be joined.
A large fraction of the table needs to be joined.

SELECT o.customer_id, l.unit_price * l.quantity
FROM orders o ,order_items l
WHERE l.order_id = o.order_id;



Apply the USE_HASH hint to advise the optimizer to use a hash join when joining two tables together. If you are having trouble getting the optimizer to use hash joins, investigate the values for the HASH_AREA_SIZE and HASH_JOIN_ENABLED parameters.


相关主题
相关文档 最新文档