分布式执行计划示例

最近更新时间: 2024-06-12 15:06:00

下面是分布式架构下, 不同类型的关联(join)的执行计划的示例。

Local Join

--关联的两张表        
create table t1 (                                         
    c1 integer,                                           
    c2 integer,                                                 
    c3 integer,                                                  
    c4 integer,                                                     
    c5 integer                                                       
) distribute by shard(c1);

create table t2 (                                         
    c1 integer,                                           
    c2 integer,                                                 
    c3 integer,                                                  
    c4 integer,                                                     
    c5 integer                                                       
) distribute by shard(c1);


tbase=# explain select t1.c3, t2.c3 from t1, t2 where t1.c1=t2.c1;

结果为

                                   QUERY PLAN                                    
--------------------------------------------------------------------------------
 Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3)  (cost=17.71..85.28 rows=3192 width=8)
   ->  Hash Join  (cost=17.71..85.28 rows=3192 width=8)
         Hash Cond: (t1.c1 = t2.c1)
         ->  Seq Scan on t1  (cost=0.00..10.65 rows=565 width=8)
         ->  Hash  (cost=10.65..10.65 rows=565 width=8)
               ->  Seq Scan on t2  (cost=0.00..10.65 rows=565 width=8)

EXPLAIN对应的执行计划流程

Distributed Join(1)

下面这个例子中只有Join的一端(t2表)需要重新分布

tbase=# explain select t1.c3, t2.c3 from t1, t2 where t1.c1=t2.c2;

结果为

                                      QUERY PLAN                                             
--------------------------------------------------------------------------------
 Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3)  (cost=3285.00..7395.00 rows=100000 width=8)
   ->  Hash Join  (cost=3285.00..7395.00 rows=100000 width=8)
         Hash Cond: (t2.c2 = t1.c1)
         ->  Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3)  (cost=100.00..3335.00 rows=100000 width=8)
               Distribute results by S: c2
               ->  Seq Scan on t2  (cost=0.00..1935.00 rows=100000 width=8)
         ->  Hash  (cost=1935.00..1935.00 rows=100000 width=8)
               ->  Seq Scan on t1  (cost=0.00..1935.00 rows=100000 width=8)

EXPLAIN对应的执行流程: 其中t2表的扫描结果会按照c2列值重分布。

Distributed Join(2)

下面这个例子中Join的两个表(t1和t2)都需要重新分布, 因为关联条件都不是分布列。

tbase =# explain select t1.c3, t2.c3 from t1, t2 where t1.c2=t2.c2;

结果为

                                QUERY PLAN                                                
--------------------------------------------------------------------------------
 Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3)  (cost=4685.00..8795.00 rows=100000 width=8)
   ->  Hash Join  (cost=4685.00..8795.00 rows=100000 width=8)
         Hash Cond: (t1.c2 = t2.c2)
         ->  Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3)  (cost=100.00..3335.00 rows=100000 width=8)
               Distribute results by S: c2
               ->  Seq Scan on t1  (cost=0.00..1935.00 rows=100000 width=8)
         ->  Hash  (cost=3335.00..3335.00 rows=100000 width=8)
               ->  Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3)  (cost=100.00..3335.00 rows=100000 width=8)
                     Distribute results by S: c2
                     ->  Seq Scan on t2  (cost=0.00..1935.00 rows=100000 width=8)

EXPLAIN对应的执行流程: 其中t1表和t2表的扫描结果会按照各自c2列值重分布。

Broadcast

下面这个例子中,当表t1是大表, t2是小表的时候, 虽然两张表的关联条件都不是分布列, 优化器可能选择broadcast join, 把小表的扫描结果发送到各个数据节点。

tbase =# explain select t1.c3, t2.c3 from t1, t2 where t1.c2=t2.c2;

结果为

                                QUERY PLAN                                                
--------------------------------------------------------------------------------
 Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3)  (cost=6635.00..10070.00 rows=100000 width=8)
   ->  Hash Join  (cost=6635.00..10070.00 rows=100000 width=8)
         Hash Cond: (t1.c2 = t2.c2)
         ->  Seq Scan on t1  (cost=0.00..1935.00 rows=100000 width=8)
         ->  Hash  (cost=4135.00..4135.00 rows=200000 width=8)
               ->  Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3)  (cost=100.00..4135.00 rows=200000 width=8)
                     ->  Seq Scan on t2  (cost=0.00..1935.00 rows=100000 width=8)

EXPLAIN对应的执行流程: