分布式执行计划示例
最近更新时间: 2024-10-17 17:10:00
下面是分布式架构下, 不同类型的关联(join)的执行计划的示例。
Local Join
--关联的两张表
create table t1 (
c1 integer,
c2 integer,
c3 integer,
c4 integer,
c5 integer
) distribute by shard(c1);
create table t2 (
c1 integer,
c2 integer,
c3 integer,
c4 integer,
c5 integer
) distribute by shard(c1);
tbase=# explain select t1.c3, t2.c3 from t1, t2 where t1.c1=t2.c1;
结果为
QUERY PLAN
--------------------------------------------------------------------------------
Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3) (cost=17.71..85.28 rows=3192 width=8)
-> Hash Join (cost=17.71..85.28 rows=3192 width=8)
Hash Cond: (t1.c1 = t2.c1)
-> Seq Scan on t1 (cost=0.00..10.65 rows=565 width=8)
-> Hash (cost=10.65..10.65 rows=565 width=8)
-> Seq Scan on t2 (cost=0.00..10.65 rows=565 width=8)
EXPLAIN对应的执行计划流程
Distributed Join(1)
下面这个例子中只有Join的一端(t2表)需要重新分布
tbase=# explain select t1.c3, t2.c3 from t1, t2 where t1.c1=t2.c2;
结果为
QUERY PLAN
--------------------------------------------------------------------------------
Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3) (cost=3285.00..7395.00 rows=100000 width=8)
-> Hash Join (cost=3285.00..7395.00 rows=100000 width=8)
Hash Cond: (t2.c2 = t1.c1)
-> Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3) (cost=100.00..3335.00 rows=100000 width=8)
Distribute results by S: c2
-> Seq Scan on t2 (cost=0.00..1935.00 rows=100000 width=8)
-> Hash (cost=1935.00..1935.00 rows=100000 width=8)
-> Seq Scan on t1 (cost=0.00..1935.00 rows=100000 width=8)
EXPLAIN对应的执行流程: 其中t2表的扫描结果会按照c2列值重分布。
Distributed Join(2)
下面这个例子中Join的两个表(t1和t2)都需要重新分布, 因为关联条件都不是分布列。
tbase =# explain select t1.c3, t2.c3 from t1, t2 where t1.c2=t2.c2;
结果为
QUERY PLAN
--------------------------------------------------------------------------------
Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3) (cost=4685.00..8795.00 rows=100000 width=8)
-> Hash Join (cost=4685.00..8795.00 rows=100000 width=8)
Hash Cond: (t1.c2 = t2.c2)
-> Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3) (cost=100.00..3335.00 rows=100000 width=8)
Distribute results by S: c2
-> Seq Scan on t1 (cost=0.00..1935.00 rows=100000 width=8)
-> Hash (cost=3335.00..3335.00 rows=100000 width=8)
-> Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3) (cost=100.00..3335.00 rows=100000 width=8)
Distribute results by S: c2
-> Seq Scan on t2 (cost=0.00..1935.00 rows=100000 width=8)
EXPLAIN对应的执行流程: 其中t1表和t2表的扫描结果会按照各自c2列值重分布。
Broadcast
下面这个例子中,当表t1是大表, t2是小表的时候, 虽然两张表的关联条件都不是分布列, 优化器可能选择broadcast join, 把小表的扫描结果发送到各个数据节点。
tbase =# explain select t1.c3, t2.c3 from t1, t2 where t1.c2=t2.c2;
结果为
QUERY PLAN
--------------------------------------------------------------------------------
Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3) (cost=6635.00..10070.00 rows=100000 width=8)
-> Hash Join (cost=6635.00..10070.00 rows=100000 width=8)
Hash Cond: (t1.c2 = t2.c2)
-> Seq Scan on t1 (cost=0.00..1935.00 rows=100000 width=8)
-> Hash (cost=4135.00..4135.00 rows=200000 width=8)
-> Remote Subquery Scan on all (datanode_1, datanode_2, datanode_3) (cost=100.00..4135.00 rows=200000 width=8)
-> Seq Scan on t2 (cost=0.00..1935.00 rows=100000 width=8)
EXPLAIN对应的执行流程: