故障处理

最近更新时间: 2024-10-17 17:10:00

1.节点服务启动或者停止失败排查方法

从监控平台上或者脚本操作节点服务失败的排查方法。

1.1 检查磁盘空间是否足够

上面的Avail表示还有多少空间可用,如果空间不够的话会影响tbase服务的运行

1.2 查看故障节点目录中pg_log目录最后一个日志文件日志内容

输入 ll -rt | tail -n 10,最示在最后一个文件就是最新的日志文件,使用vim打开,按shift + g移到文件内容最后面,按日志提示出错解决问题即可。

2.取消掉某些查询方法

有时应用写错了sql语句,或者程序bug引发大量不需要的查询并发执行,占用了大量系统资源,这时需要dba使用工具取消掉这些正在执行的查询。

2.1 查询要取消进程pid

postgres=# select * from pg_stat_activity where datid is not null and
pid!=pg_backend_pid() limit 1;
-[ RECORD 1 ]----+------------------------------
datid | 13325
datname | postgres
pid | 3981
usesysid | 10
usename | tbase
application_name | psql
client_addr | 127.0.0.1
client_hostname |
client_port | 56254
backend_start | 2018-08-14 19:33:12.235117+08
xact_start |
query_start | 2018-08-14 19:33:26.458043+08
state_change | 2018-08-14 19:33:26.60002+08
wait_event_type | Client
wait_event | ClientRead
state | idle
backend_xid |
backend_xmin |
query | select count(1) from t;
backend_type | client backend

2.2 取消某个节点上面的查询

postgres=# select pg_cancel_backend(3981)

2.3 取消所有cn节点上面的查询

[tbase@VM_0_37_centos shell]$./tbase_run_sql_cn.sh "select

pg_cancel_backend(pid) from pg_stat_activity where query like '%update t set

mc%' and pid!=pg_backend_pid()"

你可以先执行

./tbase_run_sql_cn.sh "select pid,query from pg_stat_activity where query like

'%update t set mc%' and pid!=pg_backend_pid()"

确认要取消进程是否正确

2.4 取消所有dn节点上面的查询

[tbase@VM_0_37_centos shell]$./tbase_run_sql_dn_master.sh "select

pg_cancel_backend(pid) from pg_stat_activity where query like '%update t set

mc%' and pid!=pg_backend_pid()"

你可以先执行

./tbase_run_sql_dn_master.sh "select pid,query from pg_stat_activity where query

like '%update t set mc%' and pid!=pg_backend_pid()"

确认要取消进程是否正确

使用执行pg_cancel_backend()有时进程不会响应,这时需要使用pg_terminate_backend()来杀死某个进程

3 kill掉某些进程方法

3.1 查到要kill掉进程pid

postgres=# select * from pg_stat_activity where datid is not null and
pid!=pg_backend_pid() limit 1;
-[ RECORD 1 ]----+------------------------------
datid | 13325
datname | postgres
pid | 3981
usesysid | 10
usename | tbase
application_name | psql
client_addr | 127.0.0.1
client_hostname |
client_port | 56254
backend_start | 2018-08-14 19:33:12.235117+08
xact_start |
query_start | 2018-08-14 19:33:26.458043+08
state_change | 2018-08-14 19:33:26.60002+08
wait_event_type | Client
wait_event | ClientRead
state | idle
backend_xid |
backend_xmin |
query | select count(1) from t;
backend_type | client backend

3.2 kill某个节点上面的查询

postgres=# select pg_terminate_backend(3981)

3.3 kill所有cn节点上面的查询

[tbase@VM_0_37_centos shell]$./tbase_run_sql_cn.sh "select

pg_terminate_backend(pid) from pg_stat_activity where query like '%update t set

mc%' and pid!=pg_backend_pid()"

你可以先执行

[tbase@VM_0_37_centos shell]$./tbase_run_sql_cn.sh "select pid,query from

pg_stat_activity where query like '%update t set mc%' and pid!=pg_backend_pid()"

确认要kill进程是否正确

3.4 kill所有dn节点上面的查询

[tbase@VM_0_37_centos shell]$./tbase_run_sql_dn_master.sh "select

pg_terminate_backend(pid) from pg_stat_activity where query like '%update t set

mc%' and pid!=pg_backend_pid()"

你可以先执行

[tbase@VM_0_37_centos shell]$./tbase_run_sql_dn_master.sh "select pid,query

from pg_stat_activity where query like '%update t set mc%' and

pid!=pg_backend_pid()"

确认要kill进程是否正确

4.执行ddl或者更新数据卡住排查

4.1 检查dn主备节点的复制同步级别

使用下面语句检查主备同步模式

[tbase@VM_0_37_centos shell]$./tbase_run_sql_dn_master.sh "select
current_setting('synchronous_commit') as
synchronous_commit,current_setting('synchronous_standby_names') as
synchronous_standby_names,array((select sync_state from pg_stat_replication ))
as sync_state"

synchronous_commit | synchronous_standby_names | sync_state
--------------------+---------------------------+------------
local | | {async}
(1 row)

如果返回值为synchronous_commit = on、synchronous_standby_names

不为空则表示主备为同步复制模式,这时sync_state值必需为sync,否则无法执行ddl或者更新的dml操作

解决问题方法

如果 sync_state值为空则表示备机服务没拉起业,把备机服务拉起来即可,或者

  1. 按8.1.2配置主备复制为某智能模式
  2. 把synchronous_commit 值配置为local
  3. roload dn主服务

4.2 检查cn或者dn节点是否有2pc残留

使用下面语句检查cn上是否有2pc存在

[tbase@VM_0_37_centos shell]$./tbase_run_sql_cn.sh "select gid from
pg_catalog.pg_prepared_xacts"
psql -h 172.16.0.37 -p 15432 -d postgres -U pgxz -c "select gid from
pg_catalog.pg_prepared_xacts"
gid
-----------
tbase_2pc
(1 row)
psql -h 172.16.0.42 -p 15432 -d postgres -U pgxz -c "select gid from
pg_catalog.pg_prepared_xacts"
gid
-----
(0 rows)

使用下面语句检查dn上是否有2pc存在

[tbase@VM_0_37_centos shell]$./tbase_run_sql_dn_master.sh "select gid from
pg_catalog.pg_prepared_xacts"
psql -h 172.16.0.37 -p 23001 -d postgres -U pgxz -c "select gid from
pg_catalog.pg_prepared_xacts"
gid
-----
(0 rows)
psql -h 172.16.0.42 -p 23002 -d postgres -U pgxz -c "select gid from
pg_catalog.pg_prepared_xacts"
gid
-----------
tbase_2pc
(1 row)

解决问题方法

commit或者rollback 影响的2pc残留,清理cn 2pc残留方法

rollback cn 2pc :./tbase_cn_2pc_rollback.sh
commit cn 2pc :./tbase_cn_2pc_commit.sh
rollback dn 2pc :./tbase_dn_2pc_rollback.sh
commit dn 2pc :./tbase_dn_2pc_commit.sh

4.3 排它锁影响

使用下面语句查询各个cn或者dn节点上面是否运行相关锁语句

检查cn的语句

[tbase@VM_0_37_centos shell]$./tbase_run_sql_cn.sh "select * from

pg_stat_activity where query ilike '%t_time_range%' and pid!=pg_backend_pid()"

检查dn的语句

[tbase@VM_0_37_centos shell]$ ./tbase_run_sql_dn_master.sh "select * from

pg_stat_activity where query ilike '%t_time_range%' and pid!=pg_backend_pid()"

上面的t_time_range为资源关键字,可以是表名,索引名。。。

解决问题方法

  • 使用pg_cancel_backend函数把查询取消掉

    在cn上执行

    [tbase@VM_0_37_centos shell]$./tbase_run_sql_cn.sh "select
    
    pg_cancel_backend(pid) from pg_stat_activity where query like '%t_time_range%'
    
    and pid!=pg_backend_pid()"

    在dn上面执行

    [tbase@VM_0_37_centos shell]$./tbase_run_sql_dn_master.sh "select
    
    pg_cancel_backend(pid) from pg_stat_activity where query like '%t_time_range%'
    
    and pid!=pg_backend_pid()"
  • 使用pg_terminate_backend函数把进程kill掉

    在cn上执行

    [tbase@VM_0_37_centos shell]$./tbase_run_sql_cn.sh "select
    
    pg_terminate_backend(pid) from pg_stat_activity where query like
    
    '%t_time_range%' and pid!=pg_backend_pid()"

    在dn上面执行

    [tbase@VM_0_37_centos shell]$./tbase_run_sql_dn_master.sh "select
    
    pg_terminate_backend(pid) from pg_stat_activity where query like
    
    '%t_time_range%' and pid!=pg_backend_pid()"