故障处理
最近更新时间: 2024-10-17 17:10:00
1.节点服务启动或者停止失败排查方法
从监控平台上或者脚本操作节点服务失败的排查方法。
1.1 检查磁盘空间是否足够
上面的Avail表示还有多少空间可用,如果空间不够的话会影响tbase服务的运行
1.2 查看故障节点目录中pg_log目录最后一个日志文件日志内容
输入 ll -rt | tail -n 10,最示在最后一个文件就是最新的日志文件,使用vim打开,按shift + g移到文件内容最后面,按日志提示出错解决问题即可。
2.取消掉某些查询方法
有时应用写错了sql语句,或者程序bug引发大量不需要的查询并发执行,占用了大量系统资源,这时需要dba使用工具取消掉这些正在执行的查询。
2.1 查询要取消进程pid
postgres=# select * from pg_stat_activity where datid is not null and
pid!=pg_backend_pid() limit 1;
-[ RECORD 1 ]----+------------------------------
datid | 13325
datname | postgres
pid | 3981
usesysid | 10
usename | tbase
application_name | psql
client_addr | 127.0.0.1
client_hostname |
client_port | 56254
backend_start | 2018-08-14 19:33:12.235117+08
xact_start |
query_start | 2018-08-14 19:33:26.458043+08
state_change | 2018-08-14 19:33:26.60002+08
wait_event_type | Client
wait_event | ClientRead
state | idle
backend_xid |
backend_xmin |
query | select count(1) from t;
backend_type | client backend
2.2 取消某个节点上面的查询
postgres=# select pg_cancel_backend(3981)
2.3 取消所有cn节点上面的查询
[tbase@VM_0_37_centos shell]$./tbase_run_sql_cn.sh "select
pg_cancel_backend(pid) from pg_stat_activity where query like '%update t set
mc%' and pid!=pg_backend_pid()"
你可以先执行
./tbase_run_sql_cn.sh "select pid,query from pg_stat_activity where query like
'%update t set mc%' and pid!=pg_backend_pid()"
确认要取消进程是否正确
2.4 取消所有dn节点上面的查询
[tbase@VM_0_37_centos shell]$./tbase_run_sql_dn_master.sh "select
pg_cancel_backend(pid) from pg_stat_activity where query like '%update t set
mc%' and pid!=pg_backend_pid()"
你可以先执行
./tbase_run_sql_dn_master.sh "select pid,query from pg_stat_activity where query
like '%update t set mc%' and pid!=pg_backend_pid()"
确认要取消进程是否正确
使用执行pg_cancel_backend()有时进程不会响应,这时需要使用pg_terminate_backend()来杀死某个进程
3 kill掉某些进程方法
3.1 查到要kill掉进程pid
postgres=# select * from pg_stat_activity where datid is not null and
pid!=pg_backend_pid() limit 1;
-[ RECORD 1 ]----+------------------------------
datid | 13325
datname | postgres
pid | 3981
usesysid | 10
usename | tbase
application_name | psql
client_addr | 127.0.0.1
client_hostname |
client_port | 56254
backend_start | 2018-08-14 19:33:12.235117+08
xact_start |
query_start | 2018-08-14 19:33:26.458043+08
state_change | 2018-08-14 19:33:26.60002+08
wait_event_type | Client
wait_event | ClientRead
state | idle
backend_xid |
backend_xmin |
query | select count(1) from t;
backend_type | client backend
3.2 kill某个节点上面的查询
postgres=# select pg_terminate_backend(3981)
3.3 kill所有cn节点上面的查询
[tbase@VM_0_37_centos shell]$./tbase_run_sql_cn.sh "select
pg_terminate_backend(pid) from pg_stat_activity where query like '%update t set
mc%' and pid!=pg_backend_pid()"
你可以先执行
[tbase@VM_0_37_centos shell]$./tbase_run_sql_cn.sh "select pid,query from
pg_stat_activity where query like '%update t set mc%' and pid!=pg_backend_pid()"
确认要kill进程是否正确
3.4 kill所有dn节点上面的查询
[tbase@VM_0_37_centos shell]$./tbase_run_sql_dn_master.sh "select
pg_terminate_backend(pid) from pg_stat_activity where query like '%update t set
mc%' and pid!=pg_backend_pid()"
你可以先执行
[tbase@VM_0_37_centos shell]$./tbase_run_sql_dn_master.sh "select pid,query
from pg_stat_activity where query like '%update t set mc%' and
pid!=pg_backend_pid()"
确认要kill进程是否正确
4.执行ddl或者更新数据卡住排查
4.1 检查dn主备节点的复制同步级别
使用下面语句检查主备同步模式
[tbase@VM_0_37_centos shell]$./tbase_run_sql_dn_master.sh "select
current_setting('synchronous_commit') as
synchronous_commit,current_setting('synchronous_standby_names') as
synchronous_standby_names,array((select sync_state from pg_stat_replication ))
as sync_state"
synchronous_commit | synchronous_standby_names | sync_state
--------------------+---------------------------+------------
local | | {async}
(1 row)
如果返回值为synchronous_commit = on、synchronous_standby_names
不为空则表示主备为同步复制模式,这时sync_state值必需为sync,否则无法执行ddl或者更新的dml操作
解决问题方法
如果 sync_state值为空则表示备机服务没拉起业,把备机服务拉起来即可,或者
- 按8.1.2配置主备复制为某智能模式
- 把synchronous_commit 值配置为local
- roload dn主服务
4.2 检查cn或者dn节点是否有2pc残留
使用下面语句检查cn上是否有2pc存在
[tbase@VM_0_37_centos shell]$./tbase_run_sql_cn.sh "select gid from
pg_catalog.pg_prepared_xacts"
psql -h 172.16.0.37 -p 15432 -d postgres -U pgxz -c "select gid from
pg_catalog.pg_prepared_xacts"
gid
-----------
tbase_2pc
(1 row)
psql -h 172.16.0.42 -p 15432 -d postgres -U pgxz -c "select gid from
pg_catalog.pg_prepared_xacts"
gid
-----
(0 rows)
使用下面语句检查dn上是否有2pc存在
[tbase@VM_0_37_centos shell]$./tbase_run_sql_dn_master.sh "select gid from
pg_catalog.pg_prepared_xacts"
psql -h 172.16.0.37 -p 23001 -d postgres -U pgxz -c "select gid from
pg_catalog.pg_prepared_xacts"
gid
-----
(0 rows)
psql -h 172.16.0.42 -p 23002 -d postgres -U pgxz -c "select gid from
pg_catalog.pg_prepared_xacts"
gid
-----------
tbase_2pc
(1 row)
解决问题方法
commit或者rollback 影响的2pc残留,清理cn 2pc残留方法
rollback cn 2pc :./tbase_cn_2pc_rollback.sh
commit cn 2pc :./tbase_cn_2pc_commit.sh
rollback dn 2pc :./tbase_dn_2pc_rollback.sh
commit dn 2pc :./tbase_dn_2pc_commit.sh
4.3 排它锁影响
使用下面语句查询各个cn或者dn节点上面是否运行相关锁语句
检查cn的语句
[tbase@VM_0_37_centos shell]$./tbase_run_sql_cn.sh "select * from
pg_stat_activity where query ilike '%t_time_range%' and pid!=pg_backend_pid()"
检查dn的语句
[tbase@VM_0_37_centos shell]$ ./tbase_run_sql_dn_master.sh "select * from
pg_stat_activity where query ilike '%t_time_range%' and pid!=pg_backend_pid()"
上面的t_time_range为资源关键字,可以是表名,索引名。。。
解决问题方法
使用pg_cancel_backend函数把查询取消掉
在cn上执行
[tbase@VM_0_37_centos shell]$./tbase_run_sql_cn.sh "select pg_cancel_backend(pid) from pg_stat_activity where query like '%t_time_range%' and pid!=pg_backend_pid()"
在dn上面执行
[tbase@VM_0_37_centos shell]$./tbase_run_sql_dn_master.sh "select pg_cancel_backend(pid) from pg_stat_activity where query like '%t_time_range%' and pid!=pg_backend_pid()"
使用pg_terminate_backend函数把进程kill掉
在cn上执行
[tbase@VM_0_37_centos shell]$./tbase_run_sql_cn.sh "select pg_terminate_backend(pid) from pg_stat_activity where query like '%t_time_range%' and pid!=pg_backend_pid()"
在dn上面执行
[tbase@VM_0_37_centos shell]$./tbase_run_sql_dn_master.sh "select pg_terminate_backend(pid) from pg_stat_activity where query like '%t_time_range%' and pid!=pg_backend_pid()"