KSSH程序监控显示HBase写入异常的诡异的慢,导致batch堆积。
22/07/27 01:01:37 INFO Metrics: Initializing metrics system: phoenix
22/07/27 01:01:37 INFO MetricsConfig: loaded properties from hadoop-metrics2.properties
22/07/27 01:01:37 INFO MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
22/07/27 01:01:37 INFO MetricsSystemImpl: phoenix metrics system started
22/07/27 01:01:37 INFO deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
22/07/27 01:01:38 INFO CachedKafkaConsumer: Initial fetch for spark-executor-kssh_v1 kssh 1 284382547
22/07/27 01:01:38 INFO AbstractCoordinator: Discovered coordinator hadoop38:9092 (id: 2147483459 rack: null) for group spark-executor-kssh_v1.
22/07/27 01:01:40 INFO deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
22/07/27 01:01:40 INFO deprecation: dfs.socket.timeout is deprecated. Instead, use dfs.client.socket-timeout
22/07/27 01:01:40 INFO deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
22/07/27 01:01:40 INFO deprecation: dfs.socket.timeout is deprecated. Instead, use dfs.client.socket-timeout
22/07/27 01:01:40 INFO DefaultMetricsCollector: Configured metrics report to emit every 60 seconds
22/07/27 01:02:18 INFO AsyncProcess: #1, waiting for 2 actions to finish on table: JYDW:OMS_ORDERINFOITEM
22/07/27 01:02:18 INFO AsyncProcess: Left over 2 task(s) are processed on server(s): [hadoop49,60020,1550503850620]
22/07/27 01:02:18 INFO AsyncProcess: Regions against which left over task(s) are processed: [JYDW:OMS_ORDERINFOITEM,,1509703183405.f4a8235d947d1fa0f358dd1c789f2d97.]
2022-07-27 13:33:29,964 INFO SecurityLogger.org.apache.hadoop.hbase.Server: Auth successful for zhonggang (auth:SIMPLE)
2022-07-27 13:33:29,964 INFO SecurityLogger.org.apache.hadoop.hbase.Server: Connection from 192.168.6.75 port: 50808 with version info: version: "1.2.0-cdh5.12.0" url: "file:///data/jenkins/workspace/generic-binary-tarball-and-maven-deploy/CDH5.12.0-Packaging-HBase-2017-06-29_04-13-35/hbase-1.2.0-cdh5.12.0" revision: "Unknown" user: "jenkins" date: "Thu Jun 29 04:37:42 PDT 2017" src_checksum: "6834049453a9459ccaf4cadbf9a54b2c"
2022-07-27 13:35:46,847 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3167ms
GC pool 'G1 Young Generation' had collection(s): count=2 time=3197ms
GC pool 'G1 Old Generation' had collection(s): count=1 time=8012ms
2022-07-27 13:36:12,361 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache: totalSize=6.01 MB, freeSize=5.59 GB, max=5.60 GB, blockCount=1, accesses=130, hits=130, hitRatio=100.00%, , cachingAccesses=129, cachingHits=129, cachingHitsRatio=100.00%, evictions=176, evicted=2, evictedPerRun=0.011363636702299118
2022-07-27 13:36:13,312 INFO org.apache.hadoop.hbase.io.hfile.bucket.BucketCache: failedBlockAdditions=0, totalSize=2.00 GB, freeSize=2.00 GB, usedSize=55 KB, cacheSize=38.78 KB, accesses=5168, hits=0, IOhitsPerSecond=0, IOTimePerHit=NaN, hitRatio=0,cachingAccesses=10, cachingHits=0, cachingHitsRatio=0,evictions=0, evicted=0, evictedPerRun=NaN
2022-07-27 13:36:42,457 INFO SecurityLogger.org.apache.hadoop.hbase.Server: Auth successful fo
停止KSSH程序写入HBase、通知其他部门及同事暂且不使用HBase,也就是没有读写请求。
$ hbase hbck --help
Usage: fsck [opts] {only tables}
where [opts] are:
-help Display help options (this)
-details Display full report of all regions.
-timelag Process only regions that have not experienced any metadata updates in the last seconds.
-sleepBeforeRerun Sleep this many seconds before checking if the fix worked if run with -fix
-summary Print only summary of the tables and status.
-metaonly Only check the state of the hbase:meta table.
-sidelineDir HDFS path to backup existing meta.
-boundaries Verify that regions boundaries are the same between META and store files.
-exclusive Abort if another hbck is exclusive or fixing.
Metadata Repair options: (expert features, use with caution!)
-fix Try to fix region assignments. This is for backwards compatiblity
-fixAssignments Try to fix region assignments. Replaces the old -fix
-fixMeta Try to fix meta problems. This assumes HDFS region info is good.
-noHdfsChecking Don't load/check region info from HDFS. Assumes hbase:meta region info is good. Won't check/fix any HDFS issue, e.g. hole, orphan, or overlap
-fixHdfsHoles Try to fix region holes in hdfs.
-fixHdfsOrphans Try to fix region dirs with no .regioninfo file in hdfs
-fixTableOrphans Try to fix table dirs with no .tableinfo file in hdfs (online mode only)
-fixHdfsOverlaps Try to fix region overlaps in hdfs.
-fixVersionFile Try to fix missing hbase.version file in hdfs.
-maxMerge When fixing region overlaps, allow at most regions to merge. (n=5 by default)
-sidelineBigOverlaps When fixing region overlaps, allow to sideline big overlaps
-maxOverlapsToSideline When fixing region overlaps, allow at most regions to sideline per group. (n=2 by default)
-fixSplitParents Try to force offline split parents to be online.
-removeParents Try to offline and sideline lingering parents and keep daughter regions.
-ignorePreCheckPermission ignore filesystem permission pre-check
-fixReferenceFiles Try to offline lingering reference store files
-fixEmptyMetaCells Try to fix hbase:meta entries not referencing any region (empty REGIONINFO_QUALIFIER rows)
Datafile Repair options: (expert features, use with caution!)
-checkCorruptHFiles Check all Hfiles by opening them to make sure they are valid
-sidelineCorruptHFiles Quarantine corrupted HFiles. implies -checkCorruptHFiles
Metadata Repair shortcuts
-repair Shortcut for -fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans -fixHdfsOverlaps -fixVersionFile -sidelineBigOverlaps -fixReferenceFiles -fixTableLocks -fixOrphanedTableZnodes
-repairHoles Shortcut for -fixAssignments -fixMeta -fixHdfsHoles
Table lock options
-fixTableLocks Deletes table locks held for a long time (hbase.table.lock.expire.ms, 10min by default)
Table Znode options
-fixOrphanedTableZnodes Set table state in ZNode to disabled if table does not exists
Replication options
-fixReplication Deletes replication queues for removed peers
$ hbase hbck /
ERROR: Region { meta => TESTDW:PUB_DELIVERREGIONRULE_IDX,,1548815156590.31cb7b9e86574a9ec05db1b2fe5916f6., hdfs => hdfs://nameservice1/hbase/data/TESTDW/PUB_DELIVERREGIONRULE_IDX/31cb7b9e86574a9ec05db1b2fe5916f6, deployed => , replicaId => 0 } not deployed on any region server.
.........
.........
22/07/27 16:39:14 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: There is a hole in the region chain between and . You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table TESTDW:PUB_DELIVERREGIONRULE_IDX
$ hbase hbck -repair
rs heap memory:22G
当Region中任意一个MemStore的大小达到了上限(hbase.hregion.memstore.flush.size,256MB),会触发Memstore刷新。
当Region中所有Memstore的大小总和达到了上限(hbase.hregion.memstore.block.multiplier * hbase.hregion.memstore.flush.size,默认 3* 256M = 768M),会触发memstore刷新。
当一个Region Server中所有Memstore的大小总和达到了上限(hbase.regionserver.global.memstore.upperLimit * hbaseheapsize=0.45*22G=9.9G,为45%的JVM内存使用量),会触发部分Memstore刷新。Flush顺序是按照Memstore由大到小执行,先Flush Memstore最大的Region,再执行次大的, 直至总体Memstore内存使用量低于阈值。Apache and CDH5.8.0之前:hbase.regionserver.global.memstore.lowerLimit * hbaseheapsize,默认 38%的JVM内存使用量)。CDH5.8.0之后: hbase.regionserver.global.memstore.lowerLimit=0.92 memstore memory的92%就开始flush。
默认周期为1小时,确保Memstore不会长时间没有持久化。为避免所有的MemStore在同一时间都进行flush导致的问题,定期的flush操作有20000左右的随机延时。
http://hbasefly.com/2016/03/23/hbase-memstore-flush/
[root@hadoop38 bin]# ./zkCli.sh
[zk: localhost:2181(CONNECTED) 0] ls /
[zookeeper, yarn-leader-election, hadoop-ha, rmstore, kafka, hbase]
[zk: localhost:2181(CONNECTED) 1] rmr /hbase
$ hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair
心里想想最后最后再升级吧,其实自己心里也不确定是否能够提供升级解决!
CDH5.12.1-5.16.1-changes.log
vi /usr/java/jdk1.8.0_45/jstatd.all.policy
grant codebase "file:${java.home}/../lib/tools.jar" {
permission java.security.AllPermission;
};
nohup jstatd -J-Djava.security.policy=/usr/java/jdk1.8.0_45/jstatd.all.policy &
-Dcom.sun.management.jmxremote.port=8998
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-XX:+UseG1GC
-XX:InitiatingHeapOccupancyPercent=60
-XX:-ResizePLAB
-XX:MaxGCPauseMillis=200
-XX:+UnlockDiagnosticVMOptions
-XX:+G1SummarizeConcMark
-XX:+ParallelRefProcEnabled
-XX:G1HeapRegionSize=32m
-XX:G1HeapWastePercent=20
-XX:ConcGCThreads=8
-XX:ParallelGCThreads=16
-XX:MaxTenuringThreshold=15
-XX:G1MixedGCCountTarget=64
-XX:+UnlockExperimentalVMOptions
-XX:G1NewSizePercent=3
-XX:G1OldCSetRegionThresholdPercent=5
-Dcom.sun.management.jmxremote.port=8999
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-XX:MetaspaceSize=200M
"IPC Client (1179689991) connection to hadoop36/192.168.17.36:60000 from hbase" - Thread t@37484
java.lang.Thread.State: TIMED_WAITING
at java.lang.Object.wait(Native Method)
- waiting on <3ea53ce0> (a org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.waitForWork(RpcClientImpl.java:551)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.run(RpcClientImpl.java:566)
Locked ownable synchronizers:
- None
"RS_OPEN_REGION-hadoop56:60020-87" - Thread t@1201
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <7c36101b> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- None
rs dump.log 查看dump文件分析,发现大量线程 WAITING
我在想是不是会因为表数量(index table)过多,导致启动时,死锁,waiting。
为了验证,于是手动删除MCP测试写入的table,大约200张表,通过gc图观察和机器负载 ,发现明细改善!
于是就大胆地删除非业务相关的所有的表,缩减tables数量。
留言与评论(共有 0 条评论) “” |