ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
At PostHog we use it to store metadata information for ClickHouse and Kafka.
Failure modes
Disk space usage increases rapidly
It has been observed that Zookeepers can suddenly increase it's disk usage, after being in a stable state for some time. This can sometimes be resolved by ensuring that old Zookeeper snapshots are cleared. If you experience this issue you can validate this solution by running zkCleanup.sh:
kubectl exec -it -n posthog posthog-posthog-zookeeper -- df -h /bitnami/zookeeperkubectl exec -it -n posthog posthog-posthog-zookeeper -- /opt/bitnami/zookeeper/bin/zkCleanup.sh -n 3kubectl exec -it -n posthog posthog-posthog-zookeeper -- df -h /bitnami/zookeeper
This will remove all snapshots aside from the last three, printing out the disk usage before and after.
In newer versions of our Helm chart we
run snapshot cleanups periodically every hour. If you experience Zookeeper space issues
and are on chart 18.2.0 or below, you can update to a later version to enable this.
Alternatively you can specify the Helm value zookeeper.autopurge.purgeInterval=1
which
will cause the clean up job to run every hour.
If you wish to further debug what is being added to your cluster, you can inspect a snapshot diff by running zhSnapshotComparer.sh e.g.:
kubectl exec -it -n posthog posthog-posthog-zookeeper -- /opt/bitnami/zookeeper/bin/zkSnapshotComparer.sh -l /bitnami/zookeeper/data/version-2/snapshot.fe376 -r /bitnami/zookeeper/data/version-2/snapshot.ff8c0 -b 2 -n 1
This will give you a breakdown of the number of nodes in each snapshot, as well as the exact node difference between the two. For example:
Deserialized snapshot in snapshot.fe376 in 0.045252 secondsProcessed data tree in 0.038782 secondsDeserialized snapshot in snapshot.ff8c0 in 0.018605 secondsProcessed data tree in 0.006101 secondsNode count: 1312Total size: 115110Max depth: 10Count of nodes at depth 0: 1Count of nodes at depth 1: 2Count of nodes at depth 2: 5Count of nodes at depth 3: 4Count of nodes at depth 4: 12Count of nodes at depth 5: 262Count of nodes at depth 6: 546Count of nodes at depth 7: 317Count of nodes at depth 8: 162Count of nodes at depth 9: 1Node count: 1312Total size: 115112Max depth: 10Count of nodes at depth 0: 1Count of nodes at depth 1: 2Count of nodes at depth 2: 5Count of nodes at depth 3: 4Count of nodes at depth 4: 12Count of nodes at depth 5: 262Count of nodes at depth 6: 546Count of nodes at depth 7: 317Count of nodes at depth 8: 162Count of nodes at depth 9: 1Printing analysis for nodes difference larger than 2 bytes or node count difference larger than 1.Analysis for depth 0Analysis for depth 1Analysis for depth 2Analysis for depth 3Analysis for depth 4Analysis for depth 5Analysis for depth 6Node /clickhouse/tables/0/posthog.events/blocks/202203_10072597193275699042_5516746108958885708 found only in right tree. Descendant size: 20. Descendant count: 0...