• Product
  • Pricing
  • Docs
  • Using PostHog
  • Community
  • Company
  • Login
  • Table of contents

  • Handbook
    • Start here
    • Meetings
    • Story
    • Team
    • Investors
    • Strategy overview
    • Business model
    • Objectives
    • Roadmap
    • Brand
    • Culture
    • Values
    • Small teams
    • Goal setting
    • Diversity and inclusion
    • Communication
    • Management
    • Offsites
    • Security
    • Brand assets
    • Team structure
    • Customer Success
    • Exec
    • Experimentation
    • Growth
    • Infrastructure
    • Marketing
    • People & Ops
    • Pipeline
    • Product Analytics
    • Session Recording
    • Website & Docs
    • Compensation
    • Share options
    • Benefits
    • Time off
    • Spending money
    • Progression
    • Training
    • Side gigs
    • Feedback
    • Onboarding
    • Offboarding
      • Product Manager ramp up
    • Merch store
      • Overview
      • How to interview
      • Engineering hiring
      • Marketing hiring
      • Operations hiring
      • Design hiring
      • Exec hiring
      • Developing locally
      • Tech stack
      • Project structure
      • How we review PRs
      • Frontend coding
      • Backend coding
      • Support hero
      • Feature ownership
      • Working with product design
      • Releasing a new version
      • Handling incidents
      • Bug prioritization
      • Event ingestion explained
      • Making schema changes safely
      • How to optimize queries
      • How to write an async migration
      • How to run migrations on PostHog Cloud
      • Working with ClickHouse materialized columns
      • Deployments support
      • Working with cloud providers
      • How-to access PostHog Cloud infra
      • Developing the website
      • MDX setup
      • Markdown
      • Jobs
      • Overview
      • Data storage or what is a MergeTree
      • Data replication
      • Data ingestion
      • Working with JSON
      • Query performance
      • Operations
        • Overview
        • sharded_events
        • app_metrics
        • person_distinct_id
    • Shipping things, step by step
    • Feature flags specification
    • Setting up SSL locally
    • Tech talks
    • Overview
    • Product metrics
    • User feedback
    • Paid features
    • Releasing as beta
    • Our philosophy
    • Product design process
    • Designing posthog.com
    • Overview
    • Personas
    • Testimonials
    • Value propositions
      • Content & SEO
      • Sponsorship
      • Paid ads
      • Email
      • Press
    • Growth strategy
    • Customer support
    • Inbound sales model
    • Sales operations
      • Managing our CRM
      • YC onboarding
      • Demos
      • Billing
      • Who we do business with
    • Growth reviews
  • Table of contents

  • Handbook
    • Start here
    • Meetings
    • Story
    • Team
    • Investors
    • Strategy overview
    • Business model
    • Objectives
    • Roadmap
    • Brand
    • Culture
    • Values
    • Small teams
    • Goal setting
    • Diversity and inclusion
    • Communication
    • Management
    • Offsites
    • Security
    • Brand assets
    • Team structure
    • Customer Success
    • Exec
    • Experimentation
    • Growth
    • Infrastructure
    • Marketing
    • People & Ops
    • Pipeline
    • Product Analytics
    • Session Recording
    • Website & Docs
    • Compensation
    • Share options
    • Benefits
    • Time off
    • Spending money
    • Progression
    • Training
    • Side gigs
    • Feedback
    • Onboarding
    • Offboarding
      • Product Manager ramp up
    • Merch store
      • Overview
      • How to interview
      • Engineering hiring
      • Marketing hiring
      • Operations hiring
      • Design hiring
      • Exec hiring
      • Developing locally
      • Tech stack
      • Project structure
      • How we review PRs
      • Frontend coding
      • Backend coding
      • Support hero
      • Feature ownership
      • Working with product design
      • Releasing a new version
      • Handling incidents
      • Bug prioritization
      • Event ingestion explained
      • Making schema changes safely
      • How to optimize queries
      • How to write an async migration
      • How to run migrations on PostHog Cloud
      • Working with ClickHouse materialized columns
      • Deployments support
      • Working with cloud providers
      • How-to access PostHog Cloud infra
      • Developing the website
      • MDX setup
      • Markdown
      • Jobs
      • Overview
      • Data storage or what is a MergeTree
      • Data replication
      • Data ingestion
      • Working with JSON
      • Query performance
      • Operations
        • Overview
        • sharded_events
        • app_metrics
        • person_distinct_id
    • Shipping things, step by step
    • Feature flags specification
    • Setting up SSL locally
    • Tech talks
    • Overview
    • Product metrics
    • User feedback
    • Paid features
    • Releasing as beta
    • Our philosophy
    • Product design process
    • Designing posthog.com
    • Overview
    • Personas
    • Testimonials
    • Value propositions
      • Content & SEO
      • Sponsorship
      • Paid ads
      • Email
      • Press
    • Growth strategy
    • Customer support
    • Inbound sales model
    • Sales operations
      • Managing our CRM
      • YC onboarding
      • Demos
      • Billing
      • Who we do business with
    • Growth reviews
  • Handbook
  • Engineering
  • ClickHouse manual
  • Operations

Operations

Last updated: Nov 14, 2022

On this page

  • Settings
  • Query settings
  • Server settings
  • MergeTree table settings
  • Profiles and users
  • Mutations
  • GDPR
  • Merges
  • OPTIMIZE TABLE
  • SYSTEM STOP MERGES
  • Important settings
  • Simple schema changes
  • TTLs
  • Tricky schema changes
  • Async migrations
  • Pausing ingestion
  • Changing table engines
  • Changing ORDER BY or PARTITION BY
  • Resharding
  • Denormalizing columns via dictionaries
  • Learn more

This document gives an overview of the kitchen side of ClickHouse: how various operations work, what tricky migrations we have experience with as well as various settings and tips.

Settings

ClickHouse provides daunting amounts of configuration on all levels. This section provides information on the different kind of settings and how to configure them.

Query settings

Query settings allow to manipulate the behavior of queries, for example setting limits on query execution time and resource usage or toggling specific behaviors on-and-off.

Documentation:

  • Settings
  • Restrictions on Query Complexity

Using query settings is done:

  • at query-time via ClickHouse client library arguments (preferred)
  • at query-time via explicit SETTINGS clause in queries
  • via users.xml file to apply to all queries

Server settings

Server settings allow tuning things like global thread or pool sizes, networking and other clickhouse-server-level configuration.

Documentation:

  • Server Settings

You can change server settings via config.xml file. Note: some settings may require a server restart.

MergeTree table settings

MergeTree settings allow configuring things from primary index granularity to merge behavior to limits of usage of this table.

Documentation:

  • MergeTree tables settings
  • Undocumented settings can be found in the source code

MergeTree table settings are set either:

  • at table creation time
  • or via ALTER TABLE ... SETTING statement

Profiles and users

ClickHouse allows creating different profiles and users with their own set of settings. This can be useful to grant read-only access to some users or otherwise limit resource use.

Read more in documentation:

  • Settings Profiles
  • Settings

Mutations

ALTER TABLE ... UPDATE and ALTER TABLE ... DELETE operations which mutate data require ClickHouse to rewrite whole data via special merge operations. These are frequently expensive operations and require monitoring.

You can monitor progress of mutations via the following system tables:

  • system.mutations
  • system.merges - see is_mutation column
  • system.replication_queue

When creating mutations, it's often wise to alter the value of mutations_sync setting.

Running mutations can be stopped by issuing a KILL MUTATION WHERE mutation_id = '...' statement.

Note that this may not stop any currently running merges. To do so, check out section on SYSTEM STOP MERGES

GDPR

When necessary to delete user data due to GDPR or otherwise, it's wise to do so in batches and asynchronously.

At PostHog, when deleting user data, we schedule for all deletions to occur once per week to minimize the cost of rewriting data.

In the future, lightweight deletes might simplify this process.

Merges

As explained previously, merges are the lifeblood of ClickHouse, responsible for optimizing how data is laid out on disk as well as for deduplicating data.

Merges can be monitored via the following tables:

  • system.merges
  • system.replication_queue

OPTIMIZE TABLE

[OPTIMIZE TABLE] statement schedules merges for a table, optimizing the on-disk layout or speeding up queries or forcing some schema changes into effect.

Note: not all parts are guaranteed to be merged if the size of parts exceeds maximum limits or if data is already in a single part. In this case adding a FINAL modifier forces the merge regardless.

SYSTEM STOP MERGES

SYSTEM STOP MERGES statement can stop background merges from occurring temporarily for a table or the whole database. This can be useful during trickier schema migrations when copying data.

Note unless ingestion is paused during this time, this can easily lead to too many parts errors.

Merges can be resumed via SYSTEM START MERGES statement.

Important settings

Merges have many relevant settings associated to be cognizant about:

  • parts_to_throw_insert controls when ClickHouse starts when parts count gets high.
  • max_bytes_to_merge_at_max_space_in_pool controls maximum part size
  • background_pool_size (and related) server settings control how many merges are executed in parallel
  • Undocumented max_replicated_mutations_in_queue and max_replicated_merges_in_queue settings control how many merges are processed at once

Simple schema changes

As in any other database, schema changes are done via ALTER TABLE statements.

One area where ClickHouse differs from other databases is that schema changes are generally lazy and apply to only new data or merged parts. This applies to:

  • Adding or removing columns, changing default values
  • Changing compression of columns
  • Updating table settings

You can generally force these changes onto old data by forcing data to be merged via OPTIMIZE TABLE FINAL statement, but this can be expensive.

TTLs

ClickHouse TTLs allow dropping old rows or columns after expiry.

It's suggested to set up your table to partition by timestamp as well, so old files can be dropped completely instead of needing to be rewritten as a result of TTL.

Tricky schema changes

Some schema changes are deceptively hard and frequently requires rewriting the whole table or re-creating the tables.

Make sure to never re-use Zookeeper paths when re-creating replicated table!

The difference often comes down to how data is stored on disk and its implications.

Async migrations

At PostHog, we've developed Async Migrations for executing these long-running operations in the background without affecting availability.

You can learn more about Async Migrations in our blog, handbook and runbook.

Pausing ingestion

This is frequently a prerequisite of any large-scale schema change as new data may get lost when you are copying data from one place to another.

If you're using Kafka engine tables for ingestion, you can pause ingestion by dropping materialized view(s) attached to Kafka engine tables.

To restart ingestion, recreate the dropped table(s).

Changing table engines

When changing table engines, you can leverage ALTER PARTITION commands to move data between tables.

Note: ALTER PARTITION commands only work if the two tables have identical structure: same columns and ORDER BY/PARTITION BY. It works by creating hard links between partitions, so the operation does not require any extra disk space until merges happen.

Thus it's important to stop ingesting new data and merges during this operation.

PostHog needed to implement this kind of operation to move to a sharded schema: 0004_replicated_schema.py.

Changing ORDER BY or PARTITION BY

Changing ORDER BY and PARTITION BY affects how data is stored on disk and requires rewriting this data.

Suggested procedure if using ReplacingMergeTree:

  1. Create a new table with correct ORDER BY
  2. Create a new materialized view table, writing new data to new table.
  3. Copy data over from old table via INSERT INTO SELECT
  4. Deduplicate via OPTIMIZE TABLE FINAL if feasible.

Note that INSERT-ing data this way may be slow or time out. Consider:

  • Dropping any materialized columns temporarily
  • Increasing query settings max_execution_time, send_timeout, receive_timeout timeouts to be large enough
  • Finding correct values for query settings max_block_size, max_insert_block_size, max_threads, max_insert_threads
  • Setting optimize_on_insert setting to 0

Note that this operation temporarily doubles the amount of disk space you need.

An example (from PostHog) of an async migration: 0005_person_replacing_by_version.py

Resharding

At PostHog, we've haven't had to reshard data (yet), but the process would look similar to changing ORDER BY or PARTITION BY, requiring either to pause data or deduplicate at the end.

Storing/restoring parts of data from backups might also simplify this process.

Denormalizing columns via dictionaries

A powerful tool in the arsenal of performance is de-normalization of data.

At PostHog, we eliminated some JOINs for person data by storing information on person identities and properties directly on events.

Backfilling this data was implemented via ALTER TABLE UPDATE populating new columns. The column data was pulled in using dictionaries which allowed to query and store data from other tables in memory during the update.

An alternative approach might have been to create a new table and populate it similar to changing ORDER BY, but this would have required expensive deduplication, a lot of extra space and even more memory usage.

Learn more on this:

  • 0007_persons_and_groups_on_events_backfill.py
  • Altinity knowledge base: Column backfilling with alter/update using a dictionary

Learn more

More information for ClickHouse operations can be found in:

  • Altinity knowledgebase
  • Tinybird knowledgebase

Next in the ClickHouse manual: Schema case studies

Questions?

Was this page useful?

Next article

Overview

When designing a schema for ClickHouse, there are dozens of large and small decisions engineers need to make to design a well-performing solution fit for the problem being solved. The following documents outline various schemas we have at PostHog, examining why they are designed this way, what are some good parts about them, and mistakes that were made. Schemas sharded_events app_metrics person_distinct_id

Read next article

Authors

  • Karl-Aksel Puulmann
    Karl-Aksel Puulmann
  • Ian Vanagas
    Ian Vanagas

Share

Jump to:

  • Settings
  • Query settings
  • Server settings
  • MergeTree table settings
  • Profiles and users
  • Mutations
  • GDPR
  • Merges
  • OPTIMIZE TABLE
  • SYSTEM STOP MERGES
  • Important settings
  • Simple schema changes
  • TTLs
  • Tricky schema changes
  • Async migrations
  • Pausing ingestion
  • Changing table engines
  • Changing ORDER BY or PARTITION BY
  • Resharding
  • Denormalizing columns via dictionaries
  • Learn more
  • Questions?
  • Edit this page
  • Raise an issue
  • Toggle content width
  • Toggle dark mode
  • Product

  • Overview
  • Pricing
  • Product analytics
  • Session recording
  • A/B testing
  • Feature flags
  • Apps
  • Customer stories
  • PostHog vs...
  • Docs

  • Quickstart guide
  • Self-hosting
  • Installing PostHog
  • Building an app
  • API
  • Webhooks
  • How PostHog works
  • Data privacy
  • Using PostHog

  • Product manual
  • Apps manuals
  • Tutorials
  • Community

  • Questions?
  • Product roadmap
  • Contributors
  • Partners
  • Newsletter
  • Merch
  • PostHog FM
  • PostHog on GitHub
  • Handbook

  • Getting started
  • Company
  • Strategy
  • How we work
  • Small teams
  • People & Ops
  • Engineering
  • Product
  • Design
  • Marketing
  • Customer success
  • Company

  • About
  • Team
  • Investors
  • Press
  • Blog
  • FAQ
  • Support
  • Careers
© 2022 PostHog, Inc.
  • Code of conduct
  • Privacy policy
  • Terms