• Product
  • Pricing
  • Docs
  • Using PostHog
  • Community
  • Company
  • Login
  • Table of contents

  • Handbook
    • Start here
    • Meetings
    • Story
    • Team
    • Investors
    • Strategy overview
    • Business model
    • Objectives
    • Roadmap
    • Brand
    • Culture
    • Values
    • Small teams
    • Goal setting
    • Diversity and inclusion
    • Communication
    • Management
    • Offsites
    • Security
    • Brand assets
    • Team structure
    • Customer Success
    • Exec
    • Experimentation
    • Growth
    • Infrastructure
    • Marketing
    • People & Ops
    • Pipeline
    • Product Analytics
    • Session Recording
    • Website & Docs
    • Compensation
    • Share options
    • Benefits
    • Time off
    • Spending money
    • Progression
    • Training
    • Side gigs
    • Feedback
    • Onboarding
    • Offboarding
      • Product Manager ramp up
    • Merch store
      • Overview
      • How to interview
      • Engineering hiring
      • Marketing hiring
      • Operations hiring
      • Design hiring
      • Exec hiring
      • Developing locally
      • Tech stack
      • Project structure
      • How we review PRs
      • Frontend coding
      • Backend coding
      • Support hero
      • Feature ownership
      • Working with product design
      • Releasing a new version
      • Handling incidents
      • Bug prioritization
      • Event ingestion explained
      • Making schema changes safely
      • How to optimize queries
      • How to write an async migration
      • How to run migrations on PostHog Cloud
      • Working with ClickHouse materialized columns
      • Deployments support
      • Working with cloud providers
      • How-to access PostHog Cloud infra
      • Developing the website
      • MDX setup
      • Markdown
      • Jobs
      • Overview
      • Data storage or what is a MergeTree
      • Data replication
      • Data ingestion
      • Working with JSON
      • Query performance
      • Operations
        • Overview
        • sharded_events
        • app_metrics
        • person_distinct_id
    • Shipping things, step by step
    • Feature flags specification
    • Setting up SSL locally
    • Tech talks
    • Overview
    • Product metrics
    • User feedback
    • Paid features
    • Releasing as beta
    • Our philosophy
    • Product design process
    • Designing posthog.com
    • Overview
    • Personas
    • Testimonials
    • Value propositions
      • Content & SEO
      • Sponsorship
      • Paid ads
      • Email
      • Press
    • Growth strategy
    • Customer support
    • Inbound sales model
    • Sales operations
      • Managing our CRM
      • YC onboarding
      • Demos
      • Billing
      • Who we do business with
    • Growth reviews
  • Table of contents

  • Handbook
    • Start here
    • Meetings
    • Story
    • Team
    • Investors
    • Strategy overview
    • Business model
    • Objectives
    • Roadmap
    • Brand
    • Culture
    • Values
    • Small teams
    • Goal setting
    • Diversity and inclusion
    • Communication
    • Management
    • Offsites
    • Security
    • Brand assets
    • Team structure
    • Customer Success
    • Exec
    • Experimentation
    • Growth
    • Infrastructure
    • Marketing
    • People & Ops
    • Pipeline
    • Product Analytics
    • Session Recording
    • Website & Docs
    • Compensation
    • Share options
    • Benefits
    • Time off
    • Spending money
    • Progression
    • Training
    • Side gigs
    • Feedback
    • Onboarding
    • Offboarding
      • Product Manager ramp up
    • Merch store
      • Overview
      • How to interview
      • Engineering hiring
      • Marketing hiring
      • Operations hiring
      • Design hiring
      • Exec hiring
      • Developing locally
      • Tech stack
      • Project structure
      • How we review PRs
      • Frontend coding
      • Backend coding
      • Support hero
      • Feature ownership
      • Working with product design
      • Releasing a new version
      • Handling incidents
      • Bug prioritization
      • Event ingestion explained
      • Making schema changes safely
      • How to optimize queries
      • How to write an async migration
      • How to run migrations on PostHog Cloud
      • Working with ClickHouse materialized columns
      • Deployments support
      • Working with cloud providers
      • How-to access PostHog Cloud infra
      • Developing the website
      • MDX setup
      • Markdown
      • Jobs
      • Overview
      • Data storage or what is a MergeTree
      • Data replication
      • Data ingestion
      • Working with JSON
      • Query performance
      • Operations
        • Overview
        • sharded_events
        • app_metrics
        • person_distinct_id
    • Shipping things, step by step
    • Feature flags specification
    • Setting up SSL locally
    • Tech talks
    • Overview
    • Product metrics
    • User feedback
    • Paid features
    • Releasing as beta
    • Our philosophy
    • Product design process
    • Designing posthog.com
    • Overview
    • Personas
    • Testimonials
    • Value propositions
      • Content & SEO
      • Sponsorship
      • Paid ads
      • Email
      • Press
    • Growth strategy
    • Customer support
    • Inbound sales model
    • Sales operations
      • Managing our CRM
      • YC onboarding
      • Demos
      • Billing
      • Who we do business with
    • Growth reviews
  • Handbook
  • Engineering
  • ClickHouse manual
  • Data ingestion

Data ingestion

Last updated: Nov 14, 2022

On this page

  • Using INSERTs for ingestion
  • Why we ingest via Kafka tables
  • How Kafka tables work
  • Materialized views
  • Example schema - reading and writing ingestion events
  • Example schema visualized
  • Further reading

This document covers:

  • Different options for ingesting data into MergeTree tables and trade-offs involved
  • How Kafka table engine works?
  • What are materialized views?
  • Examples of a full schema setup

Using INSERTs for ingestion

As any database system, ClickHouse allows using INSERTs to load data.

Each INSERT creates a new part in ClickHouse, which comes with a lot of overhead and, in a busy system, will lead to errors due to exceeding parts_to_throw MergeTree table setting (default 300).

ClickHouse provides a bunch of options to make INSERTs still work. For example:

  • Batch inserts
  • async_insert setting
  • Buffer table engine

These come with their own trade-offs, consistency problems, and require the ClickHouse cluster to always be accessible. As such, PostHog instead leverages Kafka engine tables for ingestion.

Why we ingest via Kafka tables

We instead rely on the Kafka table engine to handle ingestion into ClickHouse.

The benefits are:

  • Resiliency: Kafka handles sudden spikes in traffic and ClickHouse cluster unavailability gracefully
  • PostHog already uses Kafka throughout the app, making it a safe technical choice

It also has minimal overhead in terms of memory used and allows us to always temporarily stop ingestion by removing the tables in question.

How Kafka tables work

Kafka engine tables act as Kafka consumers in a given consumer group. Selecting from that table advances the consumer offsets.

A Kafka table on it's own does nothing beyond allowing to select from Kafka - it needs to be paired with other tables for ingestion to work.

Example kafka engine table:

SQL
CREATE TABLE kafka_ingestion_warnings
(
team_id Int64,
source LowCardinality(VARCHAR),
type VARCHAR,
details VARCHAR CODEC(ZSTD(3)),
timestamp DateTime64(6, 'UTC')
)
ENGINE = Kafka('kafka:9092', 'clickhouse_ingestion_warnings_test', 'group1', 'JSONEachRow')

It is important to send correctly formatted messages to the topic you're selecting from. When selecting from a kafka table, ClickHouse assumes messages in the topic are formatted correctly. If not, this may stall the consumer depending on the value of kafka_skip_broken_messages, breaking ingestion.

Materialized views

Materialized views in ClickHouse can be thought of as triggers - they react to new blocks being INSERTed into source tables and allow transforming and piping that data to other tables.

Materialized views come with a lot of gotchas. A great resource for learning more about them is this presentation.

Example schema - reading and writing ingestion events

Consider the following sharded table schema together with kafka_ingestion_warnings:

SQL
CREATE TABLE sharded_ingestion_warnings
(
team_id Int64,
source LowCardinality(VARCHAR),
type VARCHAR,
details VARCHAR CODEC(ZSTD(3)),
timestamp DateTime64(6, 'UTC'),
_timestamp DateTime,
_offset UInt64,
_partition UInt64
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/posthog.sharded_ingestion_warnings', '{replica}')
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (team_id, toHour(timestamp), type, source, timestamp)
CREATE TABLE ingestion_warnings ON CLUSTER 'posthog'
(
team_id Int64,
source LowCardinality(VARCHAR),
type VARCHAR,
details VARCHAR CODEC(ZSTD(3)),
timestamp DateTime64(6, 'UTC'),
_timestamp DateTime,
_offset UInt64,
_partition UInt64
)
ENGINE = Distributed('posthog', 'posthog', 'sharded_ingestion_warnings', rand())
CREATE MATERIALIZED VIEW ingestion_warnings_mv
TO posthog.ingestion_warnings
AS SELECT
team_id,
source,
type,
details,
timestamp,
_timestamp,
_offset,
_partition
FROM posthog.kafka_ingestion_warnings

In this schema:

  • sharded_ingestion_warnings MergeTree is responsible for storing the ingested data
  • ingestion_warnings table is responsible for fielding queries and distributing writes to sharded_ingestion_warnings tables across shards
  • ingestion_warnings_mv regularly polls kafka_ingestion_warnings and pushes the data to ingestion_warnings distributed table
    • Note: it also forwards _timestamp, _offset and _partition virtual columns containing Kafka message metadata so they can be stored and used during debugging.

Example schema visualized

This is the same schema visualized in a ClickHouse cluster with 2 shards and 1 replica each:

ClickHouse Cluster
ClickHouse Shard 1, Replica 1
ClickHouse Shard 2, Replica 1
reads from
reads from
pushes data to
pushes data to
pushes data to
pushes data to
pushes data to
pushes data to
kafka_ingestion_warnings table
(Kafka table engine)
ingestion_warnings_mv table
(Materialized view)
ingestion_warnings table
(Distributed table engine)
sharded_ingestion_warnings table
(ReplicatedMergeTree table engine)
kafka_ingestion_warnings table
(Kafka table engine)
ingestion_warnings_mv table
(Materialized view)
ingestion_warnings table
(Distributed table engine)
sharded_ingestion_warnings table
(ReplicatedMergeTree table engine)
clickhouse_events_proto topic in Kafka

Further reading

  • Performance Considerations ClickHouse docs
  • Using the Kafka table engine
  • Kafka ClickHouse docs
  • Confluent concepts (Zookeeper)
  • ClickHouse Materialized Views Illuminated, Part 1
  • ClickHouse's not so secret weapon... Materialized Views
  • Everything you should know about materialized views.

Next in the ClickHouse manual: Working with JSON

Questions?

Was this page useful?

Next article

Working with JSON

At PostHog, we store arbitrary payloads users send us for further analysis as JSON. As such, it's critical we do a good job at storing and analyzing this data. This document covers: Storing JSON in String s and operations on them How and why to compress this data Materialized columns Alternative solutions: JSON data type, arrays JSON Strings At PostHog, we store JSON data as VARCHAR (or String ) columns. Relevant properties are then parsed out from query-time using JSONExtract functions…

Read next article

Authors

  • Ian Vanagas
    Ian Vanagas
  • Karl-Aksel Puulmann
    Karl-Aksel Puulmann

Share

Jump to:

  • Using INSERTs for ingestion
  • Why we ingest via Kafka tables
  • How Kafka tables work
  • Materialized views
  • Example schema - reading and writing ingestion events
  • Example schema visualized
  • Further reading
  • Questions?
  • Edit this page
  • Raise an issue
  • Toggle content width
  • Toggle dark mode
  • Product

  • Overview
  • Pricing
  • Product analytics
  • Session recording
  • A/B testing
  • Feature flags
  • Apps
  • Customer stories
  • PostHog vs...
  • Docs

  • Quickstart guide
  • Self-hosting
  • Installing PostHog
  • Building an app
  • API
  • Webhooks
  • How PostHog works
  • Data privacy
  • Using PostHog

  • Product manual
  • Apps manuals
  • Tutorials
  • Community

  • Questions?
  • Product roadmap
  • Contributors
  • Partners
  • Newsletter
  • Merch
  • PostHog FM
  • PostHog on GitHub
  • Handbook

  • Getting started
  • Company
  • Strategy
  • How we work
  • Small teams
  • People & Ops
  • Engineering
  • Product
  • Design
  • Marketing
  • Customer success
  • Company

  • About
  • Team
  • Investors
  • Press
  • Blog
  • FAQ
  • Support
  • Careers
© 2022 PostHog, Inc.
  • Code of conduct
  • Privacy policy
  • Terms