• Product
  • Pricing
  • Docs
  • Using PostHog
  • Community
  • Company
  • Login
  • Docs

  • Overview
    • Quickstart with PostHog Cloud
    • Overview
      • AWS
      • Azure
      • DigitalOcean
      • Google Cloud Platform
      • Hobby
      • EU Hosting Companies
      • Other platforms
      • Instance settings
      • Environment variables
      • Securing PostHog
      • Monitoring with Grafana
      • Running behind a proxy
      • Configuring email
      • Helm chart configuration
      • Deploying ClickHouse using Altinity.Cloud
      • Configuring Slack
      • Overview
        • Overview
        • Upgrade notes
        • Overview
        • 0001-events-sample-by
        • 0002_events_sample_by
        • 0003_fill_person_distinct_id2
        • ClickHouse
          • Backup
          • Debug hanging / freezing process
          • Horizontal scaling (Sharding & replication)
          • Kafka Engine
          • Resize disk
          • Restore
          • Vertical scaling
        • Kafka
          • Resize disk
          • Log retention
        • PostgreSQL
          • Resize disk
          • Troubleshooting long-running migrations
        • Plugin server
        • MinIO
        • Redis
        • Zookeeper
      • Disaster recovery
    • Troubleshooting and FAQs
    • Support for self-hosting (open-source and enterprise)
    • Managing hosting costs
    • Overview
    • Ingest live data
    • Ingest historical data
    • Identify users
    • User properties
    • Deploying a reverse proxy
    • Library comparison
    • Badge
    • Browser Extensions
      • Snippet installation
      • Android
      • iOS
      • JavaScript
      • Flutter
      • React Native
      • Node.js
      • Go
      • Python
      • Rust
      • Java
      • PHP
      • Ruby
      • Elixir
      • Docusaurus v2
      • Gatsby
      • Google Tag Manager
      • Next.js
      • Nuxt.js
      • Retool
      • RudderStack
      • Segment
      • Sentry
      • Slack
      • Shopify
      • WordPress
      • Message formatting
      • Microsoft Teams
      • Slack
      • Discord
    • To another self-hosted instance
    • To PostHog from Amplitude
    • To PostHog Cloud EU
    • Between Cloud and self-hosted
    • Overview
    • Tutorial
    • Troubleshooting
    • Developer reference
    • Using the PostHog API
    • Jobs
    • Testing
    • TypeScript types
    • Overview
    • POST-only public endpoints
    • Actions
    • Annotations
    • Cohorts
    • Dashboards
    • Event definitions
    • Events
    • Experiments
    • Feature flags
    • Funnels
    • Groups
    • Groups types
    • Insights
    • Invites
    • Members
    • Persons
    • Plugin configs
    • Plugins
    • Projects
    • Property definitions
    • Session recordings
    • Trends
    • Users
    • Data model
    • Overview
    • Data model
    • Ingestion pipeline
    • ClickHouse
    • Querying data
    • Overview
    • GDPR guidance
    • HIPAA guidance
    • CCPA guidance
    • Data egress & compliance
    • Data deletion
    • Overview
    • Code of conduct
    • Recognizing contributions
  • Using PostHog

  • Table of contents
      • Dashboards
      • Funnels
      • Group Analytics
      • Insights
      • Lifecycle
      • Path analysis
      • Retention
      • Stickiness
      • Trends
      • Heatmaps
      • Session Recording
      • Correlation Analysis
      • Experimentation
      • Feature Flags
      • Actions
      • Annotations
      • Cohorts
      • Data Management
      • Events
      • Persons
      • Sessions
      • UTM segmentation
      • Team collaboration
      • Organizations & projects
      • Settings
      • SSO & SAML
      • Toolbar
      • Notifications & alerts
    • Overview
      • Amazon Kinesis Import
      • BitBucket Release Tracker
      • Event Replicator
      • GitHub Release Tracker
      • GitHub Star Sync
      • GitLab Release Tracker
      • Heartbeat
      • Ingestion Alert
      • Email Scoring
      • n8n Connector
      • Orbit Connector
      • Redshift Import
      • Segment Connector
      • Shopify Connector
      • Twitter Followers Tracker
      • Zendesk Connector
      • Airbyte Exporter
      • Amazon S3 Export
      • BigQuery Export
      • Customer.io Connector
      • Databricks Export
      • Engage Connector
      • GCP Pub/Sub Connector
      • Google Cloud Storage Export
      • Hubspot Connector
      • Intercom Connector
      • Migrator 3000
      • PagerDuty Connector
      • PostgreSQL Export
      • Redshift Export
      • RudderStack Export
      • Salesforce Connector
      • Sendgrid Connector
      • Sentry Connector
      • Snowflake Export
      • Twilio Connector
      • Variance Connector
      • Zapier Connector
      • Downsampler
      • Event Sequence Timer
      • First Time Event Tracker
      • Property Filter
      • Property Flattener
      • Schema Enforcer
      • Taxonomy Standardizer
      • Unduplicator
      • Automatic Cohort Creator
      • Currency Normalizer
      • GeoIP Enricher
      • Timestamp Parser
      • URL Normalizer
      • User Agent Populator
  • Tutorials
    • All tutorials
    • Actions
    • Apps
    • Cohorts
    • Dashboards
    • Feature flags
    • Funnels
    • Heatmaps
    • Path analysis
    • Retention
    • Session recording
    • Trends
  • Support
  • Glossary
  • Docs

  • Overview
    • Quickstart with PostHog Cloud
    • Overview
      • AWS
      • Azure
      • DigitalOcean
      • Google Cloud Platform
      • Hobby
      • EU Hosting Companies
      • Other platforms
      • Instance settings
      • Environment variables
      • Securing PostHog
      • Monitoring with Grafana
      • Running behind a proxy
      • Configuring email
      • Helm chart configuration
      • Deploying ClickHouse using Altinity.Cloud
      • Configuring Slack
      • Overview
        • Overview
        • Upgrade notes
        • Overview
        • 0001-events-sample-by
        • 0002_events_sample_by
        • 0003_fill_person_distinct_id2
        • ClickHouse
          • Backup
          • Debug hanging / freezing process
          • Horizontal scaling (Sharding & replication)
          • Kafka Engine
          • Resize disk
          • Restore
          • Vertical scaling
        • Kafka
          • Resize disk
          • Log retention
        • PostgreSQL
          • Resize disk
          • Troubleshooting long-running migrations
        • Plugin server
        • MinIO
        • Redis
        • Zookeeper
      • Disaster recovery
    • Troubleshooting and FAQs
    • Support for self-hosting (open-source and enterprise)
    • Managing hosting costs
    • Overview
    • Ingest live data
    • Ingest historical data
    • Identify users
    • User properties
    • Deploying a reverse proxy
    • Library comparison
    • Badge
    • Browser Extensions
      • Snippet installation
      • Android
      • iOS
      • JavaScript
      • Flutter
      • React Native
      • Node.js
      • Go
      • Python
      • Rust
      • Java
      • PHP
      • Ruby
      • Elixir
      • Docusaurus v2
      • Gatsby
      • Google Tag Manager
      • Next.js
      • Nuxt.js
      • Retool
      • RudderStack
      • Segment
      • Sentry
      • Slack
      • Shopify
      • WordPress
      • Message formatting
      • Microsoft Teams
      • Slack
      • Discord
    • To another self-hosted instance
    • To PostHog from Amplitude
    • To PostHog Cloud EU
    • Between Cloud and self-hosted
    • Overview
    • Tutorial
    • Troubleshooting
    • Developer reference
    • Using the PostHog API
    • Jobs
    • Testing
    • TypeScript types
    • Overview
    • POST-only public endpoints
    • Actions
    • Annotations
    • Cohorts
    • Dashboards
    • Event definitions
    • Events
    • Experiments
    • Feature flags
    • Funnels
    • Groups
    • Groups types
    • Insights
    • Invites
    • Members
    • Persons
    • Plugin configs
    • Plugins
    • Projects
    • Property definitions
    • Session recordings
    • Trends
    • Users
    • Data model
    • Overview
    • Data model
    • Ingestion pipeline
    • ClickHouse
    • Querying data
    • Overview
    • GDPR guidance
    • HIPAA guidance
    • CCPA guidance
    • Data egress & compliance
    • Data deletion
    • Overview
    • Code of conduct
    • Recognizing contributions
  • Using PostHog

  • Table of contents
      • Dashboards
      • Funnels
      • Group Analytics
      • Insights
      • Lifecycle
      • Path analysis
      • Retention
      • Stickiness
      • Trends
      • Heatmaps
      • Session Recording
      • Correlation Analysis
      • Experimentation
      • Feature Flags
      • Actions
      • Annotations
      • Cohorts
      • Data Management
      • Events
      • Persons
      • Sessions
      • UTM segmentation
      • Team collaboration
      • Organizations & projects
      • Settings
      • SSO & SAML
      • Toolbar
      • Notifications & alerts
    • Overview
      • Amazon Kinesis Import
      • BitBucket Release Tracker
      • Event Replicator
      • GitHub Release Tracker
      • GitHub Star Sync
      • GitLab Release Tracker
      • Heartbeat
      • Ingestion Alert
      • Email Scoring
      • n8n Connector
      • Orbit Connector
      • Redshift Import
      • Segment Connector
      • Shopify Connector
      • Twitter Followers Tracker
      • Zendesk Connector
      • Airbyte Exporter
      • Amazon S3 Export
      • BigQuery Export
      • Customer.io Connector
      • Databricks Export
      • Engage Connector
      • GCP Pub/Sub Connector
      • Google Cloud Storage Export
      • Hubspot Connector
      • Intercom Connector
      • Migrator 3000
      • PagerDuty Connector
      • PostgreSQL Export
      • Redshift Export
      • RudderStack Export
      • Salesforce Connector
      • Sendgrid Connector
      • Sentry Connector
      • Snowflake Export
      • Twilio Connector
      • Variance Connector
      • Zapier Connector
      • Downsampler
      • Event Sequence Timer
      • First Time Event Tracker
      • Property Filter
      • Property Flattener
      • Schema Enforcer
      • Taxonomy Standardizer
      • Unduplicator
      • Automatic Cohort Creator
      • Currency Normalizer
      • GeoIP Enricher
      • Timestamp Parser
      • URL Normalizer
      • User Agent Populator
  • Tutorials
    • All tutorials
    • Actions
    • Apps
    • Cohorts
    • Dashboards
    • Feature flags
    • Funnels
    • Heatmaps
    • Path analysis
    • Retention
    • Session recording
    • Trends
  • Support
  • Glossary
  • Docs
  • Integrate PostHog
  • Ingest historical data

Ingest historical data

Last updated: Jul 21, 2022

On this page

  • Data ingestion process
  • Importing events
  • User identification

Historical data ingestion (or importing data), opposed to live data ingestion, is the process of transporting data from external sources into PostHog so you can benefit from PostHog product analytics on historical data. It may be that you have historical data that you want to analyze along with new live data or that you have a requirement to periodically import data from third-party sources to augment your live data.

Whatever the reason for the historical data ingestion, this guide covers what to consider during that process.

The three main factors to consider are:

  • Data ingestion process: how to get the event data from the third-party source into PostHog
  • Importing events: Sending the events captured in the third-party data source into PostHog as custom events
  • User identification: How to identify users within PostHog and ties those users back to the user within original data source

Historical data is sent to PostHog using either a server library or the PostHog API. For more information see the importing events section.

Data ingestion process

Since the third-party data source will offer an API you can use the power of software to import the data from one or more sources to PostHog using the PostHog API.

The following factors are important in the export and then import process:

  • The volume of data
  • API rate limits of both the data source and PostHog
  • Ensure that only the data required for events is exported
  • Handle error scenarios allowing the process to resume form the last successful point

With the above factors in mind it's recommended that you break the process up into steps such as the following:

  1. Sequentially export selective data from the data source that represent key events keeping track of where you are in the sequence so that you can restart the process from the last successful point if any problems occur

  2. Store the selected exported data to a new data storage for faster access in future steps

  3. Transform the data to the format you will use with PostHog and again save to a storage mechanism for faster access in later steps.

    The data format should be:

    JSON
    {
    "event": "[event name]",
    "distinct_id": "[your users' distinct id]",
    "properties": {
    "key1": "value1",
    "key2": "value2"
    },
    "timestamp": "[optional timestamp in ISO 8601 format]"
    }

    At this stage it's also important to consider the following:

    1. Use the same event name that you're going to use with your live data ingestion so the historical and live events are seen as the same type within PostHog
    2. Use the same unique identifier within the distinct_id field as you are within your live data ingestion so historical events and live events are associated with the same user
    3. Convert the old event property names to the new event property names you are using within the events in your live data ingestion
    4. Ensure that the timestamp is a converted version of the original timestamp is ISO 8601 format so that PostHog correctly identifies when the original event occurred
    5. You may want to set an additional property within properties that identifies the original event within the data source
  4. Sequentially import the events into PostHog keeping track of the last successfully imported event so that you can restart the process from the last successful point if any problems occur

Importing events

Once you are ready to import the data into PostHog you can use one of the following:

  • a PostHog server libraries
  • the PostHog API

As mentioned above, the data should be in the following format:

JSON
{
"event": "[event name]",
"distinct_id": "[your users' distinct id]",
"properties": {
"key1": "value1",
"key2": "value2"
},
"timestamp": "[optional timestamp in ISO 8601 format]"
}

The server libraries handle batching capture requests. If you decide to use the API directly you will need to manage this yourself.

JavaScript
client.capture({
distinctId: 'distinct id',
event: 'movie played',
properties: {
movieId: '123',
category: 'romcom'
}
})

For more information see the Node.js docs.

Python
posthog.capture("distinct id", "movie played", {
movie_id: "123",
category: "romcom",
});

For more information see the python docs.

PHP
PostHog::capture(array(
'distinctId' => 'user:123',
'event' => 'movie played',
'properties' => array(
'movieId' => '123',
'category' => 'romcom'
)
));

For more information see the PHP docs.

Go
client.Enqueue(posthog.Capture{
DistinctId: "test-user",
Event: "test-snippet",
Properties: posthog.NewProperties().
Set("plan", "Enterprise").
Set("friends", 42),
})

For more information see the Go docs.

Ruby
posthog.capture({
distinct_id: 'distinct id',
event: 'movie played',
properties: {
movie_id: '123',
category: 'romcom'
}
})

For more information see the Ruby docs.

Terminal
curl -v -L --header "Content-Type: application/json" -d '{
"api_key": "<ph_project_api_key>",
"properties": {},
"timestamp": "2020-08-16 09:03:11.913767",
"context": {},
"distinct_id": "1234",
"type": "capture",
"event": "$event",
"messageId": "1234"
}' https://app.posthog.com/batch/

For more information see the API docs.

User identification

As discussed within the data ingestion process section, a unique user identifier distinct_id should be set for each event. In addition to setting the user with each event you can enrich information about that user by adding more properties:

JavaScript
client.identify({
distinctId: "user:123",
properties: {
email: 'john@doe.com',
proUser: false
}
})

For more information see the Node.js docs.

Python
posthog.identify('distinct id', {
'email': 'john@doe.com',
'name': 'John Doe'
})

For more information see the Python docs.

PHP
PostHog::identify(array(
'distinctId' => 'user:123',
'properties' => array(
'email' => 'john@doe.com',
'proUser' => false
)
));

For more information see the PHP docs.

Go
client.Enqueue(posthog.Identify{
DistinctId: "user:123",
Properties: posthog.NewProperties().
Set("email", "john@doe.com").
Set("proUser", false),
})

For more information see the Go docs.

Ruby
posthog.identify({
distinct_id: "user:123",
properties: {
email: 'john@doe.com',
pro_user: false
}
})

For more information see the Ruby docs.

Terminal
curl -v -L --header "Content-Type: application/json" -d '{
"api_key": "<ph_project_api_key>",
"timestamp": "2020-08-16 09:03:11.913767",
"context": {},
"type": "screen",
"distinct_id": "1234",
"$set": {},
"event": "$identify",
"messageId": "123"
}' https://app.posthog.com/batch/

For more information see the API docs.

Questions?

Was this page useful?

Next article

Identifying users

PostHog allows you to identify your users with an ID of your choice so you can track users across platforms, connect events from before and after users log in, and leverage your preferred type of ID for filtering through your users. Identifying users is usually done via our libraries, by calling the identify or alias method. This method will associate an anonymous ID with a distinct ID of your choice. In client libraries, the anonymous ID is stored locally and inferred for you, but in…

Read next article

Authors

  • Paul Hultgren
    Paul Hultgren
  • Cory Watilo
    Cory Watilo
  • leggetter
    leggetter

Share

Jump to:

  • Data ingestion process
  • Importing events
  • User identification
  • Questions?
  • Edit this page
  • Raise an issue
  • Toggle content width
  • Toggle dark mode
  • Product

  • Overview
  • Pricing
  • Product analytics
  • Session recording
  • A/B testing
  • Feature flags
  • Apps
  • Customer stories
  • PostHog vs...
  • Docs

  • Quickstart guide
  • Self-hosting
  • Installing PostHog
  • Building an app
  • API
  • Webhooks
  • How PostHog works
  • Data privacy
  • Using PostHog

  • Product manual
  • Apps manuals
  • Tutorials
  • Community

  • Questions?
  • Product roadmap
  • Contributors
  • Partners
  • Newsletter
  • Merch
  • PostHog FM
  • PostHog on GitHub
  • Handbook

  • Getting started
  • Company
  • Strategy
  • How we work
  • Small teams
  • People & Ops
  • Engineering
  • Product
  • Design
  • Marketing
  • Customer success
  • Company

  • About
  • Team
  • Investors
  • Press
  • Blog
  • FAQ
  • Support
  • Careers
© 2022 PostHog, Inc.
  • Code of conduct
  • Privacy policy
  • Terms