ClickLoad

Orchestration for loading a large dataset with trillions of rows incrementally and reliably over a long period of time.

We described the data loading orchestration mechanism used by the script in detail in a blog.

Capabilities

Reliably import data from files hosted in a object storage bucket into ClickHouse
Supports any partitioning key, projections, and materialized views
Job queue for files to be imported. Scales linearly
Continuous data loading can be set up.

Pre-requisites

python3.10+
clickhouse-client
ClickHouse instance with support for KeeperMap and keeper_map_strict_mode
~1GB of RAM per Keeper node per 1 million scheduled files in the KeeperMap backed job task table

Installing

pip install -r requirements.txt

Pre-create tables in ClickHouse.

Table Schemas for job task table

CREATE TABLE tasks
(
	file_path String,
	file_paths Array(String),
	worker_id String DEFAULT '',
	started_time DateTime DEFAULT 0,
	scheduled DateTime MATERIALIZED now()
)
ENGINE = KeeperMap('tasks')
PRIMARY KEY file_path;

Running

Scheduling files for ClickHouse import

usage: queue_files.py [-h] 
# ① ClickHouse connection settings for instance hosting the job task table
--host HOST 
--port PORT 
--username USERNAME 
--password PASSWORD 

# ② files scheduling settings
--file FILE # The file containing the set of object storage urls for the files to be loaded
--task_database DATABASE # Name of the ClickHouse database for the task table
--task_table TABLE # Name of the task table.
[--files_chunk_size_min SIZE] # How many files are atomically processed together at a minimum
[--files_chunk_size_max SIZE] # How many files are atomically processed together at a maximum

Starting a worker that continuosly imports scheduled files into ClickHouse

usage: worker.py [-h] 
# ① ClickHouse connection settings for the target instance
--host HOST 
--port PORT 
--username USERNAME 
--password PASSWORD 

# ② data loading - main settings
--database DATABASE # Name of the target ClickHouse database
--table TABLE # Name of the target table.
--task_database DATABASE # Name of the ClickHouse database for the task table
--task_table TABLE # Name of the task table.
[--worker_id ID] # Unique id for this worker
[--files_chunk_size_max SIZE] # How many files are atomically processed together at a maximum

# ③ Data loading - optional settings
[--cfg.function CFG.FUNCTION] # Name of the table function for accessing the to-be-loaded files
[--cfg.bucket_access_key CFG.ACCESS_KEY] # Access key for the object storage bucket hosting the files to be loaded
[--cfg.bucket_access_secret CFG.ACCESS_SECRET] # Access secret for the object storage bucket hosting the files to be loaded
[--cfg.format CFG.FORMAT] # Name of the file format used
[--cfg.structure CFG.STRUCTURE] # Structure of the file data
[--cfg.select CFG.SELECT] # Custom SELECT clause for retrieving the file data
[--cfg.where CFG.WHERE] # Custom WHERE clause for retrieving the file data
[--cfg.query_settings CFG.QUERY_SETTINGS [CFG.QUERY_SETTINGS ...]] # Custom query-level settings

Example

We provide an example of how to use the script with some large example data set here.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples/pypi		examples/pypi
internals		internals
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ClickLoad

Capabilities

Pre-requisites

Installing

Table Schemas for job task table

Running

Scheduling files for ClickHouse import

Starting a worker that continuosly imports scheduled files into ClickHouse

Example

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ClickHouse/ClickLoad

Folders and files

Latest commit

History

Repository files navigation

ClickLoad

Capabilities

Pre-requisites

Installing

Table Schemas for job task table

Running

Scheduling files for ClickHouse import

Starting a worker that continuosly imports scheduled files into ClickHouse

Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages