Features

Overview

The Agnostic (AGT) project is a SQL-native ETL framework that orchestrates data pipelines using ClickHouse and Apache Iceberg. It leverages a pipeline.yaml configuration, dynamic variables (vars), and templated SQL queries for flexible, fault-tolerant data processing. This document outlines AGT’s core features, focusing on pipeline configuration, stage types, and query templating. Examples are drawn from the HackerNews pipeline .

Pipeline Configuration with `pipeline.yaml`

AGT pipelines are defined in a pipeline.yaml file, which declaratively specifies the execution engine, initialization, source, and processing stages. This centralized configuration ensures reproducible workflows and simplifies pipeline management.

Dynamic Variables (`vars`)

AGT uses dynamic key-value pairs called vars, set via the CLI at startup or generated during execution. These enable parameterized pipeline logic and SQL queries.

CLI Input: Set variables like ICEBERG_DESTINATION_TABLE_LOCATION or ORDER_BY when running the pipeline (e.g., agt run --config pipeline.yaml --var ORDER_BY=id).
Runtime Generation: Queries can produce vars (e.g., RANGE_START, OUTPUT_FILE) for use in later stages.
Usage: Variables are accessed in queries using {{.VAR_NAME}} syntax, supporting dynamic behavior.

Pipeline Components

The pipeline.yaml file structures the pipeline into four components: Engine, Init, Source, and Stages.

Engine

The Engine section configures the execution environment, typically a local or remote ClickHouse instance. It supports custom settings (e.g., JSON parsing, S3 connections) and UDF bundles like icepq for Iceberg operations. The HackerNews pipeline uses a local ClickHouse engine with optimized settings.

Init

The Init section runs a single query at startup to initialize state, such as retrieving the latest processed data for fault tolerance. For example, the HackerNews pipeline’s init_start query fetches the maximum post ID from Iceberg metadata, outputting INIT_START as a var. It supports error handling, like ignoring specific error codes for non-existent tables.

pipeline.yaml — yaml

Init:
  Queries:
    - init_start#ignore-error-codes=8888

Source

The Source section generates tasks by executing a SQL query at a configurable interval (e.g., every 30 seconds). Each query row creates a task with vars. In the HackerNews pipeline, the source query produces post ID ranges (RANGE_START, RANGE_END), with options to stop on empty results or after a set number of runs.

pipeline.yaml — yaml

Source:
  Query: source
  PollInterval: 30s

Stages

The Stages section defines a sequence of processing steps, with tasks flowing through each stage in order. AGT supports five stage types: Execute, Debug, Sleep, Buffer, and Metrics. The HackerNews pipeline uses Execute and Buffer.

Execute Stage

The Execute stage runs SQL queries for each task, optionally in parallel.

Features: Supports PoolSize for concurrency, Ordered to maintain task sequence, and ClickhouseSettings for query optimization.
Example: The HackerNews pipeline’s first stage runs fetch_range to fetch posts into a Memory table, processing four tasks concurrently with order preserved.

Features: Uses Enter to initialize (e.g., create a table), Queries to process each task, Condition to check completion, Leave to finalize, and MaxDuration to limit buffering time.
Example: The HackerNews pipeline’s second stage buffers posts to reduce Iceberg write frequency, using queries like create_buffer, insert_into_buffer, and merge_vars.

pipeline.yaml — yaml

Stages:
  - Buffer:
      Enter: create_buffer
      Leave: rename_buffer
      Condition: condition
      Queries:
        - insert_into_buffer
        - drop_range
        - merge_vars
      MaxDuration: 5s

Metrics Stage

The Metrics stage collects performance data, such as task processing times or error rates, for monitoring and optimization.

Templatized SQL Queries

AGT’s SQL queries use templating to inject vars dynamically with {{.VAR_NAME}} syntax.

Variable Substitution: Accesses vars like RANGE_START or ORDER_BY.
Default Values: Supports fallbacks (e.g., {{.MAX_BUFFER_SIZE | default "1000"}}).
Contextual Templating: In Buffer stages, LEFT and RIGHT contexts merge vars (e.g., {{.LEFT.RANGE_START}}).
Example: The HackerNews pipeline’s fetch_range query constructs API URLs using RANGE_START and RANGE_END, while merge_vars generates unique file names.

This templating enhances query reusability and adaptability across pipeline stages.

Previous Tutorial: Hackernews Next References

Features

Overview

Pipeline Configuration with `pipeline.yaml`

Dynamic Variables (`vars`)

Pipeline Components

Engine

Init

Source

Stages

Execute Stage

Debug Stage

Sleep Stage

Buffer Stage

Metrics Stage

Templatized SQL Queries