The Agnostic (AGT) project is a SQL-native ETL framework that orchestrates data pipelines using ClickHouse and Apache Iceberg. It leverages a pipeline.yaml configuration, dynamic variables (vars), and templated SQL queries for flexible, fault-tolerant data processing. This document outlines AGT’s core features, focusing on pipeline configuration, stage types, and query templating. Examples are drawn from the HackerNews pipeline .

AGT pipelines are defined in a pipeline.yaml file, which declaratively specifies the execution engine, initialization, source, and processing stages. This centralized configuration ensures reproducible workflows and simplifies pipeline management.

AGT uses dynamic key-value pairs called vars, set via the CLI at startup or generated during execution. These enable parameterized pipeline logic and SQL queries.

  • CLI Input: Set variables like ICEBERG_DESTINATION_TABLE_LOCATION or ORDER_BY when running the pipeline (e.g., agt run --config pipeline.yaml --var ORDER_BY=id).
  • Runtime Generation: Queries can produce vars (e.g., RANGE_START, OUTPUT_FILE) for use in later stages.
  • Usage: Variables are accessed in queries using {{.VAR_NAME}} syntax, supporting dynamic behavior.

The pipeline.yaml file structures the pipeline into four components: Engine, Init, Source, and Stages.

The Engine section configures the execution environment, typically a local or remote ClickHouse instance. It supports custom settings (e.g., JSON parsing, S3 connections) and UDF bundles like icepq for Iceberg operations. The HackerNews pipeline uses a local ClickHouse engine with optimized settings.

The Init section runs a single query at startup to initialize state, such as retrieving the latest processed data for fault tolerance. For example, the HackerNews pipeline’s init_start query fetches the maximum post ID from Iceberg metadata, outputting INIT_START as a var. It supports error handling, like ignoring specific error codes for non-existent tables.

pipeline.yaml — yaml
Init:
  Queries:
    - init_start#ignore-error-codes=8888

The Source section generates tasks by executing a SQL query at a configurable interval (e.g., every 30 seconds). Each query row creates a task with vars. In the HackerNews pipeline, the source query produces post ID ranges (RANGE_START, RANGE_END), with options to stop on empty results or after a set number of runs.

pipeline.yaml — yaml
Source:
  Query: source
  PollInterval: 30s

The Stages section defines a sequence of processing steps, with tasks flowing through each stage in order. AGT supports five stage types: Execute, Debug, Sleep, Buffer, and Metrics. The HackerNews pipeline uses Execute and Buffer.

The Execute stage runs SQL queries for each task, optionally in parallel.

  • Features: Supports PoolSize for concurrency, Ordered to maintain task sequence, and ClickhouseSettings for query optimization.
  • Example: The HackerNews pipeline’s first stage runs fetch_range to fetch posts into a Memory table, processing four tasks concurrently with order preserved.

The Debug stage logs task vars or query results for troubleshooting, aiding development without modifying data. It’s useful for inspecting intermediate states.

The Sleep stage introduces a configurable delay per task, enabling rate-limiting or synchronization with external systems.

The Buffer stage consolidates tasks into a single operation to optimize processing.

  • Features: Uses Enter to initialize (e.g., create a table), Queries to process each task, Condition to check completion, Leave to finalize, and MaxDuration to limit buffering time.
  • Example: The HackerNews pipeline’s second stage buffers posts to reduce Iceberg write frequency, using queries like create_buffer, insert_into_buffer, and merge_vars.
pipeline.yaml — yaml
Stages:
  - Buffer:
      Enter: create_buffer
      Leave: rename_buffer
      Condition: condition
      Queries:
        - insert_into_buffer
        - drop_range
        - merge_vars
      MaxDuration: 5s

The Metrics stage collects performance data, such as task processing times or error rates, for monitoring and optimization.

AGT’s SQL queries use templating to inject vars dynamically with {{.VAR_NAME}} syntax.

  • Variable Substitution: Accesses vars like RANGE_START or ORDER_BY.
  • Default Values: Supports fallbacks (e.g., {{.MAX_BUFFER_SIZE | default "1000"}}).
  • Contextual Templating: In Buffer stages, LEFT and RIGHT contexts merge vars (e.g., {{.LEFT.RANGE_START}}).
  • Example: The HackerNews pipeline’s fetch_range query constructs API URLs using RANGE_START and RANGE_END, while merge_vars generates unique file names.

This templating enhances query reusability and adaptability across pipeline stages.