Skip to main content
Skip to main content
Edit this page

YAMLRegExpTree dictionary source

Not supported in ClickHouse Cloud

The YAMLRegExpTree source loads a regular expression tree from a YAML file on the local filesystem. It is designed exclusively for use with the regexp_tree dictionary layout and provides hierarchical regex-to-attribute mappings for pattern-based lookups such as user agent parsing.

Note

The YAMLRegExpTree source is only available in ClickHouse Open Source. For ClickHouse Cloud, export the dictionary to CSV and load it via a ClickHouse table source instead. See Using regexp_tree dictionaries in ClickHouse Cloud for details.

Configuration

CREATE DICTIONARY regexp_dict
(
    regexp String,
    name String,
    version String
)
PRIMARY KEY(regexp)
SOURCE(YAMLRegExpTree(PATH '/var/lib/clickhouse/user_files/regexp_tree.yaml'))
LAYOUT(regexp_tree)
LIFETIME(0);

Setting fields:

SettingDescription
PATHThe absolute path to the YAML file containing the regular expression tree. When created via DDL, the file must be in the user_files directory.

YAML file structure

The YAML file contains a list of regular expression tree nodes. Each node can have attributes and child nodes, forming a hierarchy:

- regexp: 'Linux/(\d+[\.\d]*).+tlinux'
  name: 'TencentOS'
  version: '\1'

- regexp: '\d+/tclwebkit(?:\d+[\.\d]*)'
  name: 'Android'
  versions:
    - regexp: '33/tclwebkit'
      version: '13'
    - regexp: '3[12]/tclwebkit'
      version: '12'
    - regexp: '30/tclwebkit'
      version: '11'
    - regexp: '29/tclwebkit'
      version: '10'

Each node has the following structure:

  • regexp: The regular expression for this node.
  • attributes: User-defined dictionary attributes (e.g. name, version). Attribute values may contain back references to capture groups in the regular expression, written as \1 or $1 (numbers 1-9). These are replaced with the matched capture group at query time.
  • child nodes: A list of children, each with its own attributes and optionally more children. The name of the child list is arbitrary (e.g. versions above). String matching proceeds depth-first: if a string matches a node, its children are also checked. Attributes of the deepest matching node take precedence, overriding equally named parent attributes.