yaml: Stata module for YAML file processing

Description

yaml is a Stata command for reading, writing, and manipulating YAML configuration files. It provides a unified interface with nine subcommands that enable Stata users to integrate YAML-based workflows into their data pipelines.

The command implements the JSON Schema subset of YAML 1.2 (3rd Edition, 2021), the current authoritative YAML standard. This JSON-compatible subset covers the most commonly used features for configuration files and metadata management. It is implemented in pure Stata with no external dependencies.

Latest: v1.9.0 with indicators preset for wbopendata/unicefdata, colfields() and maxlevel() filtering, and Mata bulk-load parser.

Key Features

Read YAML files into Stata's data structure or frames
Write YAML files from Stata datasets or scalars
Query values using hierarchical key paths
Validate configurations with required keys and type checking
Multiple frame support (Stata 16+) for managing multiple configurations
Fast-scan mode for large metadata catalogs (opt-in)
Field-selective extraction with fields()
List block extraction with listkeys() (fast-read)
Frame caching with cache() (Stata 16+)
Mata bulk parser for high-performance parsing (Phase 2)
Collapse option for wide-format indicator output (Phase 2)
Collapse filters with colfields() and maxlevel() (v1.8.0)
Indicators preset for wbopendata/unicefdata metadata (v1.9.0)
strL support for values exceeding 2045 characters

Installation

From SSC (when available)

ssc install yaml

Manual Installation

Copy yaml.ado and yaml.sthlp from src/y/ to your personal ado directory:

adopath
* Copy files to the PERSONAL directory shown

Quick Start

* Read a YAML configuration file
yaml read using config.yaml, replace

* View the structure
yaml describe

* Get a specific value
yaml get database:host
return list

* Validate required keys
yaml validate, required(name version database)

* Write modified configuration
yaml write using output.yaml, replace

* Speed-first metadata read (fastread)
yaml read using indicators.yaml, fastread fields(name description source_id topic_ids) ///
    listkeys(topic_ids topic_names) cache(ind_cache)

* Parse wbopendata/unicefdata indicator metadata (v1.9.0)
yaml read using indicators.yaml, indicators replace
list key code name in 1/5

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              yaml.ado                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────┐    ┌─────────┐    ┌──────────┐                                │
│   │  read   │    │  write  │    │ describe │                                │
│   └────┬────┘    └────┬────┘    └────┬─────┘                                │
│        │              │              │                                      │
│   ┌────┴────┐    ┌────┴────┐    ┌────┴─────┐                                │
│   │  list   │    │   get   │    │ validate │                                │
│   └────┬────┘    └────┬────┘    └────┬─────┘                                │
│        │              │              │                                      │
│   ┌────┴────┐    ┌────┴────┐    ┌────┴─────┐                                │
│   │   dir   │    │  frames │    │  clear   │                                │
│   └─────────┘    └─────────┘    └──────────┘                                │
│                                                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                         Internal Storage                                    │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │  Dataset/Frame Structure:                                          │     │
│  │  ┌──────────┬────────────┬───────┬────────────┬──────────┐         │     │
│  │  │   key    │   value    │ level │   parent   │   type   │         │     │
│  │  ├──────────┼────────────┼───────┼────────────┼──────────┤         │     │
│  │  │ str244   │ str2000    │ int   │ str244     │ str32    │         │     │
│  │  └──────────┴────────────┴───────┴────────────┴──────────┘         │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Subcommands

Subcommand	Description
`yaml read`	Parse YAML file into Stata dataset or frame
`yaml write`	Export Stata data to YAML format
`yaml describe`	Display structure summary of loaded YAML
`yaml list`	List keys, values, or children
`yaml get`	Retrieve specific key values
`yaml validate`	Check required keys and value types
`yaml dir`	List all YAML data in memory (dataset and frames)
`yaml frames`	List only YAML frames in memory (Stata 16+)
`yaml clear`	Clear YAML data from memory or frames

What's New

See src/y/yaml_whatsnew.sthlp for version history and release notes.

Syntax

yaml read

yaml read using filename.yaml [, options]

Options:

replace - Replace existing data
frame(name) - Load into named frame (Stata 16+)
locals - Store values as return locals
scalars - Store numeric values as scalars
prefix(string) - Prefix for local/scalar names (default: yaml_)
verbose - Display parsing details
fastread - Speed-first parsing for large, regular YAML
fields(string) - Restrict extraction to specific keys
listkeys(string) - Extract list blocks for specified keys (fastread only)
blockscalars - Capture block scalars in fast-read mode
targets(string) - Early-exit targets for canonical parse (exact keys)
earlyexit - Stop parsing once all targets are found (canonical)
stream - Use streaming tokenization for canonical parse
index(string) - Materialize an index frame for repeated queries (Stata 16+)
cache(string) - Cache parsed results in a frame (Stata 16+)
bulk - Use Mata bulk-load parser for high-performance parsing (Phase 2)
collapse - Produce wide-format output with _yaml_collapse (Phase 2)
colfields(string) - Filter collapsed output to specific field names (semicolon-separated)
maxlevel(#) - Limit collapsed columns by nesting depth
indicators - Preset for wbopendata/unicefdata indicator metadata (implies bulk collapse)
strl - Use strL storage for values exceeding 2045 characters

yaml write

yaml write using filename.yaml [, options]

Options:

replace - Overwrite existing file
frame(name) - Write from named frame
scalars(list) - Write specified scalars
indent(#) - Indentation spaces (default: 2)
header(string) - Custom header comment
verbose - Display write progress

yaml get

yaml get keyname [, options]
yaml get parent:child [, options]

Options:

frame(name) - Read from named frame
attributes(list) - Specific attributes to retrieve
quiet - Suppress output

Returns:

r(found) - 1 if key found
r(n_attrs) - Number of attributes
r(key) - Key name
r(parent) - Parent name (if colon syntax used)
r(attr_name) - Value for each attribute

yaml list

yaml list [keyname] [, options]

Options:

keys - Show key names
values - Show values
children - List child keys only
level(#) - Filter by nesting level
frame(name) - Read from named frame

yaml validate

yaml validate [, options]

Options:

required(keylist) - Check that keys exist
types(key:type ...) - Validate key types
frame(name) - Validate named frame
quiet - Suppress output, only set return values

Returns:

r(valid) - 1 if validation passed
r(n_errors) - Number of errors
r(n_warnings) - Number of warnings
r(missing_keys) - List of missing required keys
r(type_errors) - List of type validation failures

yaml describe

yaml describe [, level(#) frame(name)]

yaml dir

yaml dir [, detail]

Lists all YAML data currently in memory:

Current dataset - if it contains YAML structure (key, value, level, parent, type variables)
YAML frames - all frames with yaml_ prefix (Stata 16+)

Options:

detail - Show number of entries and source file for each

Detection:

YAML data is identified by the _dta[yaml_source] characteristic set by yaml read
Datasets with YAML structure but unknown source are also reported

yaml frames

yaml frames [, detail]

Lists only YAML frames in memory. Requires Stata 16+.

Options:

detail - Show number of entries and source file for each frame

Use case: When you only need to see frames, not the current dataset.

yaml clear

yaml clear [framename] [, all]

Data Model

Storage Structure

YAML data is stored in a flat dataset with hierarchical references:

Column	Type	Description
`key`	str244	Full hierarchical key name (e.g., `indicators_CME_MRY0T4_label`)
`value`	str2000	The value associated with the key
`level`	int	Nesting depth (1 = root level)
`parent`	str244	Parent key for hierarchical lookups
`type`	str32	Value type: `string`, `numeric`, `boolean`, `parent`, `list_item`, `null`

Fast-Read Output Schema

In fastread mode, the output is row-wise and minimal:

Column	Type	Description
`key`	str244	Top-level key (e.g., indicator code)
`field`	str244	Field name under the key
`value`	str2000	Field value
`list`	byte	1 if list item, 0 otherwise
`line`	long	Line number in the YAML file

Key Naming Convention

Keys are flattened using underscores to represent hierarchy:

# YAML input:
indicators:
  CME_MRY0T4:
    label: Under-five mortality rate
    unit: Deaths per 1000 live births

# Stored as:
key                              value                         parent                  type
─────────────────────────────────────────────────────────────────────────────────────────────
indicators                       (empty)                       (empty)                 parent
indicators_CME_MRY0T4            (empty)                       indicators              parent
indicators_CME_MRY0T4_label      Under-five mortality rate     indicators_CME_MRY0T4   string
indicators_CME_MRY0T4_unit       Deaths per 1000 live births   indicators_CME_MRY0T4   string

List Item Storage

YAML lists are stored as indexed separate rows:

# YAML input:
countries:
  - BRA
  - ARG
  - CHL

# Stored as:
key             value   parent      type
────────────────────────────────────────────
countries       (empty) (empty)     parent
countries_1     BRA     countries   list_item
countries_2     ARG     countries   list_item
countries_3     CHL     countries   list_item

YAML 1.2 Compliance

This command implements the JSON Schema subset of YAML 1.2 as defined in Chapter 10.2 of the YAML 1.2 Specification. This is the recommended schema for "interoperability and consistency" according to the specification.

✅ Supported (YAML 1.2 JSON Schema)

Feature	YAML 1.2 Reference	Example
Mappings	Chapter 8.2.1	`key: value`
Nested mappings	Chapter 8.2	Indentation-based hierarchy
Block sequences	Chapter 8.2.2	`- item1`, `- item2`
Comments	Chapter 6.5	`# This is a comment`
Strings	Chapter 10.2.1.1	`name: "quoted"` or `name: unquoted`
Integers	Chapter 10.2.1.2	`count: 100`
Floats	Chapter 10.2.1.3	`rate: 3.14`
Booleans	Chapter 10.2.1.4	`debug: true`, `verbose: false`
Null	Chapter 10.2.1.1	`empty:` or `empty: null`

❌ Not Supported (Advanced YAML 1.2)

These features are part of the full YAML 1.2 specification but are intentionally excluded to maintain simplicity and robustness:

Feature	YAML 1.2 Reference	Reason
Anchors & Aliases	Chapter 7.1	`&anchor`, `*alias` - Complex reference handling
Block scalars	Chapter 8.1	`\|`, `>` - Multi-line literal/folded styles
Flow collections	Chapter 7.4	`{a: 1}`, `[1, 2]` - JSON-like inline syntax
Tags	Chapter 6.9	`!!map`, `!!seq` - Type annotations
Multiple documents	Chapter 9.2	`---` document separators

Version Requirements

Feature	Minimum Version
Basic functionality	Stata 14.0
Frame support	Stata 16.0

Examples

Reading and Querying

* Load configuration
yaml read using pipeline_config.yaml, replace

* Get nested value using colon syntax
yaml get database:connection_string
local conn = r(connection_string)

* List all keys at root level
yaml list, keys level(0)

Validation

* Check required configuration keys
yaml validate, required(name version api_key)

* Validate with type checking
yaml validate, types(port:numeric debug:boolean)

if (r(valid) == 0) {
    di as error "Invalid configuration"
    exit 198
}

Working with Frames (Stata 16+)

* Load multiple configurations
yaml read using dev.yaml, frame(dev)
yaml read using prod.yaml, frame(prod)

* Query from specific frame
yaml get host, frame(prod)

* List all YAML data in memory
yaml dir, detail

* Clear specific frame
yaml clear, frame(dev)

Round-trip: Read and Write

* Read configuration
yaml read using original.yaml, replace

* Modify values
replace value = "new_value" if key == "settings_timeout"

* Write back
yaml write using modified.yaml, replace

Working with Lists

* Read YAML with lists
yaml read using countries.yaml, replace

* List items in a list
yaml list countries, keys children

* Access individual list items
yaml get countries
* Returns: r(1)="BRA" r(2)="ARG" r(3)="CHL"

Performance Optimization for Large Catalogs

For metadata catalogs with 700+ entries, vectorized frame-based queries dramatically outperform iterative yaml get calls:

Approach	Time	Relative
Naive: 733 iterative `yaml get` calls	15+ seconds	50×
Optimized: Direct frame dataset query	0.3 seconds	1×

Key Pattern:

yaml read using indicators_catalog.yaml, frame(meta)
frame yaml_meta {
    gen is_nutrition = (value == "NUTRITION") & ///
        regexm(key, "^indicators_[A-Za-z0-9_]+_dataflow$")
    levelsof indicator_code if is_nutrition == 1, local(nutrition_codes)
}

Vectorized operations (gen, regexm, levelsof) process all rows at once rather than looping through function calls. Frame isolation provides data protection and instant cleanup. See production examples in src/y/README.md.

Use Cases

Pipeline Configuration: Database connections, API endpoints, timeouts
Metadata Management: Indicator definitions, variable labels, units (optimized for 700+ catalogs)
Cross-language Workflows: Share configurations with R, Python, GitHub Actions
Reproducible Research: Version-controlled configuration files
Multi-environment Support: Dev/staging/prod configurations in separate frames
LLM Workflows: YAML-based tool interfaces and pipeline orchestration

Design Principles

YAML 1.2 Compliance: Implements the JSON Schema (Chapter 10.2) of the YAML 1.2 Specification, which covers 95%+ of configuration use cases.
JSON Compatibility: Per YAML 1.2's design goal, the supported subset ensures that valid JSON is also valid YAML (Chapter 1.2 of the specification).
Stata-Native: Pure Stata implementation using file read/write - no external dependencies (Python, LibYAML, etc.).
Hierarchical Storage: Flat storage with parent references enables both simple key-value access and hierarchical queries, following the YAML representation model (Chapter 3.2.1).
Frame Support: Optional frame storage keeps YAML data separate from working datasets (Stata 16+).
Validation First: Built-in validation ensures configuration correctness before pipeline execution.

Repository Structure

yaml/
├── README.md              # This file
├── .gitignore
├── src/y/
│   ├── yaml.ado           # Main command (v1.7.0)
│   ├── yaml.sthlp         # Stata help file
│   └── README.md          # Command documentation with production examples
├── src/_/
│   ├── _yaml_mataread.ado # Mata bulk-load parser (Phase 2)
│   └── _yaml_collapse.ado # Wide-format collapse helper (Phase 2)
├── qa/
│   ├── run_tests.do       # QA runner (26 tests)
│   ├── README.md          # QA framework documentation
│   ├── scripts/           # Test scripts (20 files)
│   └── fixtures/          # Test fixtures
├── examples/              # Examples and test files
│   ├── README.md
│   ├── yaml_basic_examples.do        # Basic usage examples
│   ├── data/              # Sample YAML files
│   └── logs/              # Output logs from examples
└── paper/                 # Documentation and article source

Suggested Citation

For the Stata command:

Azevedo, João Pedro. 2025. "yaml: Stata module for YAML file processing." Statistical Software Components, Boston College Department of Economics.

Author

João Pedro Azevedo
jpazevedo@unicef.org
UNICEF

References

YAML 1.2 Specification: Ben-Kiki, O., Evans, C., & döt Net, I. (2021). YAML Ain't Markup Language (YAML™) Version 1.2 (Revision 1.2.2). https://yaml.org/spec/1.2.2/
JSON Schema: YAML 1.2 Specification, Chapter 10.2. https://yaml.org/spec/1.2.2/#json-schema
YAML Official Site: https://yaml.org/

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
examples		examples
qa		qa
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
readme.txt		readme.txt

Folders and files

Latest commit

History

Repository files navigation

yaml: Stata module for YAML file processing

Description

Key Features

Installation

From SSC (when available)

Manual Installation

Quick Start

Architecture

Subcommands

What's New

Syntax

yaml read

yaml write

yaml get

yaml list

yaml validate

yaml describe

yaml dir

yaml frames

yaml clear

Data Model

Storage Structure

Fast-Read Output Schema

Key Naming Convention

List Item Storage

YAML 1.2 Compliance

✅ Supported (YAML 1.2 JSON Schema)

❌ Not Supported (Advanced YAML 1.2)

Version Requirements

Examples

Reading and Querying

Validation

Working with Frames (Stata 16+)

Round-trip: Read and Write

Working with Lists

Performance Optimization for Large Catalogs

Use Cases

Design Principles

Repository Structure

Suggested Citation

Author

References

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages