Skip to content

jpazvd/yaml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

yaml: Stata module for YAML file processing

Stata Version YAML 1.2 License: MIT Version

Description

yaml is a Stata command for reading, writing, and manipulating YAML configuration files. It provides a unified interface with nine subcommands that enable Stata users to integrate YAML-based workflows into their data pipelines.

The command implements the JSON Schema subset of YAML 1.2 (3rd Edition, 2021), the current authoritative YAML standard. This JSON-compatible subset covers the most commonly used features for configuration files and metadata management. It is implemented in pure Stata with no external dependencies.

Latest: v1.9.0 with indicators preset for wbopendata/unicefdata, colfields() and maxlevel() filtering, and Mata bulk-load parser.

Key Features

  • Read YAML files into Stata's data structure or frames
  • Write YAML files from Stata datasets or scalars
  • Query values using hierarchical key paths
  • Validate configurations with required keys and type checking
  • Multiple frame support (Stata 16+) for managing multiple configurations
  • Fast-scan mode for large metadata catalogs (opt-in)
  • Field-selective extraction with fields()
  • List block extraction with listkeys() (fast-read)
  • Frame caching with cache() (Stata 16+)
  • Mata bulk parser for high-performance parsing (Phase 2)
  • Collapse option for wide-format indicator output (Phase 2)
  • Collapse filters with colfields() and maxlevel() (v1.8.0)
  • Indicators preset for wbopendata/unicefdata metadata (v1.9.0)
  • strL support for values exceeding 2045 characters

Installation

From SSC (when available)

ssc install yaml

Manual Installation

Copy yaml.ado and yaml.sthlp from src/y/ to your personal ado directory:

adopath
* Copy files to the PERSONAL directory shown

Quick Start

* Read a YAML configuration file
yaml read using config.yaml, replace

* View the structure
yaml describe

* Get a specific value
yaml get database:host
return list

* Validate required keys
yaml validate, required(name version database)

* Write modified configuration
yaml write using output.yaml, replace

* Speed-first metadata read (fastread)
yaml read using indicators.yaml, fastread fields(name description source_id topic_ids) ///
    listkeys(topic_ids topic_names) cache(ind_cache)

* Parse wbopendata/unicefdata indicator metadata (v1.9.0)
yaml read using indicators.yaml, indicators replace
list key code name in 1/5

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              yaml.ado                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────┐    ┌─────────┐    ┌──────────┐                                │
│   │  read   │    │  write  │    │ describe │                                │
│   └────┬────┘    └────┬────┘    └────┬─────┘                                │
│        │              │              │                                      │
│   ┌────┴────┐    ┌────┴────┐    ┌────┴─────┐                                │
│   │  list   │    │   get   │    │ validate │                                │
│   └────┬────┘    └────┬────┘    └────┬─────┘                                │
│        │              │              │                                      │
│   ┌────┴────┐    ┌────┴────┐    ┌────┴─────┐                                │
│   │   dir   │    │  frames │    │  clear   │                                │
│   └─────────┘    └─────────┘    └──────────┘                                │
│                                                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                         Internal Storage                                    │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │  Dataset/Frame Structure:                                          │     │
│  │  ┌──────────┬────────────┬───────┬────────────┬──────────┐         │     │
│  │  │   key    │   value    │ level │   parent   │   type   │         │     │
│  │  ├──────────┼────────────┼───────┼────────────┼──────────┤         │     │
│  │  │ str244   │ str2000    │ int   │ str244     │ str32    │         │     │
│  │  └──────────┴────────────┴───────┴────────────┴──────────┘         │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Subcommands

Subcommand Description
yaml read Parse YAML file into Stata dataset or frame
yaml write Export Stata data to YAML format
yaml describe Display structure summary of loaded YAML
yaml list List keys, values, or children
yaml get Retrieve specific key values
yaml validate Check required keys and value types
yaml dir List all YAML data in memory (dataset and frames)
yaml frames List only YAML frames in memory (Stata 16+)
yaml clear Clear YAML data from memory or frames

What's New

See src/y/yaml_whatsnew.sthlp for version history and release notes.

Syntax

yaml read

yaml read using filename.yaml [, options]

Options:

  • replace - Replace existing data
  • frame(name) - Load into named frame (Stata 16+)
  • locals - Store values as return locals
  • scalars - Store numeric values as scalars
  • prefix(string) - Prefix for local/scalar names (default: yaml_)
  • verbose - Display parsing details
  • fastread - Speed-first parsing for large, regular YAML
  • fields(string) - Restrict extraction to specific keys
  • listkeys(string) - Extract list blocks for specified keys (fastread only)
  • blockscalars - Capture block scalars in fast-read mode
  • targets(string) - Early-exit targets for canonical parse (exact keys)
  • earlyexit - Stop parsing once all targets are found (canonical)
  • stream - Use streaming tokenization for canonical parse
  • index(string) - Materialize an index frame for repeated queries (Stata 16+)
  • cache(string) - Cache parsed results in a frame (Stata 16+)
  • bulk - Use Mata bulk-load parser for high-performance parsing (Phase 2)
  • collapse - Produce wide-format output with _yaml_collapse (Phase 2)
  • colfields(string) - Filter collapsed output to specific field names (semicolon-separated)
  • maxlevel(#) - Limit collapsed columns by nesting depth
  • indicators - Preset for wbopendata/unicefdata indicator metadata (implies bulk collapse)
  • strl - Use strL storage for values exceeding 2045 characters

yaml write

yaml write using filename.yaml [, options]

Options:

  • replace - Overwrite existing file
  • frame(name) - Write from named frame
  • scalars(list) - Write specified scalars
  • indent(#) - Indentation spaces (default: 2)
  • header(string) - Custom header comment
  • verbose - Display write progress

yaml get

yaml get keyname [, options]
yaml get parent:child [, options]

Options:

  • frame(name) - Read from named frame
  • attributes(list) - Specific attributes to retrieve
  • quiet - Suppress output

Returns:

  • r(found) - 1 if key found
  • r(n_attrs) - Number of attributes
  • r(key) - Key name
  • r(parent) - Parent name (if colon syntax used)
  • r(attr_name) - Value for each attribute

yaml list

yaml list [keyname] [, options]

Options:

  • keys - Show key names
  • values - Show values
  • children - List child keys only
  • level(#) - Filter by nesting level
  • frame(name) - Read from named frame

yaml validate

yaml validate [, options]

Options:

  • required(keylist) - Check that keys exist
  • types(key:type ...) - Validate key types
  • frame(name) - Validate named frame
  • quiet - Suppress output, only set return values

Returns:

  • r(valid) - 1 if validation passed
  • r(n_errors) - Number of errors
  • r(n_warnings) - Number of warnings
  • r(missing_keys) - List of missing required keys
  • r(type_errors) - List of type validation failures

yaml describe

yaml describe [, level(#) frame(name)]

yaml dir

yaml dir [, detail]

Lists all YAML data currently in memory:

  • Current dataset - if it contains YAML structure (key, value, level, parent, type variables)
  • YAML frames - all frames with yaml_ prefix (Stata 16+)

Options:

  • detail - Show number of entries and source file for each

Detection:

  • YAML data is identified by the _dta[yaml_source] characteristic set by yaml read
  • Datasets with YAML structure but unknown source are also reported

yaml frames

yaml frames [, detail]

Lists only YAML frames in memory. Requires Stata 16+.

Options:

  • detail - Show number of entries and source file for each frame

Use case: When you only need to see frames, not the current dataset.

yaml clear

yaml clear [framename] [, all]

Data Model

Storage Structure

YAML data is stored in a flat dataset with hierarchical references:

Column Type Description
key str244 Full hierarchical key name (e.g., indicators_CME_MRY0T4_label)
value str2000 The value associated with the key
level int Nesting depth (1 = root level)
parent str244 Parent key for hierarchical lookups
type str32 Value type: string, numeric, boolean, parent, list_item, null

Fast-Read Output Schema

In fastread mode, the output is row-wise and minimal:

Column Type Description
key str244 Top-level key (e.g., indicator code)
field str244 Field name under the key
value str2000 Field value
list byte 1 if list item, 0 otherwise
line long Line number in the YAML file

Key Naming Convention

Keys are flattened using underscores to represent hierarchy:

# YAML input:
indicators:
  CME_MRY0T4:
    label: Under-five mortality rate
    unit: Deaths per 1000 live births
# Stored as:
key                              value                         parent                  type
─────────────────────────────────────────────────────────────────────────────────────────────
indicators                       (empty)                       (empty)                 parent
indicators_CME_MRY0T4            (empty)                       indicators              parent
indicators_CME_MRY0T4_label      Under-five mortality rate     indicators_CME_MRY0T4   string
indicators_CME_MRY0T4_unit       Deaths per 1000 live births   indicators_CME_MRY0T4   string

List Item Storage

YAML lists are stored as indexed separate rows:

# YAML input:
countries:
  - BRA
  - ARG
  - CHL
# Stored as:
key             value   parent      type
────────────────────────────────────────────
countries       (empty) (empty)     parent
countries_1     BRA     countries   list_item
countries_2     ARG     countries   list_item
countries_3     CHL     countries   list_item

YAML 1.2 Compliance

This command implements the JSON Schema subset of YAML 1.2 as defined in Chapter 10.2 of the YAML 1.2 Specification. This is the recommended schema for "interoperability and consistency" according to the specification.

✅ Supported (YAML 1.2 JSON Schema)

Feature YAML 1.2 Reference Example
Mappings Chapter 8.2.1 key: value
Nested mappings Chapter 8.2 Indentation-based hierarchy
Block sequences Chapter 8.2.2 - item1, - item2
Comments Chapter 6.5 # This is a comment
Strings Chapter 10.2.1.1 name: "quoted" or name: unquoted
Integers Chapter 10.2.1.2 count: 100
Floats Chapter 10.2.1.3 rate: 3.14
Booleans Chapter 10.2.1.4 debug: true, verbose: false
Null Chapter 10.2.1.1 empty: or empty: null

❌ Not Supported (Advanced YAML 1.2)

These features are part of the full YAML 1.2 specification but are intentionally excluded to maintain simplicity and robustness:

Feature YAML 1.2 Reference Reason
Anchors & Aliases Chapter 7.1 &anchor, *alias - Complex reference handling
Block scalars Chapter 8.1 |, > - Multi-line literal/folded styles
Flow collections Chapter 7.4 {a: 1}, [1, 2] - JSON-like inline syntax
Tags Chapter 6.9 !!map, !!seq - Type annotations
Multiple documents Chapter 9.2 --- document separators

Version Requirements

Feature Minimum Version
Basic functionality Stata 14.0
Frame support Stata 16.0

Examples

Reading and Querying

* Load configuration
yaml read using pipeline_config.yaml, replace

* Get nested value using colon syntax
yaml get database:connection_string
local conn = r(connection_string)

* List all keys at root level
yaml list, keys level(0)

Validation

* Check required configuration keys
yaml validate, required(name version api_key)

* Validate with type checking
yaml validate, types(port:numeric debug:boolean)

if (r(valid) == 0) {
    di as error "Invalid configuration"
    exit 198
}

Working with Frames (Stata 16+)

* Load multiple configurations
yaml read using dev.yaml, frame(dev)
yaml read using prod.yaml, frame(prod)

* Query from specific frame
yaml get host, frame(prod)

* List all YAML data in memory
yaml dir, detail

* Clear specific frame
yaml clear, frame(dev)

Round-trip: Read and Write

* Read configuration
yaml read using original.yaml, replace

* Modify values
replace value = "new_value" if key == "settings_timeout"

* Write back
yaml write using modified.yaml, replace

Working with Lists

* Read YAML with lists
yaml read using countries.yaml, replace

* List items in a list
yaml list countries, keys children

* Access individual list items
yaml get countries
* Returns: r(1)="BRA" r(2)="ARG" r(3)="CHL"

Performance Optimization for Large Catalogs

For metadata catalogs with 700+ entries, vectorized frame-based queries dramatically outperform iterative yaml get calls:

Approach Time Relative
Naive: 733 iterative yaml get calls 15+ seconds 50×
Optimized: Direct frame dataset query 0.3 seconds

Key Pattern:

yaml read using indicators_catalog.yaml, frame(meta)
frame yaml_meta {
    gen is_nutrition = (value == "NUTRITION") & ///
        regexm(key, "^indicators_[A-Za-z0-9_]+_dataflow$")
    levelsof indicator_code if is_nutrition == 1, local(nutrition_codes)
}

Vectorized operations (gen, regexm, levelsof) process all rows at once rather than looping through function calls. Frame isolation provides data protection and instant cleanup. See production examples in src/y/README.md.

Use Cases

  • Pipeline Configuration: Database connections, API endpoints, timeouts
  • Metadata Management: Indicator definitions, variable labels, units (optimized for 700+ catalogs)
  • Cross-language Workflows: Share configurations with R, Python, GitHub Actions
  • Reproducible Research: Version-controlled configuration files
  • Multi-environment Support: Dev/staging/prod configurations in separate frames
  • LLM Workflows: YAML-based tool interfaces and pipeline orchestration

Design Principles

  1. YAML 1.2 Compliance: Implements the JSON Schema (Chapter 10.2) of the YAML 1.2 Specification, which covers 95%+ of configuration use cases.

  2. JSON Compatibility: Per YAML 1.2's design goal, the supported subset ensures that valid JSON is also valid YAML (Chapter 1.2 of the specification).

  3. Stata-Native: Pure Stata implementation using file read/write - no external dependencies (Python, LibYAML, etc.).

  4. Hierarchical Storage: Flat storage with parent references enables both simple key-value access and hierarchical queries, following the YAML representation model (Chapter 3.2.1).

  5. Frame Support: Optional frame storage keeps YAML data separate from working datasets (Stata 16+).

  6. Validation First: Built-in validation ensures configuration correctness before pipeline execution.

Repository Structure

yaml/
├── README.md              # This file
├── .gitignore
├── src/y/
│   ├── yaml.ado           # Main command (v1.7.0)
│   ├── yaml.sthlp         # Stata help file
│   └── README.md          # Command documentation with production examples
├── src/_/
│   ├── _yaml_mataread.ado # Mata bulk-load parser (Phase 2)
│   └── _yaml_collapse.ado # Wide-format collapse helper (Phase 2)
├── qa/
│   ├── run_tests.do       # QA runner (26 tests)
│   ├── README.md          # QA framework documentation
│   ├── scripts/           # Test scripts (20 files)
│   └── fixtures/          # Test fixtures
├── examples/              # Examples and test files
│   ├── README.md
│   ├── yaml_basic_examples.do        # Basic usage examples
│   ├── data/              # Sample YAML files
│   └── logs/              # Output logs from examples
└── paper/                 # Documentation and article source

Suggested Citation

For the Stata command:

Azevedo, João Pedro. 2025. "yaml: Stata module for YAML file processing." Statistical Software Components, Boston College Department of Economics.

Author

João Pedro Azevedo
jpazevedo@unicef.org
UNICEF

References

License

MIT License

About

The yaml command provides a complete YAML 1.2 (subset) parser for Stata, enabling configuration-driven workflows, metadata management, and interoperability with other languages (R, Python, GitHub Actions).

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages