yaml is a Stata command for reading, writing, and manipulating YAML configuration files. It provides a unified interface with nine subcommands that enable Stata users to integrate YAML-based workflows into their data pipelines.
The command implements the JSON Schema subset of YAML 1.2 (3rd Edition, 2021), the current authoritative YAML standard. This JSON-compatible subset covers the most commonly used features for configuration files and metadata management. It is implemented in pure Stata with no external dependencies.
Latest: v1.9.0 with indicators preset for wbopendata/unicefdata, colfields() and maxlevel() filtering, and Mata bulk-load parser.
- Read YAML files into Stata's data structure or frames
- Write YAML files from Stata datasets or scalars
- Query values using hierarchical key paths
- Validate configurations with required keys and type checking
- Multiple frame support (Stata 16+) for managing multiple configurations
- Fast-scan mode for large metadata catalogs (opt-in)
- Field-selective extraction with
fields() - List block extraction with
listkeys()(fast-read) - Frame caching with
cache()(Stata 16+) - Mata bulk parser for high-performance parsing (Phase 2)
- Collapse option for wide-format indicator output (Phase 2)
- Collapse filters with
colfields()andmaxlevel()(v1.8.0) - Indicators preset for wbopendata/unicefdata metadata (v1.9.0)
- strL support for values exceeding 2045 characters
ssc install yamlCopy yaml.ado and yaml.sthlp from src/y/ to your personal ado directory:
adopath
* Copy files to the PERSONAL directory shown* Read a YAML configuration file
yaml read using config.yaml, replace
* View the structure
yaml describe
* Get a specific value
yaml get database:host
return list
* Validate required keys
yaml validate, required(name version database)
* Write modified configuration
yaml write using output.yaml, replace
* Speed-first metadata read (fastread)
yaml read using indicators.yaml, fastread fields(name description source_id topic_ids) ///
listkeys(topic_ids topic_names) cache(ind_cache)
* Parse wbopendata/unicefdata indicator metadata (v1.9.0)
yaml read using indicators.yaml, indicators replace
list key code name in 1/5┌─────────────────────────────────────────────────────────────────────────────┐
│ yaml.ado │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌──────────┐ │
│ │ read │ │ write │ │ describe │ │
│ └────┬────┘ └────┬────┘ └────┬─────┘ │
│ │ │ │ │
│ ┌────┴────┐ ┌────┴────┐ ┌────┴─────┐ │
│ │ list │ │ get │ │ validate │ │
│ └────┬────┘ └────┬────┘ └────┬─────┘ │
│ │ │ │ │
│ ┌────┴────┐ ┌────┴────┐ ┌────┴─────┐ │
│ │ dir │ │ frames │ │ clear │ │
│ └─────────┘ └─────────┘ └──────────┘ │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ Internal Storage │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Dataset/Frame Structure: │ │
│ │ ┌──────────┬────────────┬───────┬────────────┬──────────┐ │ │
│ │ │ key │ value │ level │ parent │ type │ │ │
│ │ ├──────────┼────────────┼───────┼────────────┼──────────┤ │ │
│ │ │ str244 │ str2000 │ int │ str244 │ str32 │ │ │
│ │ └──────────┴────────────┴───────┴────────────┴──────────┘ │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
| Subcommand | Description |
|---|---|
yaml read |
Parse YAML file into Stata dataset or frame |
yaml write |
Export Stata data to YAML format |
yaml describe |
Display structure summary of loaded YAML |
yaml list |
List keys, values, or children |
yaml get |
Retrieve specific key values |
yaml validate |
Check required keys and value types |
yaml dir |
List all YAML data in memory (dataset and frames) |
yaml frames |
List only YAML frames in memory (Stata 16+) |
yaml clear |
Clear YAML data from memory or frames |
See src/y/yaml_whatsnew.sthlp for version history and release notes.
yaml read using filename.yaml [, options]Options:
replace- Replace existing dataframe(name)- Load into named frame (Stata 16+)locals- Store values as return localsscalars- Store numeric values as scalarsprefix(string)- Prefix for local/scalar names (default:yaml_)verbose- Display parsing detailsfastread- Speed-first parsing for large, regular YAMLfields(string)- Restrict extraction to specific keyslistkeys(string)- Extract list blocks for specified keys (fastread only)blockscalars- Capture block scalars in fast-read modetargets(string)- Early-exit targets for canonical parse (exact keys)earlyexit- Stop parsing once all targets are found (canonical)stream- Use streaming tokenization for canonical parseindex(string)- Materialize an index frame for repeated queries (Stata 16+)cache(string)- Cache parsed results in a frame (Stata 16+)bulk- Use Mata bulk-load parser for high-performance parsing (Phase 2)collapse- Produce wide-format output with_yaml_collapse(Phase 2)colfields(string)- Filter collapsed output to specific field names (semicolon-separated)maxlevel(#)- Limit collapsed columns by nesting depthindicators- Preset for wbopendata/unicefdata indicator metadata (implies bulk collapse)strl- Use strL storage for values exceeding 2045 characters
yaml write using filename.yaml [, options]Options:
replace- Overwrite existing fileframe(name)- Write from named framescalars(list)- Write specified scalarsindent(#)- Indentation spaces (default: 2)header(string)- Custom header commentverbose- Display write progress
yaml get keyname [, options]
yaml get parent:child [, options]Options:
frame(name)- Read from named frameattributes(list)- Specific attributes to retrievequiet- Suppress output
Returns:
r(found)- 1 if key foundr(n_attrs)- Number of attributesr(key)- Key namer(parent)- Parent name (if colon syntax used)r(attr_name)- Value for each attribute
yaml list [keyname] [, options]Options:
keys- Show key namesvalues- Show valueschildren- List child keys onlylevel(#)- Filter by nesting levelframe(name)- Read from named frame
yaml validate [, options]Options:
required(keylist)- Check that keys existtypes(key:type ...)- Validate key typesframe(name)- Validate named framequiet- Suppress output, only set return values
Returns:
r(valid)- 1 if validation passedr(n_errors)- Number of errorsr(n_warnings)- Number of warningsr(missing_keys)- List of missing required keysr(type_errors)- List of type validation failures
yaml describe [, level(#) frame(name)]yaml dir [, detail]Lists all YAML data currently in memory:
- Current dataset - if it contains YAML structure (key, value, level, parent, type variables)
- YAML frames - all frames with
yaml_prefix (Stata 16+)
Options:
detail- Show number of entries and source file for each
Detection:
- YAML data is identified by the
_dta[yaml_source]characteristic set byyaml read - Datasets with YAML structure but unknown source are also reported
yaml frames [, detail]Lists only YAML frames in memory. Requires Stata 16+.
Options:
detail- Show number of entries and source file for each frame
Use case: When you only need to see frames, not the current dataset.
yaml clear [framename] [, all]YAML data is stored in a flat dataset with hierarchical references:
| Column | Type | Description |
|---|---|---|
key |
str244 | Full hierarchical key name (e.g., indicators_CME_MRY0T4_label) |
value |
str2000 | The value associated with the key |
level |
int | Nesting depth (1 = root level) |
parent |
str244 | Parent key for hierarchical lookups |
type |
str32 | Value type: string, numeric, boolean, parent, list_item, null |
In fastread mode, the output is row-wise and minimal:
| Column | Type | Description |
|---|---|---|
key |
str244 | Top-level key (e.g., indicator code) |
field |
str244 | Field name under the key |
value |
str2000 | Field value |
list |
byte | 1 if list item, 0 otherwise |
line |
long | Line number in the YAML file |
Keys are flattened using underscores to represent hierarchy:
# YAML input:
indicators:
CME_MRY0T4:
label: Under-five mortality rate
unit: Deaths per 1000 live births# Stored as:
key value parent type
─────────────────────────────────────────────────────────────────────────────────────────────
indicators (empty) (empty) parent
indicators_CME_MRY0T4 (empty) indicators parent
indicators_CME_MRY0T4_label Under-five mortality rate indicators_CME_MRY0T4 string
indicators_CME_MRY0T4_unit Deaths per 1000 live births indicators_CME_MRY0T4 string
YAML lists are stored as indexed separate rows:
# YAML input:
countries:
- BRA
- ARG
- CHL# Stored as:
key value parent type
────────────────────────────────────────────
countries (empty) (empty) parent
countries_1 BRA countries list_item
countries_2 ARG countries list_item
countries_3 CHL countries list_item
This command implements the JSON Schema subset of YAML 1.2 as defined in Chapter 10.2 of the YAML 1.2 Specification. This is the recommended schema for "interoperability and consistency" according to the specification.
| Feature | YAML 1.2 Reference | Example |
|---|---|---|
| Mappings | Chapter 8.2.1 | key: value |
| Nested mappings | Chapter 8.2 | Indentation-based hierarchy |
| Block sequences | Chapter 8.2.2 | - item1, - item2 |
| Comments | Chapter 6.5 | # This is a comment |
| Strings | Chapter 10.2.1.1 | name: "quoted" or name: unquoted |
| Integers | Chapter 10.2.1.2 | count: 100 |
| Floats | Chapter 10.2.1.3 | rate: 3.14 |
| Booleans | Chapter 10.2.1.4 | debug: true, verbose: false |
| Null | Chapter 10.2.1.1 | empty: or empty: null |
These features are part of the full YAML 1.2 specification but are intentionally excluded to maintain simplicity and robustness:
| Feature | YAML 1.2 Reference | Reason |
|---|---|---|
| Anchors & Aliases | Chapter 7.1 | &anchor, *alias - Complex reference handling |
| Block scalars | Chapter 8.1 | |, > - Multi-line literal/folded styles |
| Flow collections | Chapter 7.4 | {a: 1}, [1, 2] - JSON-like inline syntax |
| Tags | Chapter 6.9 | !!map, !!seq - Type annotations |
| Multiple documents | Chapter 9.2 | --- document separators |
| Feature | Minimum Version |
|---|---|
| Basic functionality | Stata 14.0 |
| Frame support | Stata 16.0 |
* Load configuration
yaml read using pipeline_config.yaml, replace
* Get nested value using colon syntax
yaml get database:connection_string
local conn = r(connection_string)
* List all keys at root level
yaml list, keys level(0)* Check required configuration keys
yaml validate, required(name version api_key)
* Validate with type checking
yaml validate, types(port:numeric debug:boolean)
if (r(valid) == 0) {
di as error "Invalid configuration"
exit 198
}* Load multiple configurations
yaml read using dev.yaml, frame(dev)
yaml read using prod.yaml, frame(prod)
* Query from specific frame
yaml get host, frame(prod)
* List all YAML data in memory
yaml dir, detail
* Clear specific frame
yaml clear, frame(dev)* Read configuration
yaml read using original.yaml, replace
* Modify values
replace value = "new_value" if key == "settings_timeout"
* Write back
yaml write using modified.yaml, replace* Read YAML with lists
yaml read using countries.yaml, replace
* List items in a list
yaml list countries, keys children
* Access individual list items
yaml get countries
* Returns: r(1)="BRA" r(2)="ARG" r(3)="CHL"For metadata catalogs with 700+ entries, vectorized frame-based queries dramatically outperform iterative yaml get calls:
| Approach | Time | Relative |
|---|---|---|
Naive: 733 iterative yaml get calls |
15+ seconds | 50× |
| Optimized: Direct frame dataset query | 0.3 seconds | 1× |
Key Pattern:
yaml read using indicators_catalog.yaml, frame(meta)
frame yaml_meta {
gen is_nutrition = (value == "NUTRITION") & ///
regexm(key, "^indicators_[A-Za-z0-9_]+_dataflow$")
levelsof indicator_code if is_nutrition == 1, local(nutrition_codes)
}Vectorized operations (gen, regexm, levelsof) process all rows at once rather than looping through function calls. Frame isolation provides data protection and instant cleanup. See production examples in src/y/README.md.
- Pipeline Configuration: Database connections, API endpoints, timeouts
- Metadata Management: Indicator definitions, variable labels, units (optimized for 700+ catalogs)
- Cross-language Workflows: Share configurations with R, Python, GitHub Actions
- Reproducible Research: Version-controlled configuration files
- Multi-environment Support: Dev/staging/prod configurations in separate frames
- LLM Workflows: YAML-based tool interfaces and pipeline orchestration
-
YAML 1.2 Compliance: Implements the JSON Schema (Chapter 10.2) of the YAML 1.2 Specification, which covers 95%+ of configuration use cases.
-
JSON Compatibility: Per YAML 1.2's design goal, the supported subset ensures that valid JSON is also valid YAML (Chapter 1.2 of the specification).
-
Stata-Native: Pure Stata implementation using
file read/write- no external dependencies (Python, LibYAML, etc.). -
Hierarchical Storage: Flat storage with parent references enables both simple key-value access and hierarchical queries, following the YAML representation model (Chapter 3.2.1).
-
Frame Support: Optional frame storage keeps YAML data separate from working datasets (Stata 16+).
-
Validation First: Built-in validation ensures configuration correctness before pipeline execution.
yaml/
├── README.md # This file
├── .gitignore
├── src/y/
│ ├── yaml.ado # Main command (v1.7.0)
│ ├── yaml.sthlp # Stata help file
│ └── README.md # Command documentation with production examples
├── src/_/
│ ├── _yaml_mataread.ado # Mata bulk-load parser (Phase 2)
│ └── _yaml_collapse.ado # Wide-format collapse helper (Phase 2)
├── qa/
│ ├── run_tests.do # QA runner (26 tests)
│ ├── README.md # QA framework documentation
│ ├── scripts/ # Test scripts (20 files)
│ └── fixtures/ # Test fixtures
├── examples/ # Examples and test files
│ ├── README.md
│ ├── yaml_basic_examples.do # Basic usage examples
│ ├── data/ # Sample YAML files
│ └── logs/ # Output logs from examples
└── paper/ # Documentation and article source
For the Stata command:
Azevedo, João Pedro. 2025. "yaml: Stata module for YAML file processing." Statistical Software Components, Boston College Department of Economics.
João Pedro Azevedo
jpazevedo@unicef.org
UNICEF
- YAML 1.2 Specification: Ben-Kiki, O., Evans, C., & döt Net, I. (2021). YAML Ain't Markup Language (YAML™) Version 1.2 (Revision 1.2.2). https://yaml.org/spec/1.2.2/
- JSON Schema: YAML 1.2 Specification, Chapter 10.2. https://yaml.org/spec/1.2.2/#json-schema
- YAML Official Site: https://yaml.org/
MIT License