Skip to content

debianalt/spatiality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Spatiality of Software: Subnational Economic Complexity from GitHub Data in Argentina

Author: Raimundo Elias Gomez Affiliations: CONICET / National University of Misiones (Argentina); Faculty of Arts, University of Porto (Portugal) Contact: elias.gomez@conicet.gov.ar ORCID: 0000-0002-4468-9618


Overview

This repository contains the data, analysis scripts, and figures for the research "The spatiality of software: subnational economic complexity from GitHub data in Argentina", currently under development.

The study constructs an Economic Complexity Index for software production (ECIsoftware) at the level of 224 Argentine departments using a bipartite network of departments and 87 programming languages derived from 229,270 geocoded GitHub repositories. A three-stage analytical strategy — Multiple Correspondence Analysis (MCA), Hierarchical Agglomerative Clustering (CAH), and type-specific regressions — examines how the determinants of software complexity vary across six territorial types.

Repository structure

github-subir/
├── README.md
├── data/                         # Processed datasets and summary tables
│   ├── departments_full.csv      # All 511 departments: MCA coords, clusters, ECI, census vars
│   ├── bipartite_matrix.csv      # 224 depts x 87 languages (repo counts, filtered)
│   ├── rca_binary_matrix.csv     # 224 x 87 binary RCA matrix (threshold >= 1)
│   ├── eci_ranking_FINAL.csv     # ECI ranking for 224 departments
│   ├── table_01_eci_ranking_full.csv     # ECI ranking with sociodemographic variables
│   ├── table_02_pci_ranking_languages.csv # PCI ranking for 87 programming languages
│   ├── table_03_cluster_profiles.csv      # Mean profiles of 6 departmental types
│   ├── table_04_regression_summary.csv    # Regression coefficients by type
│   ├── table_05_key_numbers.csv           # Summary statistics (key-value)
│   ├── table_06_crossvalidation_geo.csv   # Geospatial cross-validation (511 depts)
│   └── regression_output_FINAL.txt        # Full regression output (text)
├── figures/                      # Article figures (300 DPI)
│   ├── fig_01_pci_ubiquity.png           # Figure 1: PCI vs ubiquity (87 languages)
│   ├── fig_02_mca_biplot.png             # Figure 2: MCA biplot (Axes 1-2, N=511)
│   ├── fig_03_cah_mca_clusters.png       # Figure 3: Six types in MCA space
│   ├── fig_04_cluster_maps.png           # Figure 4: Spatial distribution of types
│   ├── fig_05_eci_vs_devs.png            # Figure 5: ECI vs developer density
│   ├── fig_06_forest_plot.png            # Figure 6: Forest plot of betas by type
│   ├── fig_S1_dendrogram.png             # Figure S1: Ward's dendrogram (k=6)
│   └── fig_S2_diagnostics_panel.png      # Figure S2: MCA scree + cluster quality
├── scripts/                      # Analysis pipeline (Python)
│   ├── 00_build_schema.py        # Stage 0: Integrate 11 data sources into art1 schema
│   ├── 01_compute_eci.py         # Stage 1: Compute ECI via eigenvalue decomposition
│   ├── 02_mca.py                 # Stage 2a: Multiple Correspondence Analysis (8 vars, N=511)
│   ├── 03_cah.py                 # Stage 2b: Ward's CAH on MCA coordinates (k=6)
│   ├── 04_regressions_by_type.py # Stage 3: Pooled + type-specific regressions, Chow test
│   ├── 05_regenerate_figures.py  # Generate all 8 figures (6 article + 2 supplementary)
│   └── 06_cluster_maps.py       # Generate Figure 4 (3x2 small-multiples map)
├── audit/                        # Data quality and geocoding validation
│   ├── audit_01_full_province_department.csv  # Raw vs geo-validated counts (513 depts)
│   ├── audit_02_discrepancies.csv             # 32 departments with discrepancies
│   ├── audit_03_province_summary.csv          # Province-level data integrity summary
│   ├── audit_04_foreign_users.csv             # 76 excluded non-Argentine users
│   ├── audit_05_foreign_repos_by_dept.csv     # Departments affected by foreign repos
│   ├── audit_06_ambiguous_users_sample.csv    # 31 ambiguous location samples
│   └── audit_07_eci_before_after.csv          # ECI ranking before/after corrections
└── supplementary/                # Supplementary material
    ├── supplementary_tables.md              # Supplementary tables and figures
    ├── table_S1_eci_full_ranking.csv        # Full ECI ranking (224 departments)
    ├── table_S2_cluster_region_crosstab.csv # Cluster × region cross-tabulation
    ├── table_S3_small_types_data.csv        # Data for small-N types (Peripheral, Semi-Rural)
    └── table_S4_within_type_correlations.csv # Within-type correlations with ECI

Data description

Core datasets

File Rows Columns Description
departments_full.csv 511 28 All Argentine departments with census (2010), MCA coordinates (5 dims), cluster assignment, ECI, GitHub metrics
bipartite_matrix.csv 224 88 Repository counts by department and programming language (dpto5 + 87 languages)
rca_binary_matrix.csv 224 88 Binarised Revealed Comparative Advantage (RCA >= 1)
table_02_pci_ranking_languages.csv 87 5 Product Complexity Index for programming languages

Key variables in departments_full.csv

Variable Source Description
dpto5 INDEC Five-digit department code
region Derived Six regions: CABA, Pampeana, NOA, NEA, Cuyo, Patagonia
pob_2010, pob_2022 Census Population
pct_jefe_sec_2010 Census 2010 % household heads with secondary education
pct_jefe_uni_2010 Census 2010 % household heads with university education
pct_pc_2010 Census 2010 % households with computer
pct_nbi_2010 Census 2010 % with unsatisfied basic needs (poverty)
pct_hacinam_2010 Census 2010 % overcrowding
rad_2014 VIIRS Mean nighttime radiance (2014)
tasa_empleo_2010 Census 2010 Employment rate
mca_dim1...mca_dim5 MCA Factorial coordinates (5 retained axes)
mca_cluster CAH Cluster number (1-6)
mca_cluster_label CAH Cluster label
eci_software ECI Economic Complexity Index (standardised)
eci_diversity ECI Number of languages with RCA >= 1
eci_avg_ubiquity ECI Mean ubiquity of RCA languages
gh_total_developers GitHub Total geocoded developers
gh_total_repos GitHub Total repositories
gh_devs_per_10k Derived Developers per 10,000 inhabitants
gh_hill_q1_shannon GitHub Language diversity (Shannon entropy)

Analytical pipeline

The scripts are numbered in execution order and depend on a PostgreSQL database (posadas) with the source data. The pipeline proceeds as follows:

  1. 00_build_schema.py — Integrates 11 data sources (Census 2010/2022, VIIRS nighttime lights, NDVI, GitHub, ENACOM) into a single analysis-ready table (art1.departamentos, 511 departments, ~208 columns).

  2. 01_compute_eci.py — Constructs the bipartite network (departments x languages), computes RCA, and extracts ECI and PCI via eigenvalue decomposition of the normalised adjacency matrix. Applies geocoding corrections (Cordoba shift, CABA aggregation, foreign user exclusion).

  3. 02_mca.py — Multiple Correspondence Analysis on 8 pre-treatment variables discretised into terciles (24 modalities, N=511). Retains 5 axes via Benzecri correction. Projects ECI and developer metrics as supplementary variables.

  4. 03_cah.py — Ward's hierarchical clustering on 5 MCA coordinates. Selects k=6 (silhouette=0.330, Calinski-Harabasz=224.5). Profiles clusters with ANOVA and chi-squared tests.

  5. 04_regressions_by_type.py — Pooled and type-specific OLS regressions of ECI on pre-treatment predictors. Chow test for structural heterogeneity. Forest plot of standardised coefficients.

  6. 05_regenerate_figures.py — Generates all 8 figures (6 article + 2 supplementary) with unified formatting (300 DPI).

  7. 06_cluster_maps.py — Generates Figure 4 (3x2 small-multiples map of cluster spatial distribution) using PostGIS geometries.

Key findings

  • ECIsoftware is distinct from developer counts: r = 0.47 (moderate correlation)
  • PCI validates the framework: scientific computing languages (Erlang, Fortran, Julia) rank as most complex; web technologies (JavaScript, HTML, CSS) as least complex
  • Six departmental types explain 30.2% of ECI variance (eta-squared = 0.302)
  • Determinants are structurally heterogeneous: education drives complexity in Metropolitan-Core; computer ownership in Metropolitan-Diversified; population alone in Pampeana-Educated; no predictor significant in Intermediate-Urban

Data sources

Source Period Coverage Access
GitHub API Accumulated through 2025 229,270 repos, 23,619 users Scraped early 2026
Census (INDEC) 2010, 2022 511 departments datos.gob.ar
VIIRS DNB 2014 Department-level radiance Google Earth Engine
ENACOM ~2023 Internet infrastructure datosabiertos.enacom.gob.ar

Requirements

python >= 3.10
numpy
pandas
scipy
scikit-learn
prince
matplotlib
seaborn
geopandas
sqlalchemy
psycopg2

Citation

If you use these data or methods, please cite:

Gomez, R. E. (2026). The spatiality of software: subnational economic complexity from GitHub data in Argentina. Working paper.

Zenodo DOI: DOI

Licence

Data and code are provided under the CC BY 4.0 licence.

About

Replication materials: 'The spatiality of software: subnational economic complexity from GitHub data in Argentina'

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages