Author: Raimundo Elias Gomez Affiliations: CONICET / National University of Misiones (Argentina); Faculty of Arts, University of Porto (Portugal) Contact: elias.gomez@conicet.gov.ar ORCID: 0000-0002-4468-9618
This repository contains the data, analysis scripts, and figures for the research "The spatiality of software: subnational economic complexity from GitHub data in Argentina", currently under development.
The study constructs an Economic Complexity Index for software production (ECIsoftware) at the level of 224 Argentine departments using a bipartite network of departments and 87 programming languages derived from 229,270 geocoded GitHub repositories. A three-stage analytical strategy — Multiple Correspondence Analysis (MCA), Hierarchical Agglomerative Clustering (CAH), and type-specific regressions — examines how the determinants of software complexity vary across six territorial types.
github-subir/
├── README.md
├── data/ # Processed datasets and summary tables
│ ├── departments_full.csv # All 511 departments: MCA coords, clusters, ECI, census vars
│ ├── bipartite_matrix.csv # 224 depts x 87 languages (repo counts, filtered)
│ ├── rca_binary_matrix.csv # 224 x 87 binary RCA matrix (threshold >= 1)
│ ├── eci_ranking_FINAL.csv # ECI ranking for 224 departments
│ ├── table_01_eci_ranking_full.csv # ECI ranking with sociodemographic variables
│ ├── table_02_pci_ranking_languages.csv # PCI ranking for 87 programming languages
│ ├── table_03_cluster_profiles.csv # Mean profiles of 6 departmental types
│ ├── table_04_regression_summary.csv # Regression coefficients by type
│ ├── table_05_key_numbers.csv # Summary statistics (key-value)
│ ├── table_06_crossvalidation_geo.csv # Geospatial cross-validation (511 depts)
│ └── regression_output_FINAL.txt # Full regression output (text)
├── figures/ # Article figures (300 DPI)
│ ├── fig_01_pci_ubiquity.png # Figure 1: PCI vs ubiquity (87 languages)
│ ├── fig_02_mca_biplot.png # Figure 2: MCA biplot (Axes 1-2, N=511)
│ ├── fig_03_cah_mca_clusters.png # Figure 3: Six types in MCA space
│ ├── fig_04_cluster_maps.png # Figure 4: Spatial distribution of types
│ ├── fig_05_eci_vs_devs.png # Figure 5: ECI vs developer density
│ ├── fig_06_forest_plot.png # Figure 6: Forest plot of betas by type
│ ├── fig_S1_dendrogram.png # Figure S1: Ward's dendrogram (k=6)
│ └── fig_S2_diagnostics_panel.png # Figure S2: MCA scree + cluster quality
├── scripts/ # Analysis pipeline (Python)
│ ├── 00_build_schema.py # Stage 0: Integrate 11 data sources into art1 schema
│ ├── 01_compute_eci.py # Stage 1: Compute ECI via eigenvalue decomposition
│ ├── 02_mca.py # Stage 2a: Multiple Correspondence Analysis (8 vars, N=511)
│ ├── 03_cah.py # Stage 2b: Ward's CAH on MCA coordinates (k=6)
│ ├── 04_regressions_by_type.py # Stage 3: Pooled + type-specific regressions, Chow test
│ ├── 05_regenerate_figures.py # Generate all 8 figures (6 article + 2 supplementary)
│ └── 06_cluster_maps.py # Generate Figure 4 (3x2 small-multiples map)
├── audit/ # Data quality and geocoding validation
│ ├── audit_01_full_province_department.csv # Raw vs geo-validated counts (513 depts)
│ ├── audit_02_discrepancies.csv # 32 departments with discrepancies
│ ├── audit_03_province_summary.csv # Province-level data integrity summary
│ ├── audit_04_foreign_users.csv # 76 excluded non-Argentine users
│ ├── audit_05_foreign_repos_by_dept.csv # Departments affected by foreign repos
│ ├── audit_06_ambiguous_users_sample.csv # 31 ambiguous location samples
│ └── audit_07_eci_before_after.csv # ECI ranking before/after corrections
└── supplementary/ # Supplementary material
├── supplementary_tables.md # Supplementary tables and figures
├── table_S1_eci_full_ranking.csv # Full ECI ranking (224 departments)
├── table_S2_cluster_region_crosstab.csv # Cluster × region cross-tabulation
├── table_S3_small_types_data.csv # Data for small-N types (Peripheral, Semi-Rural)
└── table_S4_within_type_correlations.csv # Within-type correlations with ECI
| File | Rows | Columns | Description |
|---|---|---|---|
departments_full.csv |
511 | 28 | All Argentine departments with census (2010), MCA coordinates (5 dims), cluster assignment, ECI, GitHub metrics |
bipartite_matrix.csv |
224 | 88 | Repository counts by department and programming language (dpto5 + 87 languages) |
rca_binary_matrix.csv |
224 | 88 | Binarised Revealed Comparative Advantage (RCA >= 1) |
table_02_pci_ranking_languages.csv |
87 | 5 | Product Complexity Index for programming languages |
| Variable | Source | Description |
|---|---|---|
dpto5 |
INDEC | Five-digit department code |
region |
Derived | Six regions: CABA, Pampeana, NOA, NEA, Cuyo, Patagonia |
pob_2010, pob_2022 |
Census | Population |
pct_jefe_sec_2010 |
Census 2010 | % household heads with secondary education |
pct_jefe_uni_2010 |
Census 2010 | % household heads with university education |
pct_pc_2010 |
Census 2010 | % households with computer |
pct_nbi_2010 |
Census 2010 | % with unsatisfied basic needs (poverty) |
pct_hacinam_2010 |
Census 2010 | % overcrowding |
rad_2014 |
VIIRS | Mean nighttime radiance (2014) |
tasa_empleo_2010 |
Census 2010 | Employment rate |
mca_dim1...mca_dim5 |
MCA | Factorial coordinates (5 retained axes) |
mca_cluster |
CAH | Cluster number (1-6) |
mca_cluster_label |
CAH | Cluster label |
eci_software |
ECI | Economic Complexity Index (standardised) |
eci_diversity |
ECI | Number of languages with RCA >= 1 |
eci_avg_ubiquity |
ECI | Mean ubiquity of RCA languages |
gh_total_developers |
GitHub | Total geocoded developers |
gh_total_repos |
GitHub | Total repositories |
gh_devs_per_10k |
Derived | Developers per 10,000 inhabitants |
gh_hill_q1_shannon |
GitHub | Language diversity (Shannon entropy) |
The scripts are numbered in execution order and depend on a PostgreSQL database (posadas) with the source data. The pipeline proceeds as follows:
-
00_build_schema.py— Integrates 11 data sources (Census 2010/2022, VIIRS nighttime lights, NDVI, GitHub, ENACOM) into a single analysis-ready table (art1.departamentos, 511 departments, ~208 columns). -
01_compute_eci.py— Constructs the bipartite network (departments x languages), computes RCA, and extracts ECI and PCI via eigenvalue decomposition of the normalised adjacency matrix. Applies geocoding corrections (Cordoba shift, CABA aggregation, foreign user exclusion). -
02_mca.py— Multiple Correspondence Analysis on 8 pre-treatment variables discretised into terciles (24 modalities, N=511). Retains 5 axes via Benzecri correction. Projects ECI and developer metrics as supplementary variables. -
03_cah.py— Ward's hierarchical clustering on 5 MCA coordinates. Selects k=6 (silhouette=0.330, Calinski-Harabasz=224.5). Profiles clusters with ANOVA and chi-squared tests. -
04_regressions_by_type.py— Pooled and type-specific OLS regressions of ECI on pre-treatment predictors. Chow test for structural heterogeneity. Forest plot of standardised coefficients. -
05_regenerate_figures.py— Generates all 8 figures (6 article + 2 supplementary) with unified formatting (300 DPI). -
06_cluster_maps.py— Generates Figure 4 (3x2 small-multiples map of cluster spatial distribution) using PostGIS geometries.
- ECIsoftware is distinct from developer counts: r = 0.47 (moderate correlation)
- PCI validates the framework: scientific computing languages (Erlang, Fortran, Julia) rank as most complex; web technologies (JavaScript, HTML, CSS) as least complex
- Six departmental types explain 30.2% of ECI variance (eta-squared = 0.302)
- Determinants are structurally heterogeneous: education drives complexity in Metropolitan-Core; computer ownership in Metropolitan-Diversified; population alone in Pampeana-Educated; no predictor significant in Intermediate-Urban
| Source | Period | Coverage | Access |
|---|---|---|---|
| GitHub API | Accumulated through 2025 | 229,270 repos, 23,619 users | Scraped early 2026 |
| Census (INDEC) | 2010, 2022 | 511 departments | datos.gob.ar |
| VIIRS DNB | 2014 | Department-level radiance | Google Earth Engine |
| ENACOM | ~2023 | Internet infrastructure | datosabiertos.enacom.gob.ar |
python >= 3.10
numpy
pandas
scipy
scikit-learn
prince
matplotlib
seaborn
geopandas
sqlalchemy
psycopg2
If you use these data or methods, please cite:
Gomez, R. E. (2026). The spatiality of software: subnational economic complexity from GitHub data in Argentina. Working paper.
Data and code are provided under the CC BY 4.0 licence.