A lightweight local Streamlit web app for comprehensive data quality profiling and reporting. Upload CSV or Excel files and receive instant data quality reports including type inference, missing data analysis, cardinality metrics, and automated quality flags.
- ✅ File Upload: Support for CSV and Excel (.xlsx, .xls) files
- ✅ Dataset Overview: Rows, columns, memory usage, and duplicate analysis
- ✅ Type Inference: Automatic detection of numeric, datetime, boolean, categorical, and text columns
- ✅ Comprehensive Statistics:
- Numeric columns: min, max, mean, median, std, percentiles, skewness, zero/negative counts
- Datetime columns: min/max date ranges, future date detection
- All columns: missing %, unique count, top values with frequencies
- ✅ Data Quality Analysis:
- 6 basic quality flags for common issues
- 9 advanced quality flags for distribution patterns, string quality, and duplicates
- Sample values with row numbers for each flagged condition
- ✅ Export Options: Download column summary (CSV), dataset summary (CSV), or full profile (JSON)
The profiler automatically detects and flags potential data quality issues:
Basic Flags:
| Flag | Severity | Description |
|---|---|---|
HIGH_MISSING |
Warning | Configurable threshold (default 10%) of values are null |
CONSTANT_COLUMN |
Info | Column has only one unique value |
DOMINANT_VALUE |
Info | One value represents ≥95% of rows |
HIGH_CARDINALITY_CATEGORICAL |
Warning | Categorical/text column with >1000 unique values |
POTENTIAL_ID_COLUMN |
Info | High uniqueness (>90%) with ID-like column name |
MIXED_TYPES |
Warning | Object column contains multiple Python types |
Advanced Flags (NEW):
| Flag | Severity | Description |
|---|---|---|
SKEWED_DISTRIBUTION |
Info | Numeric column has highly skewed distribution (|skewness| > 2.0) |
CONTAINS_ZEROS |
Info | Numeric column has >10% zero values |
CONTAINS_NEGATIVES |
Warning | Amount/price columns with negative values |
FUTURE_DATES |
Warning | Datetime column contains dates in the future |
WHITESPACE_ISSUES |
Warning | Text values with leading/trailing whitespace (>1%) |
PLACEHOLDER_VALUES |
Info/Warning | Common placeholders detected (N/A, null, unknown, etc.) |
INCONSISTENT_CASING |
Info | Text column has same values with different cases |
SPECIAL_CHARACTERS |
Info | Values contain non-printable special characters (>0.5%) |
DUPLICATE_ROWS |
Info/Warning | Exact duplicate rows detected (>1% is info, >5% is warning) |
- Python 3.8 or higher
- pip package manager
-
Clone or download this repository
cd data-profiler -
Create and activate a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
streamlit run app.pyThe app will automatically open in your default web browser at http://localhost:8501
-
Upload a File: Use the sidebar to upload a CSV or Excel file (up to ~50MB)
-
View Dataset Summary: See high-level statistics about your dataset
- Number of rows and columns
- Memory usage
- Quick overview
-
Review Column Profiles: Examine the comprehensive column summary table
- Inferred data types
- Missing value percentages
- Unique counts
- Top values
- Quality flags
-
Explore Details (Optional): Enable "Show detailed column stats" in the sidebar for:
- Per-column expandable sections
- Full numeric/datetime statistics
- Color-coded quality flags
- Complete top values tables
-
Export Results: Download your profile as:
- CSV: Tabular format for easy sharing and Excel analysis
- JSON: Complete structured data for programmatic use
Access customization options in the sidebar:
- Null value threshold (%): Set the threshold for flagging columns with excessive null values (0-100%, default 10%). Columns meeting or exceeding this threshold are highlighted in yellow in the summary table and flagged in detailed views.
- Top N values to display: Choose how many frequent values to show (3-10)
- Show detailed column stats: Toggle per-column expandable detail views
data-profiler/
├── app.py # Streamlit UI application
├── profiling.py # Core profiling engine with type inference and 4 new features
├── quality.py # Data quality flag generation (9 new flags)
├── io_utils.py # File loading utilities
├── export_utils.py # Export formatting (CSV/JSON) with new feature exports
├── requirements.txt # Python dependencies
│
├── docs/ # Documentation
│ ├── TEST_PLAN.md # Comprehensive 72-test plan for all features
│ ├── FINAL_TEST_REPORT.md # All 72 tests executed - 100% pass rate
│ ├── TESTING_GUIDE.md # User-friendly testing guide
│ ├── FEATURE_QUICK_REFERENCE.md # Feature overview and usage tips
│ └── IMPLEMENTATION_SUMMARY.md # Technical implementation details
│
├── tests/ # Test scripts and automation
│ ├── execute_all_tests.py # Executes all 72 tests (100% pass rate)
│ ├── create_test_data.py # Generates 5 comprehensive test CSV files
│ ├── run_tests.py # Interactive runtime testing
│ └── test_automation.py # Automated Playwright-based testing
│
├── test_data/ # Auto-generated test datasets
│ ├── test_numeric.csv # 100 rows: skewness, zeros, negatives
│ ├── test_strings.csv # 100 rows: whitespace, placeholders, casing, special chars
│ ├── test_dates.csv # 100 rows: historical and future dates
│ ├── test_duplicates.csv # 200 rows: 40% exact duplicates
│ ├── test_all_features.csv # 1050 rows: all features integrated
│ ├── sample_data.csv # Original sample data
│ ├── sample_products.xlsx # Original sample Excel file
│ ├── quality_flags_test.csv # Original quality flags test file
│ └── bi_dataset.csv # Original BI-style sample dataset
│
├── README.md # This file
├── PRD.md # Product requirements document
├── CLAUDE.md # Development guidance
└── STATUS.md # Project status and progress
The profiler uses a multi-stage approach to infer column types:
- Direct dtype mapping: Numeric, datetime, boolean types detected from pandas dtypes
- Datetime detection: Object columns tested with
pd.to_datetime()for date strings - Cardinality heuristic: Low uniqueness ratio (<5%) → categorical, high → text
- Target: Profile datasets up to ~50MB or ~1-2M rows in under 10 seconds
- Memory profiling uses
deep=Truefor accuracy (may be slower on very large files) - Mixed type detection samples up to 100 values per column for efficiency
- Maximum file size: ~50MB (configurable, limited by available memory)
- Excel support: First sheet only
- Datetime parsing: Uses pandas datetime inference (may produce warnings)
- Initial Data Exploration: Quickly understand a new dataset before analysis
- Data Quality Assessment: Identify issues before loading into BI tools or databases
- Documentation: Generate shareable reports for stakeholders or data teams
- ETL Validation: Verify data structure and quality after transformations
Comprehensive testing infrastructure with 72 test cases (100% pass rate):
Run all tests:
python tests/execute_all_tests.pyTest scripts in tests/ directory:
execute_all_tests.py- Executes all 72 test cases with detailed reportingcreate_test_data.py- Generates 5 comprehensive test CSV filesrun_tests.py- Interactive testing with file upload simulationtest_automation.py- Automated Playwright-based testing
Test documentation in docs/ directory:
TEST_PLAN.md- 72 comprehensive test cases (Unit, Runtime, Functionality, UI, Export, Edge Cases)FINAL_TEST_REPORT.md- Complete test results and sign-offTESTING_GUIDE.md- User-friendly testing guide with manual checklist
Sample test datasets are included in test_data/:
test_numeric.csv- 100 rows testing skewness, zeros, negativestest_strings.csv- 100 rows testing whitespace, placeholders, casing, special charstest_dates.csv- 100 rows testing historical and future datestest_duplicates.csv- 200 rows with 40% exact duplicatestest_all_features.csv- 1050 rows with all features integratedsample_data.csv- Original sample data with mixed types and missing valuessample_products.xlsx- Excel file with numeric dataquality_flags_test.csv- Dataset designed to trigger quality flagsbi_dataset.csv- Realistic BI-style dataset (500 rows, 12 columns)
To manually test the profiler:
streamlit run app.pyThen upload any test file from the test_data/ directory.
- profiling.py:
profile_dataframe(df)returns nested profile structure - quality.py:
generate_quality_flags()detects data quality issues - export_utils.py:
profile_to_summary_df()flattens profile for export
Phase 1: Value Distribution Patterns - profiling.py
_compute_numeric_stats(): Added skewness, zero_count, zero_pct, negative_count, negative_pct_compute_datetime_stats(): Added future_count, future_pct, max_future_date- Quality flags: SKEWED_DISTRIBUTION, CONTAINS_ZEROS, CONTAINS_NEGATIVES, FUTURE_DATES
Phase 2: String Quality Checks - profiling.py
_analyze_string_quality(): Comprehensive string analysis (185 lines)- Analyzes whitespace, placeholders, casing, special characters
- Quality flags: WHITESPACE_ISSUES, PLACEHOLDER_VALUES, INCONSISTENT_CASING, SPECIAL_CHARACTERS
Phase 3: Sample Data Display - profiling.py
_collect_examples(): Extracts 3-5 sample values for flagged conditions_add_examples_to_flags(): Enriches quality flags with examples and row numbers
Phase 4: Duplicate Detection - profiling.py
_analyze_duplicates(): Dataset-level exact duplicate detection- Returns duplicate count, percentage, and duplicate set details
- Quality flag: DUPLICATE_ROWS (info >1%, warning >5%)
- Define threshold constants in
quality.py(e.g., SKEWNESS_THRESHOLD = 2.0) - Add detection logic in
generate_quality_flags()for column-level orgenerate_dataset_quality_flags()for dataset-level - Return flag dict with
code,severity, andmessage - Optionally add to
_add_examples_to_flags()to include sample values
Modify _infer_type() in profiling.py to add new type detection logic.
- Core data profiling and type inference
- Basic quality flags (6 flags)
- Advanced quality flags - Phase 1: Value Distribution Patterns (4 flags)
- Advanced quality flags - Phase 2: String Quality Checks (4 flags)
- Advanced quality flags - Phase 3: Sample Data Display with row numbers
- Advanced quality flags - Phase 4: Duplicate Detection
- CSV/JSON export functionality
- Comprehensive testing (72 test cases, 100% pass rate)
- Full test automation infrastructure
- Histogram visualizations for numeric columns
- Correlation analysis
- Data sampling for very large files
- Multiple sheet support for Excel files
- Configurable quality flag thresholds
- Integration with ydata-profiling for advanced analysis
MIT License - feel free to use and modify as needed.
Built with: