This project provides a comprehensive pipeline for static analysis of Android APKs to detect malware. It extracts features from APK files, preprocesses the data, and trains machine-learning models to classify applications as benign or malicious. To perform this analysis, I used 800 malware and 800 benign APK files, which I collected from https://m4lware.org.
android-static-analysis/
├── android_malware_preprocessing.py : Preprocesses cleaned feature data
├── apk_features_updated.csv : Output of feature extraction
├── benignSample/ : Directory for benign APK samples
│ └── [benign APKs]
├── cleaned_features.csv : Output of feature dropping
├── drop_irrelevant_features.py : Removes irrelevant features
├── extract_apk_features.py : Extracts features from APKs
├── malwareSample/ : Directory for malware APK samples
│ └── [malware APKs]
├── model_comparison.py : Trains and evaluates ML models
├── preprocessed_data_[timestamp]/ : Preprocessed data output directory
├── trainModel/ : Trained model output directory
├── requirements.txt : Python dependencies
└── run_pipeline.py : Orchestrates the full pipeline
- Python: Version 3.8 or higher
- Virtual Environment: Recommended (e.g.,
venv) - APK Samples: Place benign APKs in
benignSample/and malware APKs inmalwareSample/
-
Clone the Repository:
git clone <repository_url> cd android-static-analysis
-
Set Up a Virtual Environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
To execute the entire pipeline from feature extraction to model training:
python3 run_pipeline.py--malware-dir: Directory with malware APKs (default:malwareSample)--workers: Number of worker processes for feature extraction (default: 5)--save-interval: Save interval for feature extraction (default: 50)--resume: Resume from the last successful step--clean: Clean output directories and files (e.g.,python3 run_pipeline.py --clean 1)
If the pipeline is interrupted, resume the last successful step:
python3 run_pipeline.py --resume-
Feature Extraction (
extract_apk_features.py):- Extract static features (permissions, API calls, etc.) from APKs using Androguard.
- Outputs:
apk_features_updated.csv
-
Feature Dropping (
drop_irrelevant_features.py):- Removes irrelevant features (e.g.,
file_name,package_name). - Outputs:
cleaned_features.csv
- Removes irrelevant features (e.g.,
-
Preprocessing (
android_malware_preprocessing.py):- Handles missing values, outliers, and creates derived features.
- Performs feature selection and standardization.
- Outputs:
preprocessed_data_[timestamp]/with train/test splits and visualizations.
-
Model Training (
model_comparison.py):- Trains and evaluates multiple models (Random Forest, SVM, etc.).
- Saves trained models and evaluation metrics.
- Outputs:
trainModel/with models (e.g.,best_model_random_forest.pkl) and plots.
See requirements.txt for a full list. Key packages include:
numpy,pandas: Data processingscikit-learn: Machine learningmatplotlib,seaborn: Visualizationandroguard: APK analysis
- Missing APKs: Ensure
benignSample/andmalwareSample/contain.apkfiles. - Dependency Errors: Verify all packages are installed (
pip install -r requirements.txt). - Permission Issues: Run with appropriate permissions if accessing restricted directories.
- Model Not Saved: Check
pipeline_run_*.logfor errors in the "Model Comparison" step.
Feel free to submit issues or pull requests to enhance the pipeline.
This project is unlicensed unless specified otherwise by the repository owner.