To download this repository, clone it using git:
git clone https://github.com/CrystalCo/vulnerability_detection.git
Requires Python 3.8.10
If the files in slicesSource are not compressed, you may skip this step. Otherwise do the following:
- Download git lfs at https://git-lfs.github.com/
- Install by running git lfs install.
- Run git lfs pull to unzip the large files.
Create an ENV variable with the path to this project, and call it VUL_PATH.
Example for Unix/MacOS:
export VUL_PATH=`pwd`
or manually:
export VUL_PATH=/Users/cryst/Documents/vulnerability_detection
Create a virtual environment:
python3 -m virtualenv env
Activate virtual environment:
source env/bin/activate
Install requirements:
pip install -r requirements.txt
Make sure the following folders are inside the root directory:
- w2vModel/metrics/
- w2vModel/metrics/bgru
- w2vModel/metrics/blstm
- w2vModel/model/
- model/
Make sure the following folders are inside the data directory:
- CVE/
- CWE/
- DLinputs/
- DLinputs/
- DLvectors/
- DLvectors/
- DLvectors/
- DLvectors/
- slicesSource/
- token/
- tokens/
Original code that converts source code to slices.
Contains the original source code for binary vulnerability detection.
2_Application_Codes.ipynb is the main file to run in this folder. It uses BGRU & BLSTM to detect whether a slice of code contains a vulnerability or not.
Contains the follow up code that attempts the multiclass classification of vulnerabilities across 162 Common Weakness Enumeration (CWE) IDs.
CWE_Data_Preprocessing.ipynb was the first step in preprocessing the data. It collects the SARD & CVE test case IDs for all the source slices we have. Then, it scrapes the Internet for the CWE attributes for each SARD & CVE ID. Finally, it ouputs 2 files: CWE_DF.csv & CVE_DF.csv.
CWE_DF.csv contains all the unique CWE IDs, their details, and counts*.
CVE_DF.csv contains all the unique CVE IDs, their descriptions & counts*, and the CWE-ID associated with them if applicable.
*number of times they appear in the source code file.
Grouping_By_Abstraction.ipynb then collects all the CWE IDs found in the previous step, and creates a tree of relationships between these CWEs. CWEs were grouped by similarity, which are defined as pillars in the Research Concepts view in the CWE website. A dictionary of SARD & CVE IDs mapped to their respective group ID is created, and saved to SARD_CVE_to_groups.csv.
Grouping_By_CWE.ipynb CWEs grouped by their unique CWE-ID. A dictionary of the original SARD & CVE IDs mapped to their respective CWE-ID is then saved to SARD_CVE_to_CWE.csv.
3A_Vulnerability_Classification_ML.ipynb attempts to classify vulnerability types using ML models.
3A_Vulnerability_Classification_ML_PCA.ipynb attempts to classify vulnerability types using ML models with PCA transformed data.
3B_Vulnerability_Classification_DL.ipynb attempts to classify vulnerability types using DL models.
3A_Vulnerability_Classification_DL_PCA.ipynb attempts to classify vulnerability types using DL models with PCA transformed data.