Data Engineer Case Study at Albert

Concepts

The solution is based on RESTful services built in Flask/Python/Celery - text_cat_server.py

SpaCy routines for NLP text classification are not implemented but provisioned. Mock functions - training_routines.py and prediction_routines.py are in place to simulate training and inference tasks.

Datasets for training are stored in AWS S3 in JSON format. Trained spaCy text classification models are stored in AWS S3 under separate prefexes.

Provided API endpoints

/models - GET - retrieve all classification models
/models - POST - train a new model or update an existing one
/models/<model_id> - DELETE - remove a model and all related objects
/prediction - GET - classify a text based on trained model

Details about the input parameters are given in the comments of the related functions

Infrastructure Components

Two deployment options are provided - local and AWS.

Both require the following setup:

AWS account for enabling required services
AWS DynamoDB Table to store references and links to trained models, S3 buckets names, training sets names, hyperparameters

Running the sample application locally on Ubuntu 18.04 box with access to Internet

Execute the following commands:

# the usual system update routines and required packages installation
sudo apt update
sudo DEBIAN_FRONTEND=noninteractive apt upgrade -y
sudo apt install awscli -y
sudo apt install python3-pip -y
sudo apt install redis-server -y
sudo systemctl restart redis.service
sudo apt install python3-venv -y

# AWSCLI has to be installed and configured to access a AWS account with admin rights
aws configure

# pull the latest version of the app
git clone https://github.com/ttotev/albert.git

# use python virtual environment to install the necessary python modules
cd albert
python3 -m venv .env
source .env/bin/activate
sudo .env/bin/pip install -r requirements.txt

# run celery in background to prevent API blocking due to long running routines
cd app
nohup ../.env/bin/celery worker -A text_cat_server.celery --loglevel=info &

# run the app
python text_cat_server.py

After the server is running successfully, proceed to testing the app.

Running the sample application in AWS

Building the underlying infrastructure and application components are handled with AWS CloudFormation template.

Create a 'Key Pair'
Access CloudFormation console page and click 'Create stack'.
Select 'Upload a template file'
Click on 'Choose file' and pick aws_cloudformation_template.yaml.
Click Next
Enter a 'Stack name'. For example, 'AlbertCase'.
In Parameters, for 'KeyName', pick the created earlier 'Key Pair'.
Leave 'SSHLocation' as it, or enter your IP address to restrict the access only to you.
Click Next
Leave the default in 'Configure stack options' page.
Click Next
Review the parameters and check on the Capabilities: Checkmark for 'I acknowledge that AWS CloudFormation might create IAM resources with custom names.'
Click button 'Create stack'.
Keep an eye on the 'Outputs' tab and refresh regularly.
When the resources are ready the PublicIP is displayed.
It may take another 5 mins for Ubuntu to update/upgrade system packages and deploy the python app.

The following services are enabled and integrated in AWS:

EC2 Instance
Security Group
IAM Role
DynamoDB Table

After the server is running successfully, proceed to testing the app.

Run the application with sample data

To access the application running locally use localhost. For AWS deployment use the IP address provided in the 'Outputs' section of the CloudFormation stack.

Assuming the API server is running on http://[address]:5000, here are some API calls:

# POST - Train a model
curl -i -H "Content-Type: application/json" -X POST -d '{"id":"tc-01", "s3bucket":"albert-textcats", "training_object":"trainingSet.json"}' http://[address]:5000/models

# POST - Force update/retrain an existing model
curl -i -H "Content-Type: application/json" -X POST -d '{"force_update":"True", "id":"tc-01", "s3bucket":"albert-textcats-best", "training_object":"trainingSet.json"}' http://[address]:5000/models

# DELETE a model
curl -i -H "Content-Type: application/json" -X DELETE http://[address]:5000/models/tc-02

# GET a list of all models
curl -i -H "Content-Type: application/json" -X GET http://[address]:5000/models

# GET a prediction
curl -i -H "Content-Type: application/json" -X GET -d '{"model_id":"tc-01", "text":"I want to save for vacation!"}' http://[address]:5000/prediction

# GET the top one prediction
curl -i -H "Content-Type: application/json" -X GET -d '{"n_top": "1", "model_id":"tc-01", "text":"I want to save for vacation!"}' http://[address]:5000/prediction

Tests

Basic unit tests are included in tests.py

Ideas for improvements

Separate application components into Docker Images
Use AWS ECS or EKS to run the containers
Run ECS on Spot Instances to reduce expenses
Implement monitoring and healthy checks
Put the execution routines behind AWS API Gateway for setting up a single entry point, security, authentication, authorization and caching
Use AWS Lambda for text classification prediction
Research options to retrain the models with data immediately after it has been labeled/classified, instead of batch precessing later
Version control the trained models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineer Case Study at Albert

Concepts

Provided API endpoints

Infrastructure Components

Running the sample application locally on Ubuntu 18.04 box with access to Internet

Running the sample application in AWS

Run the application with sample data

Tests

Ideas for improvements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
app		app
.gitignore		.gitignore
README.md		README.md
aws_cloudformation_template.yaml		aws_cloudformation_template.yaml
requirements.txt		requirements.txt

ttotev/albert

Folders and files

Latest commit

History

Repository files navigation

Data Engineer Case Study at Albert

Concepts

Provided API endpoints

Infrastructure Components

Running the sample application locally on Ubuntu 18.04 box with access to Internet

Running the sample application in AWS

Run the application with sample data

Tests

Ideas for improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages