This Module provides a simplified way of conducting and customizing data validation, type checking and custom casting operation, it can either be used as stand-alone with classes or in conjunction with dataclass module, which is recommended, it builds upon the Data descriptor implementation of python and has one common descriptor Class which handles all Cases. The main features include:
- Type Checking with TypeError Exception raised for mismatch
- usage of common validation functions such as "is_in_range", "has_length", "file_exists" etc. more importantly custom validation functions can be added
- implicit and explicit casting to primitive and complex types, such as:
- str -> pathlib.Path
- str -> datetime.datetime
- int -> float
- Support for Inheritance and Composition of validated Classes, see examples
- dict representation of dataclass and flattend reprentation of nested Classes by prefixing
Installation can be done via pip with the link of the repository:
pip install https://github.com/Joshua-96/data_validation.gitor by referencing the folder
pip install ./data_validationThe Module is developed and tested for python versions:
- 3.7, 3.8, 3.9, 3.10
later versions should be fine but are currently not tested with
- pandas >= 1.0
Consider following class "Person" with these generic attributes:
- person_id
- first_name
- last_name
- gender
- hobbies
- is_smoker
In Order to be more explicit with gender we define an Enum Class Gender:
class Gender(Enum):
MALE = "male"
FEMALE = "female"
OTHER = "other"all of the basic examples are in conjunction with dataclasses, but these are not necessary to use. the implementation looks like this
@dataclass
class Person():
person_id: int
first_name: str
last_name: str
gender: Gender
hobbies: List[str]
is_smoker: bool
email: strfirst we want to perform basic type validation:
import pathlib as pl
from data_validation import Validator
@dataclass
class Person():
person_id: int = Validator()
first_name: str = Validator()
last_name: str = Validator()
date_of_birth: date = field(default=None)
occupation: str = Validator()
gender: str = Validator()
hobbies: List[str] = Validator()
is_smoker: bool = Validator()
email: str = Validator()
image: pl.Path = Validator()Now the job of the validator is twofold:
- ensure primitive types
- apply the default constructor to value if complex type is annotated
using the class
from .example import Person
person = Person(
person_id=23,
first_name="John",
last_name="Doe",
occupation="Teacher",
gender="male",
hobbies=["Baseball", "Soccer"],
is_smoker=true,
email="john.doe_94@gmail.com",
image="example_folder/Profile.jpg"
)
# setting person_id to "abc" will result in a TypeError
person.person_id = "abc"the error message reads:
TypeError: invalid type provided for attribute: 'person_id' expected type <class 'int'>, received value <abc> of type <class 'str'>
The following Casting functions are applied by default:
- str -> bool, indifferent to case:
- ["true",True",TRUE"] -> true
- ["False","false","FALSE"] -> false
- other literals -> ValueError
- int -> bool, works only for value 0 and 1:
- 1 -> True
- 0 -> False
- other int values -> ValueError
- float -> int, will cast except for floats having decimal places != 0:
- 2.0 -> 2
- 2.5 -> ValueError
- int -> datetime, convert unix-timestamp int into datetime:
- no checking for plausibility by default
- str -> datetime, parse as string formatted time
- default format is '%Y-%m-%d'
The Default Casting behavior is reflected in the class DefaultTypeHandler it serves as a fallback or baseline, thus it is recommended customize a TypeHandler to your needs. This can be achieved in two way:
- Having one type of casting per TypeHandler, e.g. DateTypeHandler, NumericTypeHandler, PathTypeHandler etc.
- Defining one TypeHandler which includes most or all of the casting functionality
In the first case construct a dict or other mapping type and pass it to the init function:
from datetime import datetime, date
from data_validation.validation import DefaultTypeHandler
from data_validation.decorators import apply_casting
# define custom function for casting
@apply_casting
def _cast_from_str_to_date(inp: str, dateformat: str) -> date:
return datetime.strptime(inp, dateformat).date()
# case 1: specific handler
customDateHandler = DefaultTypeHandler(
source_type=str,
dest_type=date,
casting_fct=ArgFunctionWrapper(_cast_from_str_to_date, dateformat="%Y/%m/%d"),
type_mapping=None
)
# case 2: universal handler
common_casting_mapping = {
(str, date): ArgFunctionWrapper(_cast_from_str_to_date, dateformat="%Y/%m/%d"),
...
}
universalDataHandler = DefaultTypeHandler(
type_mapping = common_casting_mapping
)In the second case use the defined additional type_mapping will overwrite exiting entries or be added to the DEFAULT_TYPE_MAPPING object. In this case a conversion from str -> datetime was already defined but is overwritten by the new definition or in this case the different dateformat is applied.
In order to use any custom validation function itself must return None on success and raise a ValueError Exception on validation-failure. This exception will internally be wrapped into a CastException which in turn can be caught and handled.
Consider the following (incomplete) implementation of a email-validation-function:
def validate_email_precisely(value: str,
first_name: str,
last_name: str,
allowed_domains: List[str]):
if "@" not in value:
raise ValueError("Invalid Email, <@> is missing")
if first_name.lower() not in value.lower():
raise ValueError(f"first name <{first_name}> missing from email")
if last_name.lower() not in value.lower():
raise ValueError(f"last name <{last_name}> missing from email")
domain = value.split("@")[1]
enumerated_domains = ",".join(allowed_domains)
if domain not in allowed_domains:
raise ValueError(f"domain <{domain}> is not in domain whitelist: <{enumerated_domains}>")For this function to be passed as dataValidator instance, we have to instantiate it as such:
email_Validation = ArgFunctionWrapper(
func=validate_email_precisely,
value_kw="value",
# Pass attribute Name to refer to instance field
first_name="first_name",
last_name="last_name",
# Passing static arguments is also possible
allowed_domains=allowed_domains)the keyword arguments passed to the eventual function can be statically defined i.e. the allowed_domains variable but they can also reference a attribute of the Class, which it will be associated with. Finally for a minimum working example we re-use our Person-Class and drop in the custom email Validator.
@dataclass
class Person():
first_name: str = defaultValidator()
last_name: str = defaultValidator()
email: str = Validator(
validator_func=email_Validation, default=None, allow_none=True
)as with the other cases, upon creating an instance and we try to set an invalid value we get:
# Note: because we set "allow_none" to True and provided a default, we can omit the email from the constructor
person: Person = Person(first_name = "John", last_name = "Doe")
person.email = "john.doe@spammer.io"
# Error: ValueError("Validation Test failed for field 'email': domain <spammer.io> is not in domain whitelist: <gmail.com,example_uni.edu,outlook.com>")A common Use-Case are String-concatenated Field which represent a Collection, generally speaking a string should be expanded into a list, considering the type_mapping object we can utilize the List Object from the typing lib and define:
from typing import List
@apply_casting
def split_str(inp: str, delimiter: str) -> List[str]:
return inp.split(delimiter)
{(str, List[str]): ArgFunctionWrapper(split_str, delimiter=",")}