GitHub - Intsights/PyDomainExtractor at extract-from-url

Name	Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows	.github/workflows
images	images
src	src
tests	tests
.gitignore	.gitignore
LICENSE	LICENSE
MANIFEST.in	MANIFEST.in
README.md	README.md
setup.py	setup.py

Name

Last commit message

Last commit date

Highly optimized domain name extraction library written in C++

Table of Contents
About The Project
Usage
- Extraction
- Validation
License
Contact

About The Project

PyDomainExtractor is a library intended for parsing domain names into their parts fast. The library is written in C++ to achieve the highest performance possible.

Built With

Performance

Test was measured on a file containing 10 million random domains from various TLDs

Library	Function	Time	Improvement Factor
tldextract	__call__	67.0s	1.0x
publicsuffix2	publicsuffix2.get_tld	25.8s	2.6x
PyDomainExtractor	pydomainextractor.extract	2.76s	24.3x

Prerequisites

In order to compile this package you should have GCC, libidn2, and Python development package installed.

Fedora

sudo dnf install python3-devel libidn2-devel gcc-c++

Ubuntu 18.04

sudo apt install python3-dev libidn2-dev g++-9

Installation

pip3 install PyDomainExtractor

Usage

Extraction

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.extract('google.com')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'com'
>>> }

# Loads a custom SuffixList data. Should follow PublicSuffixList's format.
domain_extractor = pydomainextractor.DomainExtractor(
    'tld\n'
    'custom.tld\n'
)

domain_extractor.extract('google.com')
>>> {
>>>     'subdomain': 'google',
>>>     'domain': 'com',
>>>     'suffix': ''
>>> }

domain_extractor.extract('google.custom.tld')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'custom.tld'
>>> }

Validation

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.is_valid_domain('google.com')
>>> True

domain_extractor.is_valid_domain('domain.اتصالات')
>>> True

domain_extractor.is_valid_domain('xn--mgbaakc7dvf.xn--mgbaakc7dvf')
>>> True

domain_extractor.is_valid_domain('domain-.com')
>>> False

domain_extractor.is_valid_domain('-sub.domain.com')
>>> False

domain_extractor.is_valid_domain('\xF0\x9F\x98\x81nonalphanum.com')
>>> False

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Gal Ben David - gal@intsights.com

Project Link: https://github.com/Intsights/PyDomainExtractor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Highly optimized domain name extraction library written in C++

Table of Contents

About The Project

Built With

Performance

Prerequisites

Installation

Usage

Extraction

Validation

License

Contact

About

Uh oh!

Releases 25

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

License

Intsights/PyDomainExtractor

Folders and files

Latest commit

History

Repository files navigation

Highly optimized domain name extraction library written in C++

Table of Contents

About The Project

Built With

Performance

Prerequisites

Installation

Usage

Extraction

Validation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 25

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

Packages