PyDomainExtractor is a library intended for parsing domain names into their parts fast. The library is written in C++ to achieve the highest performance possible.
Test was measured on a file containing 10 million random domains from various TLDs
| Library | Function | Time | Improvement Factor |
|---|---|---|---|
| tldextract | __call__ | 67.0s | 1.0x |
| publicsuffix2 | publicsuffix2.get_tld | 25.8s | 2.6x |
| PyDomainExtractor | pydomainextractor.extract | 2.76s | 24.3x |
In order to compile this package you should have GCC, libidn2, and Python development package installed.
- Fedora
sudo dnf install python3-devel libidn2-devel gcc-c++- Ubuntu 18.04
sudo apt install python3-dev libidn2-dev g++-9pip3 install PyDomainExtractorimport pydomainextractor
# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.extract('google.com')
>>> {
>>> 'subdomain': '',
>>> 'domain': 'google',
>>> 'suffix': 'com'
>>> }
# Loads a custom SuffixList data. Should follow PublicSuffixList's format.
domain_extractor = pydomainextractor.DomainExtractor(
'tld\n'
'custom.tld\n'
)
domain_extractor.extract('google.com')
>>> {
>>> 'subdomain': 'google',
>>> 'domain': 'com',
>>> 'suffix': ''
>>> }
domain_extractor.extract('google.custom.tld')
>>> {
>>> 'subdomain': '',
>>> 'domain': 'google',
>>> 'suffix': 'custom.tld'
>>> }import pydomainextractor
# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.is_valid_domain('google.com')
>>> True
domain_extractor.is_valid_domain('domain.اتصالات')
>>> True
domain_extractor.is_valid_domain('xn--mgbaakc7dvf.xn--mgbaakc7dvf')
>>> True
domain_extractor.is_valid_domain('domain-.com')
>>> False
domain_extractor.is_valid_domain('-sub.domain.com')
>>> False
domain_extractor.is_valid_domain('\xF0\x9F\x98\x81nonalphanum.com')
>>> FalseDistributed under the MIT License. See LICENSE for more information.
Gal Ben David - gal@intsights.com
Project Link: https://github.com/Intsights/PyDomainExtractor