HTTrack Website Scraper - Apify Actor

Download complete websites using HTTrack and get them as ZIP archives. Perfect for creating offline backups, archiving websites, or downloading entire sites with all assets (HTML, CSS, JavaScript, images, videos).

Features

✅ Complete Website Downloads - Downloads entire websites with all assets
✅ ZIP Archive Output - Automatically creates compressed ZIP files
✅ Configurable Depth - Control how deep to follow links (1-10 levels)
✅ Rate Limiting - Respect servers with configurable download rates
✅ Domain Filtering - Stay on same domain or follow external links
✅ Content Selection - Choose to download images, videos, or just HTML/CSS/JS
✅ Robots.txt Support - Optionally respect website's robots.txt
✅ Progress Tracking - Real-time logging of scraping progress
✅ Statistics - File counts, sizes, and compression ratios

Input Configuration

Required

Website URL - The URL to scrape (must include http:// or https://)

Optional

Parameter	Type	Default	Description
`depth`	Integer	2	How many links deep to follow (1-10)
`stayOnDomain`	Boolean	true	Only download from the same domain
`externalDepth`	Integer	0	How deep to follow external links
`connections`	Integer	4	Number of simultaneous downloads
`maxRate`	Integer	0	Max download rate in KB/s (0 = unlimited)
`maxSize`	Integer	0	Max total size in MB (0 = unlimited)
`maxTime`	Integer	0	Max scraping time in seconds (0 = unlimited)
`retries`	Integer	2	Number of retry attempts on error
`timeout`	Integer	30	Connection timeout in seconds
`getImages`	Boolean	true	Download image files
`getVideos`	Boolean	true	Download video files
`followRobots`	Boolean	true	Respect robots.txt
`outputName`	String	null	Custom output name (auto-generated if empty)
`cleanup`	Boolean	true	Remove source files after creating ZIP

Output

The Actor provides two types of output:

1. Dataset

Statistics and metadata for each scrape:

{
  "url": "https://example.com",
  "outputName": "example.com_20241205_130000",
  "zipFile": "example.com_20241205_130000.zip",
  "fileCount": 156,
  "totalSize": 5242880,
  "zipSize": 2621440,
  "compressionRatio": 50.0,
  "timestamp": "2024-12-05T13:00:00.000Z",
  "config": { ... },
  "status": "success"
}

2. Key-Value Store

The complete website as a ZIP archive. Access it via:

Apify Console: Storage → Key-Value Store → [filename].zip
API: https://api.apify.com/v2/key-value-stores/{storeId}/keys/{filename}.zip

Usage Examples

Example 1: Basic Website Backup

{
  "url": "https://example.com",
  "depth": 2,
  "stayOnDomain": true
}

Downloads the website up to 2 levels deep, staying on the same domain.

Example 2: Deep Archive with External Links

{
  "url": "https://example.com",
  "depth": 5,
  "externalDepth": 1,
  "stayOnDomain": false
}

Downloads 5 levels deep and follows external links 1 level.

Example 3: Fast Scrape (HTML/CSS/JS Only)

{
  "url": "https://example.com",
  "depth": 3,
  "getImages": false,
  "getVideos": false,
  "connections": 8
}

Fast scraping without images or videos, using 8 parallel connections.

Example 4: Rate-Limited Polite Scrape

{
  "url": "https://example.com",
  "depth": 2,
  "maxRate": 500,
  "connections": 2,
  "followRobots": true
}

Polite scraping with rate limiting and respecting robots.txt.

Example 5: Time-Limited Scrape

{
  "url": "https://example.com",
  "depth": 10,
  "maxTime": 300,
  "maxSize": 100
}

Stops after 5 minutes or 100 MB, whichever comes first.

How It Works

Input Validation - Validates the URL and configuration
HTTrack Execution - Runs HTTrack with configured parameters
Progress Monitoring - Logs progress in real-time
ZIP Creation - Creates a compressed archive of all files
Storage - Saves ZIP to Key-Value Store and stats to Dataset
Cleanup - Optionally removes temporary files

Technical Details

Based On

HTTrack 3.49+ - Industry-standard website copier
Python 3.13 - Modern async Python runtime
Apify SDK 2.0 - For Actor integration and storage

Limitations

Some JavaScript-heavy SPAs may not download completely
Websites with aggressive bot protection may block scraping
Dynamic content loaded after page load may be missed
Maximum recommended depth is 5-6 for most websites

Performance

Small websites (< 100 pages): 1-5 minutes
Medium websites (100-1000 pages): 5-30 minutes
Large websites (1000+ pages): 30+ minutes

Performance depends on:

Website size and structure
Number of connections
Network speed
Rate limiting settings

Legal and Ethical Considerations

⚠️ Important: Always ensure you have permission to scrape websites.

✅ Respect robots.txt files (enabled by default)
✅ Don't overload servers (use rate limiting)
✅ Check website Terms of Service
✅ Don't scrape copyrighted content without permission
✅ Use reasonable connection limits (2-8)

Troubleshooting

Scraping Takes Too Long

Reduce depth to 1 or 2
Disable getVideos and getImages
Increase connections (but be respectful)
Set maxTime or maxSize limits

ZIP File Too Large

Reduce depth
Disable getVideos
Set maxSize limit
Use maxTime to stop early

Website Blocks Scraping

Enable followRobots
Reduce connections to 2-4
Add rate limiting with maxRate
Increase timeout if connections are slow

Missing Content

Increase depth
Enable externalDepth if content is on other domains
Check if website uses heavy JavaScript (may not work)
Enable getImages and getVideos if needed

Development

Local Testing

# Install dependencies
pip install -r requirements.txt

# Run locally
apify run

Building

# Build Docker image
docker build -t httrack-scraper .

# Run container
docker run httrack-scraper

Support

For issues or questions:

Check Actor logs for detailed error messages
Review HTTrack documentation: https://www.httrack.com/
Contact Apify support through the platform

License

This Actor uses HTTrack, which is licensed under GPL v3.

Version History

1.0.0 - Initial release with full HTTrack integration

Name		Name	Last commit message	Last commit date
Latest commit History 987 Commits
.actor		.actor
debian		debian
html		html
hts-cache		hts-cache
lang		lang
libtest		libtest
m4		m4
man		man
scraped_websites		scraped_websites
src		src
templates		templates
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
ACTOR_GUIDE.md		ACTOR_GUIDE.md
AGENTS.md		AGENTS.md
AUTHORS		AUTHORS
COPYING		COPYING
ChangeLog		ChangeLog
DEPLOY_INSTRUCTIONS.md		DEPLOY_INSTRUCTIONS.md
DEPLOY_NOW.md		DEPLOY_NOW.md
DOCKER_SETUP_COMPLETE.md		DOCKER_SETUP_COMPLETE.md
Dockerfile		Dockerfile
INSTALL		INSTALL
INSTALL.Linux		INSTALL.Linux
Makefile.am		Makefile.am
Makefile.in		Makefile.in
NEWS		NEWS
QUICK_DEPLOY.txt		QUICK_DEPLOY.txt
README		README
README.md		README.md
README_SCRAPER.md		README_SCRAPER.md
RUN_THIS_IN_WSL.txt		RUN_THIS_IN_WSL.txt
SETUP_COMPLETE.txt		SETUP_COMPLETE.txt
WSL_SETUP.md		WSL_SETUP.md
backblue.gif		backblue.gif
compile		compile
config.guess		config.guess
config.h.in		config.h.in
config.sub		config.sub
configure		configure
configure.ac		configure.ac
depcomp		depcomp
example_usage.sh		example_usage.sh
fade.gif		fade.gif
gpl-fr.txt		gpl-fr.txt
greetings.txt		greetings.txt
history.txt		history.txt
hts-log.txt		hts-log.txt
httrack-actor.zip		httrack-actor.zip
httrack-doc.html		httrack-doc.html
index.html		index.html
install-sh		install-sh
lang.def		lang.def
lang.indexes		lang.indexes
ltmain.sh		ltmain.sh
missing		missing
requirements.txt		requirements.txt
setup-wsl.sh		setup-wsl.sh
source-code (2).zip		source-code (2).zip
website_scraper.py		website_scraper.py

License

0xrish/httrack

Folders and files

Latest commit

History

Repository files navigation

HTTrack Website Scraper - Apify Actor

Features

Input Configuration

Required

Optional

Output

1. Dataset

2. Key-Value Store

Usage Examples

Example 1: Basic Website Backup

Example 2: Deep Archive with External Links

Example 3: Fast Scrape (HTML/CSS/JS Only)

Example 4: Rate-Limited Polite Scrape

Example 5: Time-Limited Scrape

How It Works

Technical Details

Based On

Limitations

Performance

Legal and Ethical Considerations

Troubleshooting

Scraping Takes Too Long

ZIP File Too Large

Website Blocks Scraping

Missing Content

Development

Local Testing

Building

Support

License

Version History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages