-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Description
Hello guys, recently I was using crawler to crawl some stuff and it was taking quite a lot of time, so I decided to use async mode. While using the async mode I've noticed a lot of duplicates in my results, especially number of duplicates was matching the number of threads I was launching my crawler.
Here is a quick example, let's take an example from official docs - https://github.com/gocolly/colly/blob/master/_examples/rate_limit/rate_limit.go
func main() {
url := "https://httpbin.org/delay/2"
// Instantiate default collector
c := colly.NewCollector(
// Turn on asynchronous requests
colly.Async(true),
)
// Start scraping in five threads on https://httpbin.org/delay/2
for i := 0; i < 5; i++ {
c.OnResponse(func(response *colly.Response) {
fmt.Println(string(response.Body))
})
c.Visit(fmt.Sprintf("%s?n=%d", url, i))
}
// Wait until threads are finished
c.Wait()
}If we would launch this code, we can see the results:
A lot of text here with http body response
{
"args": {
"n": "3"
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip",
"Host": "httpbin.org",
"User-Agent": "colly - https://github.com/gocolly/colly/v2",
"X-Amzn-Trace-Id": "Root=1-659818d1-0ce769125429588340e95d6c"
},
"origin": "83.139.137.160",
"url": "https://httpbin.org/delay/2?n=3"
}
{
"args": {
"n": "3"
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip",
"Host": "httpbin.org",
"User-Agent": "colly - https://github.com/gocolly/colly/v2",
"X-Amzn-Trace-Id": "Root=1-659818d1-0ce769125429588340e95d6c"
},
"origin": "83.139.137.160",
"url": "https://httpbin.org/delay/2?n=3"
}
{
"args": {
"n": "3"
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip",
"Host": "httpbin.org",
"User-Agent": "colly - https://github.com/gocolly/colly/v2",
"X-Amzn-Trace-Id": "Root=1-659818d1-0ce769125429588340e95d6c"
},
"origin": "83.139.137.160",
"url": "https://httpbin.org/delay/2?n=3"
}
{
"args": {
"n": "1"
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip",
"Host": "httpbin.org",
"User-Agent": "colly - https://github.com/gocolly/colly/v2",
"X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd"
},
"origin": "83.139.137.160",
"url": "https://httpbin.org/delay/2?n=1"
}
{
"args": {
"n": "1"
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip",
"Host": "httpbin.org",
"User-Agent": "colly - https://github.com/gocolly/colly/v2",
"X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd"
},
"origin": "83.139.137.160",
"url": "https://httpbin.org/delay/2?n=1"
}
{
"args": {
"n": "1"
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip",
"Host": "httpbin.org",
"User-Agent": "colly - https://github.com/gocolly/colly/v2",
"X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd"
},
"origin": "83.139.137.160",
"url": "https://httpbin.org/delay/2?n=1"
}
{
"args": {
"n": "1"
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip",
"Host": "httpbin.org",
"User-Agent": "colly - https://github.com/gocolly/colly/v2",
"X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd"
},
"origin": "83.139.137.160",
"url": "https://httpbin.org/delay/2?n=1"
}
{
"args": {
"n": "1"
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip",
"Host": "httpbin.org",
"User-Agent": "colly - https://github.com/gocolly/colly/v2",
"X-Amzn-Trace-Id": "Root=1-659818d1-41c8deb73f2c9a702e3a9fcd"
},
"origin": "83.139.137.160",
"url": "https://httpbin.org/delay/2?n=1"
}As you can see, there are duplicates in results. Maybe I'm doing something wrong, not setting up crawler properly, but still I highly doubt if this is a intended behaviour. Anyways, would appreciate any help.
Metadata
Metadata
Assignees
Labels
No labels