-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Hi,
I was able to run the following code on my own linux machine without a problem:
from nutch.nutch import Nutch
from nutch.nutch import SeedClient
from nutch.nutch import Server
from nutch.nutch import JobClient
import nutch
sv=Server('http://localhost:8081')
sc=SeedClient(sv)
seed_urls=('http://espn.go.com','http://www.espn.com')
sd= sc.create('espn-seed',seed_urls)
nt = Nutch('default')
jc = JobClient(sv, 'test', 'default')
cc = nt.Crawl(sd, sc, jc)
while True:
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
if job == None:
break
however, when I run the same code on AWS (ubuntu 14.04), it gives a runtime error. here is the runtime log of the code:
nutch.py: Response status: 200
nutch.py: Response JSON: {u'crawlId': u'test', u'args': {u'url_dir': u'/tmp/1456875353316-0'}, u'state': u'IDLE', u'result': None, u'msg': u'idle', u'type': u'GENERATE', u'id': u'test-default-GENERATE-1140031758', u'confId': u'default'}
nutch.py: GET Endpoint: /job/test-default-GENERATE-1140031758
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Date': 'Tue, 01 Mar 2016 23:36:35 GMT', 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 204
Traceback (most recent call last):
File "main.py", line 22, in
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 531, in progress
jobInfo = currentJob.info()
File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 201, in info
return self.server.call('get', '/job/' + self.id)
File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 160, in call
raise error
nutch.nutch.NutchException: Unexpected server response: 204
in order to run the python code, I was running nutch as: /bin/nutch startserver, here is the run the
Injector: starting at 2016-03-01 23:35:53
Injector: crawlDb: test/crawldb
Injector: urlDir: /tmp/1456875353316-0
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total number of urls rejected by filters: 0
Injector: Total number of urls after normalization: 2
Injector: Total new urls injected: 2
Injector: finished at 2016-03-01 23:36:34, elapsed: 00:00:40
Generator: starting at 2016-03-01 23:36:35
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: test/segments/20160301233638
Generator: finished at 2016-03-01 23:36:40, elapsed: 00:00:05
I would appreciate if you can help.
Thanks