Skip to content

runtime error at AWS #16

@mbnik

Description

@mbnik

Hi,

I was able to run the following code on my own linux machine without a problem:


from nutch.nutch import Nutch
from nutch.nutch import SeedClient
from nutch.nutch import Server
from nutch.nutch import JobClient
import nutch

sv=Server('http://localhost:8081')
sc=SeedClient(sv)
seed_urls=('http://espn.go.com','http://www.espn.com')
sd= sc.create('espn-seed',seed_urls) 

nt = Nutch('default')
jc = JobClient(sv, 'test', 'default')
cc = nt.Crawl(sd, sc, jc)
while True:
    job = cc.progress() # gets the current job if no progress, else iterates and makes progress
    if job == None:
        break

however, when I run the same code on AWS (ubuntu 14.04), it gives a runtime error. here is the runtime log of the code:


nutch.py: Response status: 200
nutch.py: Response JSON: {u'crawlId': u'test', u'args': {u'url_dir': u'/tmp/1456875353316-0'}, u'state': u'IDLE', u'result': None, u'msg': u'idle', u'type': u'GENERATE', u'id': u'test-default-GENERATE-1140031758', u'confId': u'default'}
nutch.py: GET Endpoint: /job/test-default-GENERATE-1140031758
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Date': 'Tue, 01 Mar 2016 23:36:35 GMT', 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 204
Traceback (most recent call last):
File "main.py", line 22, in
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 531, in progress
jobInfo = currentJob.info()
File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 201, in info
return self.server.call('get', '/job/' + self.id)
File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 160, in call
raise error

nutch.nutch.NutchException: Unexpected server response: 204

in order to run the python code, I was running nutch as: /bin/nutch startserver, here is the run the

Injector: starting at 2016-03-01 23:35:53
Injector: crawlDb: test/crawldb
Injector: urlDir: /tmp/1456875353316-0
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total number of urls rejected by filters: 0
Injector: Total number of urls after normalization: 2
Injector: Total new urls injected: 2
Injector: finished at 2016-03-01 23:36:34, elapsed: 00:00:40
Generator: starting at 2016-03-01 23:36:35
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: test/segments/20160301233638
Generator: finished at 2016-03-01 23:36:40, elapsed: 00:00:05


I would appreciate if you can help.

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions