GitHub - Fadeleke57/spyderweb: web crawler using scrapy library to create a weighted graph (stored in NEO4J instance) of a start search and its referenced links

Graph Web Indexing

My inspiration for https://www.spydr.dev/

For more information regarding Google's current indexing standard to power search, refer to: https://www.youtube.com/watch?v=knDDGYHnnSI

To re-create:

Prerequisites:

Python3
Express.Js
Node.Js

Web Crawler Pipeline:

I initialized a Scrapy Spider with connections to a Neo4j driver
Using the beautiful genism library, I created a makeshift TF-IDF model that dynamically computes a similarity score between article nodes in the graph. It was slow at first, so I removed mandatory NLTK tokenization and made it an optional parameter for slightly more accurate results
The flow looks like this: Model instantiated -> parent-child article text is extracted -> corpus about the subject matter is generated to train the model -> article text is transformed into a bag-of-words -> perform a text frequency inverse text frequency between both articles -> similarity score calculated
Upon a Scrapy crawl with a set of root nodes, children's articles are connected to their parent articles and weighted by this similarity score
Results are dumped to the Neo4j graph

Installation:

First, clone:

git clone git@github.com:Fadeleke57/spyderweb.git
cd spydrweb

Install the requirements and set up any required credentials in MongoDB, Neo4J, and Scrapy.

pip install -r requirements.txt

Run Crawler:

scrapy crawl time

Run Crawler with a specified search term:

scrapy crawl time --a search_term="{TERM}"

Run Client:

npm install
npm run dev

Run Backend:

npm install
node server.js

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
backend		backend
client		client
news_crawler		news_crawler
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Graph Web Indexing

For more information regarding Google's current indexing standard to power search, refer to: https://www.youtube.com/watch?v=knDDGYHnnSI

Prerequisites:

Web Crawler Pipeline:

Installation:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Fadeleke57/spyderweb

Folders and files

Latest commit

History

Repository files navigation

Graph Web Indexing

For more information regarding Google's current indexing standard to power search, refer to: https://www.youtube.com/watch?v=knDDGYHnnSI

Prerequisites:

Web Crawler Pipeline:

Installation:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages