Skip to content

fotosit/Search-Engine-Design

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Search-Engine-Design

This project is written in python language.

The Aim of the Project

The designed project; It consists of 2 sections: creating a general index file after the content of the seed url is downloaded from the internet and displaying the results of the query parameters to the user using the cosine similarity method over the generated index file. The project is designed in a modular structure and new modules can be included in the project if desired. The use of the words to be used in the index file to be created in stem or raw form depends on the user's choice.

Method

In the first part of the project; escaped characters are removed from the seed web address link. The seed address is then added to the frontier list. Web addresses in the frontier list are accessed via the first in, first out (FIFO) method. It is checked whether the web addresses obtained with the FIFO method are links to the image, video or audio file. If the web address is a media link, the content of the link is not downloaded and added to the variable where the media links are kept. After the media check, it is checked whether the link has already been crawled. If the link is crawled before, it is transferred to the variable where the links that get multi hits are stored, the content of the link is not downloaded again. Header codes (meta, script, style) in the downloaded content are cleaned and the content is subjected to normalization process. As a result of the normalization process; the words and the links in the content are stored in a variable for indexing. After all operations are completed, web address is added to the crawled list and the next web address is taken from the frontier list using the FIFO method. All these processes are repeated until the number of crawled list count of stops specified by the user. When the stop criteria is complete, the IDF values of the indexed words are calculated. Summary statistics are shown to the user (number of crawled url, broken links, number of word indexed etc.) All the obtained information is stored in the text file.

In the second part of the project, the words and IDF values indexed in the first part are imported. The words to be queried are parse by commas and imported from the text file. Normalization processes are performed for each query. The frequency and TF values of the queried words are calculated. In the first part, the IDF values of the indexed words and the TF values of the queried words are multiplied and the TF-IDF matrix is created. There are web addresses containing the words queried by the program. The content of each found web address is downloaded from the internet. The content of each web address found is downloaded over the internet and normalization process is applied. The frequency table and TF values of the downloaded content are calculated. TF-IDF matrix of web addresses associated with the query is created. The cosine similarity method is used to measure the relevance of web addresses. Cosine proximity degrees are calculated by multiplying the TF-IDF matrix of the query with the TF-IDF matrix of each document found in the query result. The calculated cosine similarity values are listed in descending order and presented to the user. The large cosine similarity value calculated shows how relevant the document found with the query is with each other.

Samples

The address of the wikipedia website is "https://en.wikipedia.org/" as the seed page of the project. The number of scanned links as a stopping criterion was determined as 1000. Separate indexing files have been created for both the raw and stemmed versions of the words in the content of the web pages. The "PorterStemmer" library was used for the body to words. The content at "http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/english.stop" was used as stop words.

About

This project is a search engine design written in python language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages