Skip to content

xsergiolpx/Search-Engine-Scorer

Repository files navigation

Intro

In this project, we configure a search engine on two particular collection of documents: ‘Cranfield’ and 'Time' collections. This collections consist of: a set of html documents. a set of queries. a set of relevant documents ids for each query in the query set: the Ground-Truth.

One of the objectives of this project, is to find the best configuration (in terms of stemming method and scorer function) for the search engine, using the available Ground-Truth data. To evaluate the search engine performance, you will use the following metrics: Average R-Precision, Threshold algorithm and nMDCG.

More details on the project can be found here.

How to run it

The python files are coded for python 3.x The steps to run the homework are the following:

To clean al the already generated files and run everything from the begining run:

./clean.sh

First open a terminal inside the folder that contains this file Run then:

source set-my-classpath-homework.sh

Then to run the homework with the cran colletion run

. inverted-index.sh --cran
. scores.sh --cran

To run the homework with the time colletion run instead

. inverted-index.sh --time      
. scores.sh --time

To create the output file of the Fagin's algorithm for the cran colletion run

python Fagins-Algorithm/FaginsAlgorithm.py 5 2 collection-cran/output-stopwords-BM25Scorer-title.tsv collection-cran/output-stopwords-BM25Scorer-text.tsv 2 1 collection-cran/output-fagins.tsv

For the time collection the Fagin's algorithm does not work since the titles do not provide useful information [The general syntax for FaginsAlgorithm.py is the following]

python Fagins-Algorithm/FaginsAlgorithm.py [k] [number of files/dataset] [weight of the i-th dataframe score separete by space] [output directory]

To run the Threashole algorithm run

python Threshold-Algorithm/ThresholdAlgorithm.py 5 2 collection-cran/output-stopwords-BM25Scorer-title.tsv collection-cran/output-stopwords-BM25Scorer-text.tsv 2 1 collection-cran/output-threshold.tsv

The output files is exported to collection-cran and collection-time To compute the Average R-Precision run

python raverage.py --cran

or

python raverage.py --time

The results are saved in collection-cran/results-cran.tsv and colletion-time/results-time.tsv To see the Average R-Precision of the 9+1+1 asked files in the homework run:

cat collection-cran/results_cran.tsv | grep "text_and_title\|fagins\|threshold" > collection-cran/results_cran_11_files.tsv; cat collection-cran/results_cran_11_files.tsv

cat collection-time/results_time.tsv | grep "text_and_title\|fagins\|threshold" > collection-time/results_time_9_files.tsv; cat collection-time/results_time_9_files.tsv

To run the average nMDCG:

python average_nMDCG_cran.py 1

python average_nMDCG_time.py 1

python average_nMDCG_cran.py 3

python average_nMDCG_time.py 3

python average_nMDCG_cran.py 5

python average_nMDCG_time.py 5

python average_nMDCG_cran.py 10

python average_nMDCG_time.py 10

Where the last number is k, change that number to try more values of k The output files are exported to collection-cran and collection-time

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages