Intro

In this project, we configure a search engine on two particular collection of documents: ‘Cranfield’ and 'Time' collections. This collections consist of: a set of html documents. a set of queries. a set of relevant documents ids for each query in the query set: the Ground-Truth.

One of the objectives of this project, is to find the best configuration (in terms of stemming method and scorer function) for the search engine, using the available Ground-Truth data. To evaluate the search engine performance, you will use the following metrics: Average R-Precision, Threshold algorithm and nMDCG.

More details on the project can be found here.

How to run it

The python files are coded for python 3.x The steps to run the homework are the following:

To clean al the already generated files and run everything from the begining run:

./clean.sh

First open a terminal inside the folder that contains this file Run then:

source set-my-classpath-homework.sh

Then to run the homework with the cran colletion run

. inverted-index.sh --cran
. scores.sh --cran

To run the homework with the time colletion run instead

. inverted-index.sh --time      
. scores.sh --time

To create the output file of the Fagin's algorithm for the cran colletion run

python Fagins-Algorithm/FaginsAlgorithm.py 5 2 collection-cran/output-stopwords-BM25Scorer-title.tsv collection-cran/output-stopwords-BM25Scorer-text.tsv 2 1 collection-cran/output-fagins.tsv

For the time collection the Fagin's algorithm does not work since the titles do not provide useful information [The general syntax for FaginsAlgorithm.py is the following]

python Fagins-Algorithm/FaginsAlgorithm.py [k] [number of files/dataset] [weight of the i-th dataframe score separete by space] [output directory]

To run the Threashole algorithm run

python Threshold-Algorithm/ThresholdAlgorithm.py 5 2 collection-cran/output-stopwords-BM25Scorer-title.tsv collection-cran/output-stopwords-BM25Scorer-text.tsv 2 1 collection-cran/output-threshold.tsv

The output files is exported to collection-cran and collection-time To compute the Average R-Precision run

python raverage.py --cran

or

python raverage.py --time

The results are saved in collection-cran/results-cran.tsv and colletion-time/results-time.tsv To see the Average R-Precision of the 9+1+1 asked files in the homework run:

cat collection-cran/results_cran.tsv | grep "text_and_title\|fagins\|threshold" > collection-cran/results_cran_11_files.tsv; cat collection-cran/results_cran_11_files.tsv

cat collection-time/results_time.tsv | grep "text_and_title\|fagins\|threshold" > collection-time/results_time_9_files.tsv; cat collection-time/results_time_9_files.tsv

To run the average nMDCG:

python average_nMDCG_cran.py 1

python average_nMDCG_time.py 1

python average_nMDCG_cran.py 3

python average_nMDCG_time.py 3

python average_nMDCG_cran.py 5

python average_nMDCG_time.py 5

python average_nMDCG_cran.py 10

python average_nMDCG_time.py 10

Where the last number is k, change that number to try more values of k The output files are exported to collection-cran and collection-time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intro

How to run it

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
Cranfield_DATASET		Cranfield_DATASET
Fagins-Algorithm		Fagins-Algorithm
Homework_1_software		Homework_1_software
Threshold-Algorithm		Threshold-Algorithm
Time_DATASET		Time_DATASET
collection-cran		collection-cran
collection-time		collection-time
README.md		README.md
average_nMDCG_cran.py		average_nMDCG_cran.py
average_nMDCG_time.py		average_nMDCG_time.py
clean.sh		clean.sh
inverted-index.sh		inverted-index.sh
raverage.py		raverage.py
scores.sh		scores.sh
set-my-classpath-homework.sh		set-my-classpath-homework.sh

Folders and files

Latest commit

History

Repository files navigation

Intro

How to run it

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages