Naive Search Engine

Frank (Haoyang) Ling

Nov 7, 2023 2 min read

Designed and implemented a Naive Search Engine with a focus on efficiency and relevance, achieving notable results in indexing and ranking tasks.
Engineered basic inverted, positional, and on-disk inverted indexes while incorporating the idea of SPIMI to accelerate indexing processes by at least 30 times, effectively optimizing memory usage.
Successfully processed and tokenized a substantial dataset of 1 million Wikipedia pages (7GB) within 20 minutes, demonstrating efficient and scalable data handling capabilities.
Enhanced search result relevance by implementing L2Ranker with LightGBM Ranker, Doc2Query, Bi-Encder, and CrossEncoder with huggingface.

Basic Setup:

Comment: select related models based on number of downloads on huggingface.

	MAP@10	NDCG@10	CONFIG
base	0.043913	0.332071	index + BM25
BM25	0.039091	0.331340	index_aug + BM25
VectorRanker	0.059544	0.314000	index_aug + vec_rank
l2r	0.070226	0.343751	index + BM25 + l2r
pipeline_1	0.052045	0.330412	index_aug + BM25 + l2r (cross_enc)
pipeline_2	0.059102	0.333798	index_aug + vec_rank + l2r (cross_enc)
pipeline_3	0.046063	0.328585	index + vec_rank + l2r
pipeline_4	0.056367	0.331492	index_aug + vec_rank + l2r

Discussion

Document augmentation might help the re-rank, but a further experiment shows that pure BM25 with document augmentation gets lower MAP in the first matching part than that without augmentations.
VectorRanker outperforms BM25, which shows that bi-encoder are providing important information.
Cross encoders will increase NDCG score without lowering MAP too much. It shows the effectiveness to use CrossEncoder in the reranking part.
HW2 pipeline performs better than others, which shows that the added new features in hw2 plays a quite critical role in prediction.