Naive Search Engine

  • Designed and implemented a Naive Search Engine with a focus on efficiency and relevance, achieving notable results in indexing and ranking tasks.
  • Engineered basic inverted, positional, and on-disk inverted indexes while incorporating the idea of SPIMI to accelerate indexing processes by at least 30 times, effectively optimizing memory usage.
  • Successfully processed and tokenized a substantial dataset of 1 million Wikipedia pages (7GB) within 20 minutes, demonstrating efficient and scalable data handling capabilities.
  • Enhanced search result relevance by implementing L2Ranker with LightGBM Ranker, Doc2Query, Bi-Encder, and CrossEncoder with huggingface.

Basic Setup:

  • Number of documents: 200k
  • Indexer: BasicInvertedIndex
  • Doc2Query: doc2query/msmarco-t5-base-v1
  • Learn-to-Rank: lightgbm.LGBMRanker
  • Bi-Encoder: sentence-transformers/msmarco-MiniLM-L12-cos-v5
  • CrossEncoder: cross-encoder/msmarco-MiniLM-L6-en-de-v1

Comment: select related models based on number of downloads on huggingface.

MAP@10NDCG@10CONFIG
base0.0439130.332071index + BM25
BM250.0390910.331340index_aug + BM25
VectorRanker0.0595440.314000index_aug + vec_rank
l2r0.0702260.343751index + BM25 + l2r
pipeline_10.0520450.330412index_aug + BM25 + l2r (cross_enc)
pipeline_20.0591020.333798index_aug + vec_rank + l2r (cross_enc)
pipeline_30.0460630.328585index + vec_rank + l2r
pipeline_40.0563670.331492index_aug + vec_rank + l2r

Discussion

  • Document augmentation might help the re-rank, but a further experiment shows that pure BM25 with document augmentation gets lower MAP in the first matching part than that without augmentations.
  • VectorRanker outperforms BM25, which shows that bi-encoder are providing important information.
  • Cross encoders will increase NDCG score without lowering MAP too much. It shows the effectiveness to use CrossEncoder in the reranking part.
  • HW2 pipeline performs better than others, which shows that the added new features in hw2 plays a quite critical role in prediction.
Frank (Haoyang) Ling
Frank (Haoyang) Ling
Master Student @ UMICH

My interests include artificial intelligence, information retrieval, and programmable matter.