Naive Search Engine
- Designed and implemented a Naive Search Engine with a focus on efficiency and relevance, achieving notable results in indexing and ranking tasks.
- Engineered basic inverted, positional, and on-disk inverted indexes while incorporating the idea of SPIMI to accelerate indexing processes by at least 30 times, effectively optimizing memory usage.
- Successfully processed and tokenized a substantial dataset of 1 million Wikipedia pages (7GB) within 20 minutes, demonstrating efficient and scalable data handling capabilities.
- Enhanced search result relevance by implementing L2Ranker with LightGBM Ranker, Doc2Query, Bi-Encder, and CrossEncoder with huggingface.
Basic Setup:
- Number of documents: 200k
- Indexer:
BasicInvertedIndex
- Doc2Query:
doc2query/msmarco-t5-base-v1
- Learn-to-Rank:
lightgbm.LGBMRanker
- Bi-Encoder:
sentence-transformers/msmarco-MiniLM-L12-cos-v5
- CrossEncoder:
cross-encoder/msmarco-MiniLM-L6-en-de-v1
Comment: select related models based on number of downloads on huggingface.
MAP@10 | NDCG@10 | CONFIG | |
---|---|---|---|
base | 0.043913 | 0.332071 | index + BM25 |
BM25 | 0.039091 | 0.331340 | index_aug + BM25 |
VectorRanker | 0.059544 | 0.314000 | index_aug + vec_rank |
l2r | 0.070226 | 0.343751 | index + BM25 + l2r |
pipeline_1 | 0.052045 | 0.330412 | index_aug + BM25 + l2r (cross_enc) |
pipeline_2 | 0.059102 | 0.333798 | index_aug + vec_rank + l2r (cross_enc) |
pipeline_3 | 0.046063 | 0.328585 | index + vec_rank + l2r |
pipeline_4 | 0.056367 | 0.331492 | index_aug + vec_rank + l2r |
Discussion
- Document augmentation might help the re-rank, but a further experiment shows that pure BM25 with document augmentation gets lower MAP in the first matching part than that without augmentations.
- VectorRanker outperforms BM25, which shows that bi-encoder are providing important information.
- Cross encoders will increase NDCG score without lowering MAP too much. It shows the effectiveness to use CrossEncoder in the reranking part.
- HW2 pipeline performs better than others, which shows that the added new features in hw2 plays a quite critical role in prediction.