Project Writeup

NLP Tweet Search Engine: Comparing TF-IDF and Word Embeddings

An information retrieval project that searches and ranks relevant tweets from a dataset of nearly 300,000 Australian election tweets.

PythonNLTKspaCyscikit-learnGensimNumPyCountVectorizerTF-IDFGloVeCosine Similarity

Project Overview

This project explored how different natural language processing techniques affect search quality in an information retrieval system. I built a tweet search engine using a dataset of Australian Federal Election tweets and compared traditional keyword-based search methods against semantic embedding-based retrieval.

The goal was to understand how different text representation methods change the relevance of search results. Instead of only building one model, I compared multiple approaches so I could evaluate the tradeoffs between exact keyword matching, weighted term importance, and semantic similarity.

Dataset

The project used a real-world social media dataset containing 298,252 raw tweets from the 2019 Australian Federal Election. After preprocessing, the final dataset contained 282,405 cleaned tweets.

Working with real tweet data introduced common text-processing challenges, including URLs, mentions, hashtags, duplicate posts, punctuation noise, short text, and inconsistent language patterns. This made the project useful practice for handling messy, user-generated text.

What I Built

I built a search system that retrieves tweets based on user queries and ranks the results using cosine similarity. The preprocessing pipeline removed URLs, mentions, hashtags, duplicate tweets, and other noisy text features. I also applied tokenization, stopword removal, stemming, and vocabulary normalization.

The search engine compared three text representation approaches: CountVectorizer, TF-IDF, and GloVe word embeddings. Each method represented tweet text differently, which made it possible to compare exact keyword retrieval against more semantic search behavior.

Search Methods

The first method used CountVectorizer, which represents tweets as word occurrence vectors. This worked well for exact keyword matching because tweets with direct query-term overlap were ranked highly.

The second method used TF-IDF, which improved search quality by weighting distinctive terms more heavily and reducing the influence of common words. This was especially useful for multi-word political topics and policy-related queries.

The third method used pre-trained GloVe embeddings to create dense vector representations of tweet meaning. I used GloVe vectors to represent individual words and averaged them into sentence-level tweet embeddings. This allowed the system to retrieve semantically related tweets even when the exact query words were not present.

Key Findings

Each retrieval method had different strengths. CountVectorizer was simple and effective when the query terms appeared directly in the tweet. TF-IDF generally improved relevance by emphasizing more informative words and reducing the importance of common terms. GloVe embeddings added semantic search ability by capturing conceptual similarity between tweets and queries.

The main takeaway was that search quality depends heavily on how text is represented. Traditional sparse methods are fast, interpretable, and strong for exact matching, while embedding-based methods can capture broader meaning but may be less transparent.

Skills Demonstrated

This project demonstrated natural language processing, information retrieval, text preprocessing, feature engineering, vector embeddings, similarity search, and Python-based data analysis. It also helped me understand the practical differences between keyword search and semantic search systems.