Project Writeup

Airbnb Review Insights with NLP

A natural language processing project analyzing Airbnb guest reviews across 15 U.S. cities to identify review themes, sentiment patterns, and rating-category prediction signals.

PythonpandasNumPyscikit-learnTF-IDFVADERLDASentence-BERTNLTKMatplotlibSeabornlangdetect

Project Overview

Online accommodation platforms like Airbnb rely heavily on user reviews to communicate guest experiences and build trust between hosts and potential guests. These reviews contain detailed feedback about cleanliness, location, communication, amenities, value, and overall satisfaction, but the scale of review data makes it difficult to manually summarize patterns across cities and listings.

The goal of this project was to use natural language processing to turn large-scale Airbnb review text into interpretable customer-experience insights. The project focused on two main questions: what themes do guests most commonly discuss in reviews, and can review text help predict listing quality categories?

To answer these questions, I built an NLP pipeline that combined text preprocessing, sentiment analysis, aspect-based sentiment analysis, topic modeling, feature engineering, and supervised machine learning.

Data Source and Scope

The dataset came from Inside Airbnb, a public data source that provides Airbnb listing and review information for cities around the world. The project used two main files: listings.csv and reviews.csv. The review file contained guest comments, review dates, reviewer identifiers, and listing identifiers. The listing file contained property details, room type, host attributes, location fields, and listing-level rating scores.

Because the review text alone did not include complete listing-level rating information, the review and listing datasets were merged by listing ID. This allowed each guest review to be connected to the corresponding listing's overall rating and related listing features.

The full merged dataset contained 9,272,268 reviews across 23 columns. It included reviews from 15 U.S. cities: Austin, Boston, Cambridge, Chicago, Denver, Hawaii, Los Angeles, Nashville, New Orleans, New York City, Oakland, Portland, San Francisco, Seattle, and Washington, D.C.

Key Variables

The main variables used in the analysis included listing ID, guest review text, city, property type, room type, superhost status, overall rating score, cleanliness score, communication score, location score, value score, and cleaned neighborhood field.

The most important text field was the guest review comment. The most important prediction label came from the listing-level review score. This created one of the main limitations of the project: the model predicted a listing-quality category connected to the review, not an individual reviewer's exact rating.

Sampling and Cleaning Strategy

The raw dataset was too large for a clean end-to-end analysis that included topic modeling, sentiment analysis, machine learning, and cross-city validation. I reduced the dataset in a structured way while trying to preserve meaningful variation across cities and rating categories.

First, I removed reviews shorter than 10 words. This removed low-information comments like "Great stay" while still keeping short but meaningful negative reviews. This step reduced the dataset from 9,272,268 reviews to 7,803,721 reviews.

Next, I capped each listing at five reviews. This prevented a small number of heavily reviewed listings from dominating the sample and helped the model learn general review patterns instead of listing-specific language.

Then, I sampled up to 4,000 reviews per city so that larger cities like Los Angeles and Hawaii would not overwhelm smaller cities like Cambridge or Oakland. Before English filtering, this produced a 60,000-review sample across the 15 cities.

Rating Category Labels

The listing-level overall rating score was converted into three rating categories. Listings below 4.2 were labeled low, listings from 4.2 to below 4.85 were labeled medium, and listings at or above 4.85 were labeled high.

This label design was useful for supervised classification, but it also introduced a limitation. Airbnb ratings are heavily skewed toward positive scores, so low-rated listings were rare. The full dataset had a mean rating of about 4.84 and a median rating of about 4.88. Because of this imbalance, I used rating-stratified sampling and balanced model training.

English-Language Filtering

After city and rating-based sampling, I applied an English-language filter using a combined check based on token count, ASCII ratio, and language detection. This reduced noise from multilingual text and improved topic consistency.

The sample contained 60,000 reviews before English filtering. After filtering, the final working dataset contained 57,903 English reviews. The filter removed 2,097 reviews, or about 3.5 percent of the sampled data.

The final class distribution was 28,692 medium reviews, 20,050 high reviews, and 9,161 low reviews. The low class remained the smallest group, so I used balanced class weighting during supervised model training.

Text Preprocessing

The preprocessing pipeline standardized the review text before sentiment analysis, topic modeling, and machine learning. The main steps included lowercasing, punctuation removal, tokenization, stopword removal, negation-safe stopword handling, negation encoding, and domain-specific stopword removal.

Negation handling was especially important. Standard stopword lists often remove words like "not," "no," and "never," but these words completely change meaning in reviews. For example, "clean" and "not clean" should not be treated as the same signal. To preserve this information, I used negation-safe stopwords for TF-IDF and encoded negated phrases for LDA topic modeling.

TF-IDF and Logistic Regression

TF-IDF was used to convert review text into numerical features. I used both unigrams and bigrams so the model could capture individual words like "clean" and phrases like "great location," "not clean," "easy check," and "loud noise."

Logistic regression was used as the main supervised learning model. It was a strong choice because it trains efficiently on sparse TF-IDF features, works well for text classification, and is easier to interpret than more complex models. Since the dataset was imbalanced, the logistic regression models used balanced class weighting so the low-rated class would not be ignored.

Sentiment and Aspect-Based Analysis

VADER sentiment analysis was used to estimate the overall emotional tone of each review. VADER was useful because Airbnb reviews often include short, informal opinion phrases, and it does not require manually labeled training data.

I also used aspect-based sentiment analysis to look at sentiment toward specific parts of the guest experience instead of treating each review as only positive or negative overall. The main aspects were host, cleanliness, location, amenities, and value.

This was useful because a review can be positive overall while still mentioning a specific problem. For example, a guest might praise the location and host but complain that the bathroom was dirty. Overall sentiment could miss that detail, while aspect-based sentiment helps identify the specific source of praise or complaint.

Topic Modeling with LDA

LDA topic modeling was used to discover common themes in the review text. I tested multiple topic counts and compared quantitative model fit with human interpretability. Although perplexity helped evaluate model fit, I did not rely only on it because topic models also need to produce themes that are understandable and useful.

I selected five topics for the main LDA model because that produced the clearest balance between detail and readability. With fewer topics, ideas like location, host quality, comfort, and complaints were blended together. With more topics, the model started producing narrower and more repetitive themes.

The final topics were labeled as Space & Physical Setting, Negative Experience / Complaints, Location & Neighborhood, Host Quality, and Comfort & Value.

Advanced Model Comparison

In addition to the TF-IDF model, I tested a Sentence-BERT-based approach. Sentence-BERT creates dense sentence embeddings that represent the meaning of a review in a continuous vector space. Unlike TF-IDF, which focuses on word and phrase frequency, Sentence-BERT is designed to capture semantic meaning and context.

The purpose of this comparison was to test whether a more advanced representation could better capture subtle or mixed reviews, especially in the medium rating category. Medium reviews were difficult because they often included both praise and criticism.

City Generalization

To test whether the model generalized across cities, I used Leave-One-City-Out validation. In this evaluation, the model was trained on reviews from 14 cities and tested on the remaining city. This process was repeated until each city had been used as the held-out test city.

This mattered because a model could appear accurate by learning city-specific words, neighborhoods, landmarks, or local phrasing rather than general language patterns related to guest satisfaction. Leave-One-City-Out validation tested whether the model learned transferable signals across markets.

Key Takeaways

The project showed that TF-IDF features captured much of the useful predictive signal in Airbnb reviews. Words and phrases related to cleanliness, location, host communication, noise, value, and broken amenities were directly useful for rating-category prediction.

LDA and VADER were especially useful for interpretation, but they only added marginal predictive improvement over TF-IDF alone. This made sense because TF-IDF already captured many of the specific terms and phrases that sentiment and topic features summarized more broadly.

The biggest limitation was the label design. The rating category came from the listing-level score rather than an individual review-level rating. A future version would use review-level ratings if available, fine-tune a transformer model, improve aspect extraction, and turn the project into an interactive dashboard for filtering by city, sentiment, topic, and rating category.