Lab 2 - Data Cleansing & Preparation

Author

Peyton Pope

Published

October 12, 2025

Setup

Libraries & Paths

Python

# Libraries
import pandas as pd
import datetime as dt
from sklearn.preprocessing import StandardScaler # standardization

# Path
path = "~/Downloads/AirBNB/"

R

# Libraries
library(dplyr)      
library(tidyverse) 
library(readxl)
library(stringr)
library(lubridate)
library(fastDummies)

# Path
path <- "~/Downloads/AirBNB/"

1. Load & Inspect

Python

room = pd.read_excel(path + "airbnb_room_type.xlsx", index_col=0)
print(room.head(1))
                      description  ... review_scores_value
listing_id                         ...                    
2595        Skylit Midtown Castle  ...                 9.0

[1 rows x 11 columns]
price = pd.read_csv(path + "airbnb_price.csv", index_col=0)
print(price.head(1))
                  price         nbhood_full
listing_id                                 
2595        225 dollars  Manhattan, Midtown
review = pd.read_csv(path + "airbnb_last_review.csv", index_col=0)
print(review.head(1))
           host_name  last_review
listing_id                       
2595        Jennifer  May 21 2024

R

room <- read_excel(paste0(path, "airbnb_room_type.xlsx"))
price <- read.csv(paste0(path, "airbnb_price.csv"))
review <- read.csv(paste0(path, "airbnb_last_review.csv"))

Reasoning

I chose the three names of ‘room’, ‘price’, and ‘review’ because of the precedent set by the instructions when it said to read airbnb_room_type into ‘room’. I also considered the fact that these would not be my master dataframes, so I prioritized choosing short and distinct names to avoid mixing them up. I used two different strategies for Python and R when reading in the IDs. In Python, I was easily able to assign the ‘listing_id’ column as the primary key. This made it so it did not count as one of the data columns. Unfortunately, the process was not as straightforward in R, and I left ‘listing_id’ as one of the data columns. So, the shapes for Python and R are off by one in the column count for this project.

2. Merge & Join Logic

Python

listings = pd.merge(room, price, on="listing_id", how="left")
listings = pd.merge(listings, review, on="listing_id", how="left")

listings.shape
(25209, 15)

R

listings <- left_join(room, price, by = "listing_id")
listings <- left_join(listings, review, by = "listing_id")

dim(listings)
[1] 25209    16

Reasoning

Using a left join was the clear choice in this situation. A left join prioritizes the primary table, in this case room, so that none of its rows are lost regardless of what matches on the right. The left join was used so that even if a review or price was missing, the room data would not be dropped. If we were to use an alternative, such as an inner join, we would lose significant data. For example, if a room did not have any reviews yet, then the room would be completely omitted. Missing keys have the potential to create bias in the data. For example, newer listings may not have a ‘last_review’ date or ratings. This could potentially create a bias where only listings with activity are considered.

3. Deduplication

Python

listings.duplicated().sum()
np.int64(0)
print("Before: ", listings.shape)
Before:  (25209, 15)
listings = listings.drop_duplicates(keep='last')
print("After: ", listings.shape)
After:  (25209, 15)

R

sum(duplicated(listings))
[1] 0

Reasoning

To identify duplicate rows in both listings data frames, I used the function duplicated() for both Python and R. I determined that the ‘listing_id’ cannot be duplicated. Since these are the primary keys, there cannot exist two of the same. Choosing last over first as our keep parameter prioritizes keeping the most recent/updated occurrence. This is preferred in this scenario because the later the row is added, the more updated its data is likely to be. First can be used in a situation where you want the original to be prioritized. If it is assumed that added duplicates are incorrect, then you would use first.

4. Missing Value Strategy

Approach

listings.isnull().sum() / len(listings)
description                    0.000397
room_type                      0.000000
minimum_nights                 0.000000
maximum_nights                 0.000000
review_scores_rating           0.217145
review_scores_accuracy         0.218890
review_scores_cleanliness      0.218533
review_scores_checkin          0.219049
review_scores_communication    0.218652
review_scores_location         0.218930
review_scores_value            0.218969
price                          0.000000
nbhood_full                    0.000000
host_name                      0.001388
last_review                    0.001071
dtype: float64

I identified the 10 features that have missing data. For the categorical features, ‘description’ and ‘host_name’, I decided to input filler values such as “none”. For ‘last_review’, I considered it negligible and dropped the column. For the ratings, I chose the MICE imputer so that the other rating features could help impute missing ratings. However, I realized that in nearly all cases, when there was one missing rating, all ratings were missing. So, I chose to drop these rows.

Python

listings['description'] = listings['description'].fillna("None")
listings['host_name'] = listings['host_name'].fillna("None")
listings.dropna(subset=['last_review'], inplace=True)
# drop all rows with na review scores
listings.dropna(subset=['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
                        'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
                        'review_scores_value'], inplace=True)


listings.isnull().sum() / len(listings)
description                    0.0
room_type                      0.0
minimum_nights                 0.0
maximum_nights                 0.0
review_scores_rating           0.0
review_scores_accuracy         0.0
review_scores_cleanliness      0.0
review_scores_checkin          0.0
review_scores_communication    0.0
review_scores_location         0.0
review_scores_value            0.0
price                          0.0
nbhood_full                    0.0
host_name                      0.0
last_review                    0.0
dtype: float64
listings.shape
(19648, 15)

R

listings$description[is.na(listings$description)] <- "None"
listings$host_name[is.na(listings$host_name)] <- "None"

listings <- listings |>
  drop_na(last_review)

listings <- listings |>
  drop_na(review_scores_rating, review_scores_accuracy, 
          review_scores_cleanliness, review_scores_checkin,
          review_scores_communication, review_scores_location,
          review_scores_value)

dim(listings)
[1] 19648    16

Reasoning

The first two decisions were simpe, since their missing percentages were 0.1% or less. The review score features, however, forced me to consider integrity, bias, and utility. My main consideration was the target variable: ‘review_scores_rating’. Imputing 20% of the target variable values seemed like it would challenge integrity, especially when those imputations would very likely not involve using the other six rating features. This decision did potentially create some bias against new, less-used listings. However, imputing with values likely would have created a different form of bias for/against these same listings. 21% is a relatively high percentage; however, after dropping these rows there are still nearly 20,000, which I considered to be a usable count.

5. Dates & Recency Feature

Python

listings['last_review'] = pd.to_datetime(listings['last_review'])
listings['days_since_review'] = (dt.datetime.today() - listings['last_review']).dt.days

listings[['last_review', 'days_since_review']].head()
           last_review  days_since_review
listing_id                               
2595        2024-05-21                509
3831        2024-07-05                464
5099        2024-06-22                477
5178        2024-06-24                475
5238        2024-06-09                490

R

listings$last_review <- as.Date(listings$last_review, format = "%B %d %Y")
print(class(listings$last_review))
[1] "Date"
listings$days_since_review <- as.numeric(Sys.Date() - listings$last_review)
print(class(listings$days_since_review))
[1] "numeric"

Insights

A lack of proper date formatting and large data sets can create nightmares. It is difficult to efficiently convert improperly formatted dates to usable objects. Creating the feature ‘days_since_review’ is a very useful decision because its data type, integer, can be more freely built upon in later stages of data science. It is very difficult for date types to be used for things such as visualization or training. It also gives a glanceable look at recency. Mixing up formats can lead to issues, and some of them might go unnoticed. When formats are not filed correctly, for example if days and months are swapped, this can lead to a model using incorrect data. The error would not be caught easily.

6. Text Cleaning

Python

listings['description'] = listings['description'].str.replace(r"[^A-Za-z0-9\s]+", '', regex=True)
listings['description'] = listings['description'].str.lower()

listings['description'].head()
listing_id
2595                        skylit midtown castle
3831              cozy entire floor of brownstone
5099    large cozy 1 br apartment in midtown east
5178               large furnished room near bway
5238            cute  cozy lower east side 1 bdrm
Name: description, dtype: object

R

listings$description <- str_replace_all(listings$description, "[^A-Za-z0-9\\s]+", "")
listings$description <- str_to_lower(listings$description)

head(listings$description)
[1] "skylit midtown castle"                    
[2] "cozy entire floor of brownstone"          
[3] "large cozy 1 br apartment in midtown east"
[4] "large furnished room near bway"           
[5] "cute  cozy lower east side 1 bdrm"        
[6] "beautiful 1br on upper west side"         

Insight

I chose my pattern based on characters that I wanted to keep. My pattern removes anything that is not a letter, digit, or space. When a feature includes several words, over-cleansing runs the risk of distorting the original meaning. For example, sometimes punctuation is critical. Removing it can alter the author’s intent. This risk brings up an important ethical issue. If an author intentionally included certain aspects of a feature, like punctuation, it could be considered dishonest to use the cleaned, altered version as if that is what they meant.

7. Categorical Encoding

Room Type

Approach

listings.select_dtypes('object').nunique()
description    19299
room_type          9
price            502
nbhood_full      213
host_name       6328
dtype: int64

After reviewing the categorical features in their uniqueness, it was clear to me that host_name and description were unusable. This left room_type and nbhood_full as candidates.

Python

listings['room_type'].unique()
array(['Entire home/apt', 'private room', 'Private room',
       'entire home/apt', 'PRIVATE ROOM', 'shared room',
       'ENTIRE HOME/APT', 'Shared room', 'SHARED ROOM'], dtype=object)
# Normalize -> 3 Types
listings['room_type'] = listings['room_type'].str.replace(r'[ /]', '_', regex=True).str.lower()

listings = pd.get_dummies(listings, columns=['room_type'], drop_first=False)

listings[['room_type_entire_home_apt', 'room_type_private_room', 'room_type_shared_room']].head()
            room_type_entire_home_apt  ...  room_type_shared_room
listing_id                             ...                       
2595                             True  ...                  False
3831                             True  ...                  False
5099                             True  ...                  False
5178                            False  ...                  False
5238                             True  ...                  False

[5 rows x 3 columns]

R

listings$room_type <- str_replace_all(listings$room_type, "[ /]", "_")
listings$room_type <- str_to_lower(listings$room_type)

listings <- dummy_cols(listings,
                       select_columns = "room_type",
                       remove_first_dummy = FALSE,
                       remove_selected_columns = TRUE)
head(listings[, c("room_type_entire_home_apt", "room_type_private_room", "room_type_shared_room")])
# A tibble: 6 × 3
  room_type_entire_home_apt room_type_private_room room_type_shared_room
                      <int>                  <int>                 <int>
1                         1                      0                     0
2                         1                      0                     0
3                         1                      0                     0
4                         0                      1                     0
5                         1                      0                     0
6                         1                      0                     0

Neighborhood

Python

listings['borough'] = listings['nbhood_full'].str.lower().str.split(', ').str[0]
listings['nbhood'] = listings['nbhood_full'].str.lower().str.split(', ').str[1]

listings = listings.drop('nbhood_full', axis=1)

listings = pd.get_dummies(listings, columns=['borough'])

listings[["borough_bronx",
    "borough_brooklyn",
    "borough_manhattan",
    "borough_queens",
    "borough_staten island",
    "nbhood"]].head()
            borough_bronx  ...          nbhood
listing_id                 ...                
2595                False  ...         midtown
3831                False  ...    clinton hill
5099                False  ...     murray hill
5178                False  ...  hell's kitchen
5238                False  ...       chinatown

[5 rows x 6 columns]

R

nbhood_split <- str_split(listings$nbhood_full, ", ", simplify = TRUE)

listings$borough <- str_to_lower(nbhood_split[, 1])
listings$nbhood <- str_to_lower(nbhood_split[, 2])

listings$nbhood_full <- NULL

listings <- dummy_cols(listings,
                       select_columns = "borough",
                       remove_first_dummy = FALSE,
                       remove_selected_columns = TRUE)

head(listings[, c("borough_bronx",
                  "borough_brooklyn",
                  "borough_manhattan",
                  "borough_queens",
                  "borough_staten island",
                  "nbhood")])
# A tibble: 6 × 6
  borough_bronx borough_brooklyn borough_manhattan borough_queens
          <int>            <int>             <int>          <int>
1             0                0                 1              0
2             0                1                 0              0
3             0                0                 1              0
4             0                0                 1              0
5             0                0                 1              0
6             0                0                 1              0
# ℹ 2 more variables: `borough_staten island` <int>, nbhood <chr>

Reasoning

Room_type I originally determined had nine unique values. I quickly realized this was incorrect due to a lack of normalization. After normalizing, I determined that there were three unique room types, which was a great number for one-hot encoding using dummies.

Originally, I considered nbhood_full to be unusable; however, after noticing that every value included one of the five New York boroughs, I realized I could work with that. So, after splitting this feature into two columns, boroughs and neighborhood, I determined that I can use one-hot encoding for boroughs since this feature only had five unique values.

8. Type Coercion & Scaling

Type Coercion

Python

listings['price'] = listings['price'].replace(r'\D', '', regex=True).astype('int64')
listings.dtypes
description                            object
minimum_nights                          int64
maximum_nights                          int64
review_scores_rating                  float64
review_scores_accuracy                float64
review_scores_cleanliness             float64
review_scores_checkin                 float64
review_scores_communication           float64
review_scores_location                float64
review_scores_value                   float64
price                                   int64
host_name                              object
last_review                    datetime64[ns]
days_since_review                       int64
room_type_entire_home_apt                bool
room_type_private_room                   bool
room_type_shared_room                    bool
nbhood                                 object
borough_bronx                            bool
borough_brooklyn                         bool
borough_manhattan                        bool
borough_queens                           bool
borough_staten island                    bool
dtype: object

R

listings$price <- as.double(str_replace_all(listings$price, "\\D", ""))
str(listings)
tibble [19,648 × 24] (S3: tbl_df/tbl/data.frame)
 $ listing_id                 : num [1:19648] 2595 3831 5099 5178 5238 ...
 $ description                : chr [1:19648] "skylit midtown castle" "cozy entire floor of brownstone" "large cozy 1 br apartment in midtown east" "large furnished room near bway" ...
 $ minimum_nights             : num [1:19648] 2 15 3 3 7 15 7 7 7 15 ...
 $ maximum_nights             : num [1:19648] 1125 365 150 180 1110 ...
 $ review_scores_rating       : num [1:19648] 90 82 88 93 95 96 100 100 100 100 ...
 $ review_scores_accuracy     : num [1:19648] 9 9 9 9 10 10 10 10 10 10 ...
 $ review_scores_cleanliness  : num [1:19648] 9 8 9 9 10 10 10 10 9 10 ...
 $ review_scores_checkin      : num [1:19648] 10 10 9 10 10 10 10 10 10 10 ...
 $ review_scores_communication: num [1:19648] 10 10 9 10 10 10 10 10 10 10 ...
 $ review_scores_location     : num [1:19648] 8 9 10 10 10 8 10 10 9 10 ...
 $ review_scores_value        : num [1:19648] 9 8 9 9 9 9 8 10 9 10 ...
 $ price                      : num [1:19648] 225 89 200 79 150 135 85 89 85 140 ...
 $ host_name                  : chr [1:19648] "Jennifer" "LisaRoxanne" "Chris" "Shunichi" ...
 $ last_review                : Date[1:19648], format: "2024-05-21" "2024-07-05" ...
 $ days_since_review          : num [1:19648] 509 464 477 475 490 477 476 475 464 470 ...
 $ room_type_entire_home_apt  : int [1:19648] 1 1 1 0 1 1 0 0 0 1 ...
 $ room_type_private_room     : int [1:19648] 0 0 0 1 0 0 1 1 1 0 ...
 $ room_type_shared_room      : int [1:19648] 0 0 0 0 0 0 0 0 0 0 ...
 $ nbhood                     : chr [1:19648] "midtown" "clinton hill" "murray hill" "hell's kitchen" ...
 $ borough_bronx              : int [1:19648] 0 0 0 0 0 0 0 0 0 0 ...
 $ borough_brooklyn           : int [1:19648] 0 1 0 0 0 0 0 1 0 1 ...
 $ borough_manhattan          : int [1:19648] 1 0 1 1 1 1 1 0 1 0 ...
 $ borough_queens             : int [1:19648] 0 0 0 0 0 0 0 0 0 0 ...
 $ borough_staten island      : int [1:19648] 0 0 0 0 0 0 0 0 0 0 ...

Scaling

Python

scaler = StandardScaler()
scaled = scaler.fit_transform(listings.loc[:, ['minimum_nights', 
                                              'maximum_nights', 
                                              'price', 
                                              'days_since_review']])

listings[['minimum_nights', 'maximum_nights', 'price', 'days_since_review']] = pd.DataFrame(
  data=scaled,
  columns=['minimum_nights', 'maximum_nights', 'price', 'days_since_review'],
  index=listings.index)

listings[["minimum_nights", "maximum_nights", "price", "days_since_review"]].describe()
       minimum_nights  maximum_nights         price  days_since_review
count    1.964800e+04    1.964800e+04  1.964800e+04       1.964800e+04
mean    -2.314471e-17   -1.446545e-18  2.025163e-17      -4.339634e-16
std      1.000025e+00    1.000025e+00  1.000025e+00       1.000025e+00
min     -3.462369e-01   -9.575426e-03 -1.022618e+00      -8.547763e-01
25%     -3.096110e-01   -9.571935e-03 -5.234007e-01      -6.394291e-01
50%     -2.729850e-01   -9.553887e-03 -2.774097e-01      -4.025473e-01
75%     -1.264813e-01   -9.508917e-03  2.435124e-01       1.573553e-01
max      4.078467e+01    1.270615e+02  3.515253e+01       3.236820e+00
listings[["minimum_nights", "maximum_nights", "price", "days_since_review"]].head()
            minimum_nights  maximum_nights     price  days_since_review
listing_id                                                             
2595             -0.309611       -0.009509  0.605264           0.200425
3831              0.166526       -0.009554 -0.378700          -0.768637
5099             -0.272985       -0.009567  0.424388          -0.488686
5178             -0.272985       -0.009565 -0.451050          -0.531756
5238             -0.126481       -0.009510  0.062637          -0.208735

R

listings[, c('minimum_nights', 'maximum_nights', 'price', 'days_since_review')] <- 
  scale(listings[, c('minimum_nights', 'maximum_nights', 'price', 'days_since_review')])

summary(listings[, c('minimum_nights', 'maximum_nights', 'price', 'days_since_review')])
 minimum_nights    maximum_nights           price         days_since_review
 Min.   :-0.3462   Min.   : -0.009575   Min.   :-1.0226   Min.   :-0.8548  
 1st Qu.:-0.3096   1st Qu.: -0.009572   1st Qu.:-0.5234   1st Qu.:-0.6394  
 Median :-0.2730   Median : -0.009554   Median :-0.2774   Median :-0.4025  
 Mean   : 0.0000   Mean   :  0.000000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.:-0.1265   3rd Qu.: -0.009509   3rd Qu.: 0.2435   3rd Qu.: 0.1574  
 Max.   :40.7836   Max.   :127.058291   Max.   :35.1516   Max.   : 3.2367  

Insight

Type mismatches can undermine future analysis by creating small and potentially hidden errors in math. Combining different types through addition and other operations is highly dependent on having identical types. I faced a debugging issue with my Python DataFrame reassignment after I used my scalar. In my first attempt, the values I got were not even close to what I expected, but nothing had broken, so I was confused. I started to realize that some of the values I expected were one off from each other. So, I added the argument index to match everything correctly. Then I was able to verify the scaling by observing a mean of 0 and a standard deviation of 1 for all four features. Scaling is crucial for later analysis with training models and visualization. Since I was familiar with the dataset, and I observed that there were commonly outliers and unpredictable ranges, I decided not to use MinMaxScaler, and went with StandardScaler.

9. Reorder & Export

Reorder

Approach

I used the reordering strategies of simply creating an altered list. I found in my research, “Pandas allows us to change the order of columns in a DataFrame by passing in a reordered list of column names” (David Y., 2023).

Python

categorical = list(listings.select_dtypes(include=['object']).columns)
boolean = list(listings.select_dtypes(include=['bool']).columns)
datetime = list(listings.select_dtypes(include=['datetime64']).columns)
numeric = list(listings.select_dtypes(include=['int64', 'float64']).columns)
numeric.remove('review_scores_rating')
target = ['review_scores_rating']

listings = listings[categorical + boolean + datetime + numeric + target]

listings.columns
Index(['description', 'host_name', 'nbhood', 'room_type_entire_home_apt',
       'room_type_private_room', 'room_type_shared_room', 'borough_bronx',
       'borough_brooklyn', 'borough_manhattan', 'borough_queens',
       'borough_staten island', 'last_review', 'minimum_nights',
       'maximum_nights', 'review_scores_accuracy', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value', 'price',
       'days_since_review', 'review_scores_rating'],
      dtype='object')

R

id <- c("listing_id")
categorical <- c("description", "host_name", "nbhood")
boolean <- c(
  "room_type_entire_home_apt",
  "room_type_private_room",
  "room_type_shared_room",
  "borough_bronx",
  "borough_brooklyn",
  "borough_manhattan",
  "borough_queens",
  "borough_staten island"
)
datetime <- c("last_review")
numeric <- c(
  "minimum_nights",
  "maximum_nights",
  # "review_scores_rating",
  "review_scores_accuracy",
  "review_scores_cleanliness",
  "review_scores_checkin",
  "review_scores_communication",
  "review_scores_location",
  "review_scores_value",
  "price",
  "days_since_review"
)
target <- c("review_scores_rating")

listings <- listings[, c(id, categorical, boolean, datetime, numeric, target)]

colnames(listings)
 [1] "listing_id"                  "description"                
 [3] "host_name"                   "nbhood"                     
 [5] "room_type_entire_home_apt"   "room_type_private_room"     
 [7] "room_type_shared_room"       "borough_bronx"              
 [9] "borough_brooklyn"            "borough_manhattan"          
[11] "borough_queens"              "borough_staten island"      
[13] "last_review"                 "minimum_nights"             
[15] "maximum_nights"              "review_scores_accuracy"     
[17] "review_scores_cleanliness"   "review_scores_checkin"      
[19] "review_scores_communication" "review_scores_location"     
[21] "review_scores_value"         "price"                      
[23] "days_since_review"           "review_scores_rating"       

Export

Python

listings.to_csv(path + "airbnb_cleaned_py.csv", index=True)
listings.shape
(19648, 23)

R

write.csv(listings, paste0(path, "airbnb_cleaned_r.csv"), row.names = FALSE)
dim(listings)
[1] 19648    24

As stated previously, R is counting listing_id as a data column, while python is discluding listing_id as the index, causing a difference

Insight

Having a logical column order may not seem important initially, but normalizing the order by creating a categorized, predictable pattern helps keep reproducibility and collaboration efficient by decreasing confusion. Categorical data may be irrelevant to some who have more quantitative goals in mind; however, I believe it is still valuable for those interested in natural language analysis.

10. Summary & Reflection

Language Comparison

At a high level, the main difference I experienced with these languages can be described as follows: base Python and Pandas is more intuitive than base R data frames, but library integration is smoother in R than in Python. I found that I enjoyed trying things for the first time in Python, and the conversion to R usually came easily. A challenge I experienced with Python was the syntax for the scaling portion. Elements had to be very specific, and there were several steps, whereas R accomplished the same thing with a single line and the built-in scale() function.

Biblical Application

The verses in Proverbs 27 can be applied as wisdom in many facets of life, including monitoring data. It is crucial to “know the condition” of each feature. If your attention slips and best practices are not used, integrity can be lost. I would consistently use .head() in Python and View() in R to keep an eye on my data. I also used precautions such as creating copies so that I could experiment while maintaining the data I had cleansed so far. I could also compare the copied data with the original data.

Reflection

If I could start over, I would do a couple of things differently. I would try to read and understand all nine instructions before focusing on just the first one. I completed everything sequentially each week, but I realized that understanding the final objectives is very helpful when trying to understand the early instructions. I would also alternate between Python and R more. Oftentimes, once I figured out something in Python, I would realize an easier solution in R, or vice versa. I think I would leverage the strengths of each language instead of always starting with Python.

References

Y, David. (2023, November 15). Change the order of columns in a Python Pandas DataFrame. Sentry Answers. https://sentry.io/answers/change-the-order-of-columns-in-a-python-pandas-dataframe/