# Libraries
import pandas as pd
import datetime as dt
from sklearn.preprocessing import StandardScaler # standardization
# Path
path = "~/Downloads/AirBNB/"Lab 2 - Data Cleansing & Preparation
Setup
Libraries & Paths
Python
R
# Libraries
library(dplyr)
library(tidyverse)
library(readxl)
library(stringr)
library(lubridate)
library(fastDummies)
# Path
path <- "~/Downloads/AirBNB/"1. Load & Inspect
Python
room = pd.read_excel(path + "airbnb_room_type.xlsx", index_col=0)
print(room.head(1)) description ... review_scores_value
listing_id ...
2595 Skylit Midtown Castle ... 9.0
[1 rows x 11 columns]
price = pd.read_csv(path + "airbnb_price.csv", index_col=0)
print(price.head(1)) price nbhood_full
listing_id
2595 225 dollars Manhattan, Midtown
review = pd.read_csv(path + "airbnb_last_review.csv", index_col=0)
print(review.head(1)) host_name last_review
listing_id
2595 Jennifer May 21 2024
R
room <- read_excel(paste0(path, "airbnb_room_type.xlsx"))
price <- read.csv(paste0(path, "airbnb_price.csv"))
review <- read.csv(paste0(path, "airbnb_last_review.csv"))Reasoning
I chose the three names of ‘room’, ‘price’, and ‘review’ because of the precedent set by the instructions when it said to read airbnb_room_type into ‘room’. I also considered the fact that these would not be my master dataframes, so I prioritized choosing short and distinct names to avoid mixing them up. I used two different strategies for Python and R when reading in the IDs. In Python, I was easily able to assign the ‘listing_id’ column as the primary key. This made it so it did not count as one of the data columns. Unfortunately, the process was not as straightforward in R, and I left ‘listing_id’ as one of the data columns. So, the shapes for Python and R are off by one in the column count for this project.
2. Merge & Join Logic
Python
listings = pd.merge(room, price, on="listing_id", how="left")
listings = pd.merge(listings, review, on="listing_id", how="left")
listings.shape(25209, 15)
R
listings <- left_join(room, price, by = "listing_id")
listings <- left_join(listings, review, by = "listing_id")
dim(listings)[1] 25209 16
Reasoning
Using a left join was the clear choice in this situation. A left join prioritizes the primary table, in this case room, so that none of its rows are lost regardless of what matches on the right. The left join was used so that even if a review or price was missing, the room data would not be dropped. If we were to use an alternative, such as an inner join, we would lose significant data. For example, if a room did not have any reviews yet, then the room would be completely omitted. Missing keys have the potential to create bias in the data. For example, newer listings may not have a ‘last_review’ date or ratings. This could potentially create a bias where only listings with activity are considered.
3. Deduplication
Python
listings.duplicated().sum()np.int64(0)
print("Before: ", listings.shape)Before: (25209, 15)
listings = listings.drop_duplicates(keep='last')
print("After: ", listings.shape)After: (25209, 15)
R
sum(duplicated(listings))[1] 0
Reasoning
To identify duplicate rows in both listings data frames, I used the function duplicated() for both Python and R. I determined that the ‘listing_id’ cannot be duplicated. Since these are the primary keys, there cannot exist two of the same. Choosing last over first as our keep parameter prioritizes keeping the most recent/updated occurrence. This is preferred in this scenario because the later the row is added, the more updated its data is likely to be. First can be used in a situation where you want the original to be prioritized. If it is assumed that added duplicates are incorrect, then you would use first.
4. Missing Value Strategy
Approach
listings.isnull().sum() / len(listings)description 0.000397
room_type 0.000000
minimum_nights 0.000000
maximum_nights 0.000000
review_scores_rating 0.217145
review_scores_accuracy 0.218890
review_scores_cleanliness 0.218533
review_scores_checkin 0.219049
review_scores_communication 0.218652
review_scores_location 0.218930
review_scores_value 0.218969
price 0.000000
nbhood_full 0.000000
host_name 0.001388
last_review 0.001071
dtype: float64
I identified the 10 features that have missing data. For the categorical features, ‘description’ and ‘host_name’, I decided to input filler values such as “none”. For ‘last_review’, I considered it negligible and dropped the column. For the ratings, I chose the MICE imputer so that the other rating features could help impute missing ratings. However, I realized that in nearly all cases, when there was one missing rating, all ratings were missing. So, I chose to drop these rows.
Python
listings['description'] = listings['description'].fillna("None")
listings['host_name'] = listings['host_name'].fillna("None")
listings.dropna(subset=['last_review'], inplace=True)
# drop all rows with na review scores
listings.dropna(subset=['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
'review_scores_value'], inplace=True)
listings.isnull().sum() / len(listings)description 0.0
room_type 0.0
minimum_nights 0.0
maximum_nights 0.0
review_scores_rating 0.0
review_scores_accuracy 0.0
review_scores_cleanliness 0.0
review_scores_checkin 0.0
review_scores_communication 0.0
review_scores_location 0.0
review_scores_value 0.0
price 0.0
nbhood_full 0.0
host_name 0.0
last_review 0.0
dtype: float64
listings.shape(19648, 15)
R
listings$description[is.na(listings$description)] <- "None"
listings$host_name[is.na(listings$host_name)] <- "None"
listings <- listings |>
drop_na(last_review)
listings <- listings |>
drop_na(review_scores_rating, review_scores_accuracy,
review_scores_cleanliness, review_scores_checkin,
review_scores_communication, review_scores_location,
review_scores_value)
dim(listings)[1] 19648 16
Reasoning
The first two decisions were simpe, since their missing percentages were 0.1% or less. The review score features, however, forced me to consider integrity, bias, and utility. My main consideration was the target variable: ‘review_scores_rating’. Imputing 20% of the target variable values seemed like it would challenge integrity, especially when those imputations would very likely not involve using the other six rating features. This decision did potentially create some bias against new, less-used listings. However, imputing with values likely would have created a different form of bias for/against these same listings. 21% is a relatively high percentage; however, after dropping these rows there are still nearly 20,000, which I considered to be a usable count.
5. Dates & Recency Feature
Python
listings['last_review'] = pd.to_datetime(listings['last_review'])
listings['days_since_review'] = (dt.datetime.today() - listings['last_review']).dt.days
listings[['last_review', 'days_since_review']].head() last_review days_since_review
listing_id
2595 2024-05-21 509
3831 2024-07-05 464
5099 2024-06-22 477
5178 2024-06-24 475
5238 2024-06-09 490
R
listings$last_review <- as.Date(listings$last_review, format = "%B %d %Y")
print(class(listings$last_review))[1] "Date"
listings$days_since_review <- as.numeric(Sys.Date() - listings$last_review)
print(class(listings$days_since_review))[1] "numeric"
Insights
A lack of proper date formatting and large data sets can create nightmares. It is difficult to efficiently convert improperly formatted dates to usable objects. Creating the feature ‘days_since_review’ is a very useful decision because its data type, integer, can be more freely built upon in later stages of data science. It is very difficult for date types to be used for things such as visualization or training. It also gives a glanceable look at recency. Mixing up formats can lead to issues, and some of them might go unnoticed. When formats are not filed correctly, for example if days and months are swapped, this can lead to a model using incorrect data. The error would not be caught easily.
6. Text Cleaning
Python
listings['description'] = listings['description'].str.replace(r"[^A-Za-z0-9\s]+", '', regex=True)
listings['description'] = listings['description'].str.lower()
listings['description'].head()listing_id
2595 skylit midtown castle
3831 cozy entire floor of brownstone
5099 large cozy 1 br apartment in midtown east
5178 large furnished room near bway
5238 cute cozy lower east side 1 bdrm
Name: description, dtype: object
R
listings$description <- str_replace_all(listings$description, "[^A-Za-z0-9\\s]+", "")
listings$description <- str_to_lower(listings$description)
head(listings$description)[1] "skylit midtown castle"
[2] "cozy entire floor of brownstone"
[3] "large cozy 1 br apartment in midtown east"
[4] "large furnished room near bway"
[5] "cute cozy lower east side 1 bdrm"
[6] "beautiful 1br on upper west side"
Insight
I chose my pattern based on characters that I wanted to keep. My pattern removes anything that is not a letter, digit, or space. When a feature includes several words, over-cleansing runs the risk of distorting the original meaning. For example, sometimes punctuation is critical. Removing it can alter the author’s intent. This risk brings up an important ethical issue. If an author intentionally included certain aspects of a feature, like punctuation, it could be considered dishonest to use the cleaned, altered version as if that is what they meant.
7. Categorical Encoding
Room Type
Approach
listings.select_dtypes('object').nunique()description 19299
room_type 9
price 502
nbhood_full 213
host_name 6328
dtype: int64
After reviewing the categorical features in their uniqueness, it was clear to me that host_name and description were unusable. This left room_type and nbhood_full as candidates.
Python
listings['room_type'].unique()array(['Entire home/apt', 'private room', 'Private room',
'entire home/apt', 'PRIVATE ROOM', 'shared room',
'ENTIRE HOME/APT', 'Shared room', 'SHARED ROOM'], dtype=object)
# Normalize -> 3 Types
listings['room_type'] = listings['room_type'].str.replace(r'[ /]', '_', regex=True).str.lower()
listings = pd.get_dummies(listings, columns=['room_type'], drop_first=False)
listings[['room_type_entire_home_apt', 'room_type_private_room', 'room_type_shared_room']].head() room_type_entire_home_apt ... room_type_shared_room
listing_id ...
2595 True ... False
3831 True ... False
5099 True ... False
5178 False ... False
5238 True ... False
[5 rows x 3 columns]
R
listings$room_type <- str_replace_all(listings$room_type, "[ /]", "_")
listings$room_type <- str_to_lower(listings$room_type)
listings <- dummy_cols(listings,
select_columns = "room_type",
remove_first_dummy = FALSE,
remove_selected_columns = TRUE)
head(listings[, c("room_type_entire_home_apt", "room_type_private_room", "room_type_shared_room")])# A tibble: 6 × 3
room_type_entire_home_apt room_type_private_room room_type_shared_room
<int> <int> <int>
1 1 0 0
2 1 0 0
3 1 0 0
4 0 1 0
5 1 0 0
6 1 0 0
Neighborhood
Python
listings['borough'] = listings['nbhood_full'].str.lower().str.split(', ').str[0]
listings['nbhood'] = listings['nbhood_full'].str.lower().str.split(', ').str[1]
listings = listings.drop('nbhood_full', axis=1)
listings = pd.get_dummies(listings, columns=['borough'])
listings[["borough_bronx",
"borough_brooklyn",
"borough_manhattan",
"borough_queens",
"borough_staten island",
"nbhood"]].head() borough_bronx ... nbhood
listing_id ...
2595 False ... midtown
3831 False ... clinton hill
5099 False ... murray hill
5178 False ... hell's kitchen
5238 False ... chinatown
[5 rows x 6 columns]
R
nbhood_split <- str_split(listings$nbhood_full, ", ", simplify = TRUE)
listings$borough <- str_to_lower(nbhood_split[, 1])
listings$nbhood <- str_to_lower(nbhood_split[, 2])
listings$nbhood_full <- NULL
listings <- dummy_cols(listings,
select_columns = "borough",
remove_first_dummy = FALSE,
remove_selected_columns = TRUE)
head(listings[, c("borough_bronx",
"borough_brooklyn",
"borough_manhattan",
"borough_queens",
"borough_staten island",
"nbhood")])# A tibble: 6 × 6
borough_bronx borough_brooklyn borough_manhattan borough_queens
<int> <int> <int> <int>
1 0 0 1 0
2 0 1 0 0
3 0 0 1 0
4 0 0 1 0
5 0 0 1 0
6 0 0 1 0
# ℹ 2 more variables: `borough_staten island` <int>, nbhood <chr>
Reasoning
Room_type I originally determined had nine unique values. I quickly realized this was incorrect due to a lack of normalization. After normalizing, I determined that there were three unique room types, which was a great number for one-hot encoding using dummies.
Originally, I considered nbhood_full to be unusable; however, after noticing that every value included one of the five New York boroughs, I realized I could work with that. So, after splitting this feature into two columns, boroughs and neighborhood, I determined that I can use one-hot encoding for boroughs since this feature only had five unique values.
8. Type Coercion & Scaling
Type Coercion
Python
listings['price'] = listings['price'].replace(r'\D', '', regex=True).astype('int64')
listings.dtypesdescription object
minimum_nights int64
maximum_nights int64
review_scores_rating float64
review_scores_accuracy float64
review_scores_cleanliness float64
review_scores_checkin float64
review_scores_communication float64
review_scores_location float64
review_scores_value float64
price int64
host_name object
last_review datetime64[ns]
days_since_review int64
room_type_entire_home_apt bool
room_type_private_room bool
room_type_shared_room bool
nbhood object
borough_bronx bool
borough_brooklyn bool
borough_manhattan bool
borough_queens bool
borough_staten island bool
dtype: object
R
listings$price <- as.double(str_replace_all(listings$price, "\\D", ""))
str(listings)tibble [19,648 × 24] (S3: tbl_df/tbl/data.frame)
$ listing_id : num [1:19648] 2595 3831 5099 5178 5238 ...
$ description : chr [1:19648] "skylit midtown castle" "cozy entire floor of brownstone" "large cozy 1 br apartment in midtown east" "large furnished room near bway" ...
$ minimum_nights : num [1:19648] 2 15 3 3 7 15 7 7 7 15 ...
$ maximum_nights : num [1:19648] 1125 365 150 180 1110 ...
$ review_scores_rating : num [1:19648] 90 82 88 93 95 96 100 100 100 100 ...
$ review_scores_accuracy : num [1:19648] 9 9 9 9 10 10 10 10 10 10 ...
$ review_scores_cleanliness : num [1:19648] 9 8 9 9 10 10 10 10 9 10 ...
$ review_scores_checkin : num [1:19648] 10 10 9 10 10 10 10 10 10 10 ...
$ review_scores_communication: num [1:19648] 10 10 9 10 10 10 10 10 10 10 ...
$ review_scores_location : num [1:19648] 8 9 10 10 10 8 10 10 9 10 ...
$ review_scores_value : num [1:19648] 9 8 9 9 9 9 8 10 9 10 ...
$ price : num [1:19648] 225 89 200 79 150 135 85 89 85 140 ...
$ host_name : chr [1:19648] "Jennifer" "LisaRoxanne" "Chris" "Shunichi" ...
$ last_review : Date[1:19648], format: "2024-05-21" "2024-07-05" ...
$ days_since_review : num [1:19648] 509 464 477 475 490 477 476 475 464 470 ...
$ room_type_entire_home_apt : int [1:19648] 1 1 1 0 1 1 0 0 0 1 ...
$ room_type_private_room : int [1:19648] 0 0 0 1 0 0 1 1 1 0 ...
$ room_type_shared_room : int [1:19648] 0 0 0 0 0 0 0 0 0 0 ...
$ nbhood : chr [1:19648] "midtown" "clinton hill" "murray hill" "hell's kitchen" ...
$ borough_bronx : int [1:19648] 0 0 0 0 0 0 0 0 0 0 ...
$ borough_brooklyn : int [1:19648] 0 1 0 0 0 0 0 1 0 1 ...
$ borough_manhattan : int [1:19648] 1 0 1 1 1 1 1 0 1 0 ...
$ borough_queens : int [1:19648] 0 0 0 0 0 0 0 0 0 0 ...
$ borough_staten island : int [1:19648] 0 0 0 0 0 0 0 0 0 0 ...
Scaling
Python
scaler = StandardScaler()
scaled = scaler.fit_transform(listings.loc[:, ['minimum_nights',
'maximum_nights',
'price',
'days_since_review']])
listings[['minimum_nights', 'maximum_nights', 'price', 'days_since_review']] = pd.DataFrame(
data=scaled,
columns=['minimum_nights', 'maximum_nights', 'price', 'days_since_review'],
index=listings.index)
listings[["minimum_nights", "maximum_nights", "price", "days_since_review"]].describe() minimum_nights maximum_nights price days_since_review
count 1.964800e+04 1.964800e+04 1.964800e+04 1.964800e+04
mean -2.314471e-17 -1.446545e-18 2.025163e-17 -4.339634e-16
std 1.000025e+00 1.000025e+00 1.000025e+00 1.000025e+00
min -3.462369e-01 -9.575426e-03 -1.022618e+00 -8.547763e-01
25% -3.096110e-01 -9.571935e-03 -5.234007e-01 -6.394291e-01
50% -2.729850e-01 -9.553887e-03 -2.774097e-01 -4.025473e-01
75% -1.264813e-01 -9.508917e-03 2.435124e-01 1.573553e-01
max 4.078467e+01 1.270615e+02 3.515253e+01 3.236820e+00
listings[["minimum_nights", "maximum_nights", "price", "days_since_review"]].head() minimum_nights maximum_nights price days_since_review
listing_id
2595 -0.309611 -0.009509 0.605264 0.200425
3831 0.166526 -0.009554 -0.378700 -0.768637
5099 -0.272985 -0.009567 0.424388 -0.488686
5178 -0.272985 -0.009565 -0.451050 -0.531756
5238 -0.126481 -0.009510 0.062637 -0.208735
R
listings[, c('minimum_nights', 'maximum_nights', 'price', 'days_since_review')] <-
scale(listings[, c('minimum_nights', 'maximum_nights', 'price', 'days_since_review')])
summary(listings[, c('minimum_nights', 'maximum_nights', 'price', 'days_since_review')]) minimum_nights maximum_nights price days_since_review
Min. :-0.3462 Min. : -0.009575 Min. :-1.0226 Min. :-0.8548
1st Qu.:-0.3096 1st Qu.: -0.009572 1st Qu.:-0.5234 1st Qu.:-0.6394
Median :-0.2730 Median : -0.009554 Median :-0.2774 Median :-0.4025
Mean : 0.0000 Mean : 0.000000 Mean : 0.0000 Mean : 0.0000
3rd Qu.:-0.1265 3rd Qu.: -0.009509 3rd Qu.: 0.2435 3rd Qu.: 0.1574
Max. :40.7836 Max. :127.058291 Max. :35.1516 Max. : 3.2367
Insight
Type mismatches can undermine future analysis by creating small and potentially hidden errors in math. Combining different types through addition and other operations is highly dependent on having identical types. I faced a debugging issue with my Python DataFrame reassignment after I used my scalar. In my first attempt, the values I got were not even close to what I expected, but nothing had broken, so I was confused. I started to realize that some of the values I expected were one off from each other. So, I added the argument index to match everything correctly. Then I was able to verify the scaling by observing a mean of 0 and a standard deviation of 1 for all four features. Scaling is crucial for later analysis with training models and visualization. Since I was familiar with the dataset, and I observed that there were commonly outliers and unpredictable ranges, I decided not to use MinMaxScaler, and went with StandardScaler.
9. Reorder & Export
Reorder
Approach
I used the reordering strategies of simply creating an altered list. I found in my research, “Pandas allows us to change the order of columns in a DataFrame by passing in a reordered list of column names” (David Y., 2023).
Python
categorical = list(listings.select_dtypes(include=['object']).columns)
boolean = list(listings.select_dtypes(include=['bool']).columns)
datetime = list(listings.select_dtypes(include=['datetime64']).columns)
numeric = list(listings.select_dtypes(include=['int64', 'float64']).columns)
numeric.remove('review_scores_rating')
target = ['review_scores_rating']
listings = listings[categorical + boolean + datetime + numeric + target]
listings.columnsIndex(['description', 'host_name', 'nbhood', 'room_type_entire_home_apt',
'room_type_private_room', 'room_type_shared_room', 'borough_bronx',
'borough_brooklyn', 'borough_manhattan', 'borough_queens',
'borough_staten island', 'last_review', 'minimum_nights',
'maximum_nights', 'review_scores_accuracy', 'review_scores_cleanliness',
'review_scores_checkin', 'review_scores_communication',
'review_scores_location', 'review_scores_value', 'price',
'days_since_review', 'review_scores_rating'],
dtype='object')
R
id <- c("listing_id")
categorical <- c("description", "host_name", "nbhood")
boolean <- c(
"room_type_entire_home_apt",
"room_type_private_room",
"room_type_shared_room",
"borough_bronx",
"borough_brooklyn",
"borough_manhattan",
"borough_queens",
"borough_staten island"
)
datetime <- c("last_review")
numeric <- c(
"minimum_nights",
"maximum_nights",
# "review_scores_rating",
"review_scores_accuracy",
"review_scores_cleanliness",
"review_scores_checkin",
"review_scores_communication",
"review_scores_location",
"review_scores_value",
"price",
"days_since_review"
)
target <- c("review_scores_rating")
listings <- listings[, c(id, categorical, boolean, datetime, numeric, target)]
colnames(listings) [1] "listing_id" "description"
[3] "host_name" "nbhood"
[5] "room_type_entire_home_apt" "room_type_private_room"
[7] "room_type_shared_room" "borough_bronx"
[9] "borough_brooklyn" "borough_manhattan"
[11] "borough_queens" "borough_staten island"
[13] "last_review" "minimum_nights"
[15] "maximum_nights" "review_scores_accuracy"
[17] "review_scores_cleanliness" "review_scores_checkin"
[19] "review_scores_communication" "review_scores_location"
[21] "review_scores_value" "price"
[23] "days_since_review" "review_scores_rating"
Export
Python
listings.to_csv(path + "airbnb_cleaned_py.csv", index=True)
listings.shape(19648, 23)
R
write.csv(listings, paste0(path, "airbnb_cleaned_r.csv"), row.names = FALSE)
dim(listings)[1] 19648 24
As stated previously, R is counting listing_id as a data column, while python is discluding listing_id as the index, causing a difference
Insight
Having a logical column order may not seem important initially, but normalizing the order by creating a categorized, predictable pattern helps keep reproducibility and collaboration efficient by decreasing confusion. Categorical data may be irrelevant to some who have more quantitative goals in mind; however, I believe it is still valuable for those interested in natural language analysis.
10. Summary & Reflection
Language Comparison
At a high level, the main difference I experienced with these languages can be described as follows: base Python and Pandas is more intuitive than base R data frames, but library integration is smoother in R than in Python. I found that I enjoyed trying things for the first time in Python, and the conversion to R usually came easily. A challenge I experienced with Python was the syntax for the scaling portion. Elements had to be very specific, and there were several steps, whereas R accomplished the same thing with a single line and the built-in scale() function.
Biblical Application
The verses in Proverbs 27 can be applied as wisdom in many facets of life, including monitoring data. It is crucial to “know the condition” of each feature. If your attention slips and best practices are not used, integrity can be lost. I would consistently use .head() in Python and View() in R to keep an eye on my data. I also used precautions such as creating copies so that I could experiment while maintaining the data I had cleansed so far. I could also compare the copied data with the original data.
Reflection
If I could start over, I would do a couple of things differently. I would try to read and understand all nine instructions before focusing on just the first one. I completed everything sequentially each week, but I realized that understanding the final objectives is very helpful when trying to understand the early instructions. I would also alternate between Python and R more. Oftentimes, once I figured out something in Python, I would realize an easier solution in R, or vice versa. I think I would leverage the strengths of each language instead of always starting with Python.