Housing Price Prediction in Illinois

Dec 1, 2021

Dataset

The dataset comes from the Cook County Assessor’s Office (CCAO) in Illinois, a government institution that determines property taxes across most of Chicago’s metropolitan area and its nearby suburbs. In the United States, all property owners are required to pay property taxes, which are then used to fund public services including education, road maintenance, and sanitation. These property tax assessments are based on property values estimated using statistical models that consider multiple factors, such as real estate value and construction cost.

The CCAO dataset consists of over 500 thousand records describing houses sold in Cook County in recent years (new records are still coming in every week!). The data set we will be working with has 61 features in total. An explanation of each variable can be found in the included codebook.txt file. Some of the columns have been filtered out to ensure this assignment doesn’t become overly long when dealing with data cleaning and formatting.

The data are split into training and test sets with 204792 and 68264 observations, respectively.

Part 1: Explortary Data Analysis (EDA)

Abnormal Values: remove outliers, fill with default values
Feature Engineering: log transformation, one-hot encoding, keyword extraction
Modeling: linear regression
Notice: implement with pipeline and visualization

Part 2: Advanced Prediction with Machine Learning

Criterion: L2 loss
Baseline: ridge regression
Main Model: xgboost + random forest
Ablation study: xgboost, xgboost + ridge regression

Data Science

Housing Price Prediction in Illinois

Dataset

Part 1: Explortary Data Analysis (EDA)

Part 2: Advanced Prediction with Machine Learning

Frank (Haoyang) Ling

Master Student @ UMICH