Dataset
The dataset comes from the Cook County Assessor’s Office (CCAO) in Illinois, a government institution that determines property taxes across most of Chicago’s metropolitan area and its nearby suburbs. In the United States, all property owners are required to pay property taxes, which are then used to fund public services including education, road maintenance, and sanitation. These property tax assessments are based on property values estimated using statistical models that consider multiple factors, such as real estate value and construction cost.
The CCAO dataset consists of over 500 thousand records describing houses sold in Cook County in recent years (new records are still coming in every week!). The data set we will be working with has 61 features in total. An explanation of each variable can be found in the included codebook.txt
file. Some of the columns have been filtered out to ensure this assignment doesn’t become overly long when dealing with data cleaning and formatting.
The data are split into training and test sets with 204792 and 68264 observations, respectively.
Part 1: Explortary Data Analysis (EDA)
- Abnormal Values: remove outliers, fill with default values
- Feature Engineering: log transformation, one-hot encoding, keyword extraction
- Modeling: linear regression
- Notice: implement with pipeline and visualization
Part 2: Advanced Prediction with Machine Learning
- Criterion: L2 loss
- Baseline: ridge regression
- Main Model: xgboost + random forest
- Ablation study: xgboost, xgboost + ridge regression