House rental — the Data Science way Part 2.1: train and regression model using PYCARET

Francesco Manghi
6 min readNov 21, 2020

--

In the previous Chapter we described how to scrape a famous italian website that collects announces about house selling and renting from many agencies around the country, in order to create a dataset on which we can subsequently train our regression model on prices. In this Chapter we are gonna do exactly that. Let’s start!

How to prepare a good model in minutes

PYCARET announce of the newest release of late October 2020

What is PYCARET? Autors describe it like that:

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the machine learning experiment cycle and makes you more productive.

In practice, PYCARET is a “wrapper” of the most utilized ML models and tools that takes your dataset as input and uses an approach like “run them all, but take it easy dude” to run all models in sequence with no optimization and let you select the one that suits more your needs after that rapid first shot.

After that first shot you can tune your model by optimizing hyperparameters with different methos, being them random, grid, bayesian, hyperopt, bohb.

The website really well documented if you want to learn more about it.

For our purpose, being it that we want to achieve the result of understanding houses’ renting prices and not to build a guide on how to choose the best algorithm for our task, we have to keep it simple: put all the ingredients together and see what happens.

Dataset Cleaning

Don’t be too euphoric btw, it automates many things but not the most painful task in the life of data scientist: dataset cleaning

The best way to understand if something is wrong in the dataset is composed by 2 simple pandas’ functions:

dataset['column_name'].unique
dataset.dtypes

pandas.Series.unique function returns unique values of Series object, and therefore it is possible to see if inside a column are stored values that do not match the nature of the column.

In a similar way, dataset.dtypes provides you with the type of data present in your dataset’s columns. As example, if “surface” column is described as an Object type, it means that there are some rows that contain data that are not numbers but probably strings or something similar.

There is not a simple way to clean them all, it depends on how data has been provided by announces’ inputeers.

Just for you to see some examples:

import pandas as pd#the dataset we saved in the previous chapterdataset.replace('5+', 6, inplace = True)
dataset.replace('3+', 4, inplace = True)
dataset = dataset[dataset.price != 'Affitto\n 750/mese\n ']
dataset = dataset[dataset.price != 'Affitto\n 500/mese\n ']
dataset = dataset[dataset.locals != 'C']
dataset = dataset[dataset.price != 'prezzo su richiesta']
dataset = dataset[dataset.surface != '250, terreno di proprietà 2.000']
dataset = dataset[dataset.surface != '100, terreno di proprietà 200']
dataset = dataset[dataset.surface != '160, terreno di proprietà 400']
dataset = dataset[dataset.surface != '165, terreno di proprietà 450']
dataset = dataset[dataset.surface != '105, terreno di proprietà 100']
dataset = dataset[dataset.surface != '130, terreno di proprietà 180']
dataset = dataset[dataset.price != 'Affitto\n 450/mese\n ']
dataset = dataset[dataset.locals != 'c']
dataset['floor']= dataset['floor'].str.replace('\W', '')metri = []for elem in dataset['surface']:
if len(elem)>3:
metri.append(elem[0:3])
else:
metri.append(elem)

dataset['surface'] = metri
dataset['price'] = dataset['price'].astype(int)
dataset['locals'] = dataset['locals'].astype(int)
dataset['surface'] = dataset['surface'].astype(int)
dati = dataset.drop(columns= ['spese condominio', 'heating_expences', 'other_expences', 'energy_certificate'])
dataset.fillna(value=0, inplace=True)
dataset.to_csv('before_regression.csv', sep=';')

Ok, now our dataset if clean. Yeah!

Now, PYCARED need 2 things: a dataset with features’ values and a list with target, so it can train the model on the data to obtain the target. In our case, the target is the renting price, and all other infos in the dataset are our features. Doing so is kind of seemless:

exp_reg101 = setup(data = dati.drop(columns=['announce_link', 'Unnamed: 0.1']), target = 'price', categorical_features = ['contract', 'district', 'floor', 'property_type', 'building_year', 'air_conditioning'], numeric_features=['locals','surface'], remove_outliers= True)

Straight and easy. With the setup function you provide the dataset, define the target to train the algorithm on, and define categorical_features and numeric_features.
What are them?

Categorical Features

A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories. Hair color is also a categorical variable having a number of categories (blonde, brown, brunette, red, etc.) and again, there is no agreed way to order these from highest to lowest. A purely categorical variable is one that simply allows you to assign categories but you cannot clearly order the variables. If the variable has a clear ordering, then that variable would be an ordinal variable.

https://stats.idre.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-numerical-variables/

In short: a feature like the area in which the house in located is not a numerical value and, even more, there is no possibility to assign to an area a higher value than another. Same goes for the leasing contract that the landlord want to offer. These features must be manged as dummy/indicator variables and it can be done using pandas. This will increase a lot the number of features of our problem and so the dimension of the dataframe, but it’s the only way we can deal with categorical features.

Fortunately PYCARET takes care of them automatically, so we just need to tell him where they are.

And here we go, let’s shoot them all.

compare_models()

PYCARET starts a training session for every regression model it has.
When the process has finished, it provides you with the result from every ML model ordered by R2 result.

Ok, let’s say we want to proceed with the LightGBM algo, being it the one with the best trade of between training time (TT) and R2.

model = create_model('lightgbm')

And here we go! It trains a litte better model based on different folds and other few parameters. This is the base from which we can continue.

Here goes the hyperparameter’s optimization.

tuned_model = tune_model(model)

and after a little bit of patience here we have our tuned model with hyperparameter optimization!

Really simmple. Far too simple to be honest.
But, that’s really it.

Results

PYCARET has a simple function to provide you with many generic plots and tables describing your model. It provides a widget with selectable outputs descibing different parameters, as you can see from the image below.

evaluate_model(tuned_model)

But how does our model performs?
Let’s take a look.

final_model = finalize_model(tuned_model)
predictions = predict_model(final_model, data=dati.drop(columns=['announce_link', 'Unnamed: 0.1']))
predictions.reset_index(inplace=True)
predictions.reset_index(inplace=True)
plt = predictions[['price', 'Label']].plot(figsize=(25,10))
fig = plt.get_figure()
fig.savefig("risultato_pycaret.png")

Pearson gives us a 0.87 correlation value. That’s good. Check how the error graph looks like:

(predictions['price'] - predictions['Label']).plot(figsize=(25,10))

That’s fine!
It’s a little bit scattered, but hey! It’s real life dataset!
That’s real world :)
We’ll think at how to improve the performance later.

So, that’s it! The Chef hopes you enjoyed the meal!
See you soon guys! ;D

--

--

Francesco Manghi
Francesco Manghi

Written by Francesco Manghi

Energy and Mechatronics Engineer. I have learnt Machine Learning and Data Science for work and passion. I love handcrafting and hiking in my freetime.

No responses yet