Loading the dataset


The dataset contains real-life driving data of two Volkswagen e-Golf cars, with year of manufacture as 2014 and 2016 respectively. The data is available at the Spritmonitor website.

Volkswagen e-Golf, year 2014, 85 kW (116 PS): https://www.spritmonitor.de/en/detail/679341.html?page=8

Volkswagen e-Golf, year 2016, 85kW (116 PS): https://www.spritmonitor.de/en/detail/786327.html

The data was scrapped using a python crawler (vehicle_crawler.py) available at: https://github.com/armiro/crawlers/tree/master/SpritMonitor-Crawler

The file includes data about 3615 trips with a total travel distance of around 152167 kilometers.

It is important to restrict the dataset to a limited number of cars. If data of the same type of car, owned by different people around the world, is included in the same dataset then it leads to instability in the model training. Hence, it is important to make sure that the dataset should not contain data of cars driven in too varied conditions.

Note: Some of the features were trimmed from the original CSV file before being loaded in the notebook.

Checking the type of data in each column


To check for categorical and non-categorical features in the dataset.

Removing unnecessary features


Since, the dataset contains data related to only one car i.e., Volkswagen Golf, the Manufacturer and Power(kW) of battery are constant. Hence, they can be safely dropped from the dataset, without leading to any loss of valuable information.

Removing all rows with missing values for Trip_Distance(km)


All the rows/ records with missing values of Trip_Distance(km) should be removed. These values cannot be imputed, because this is the target variable.

One-Hot Encoding the categorical variables (Types of tires and driving styles)


The categorical variables cannot be used in the dataset for training the model. They have to be converted into a an integer representation of boolean form, i.e. 1/0

Encoding types of tires

Encoding types of driving styles

Checking the data types of the encoded features

Shuffling and splitting the data it into training and test sets

Dividing the dataset into features (X) and label (Y)

Dividing the features into training (80%) and testing (20%) datasets for model training and evaluation respectively

Looking for columns with missing values in training set


Missing values for numerical/ non-categorical features is to be imputed using the rest of the data in the same feature.

Imputing the missing values


Here, we have used a simple imputer using the 'mean strategy', i.e. the missing values will be replaced by the mean of non-missing values in that feature.

Checking the number of non-null values in each column of the dataset

Outlier Analysis


This analysis is performed on continuos features assuming that they are having Gaussian distribution. The data points having a z-score of more than 3 and less than -3 are removed from the dataset since they are considered as outliers.

For Average Speed (km/hr)

For Quantity(kWh)

Correlation between different features

As it could be seen from the heatmap that the Driving_Style Normal & Driving_Style Moderate are possessing strong negative correlation, one of them needs to be dropped. Same goes for Summer & winter tire types.

Tensorflow Model


Preparing a neural network model in tensorflow for this regression problem.

Importing the necessary libraries

Normalizing the continuos features before feeding it to the neural net

Bulding a sequential model

Fitting the model to the training data

Making predictions for the test dataset and evaluating the model on the basis of mean absolute error

Calculating the standard deviation of the test labels

As per the research paper (https://sci-hub.se/10.1109/ICSPIS48872.2019.9066042), a model having a MAE of less than 10% of standard deviation of the label (for regression), i.e. 4.384 km in our case, is considered as an excellent model. the MAE of our model for test data is 3.18 (approximately). Hence, it can considered as a reliable model.

Saving and exporting JSON version of the model