Data analysis of used car database
Data Analysis or sometimes referred to as exploratory data analysis (EDA) is one of the core components of data science. It is also the part on which data scientists, data engineers and data analysts spend their majority of the time which makes it extremely important in the field of data science. This repository demonstartes some common exploratory data analysis methods and techniques using python. For purpose of illustration the used car database dataset has been taken from kaggle since it is one of the ideal dataset for performing EDA and taking a step towards the most amazing and interesting field of data science. Good luck with your EDA on the used car database dataset.
- The dataset is taken from kaggle and contains details of the used cars in germany which are on sale on ebay.
- The dataset is not clean and hence a lot of data cleaning is carried out. For e.g. prices where too high which are replaced by the median and outliers are removed accordingly.
- Also vehicles whose registration year was greater than 2016 and less than 1890 were removed from the dataset as this data is inconsistense and would yield incorrect results.
- The dataset is cleaned and stored in a CleanData folder which contains the entire cleaned dataset named as cleaned_autos.csv and another folder named DataForAnalysis containing files structures containing subsets of the cleaned dataset based on brand of the vehicles and vehicle types.
The main folder contains 9 folders.
- Folders from Analysis1 - Analysis5 contain the iPython Notebook, python scripts along with the Plots for that analysis.
- Folder for shell scripts which automate the creation of files structures and splitting the data as mentioned above.
- Datapreparation folder contains the Datapreparation iPython Script for cleaning of data.
- CleanData folder contains the clean dataset and subsets of data as per the file structure.
- RawData folder which contains the raw dataset.
- This analysis gives the distribution of prices of vehicles based on vehicles types.
- Output before the cleaning the data is shown below in order to highlight the importance of cleaning this dataset.
Histogram and KDE before performing data cleaning.
- It is clearly visible that the dataset has many outliers and inconsistent data as year of registration cannot be more than 2016 and less than 1890.
Boxplot of prices of vehicles based on the type of vehicles after cleaning the dataset. Based on the vehicle type how the prices vary is depictable from the boxplot. low, 25th, 50th(Median), 75th percentile, high can be estimated from this boxplot.
- This analysis gives the number of cars which are available for sale in the entire dataset based on a particular brand.
Barplot of average price of the vehicles for sale based of the type of the vehicle as well as based on the gearbox of the vehicle.
- This analysis gives the average number of price for the vehicles based on the fueltype of the vehicle and also based on the type of the vehicle.
Barplot of average power of the vehicle based of the fueltype of the vehicle and also on the type of the vehicle.
- This analysis gives you the average price of the brand of vehicles and their types which are likely to be found in the dataset.
- This analysis gives you the distribution of the total no of days a partiular vehicle has been online for sale before it was purchased.
- This is a dynamic analysis and can be applied to any vehicle by specifying the brand of choice as argument to the python script.
- To run this file on your terminal type: Analysis5.py 'brand'
- where 'brand' is the choice of brand vehicle you would like to see analysis about from the column 'brand' in the dataset.
- Many outliers with registration year greater than 2016 and less than 1890 which are removed to make the dataset ready for analyis.
- Vehicles with registration year 1990-2016 are available maximum for sale. Year 2000 being the highest with 24313 vehicles.
- Vehicles of type SUV and Cabrio are the most expensive with greater than $5000 as compared to Coupe, Bus etc which are moderately expensive in the range of $2650 to $5000 where as the least expensive being Andere and Others with price less than $1800 on an average.
- Vehicles of brands Volkswagen, Opel and BMW are the maximum for sale in the decreasing order with Volkswagen being the maximum.
- As a general trend vehicles which are automatic are the most expensive as compared to manual and other unspecified gearbox type.
Average prices of vehicles that are Hybrid are most expensive as compared to other fuel types like Diesel and Gasoline
SUV type of vehicles with gearbox type automatic has the maximum power and Kleinwagen with the least.
- Vehicles of brand Audi and type SUV are the most expensive of the avialable vehicles for sale.
- Vehicles of brand Porsche and type Kleinwagen are the least expensive of the available vehicles for sale.
- Based on selected brand of choice, it can be found out what type of vehicles in the selected brand tend to get sold quickly online as compared to others.