A starter pack to descriptive statistics and data visualization.
Statistics is a form of mathematical analysis that uses quantified models, representations and synapse for a given set of experimental data or real-life studies, statistics studies methodologies to gather, review, analyze, and draw conclusions from data.
Statistics is mainly categorized into descriptive and inferential statistics.
In this article, I would be focusing mainly on descriptive statistics; Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphical analysis, they form the basis of virtually every quantitative analysis of data. For example if given a data set with the heights of children in a classroom in descriptive statistics, you will find out the maximum, minimum, and average heights of the class.
I’d be executing a small demo to show how to calculate measure of central tendency, measure of spread, and the construction of tables and graphs, using pandas and seaborn.
Before we get our hands dirty with calculations and plotting, it is important to understand that we have two types of data which are; numerical (quantitative) data and categorical data (qualitative).
Categorical data: are non-numerical information such as gender, relationship status etc.
Numerical data: are measurements such as weight, price, age, tax income etc.
Let’s get to work!
So let’s get started with data visualization on the laptop price dataset from Kaggle you can download the dataset here , then we load the dataset using pandas library, and save it as df.
Then we use the head () method to show the first five row of data. You shall see the data is in a table of 12 columns (variables)
Moving on, we calculate the measure of central tendency, we calculate the mean and median.
Mean is an average of all the number for example, we were given this set of numbers [34, 84, 55, 23, 98, and 65] and were asked to calculate the mean.
Mean = 359
Median is the middle number in a sorted list of numbers, the median is sometimes used as opposed the mean when there are outliers in the sequence. For example in a given list of numbers below: [2, 4, 5, 7, 9, 5, 3, 0, 6]
0, 2, 3, 4, 5, 5, 6, 7, 9
Median = 5
In a situation where the sum of numbers are even, for example
[5, 8, 9, 2, 7, 4]
2, 4, 5 7, 8, 9
Median = 5+7/2 = 6
But pandas made this so easy for us, that we don’t need to calculate the mean and median this way. Using the describe () method pandas automatically calculate the mean, max, standard deviation and also the percentiles.
Visualization with matplotlib and seaborn
Now we are going to see how we can explore the data distribution using matplotlib and seaborn. Ensure you have Import your matplotlib and seaborn libraries.
First of all we would be building a histogram to observe the data distribution from our laptop price dataset.
A histogram is a graphical display using rectangular bars to show the frequency distribution of a set of numerical data.
To build the histogram, firstly use the dropna() method to drop all missing values, set the figure size of the histogram, use the seaborn distplot() method to plot the histogram.
Box plot
The box plot is a really interesting tool especially if you want to show the range, with the box plot you can easily locate the minimum, maximum, median, 25th percentile, and 75th percentile.
To plot a box plot firstly set a seaborn theme as shown below, set the figure size, and use the seaborn boxplot method to generate a boxplot, set “Price euros” as the input, also if you want your boxplot to be vertical set the parameter “orient” to “v”.
To show the relationship between the price and each of the operating system, I added another parameter ‘y’ to the boxplot.
Bar chart
A bar chat is a chart that presents categorical data with rectangular bars with heights or lengths proportional to the values they represent. Bar chart is one of the most effect ways to display categorical data.
Now, we use seaborn to create a bar chart to show the number of laptop sales based on the Company’s.
To do that;
we set a seaborn theme (which ever theme you prefer either “darkgrid” or “whitegrid”
assign the variable “Company” to the parameter “x”, assign the plotting dataset to the parameter “df” and
use the seaborn method ‘count ()” to plot the graph.
From the chart we can see that both Lenovo and dell as the highest number of sales.
We can also calculate the type of laptop with the highest sales, and we can obviously see it’s the notebook
Conclusions
The main objective of this article is to show some possible way to summarize our dataset using pandas and seaborn.
I really wish this article can get you started or refresh your memory on descriptive statistics. Hope you enjoyed reading.
feel free to check out the notebook on git hub here.