How to Summarize Data with Pandas, Python

How to Summarize Data with Pandas, Python. Are you starting to learn how to analyze data using Python Pandas? If yes, this post is for you. We will go over different functions used to summarize data contained in a pandas dataframe.

For demonstration purposes, I used the Supermarket Sales data set from Kaggle. I downloaded the file and saved it in my local drive.

#import library
import pandas as pd
#import file
ss = pd.read_csv('supermarket_sales.csv')
#preview data
ss.head()

Image for post

 

Supermarket Sales dataframe

  1. info(): provides a concise summary of a dataframe.

I use this method every time I am working with pandas especially when doing data cleaning. It shows you all the information you need to know about your dataframe like: record counts, column names, data types, index range , and memory usage.

By looking at the summary provided for ss.info() below we can observe:

  • record count is 1000
  • composed of 17 columns
  • Column names can be updated to eliminate white spaces
  • Data types included are object, float64 and int
  • Date and Time columns are dtype = object, should be updated to corresponding datetime formats.
  • There are no missing values on any of the records

Image for post

 

info() result for the ss dataframe

2. describe(): Generates descriptive statistics that will provide visibility of the dispersion and shape of a dataset’s distribution. It excludes NaN values. It can be used for dataframe or a specific series.

Image for post

 

describe() results for the ss dataframe

Image for post

 

describe() results for the Quantity series

By default, the results are for numeric types, but you can use the parameter include to show the stats for some or all the data types in the dataframe.

Image for post

 

describe() results for all the series in the ss dataframe

Image for post

 

describe() results for only those series with object and int datatypes

You can exclude specific data types from a dataframe as well with the exclude parameter:

Image for post

 

describe() results for the ss dataframe excluding object and int data types. Leaving only the ones with float.

By default, the percentiles returned by this function are the 25th, 50th and 75th. If you wish to change the percentiles in the results, you can add those by using the parameter percentiles. Values must be between 0 and 1. This applies for both dataframe and series:

Image for post

 

describe() results for the ss dataframe with specific percentiles set

Image for post

 

describe() results for the Tax 5% series with specific percentiles set

3. value_counts(): returnscounts of unique values for the specified series. NaN values are excluded by default.

Image for post

 

value_counts() results for ss.City series

By default, the results are listed in descending order, but you can change it by using parameter ascending = True

Image for post

 

value_counts() results in ascending order

You can get the frequency (instead of count) for each value by setting the normalize = True

Image for post

 

value_counts() results in frequency format

4. nunique(): Count distinct observations. Can be used for a dataframe or a series. By default, it exclude the NaN values.

Image for post

 

nunique() results for ss dataframe

Image for post

 

nunique() results for City series

If you need to include the NaN values, use the parameter dropna= False

Since this dataset does not contains any NaN values, let us create a dataframe for this example:

d = {'A': [1, 2, None], 'B': [3, 4, 2], 'C': [3, None, None]}
df = pd.DataFrame(d)
df

Image for post

 

new dataframe for demo

Image for post

 

nunique() results excluding NaN values

Now see how the dropna parameter set to False changes the results:

Image for post

 

nunique() results including NaN values

5. sum(): Return the sum of the values for the requested axis. You can use it for both dataframe and series.

Image for post

 

sum() results for the entire ss dataframe

Image for post

 

sum() results for the Quantity series

You can specify to apply the function only to numeric types by using the parameter: numeric_only = True

Image for post

 

sum() results for those series with numeric types only

Another useful parameter is min_count where you can set the minimum number of records that must exists to apply the function.

Let’s create another dataframe to demo how this parameter works:

d = {'A': [1, 2, None, 5, 8], 'B': [3, 4, 2, 4, 5], 
'C': [3, None, None, 3, 4]}
df = pd.DataFrame(d)
df

Image for post

 

dataframe with some missing values

Image for post

 

sum() function result is provided for those series that has 4 or more non null values

6. count(): Return number of non-NA/null observations.

Can be applied to both dataframe and series:

Image for post

 

count() result for the ss dataframe

Image for post

 

count() result for the Gender series

Use the parameter numeric_only = True to apply functions to only series with numeric data types.

Image for post

 

count() result for ss dataframe with only numeric type series

Need Help with Researchers or Data Analysts, Lets Help you with Data Analysis & Result Interpretation for your Project, Thesis or Dissertation?

We are Experts in SPSS, EVIEWS, AMOS, STATA, R, and Python


7. Min and Max, Mean and Median:

a. min(): Return the minimum value

b. max(): Return the maximum value

c.mean(): Return the mean of the values

d. median(): Return the median of the values

These functions can be applied to both dataframe and series:

Image for post

max() result for the ss dataframe

Image for post

min() result for the ss dataframe

Image for post

mean() result for the ss dataframe

Image for post

median() result for the ss dataframe

Image for post

mean() result for the Rating series

Summary

In this post we went over some functions to get summarized data from a pandas dataframe. We used .info() to get information about the structure and format of our dataframe. We used .describe() to get descriptive statistic about the data. Then we showed how to get the counts per columns with .counts(); values grouped and their counts with .value_counts(), how many unique values a column has with .nunique() and finally functions to obtain the maximum and minimum values with .max() and .min() as well as the median and mean values with .median() and .mean(). Thanks for reading!

Data Analytics Services
Need Our Services?
Econometrics & Statistics Modelling Services
Need Help, Whatsapp Us Now