Pandas and Matplotlib for Qualitative analysis

By Uday Ahuja on December 31, 2024

Pandas and Matplotlib are Python libraries designed to make working with structured data fast, flexible, and intuitive. Pandas provides tools to clean, process, and analyze datasets efficiently. Matplotlib is useful for creating static, interactive, and publication-quality visualizations. It helps in presenting data insights visually. Matplotlib works seamlessly with Pandas, to visualize DataFrame directly.

Manually analyzing datasets in Excel or SPSS can be time-consuming and error prone. In contrast, Pandas simplifies cleaning and preprocessing tasks like handling missing values, filtering data, and merging datasets. Furthermore, Pandas also helps to automate repetitive data cleaning and aggregation for visualization tasks. Pandas and Matplotlib are free, open-source tools that provide functionalities like SPSS or STATA.

NOTE

Before proceeding further please make sure that Pandas and Matplotlib are installed in your Python environment. If it is not, then you may follow their official website instructions:

Series and DataFrame in Pandas

The one-dimensional labelled array in Pandas is known as a series. The structure of a series is like Python’s data structure dictionary. Both store sequential data, hold multiple elements and are indexable. Moreover, series provides more structured, analysis-ready data representation compared to dictionaries.

Dictionary

fruitPrice = ['apple' : 10, 'banana', : 20, 'cherry' : 30]

print(fruitPrice["cherry"])
30

Pandas Series

import pandas as pd 

fruitPanda = pd.Series([10, 20, 30], index=[ 'apple', 'banana', 'cherry'])

print(fruitPanda ['apple'])
10

The key distinctions of Pandas series are:

  • It supports more data analysis operations.
  • It maintains type consistency.
  • It has built-in statistical methods.
  • It can perform vectorized operations.

A DataFrame is a two-dimensional labeled data structure in Pandas. It is the most used structure and is also analogous to a table in a database, Excel sheet, or spreadsheet. A DataFrame is made up of multiple Series. DataFrame has both row and column labels for organized data representation. The columns in a DataFrame can have different data types.

import pandas as pd

data = { 
  "Name": ["Amar", "Baba", "Ramesh"],
  "Age": [25, 30, 35],
  "Gender": ["Male", "Male", "Male"]
}

df = pd.DataFrame (data)

print(df)
  Name   Age Gender 
0 Amar 25 Male
1 Baba 30 Male
2 Ramesh 35 Male

DataFrames support custom indexing for rows and columns and can be accessed by rows and columns.

import pandas as pd

data = { 
  "Name": ["Amar", "Baba", "Ramesh"],
  "Age": [25, 30, 35],
  "Gender": ["Male", "Male", "Male"]
}

df = pd.DataFrame(data)

print(df["Name"]) # Access by column

print(df.iloc[1]) # Access row by integer index

print(df[["Name", "Age"]]) # Select specific columns
#By column

0 Amar
1 Baba
2 Ramesh
Name: Name, dtype: object

#By integer index

Name Baba
Age 30
Gender Male
Name: 1, dtype : object

#By specific columns

Name Age
0 Amar 25
1 Baba 30
2 Ramesh 35

Importing datasets as a DataFrame for analysis

Pandas support importing a variety of file formats for analysis. Below are the commonly used ones:

  • CSV (.csv ): Comma-separated values (CSV) is a text file.
    • df = pd.read_csv("data.csv")
  • Excel files (.xls, .xlsx): Requires the library openpyxl or xlrd for reading.
    • df = pd.read_excel("data.xlsx", sheet_name ="Sheet1")
  • JSON (.json ): Semi-structured data, often used in web applications.
    • df = pd.read_json("data.json")
  • HTML (.html): Extracting tabular data from web pages.
    • pd.read_html("https://example.com/table")
  • Pickle Files (.pkl ): Serialized Python objects.
    • df = pd.read_pickle("data.pkl")
  • Clipboard: Copy and paste.
    • df = pd.read_clipboard()
NOTE

In the above-discussed methods df is the variable that contains the dataset as a DataFrame.

Essential Pandas methods to learn before working on a dataset

Mastering a set of fundamental Pandas methods is crucial to working confidently on datasets. These methods are useful in data exploration, data cleaning, data manipulation, and analytics.

Data exploration methods that help to understand the structure and content of the dataset are:

  • View the first or last rows of a dataset using df.head(n) for the first and df.tail(n) for the last n number of rows.
  • df.info() displays a summary of the dataset, including column names, data types, and non-null values.
  • df.shape provides a summary of the number of rows and columns in the dataset.
  • df.columns and df.index lists column names and row labels.
  • To generate summary statistics such as mean, standard deviation, and percentiles for numeric columns use df.describe().

Data cleaning methods that help to make the dataset ready for analysis by handling missing values, duplicates, and inconsistencies:

  • Check missing values in the dataset with df.isnull() or df.notnull(). Both methods return Boolean values (true/false) commonly used in conditional statements.
  • To handle missing values in a dataset use df.fillna(n), where ‘n’ is the value to be inserted. Use df.dropna() to drop the rows with missing values.
  • Identify duplicates in a dataset with df.duplicated() and remove duplicated rows with df.drop_duplicates().
  • Rename columns or indexes as df.rename(columns = {"old_name": "new_name"}).
  • To change the type of a column use df["column_name"] = df["column_name"].astype(float).

Data manipulation methods that help to filter, sort, and reshape the dataset for analysis:

  • To select specific rows and columns based on labels use df.loc[from: to, "column_name"]. For rows 0 – 5 replace from:to with 0:5. To select by positions use df.iloc[FromRows:ToRows, FromCols:ToCols]. For rows 0 – 5 replace FromRows:ToRows with 0:5 & for columns 1 – 3 replace FromCols:ToCols with 1:3.
  • To sort data by one or more columns use df.sort_values(by="column_name", ascending=True). Set ascending to false for descending order.
  • Remove specific columns df.drop("column_name", axis=1) and rows with df.drop(0, axis=0).
  • To apply a custom function containing calculations or manipulation to a column or row use df["column_name"].apply(lambda x: x * 2).
  • To group data by categorical values and apply aggregate functions use df.groupby("category")["value_column"].mean().
  • Combine or merge different datasets as pd.merge(df1, df2, on="common_column")
  • Convert data into a DateTime format as pd.to_datetime(arg, format=None, errors='raise', exact=True, unit=None)
    • arg: The input data to convert (e.g., a column, list, or Series).
    • format: A string representing the format of the input data (e.g., %Y-%m-%d). Optional.
    • errors: Specifies how to handle invalid parsing ('raise', 'coerce', 'ignore').
    • unit: Specifies the unit of the timestamp (e.g., s for seconds).

Basic analytics methods that help to perform descriptive and exploratory analysis on a dataset.

  • Count occurrences of unique values in a column with df["column_name"].value_counts()
  • Calculate the mean, median, and standard deviation for numeric data as df["column_name"].mean(), df["column_name"].median(), df["column_name"].std()
  • To calculate the correlation matrix between columns use df.corr(). To calculate the correlation between specific columns use df['column1'].corr(df['column2']).
  • To create pivot tables for summarizing data use df.pivot_table(index="column1",values="column2",aggfunc="mean").
  • Perform multiple aggregate functions on grouped data using df.groupby("category").agg({" value_column":["mean","sum"]}).

Using Matplotlib to visualize data

Representing data graphically using charts, graphs, and plots helps to make the data more accessible, interpretable, and actionable, especially for complex datasets. In quantitative studies, datasets involve multivariate relationships, trends, and patterns, where visualization plays a critical role in communication.

While tools like Excel, SPSS, and STATA are powerful, Python’s Matplotlib offers unique advantages such as:

  • Customization and Flexibility: Control every aspect of the chart, from colour schemes and markers to labels and legends.
  • Scalability for Large Datasets: Excel and SPSS struggle with very large datasets and require manual interventions for repetitive tasks. Matplotlib can handle large datasets programmatically and can also automate repetitive visualizations.
  • Advanced Visualizations: Matplotlib supports advanced plots like heat maps, 3D plots, and subplots for multi-dimensional data.
  • Cost-Effectiveness and Accessibility: Matplotlib is Open-source and free to use, making it accessible to all.

While traditional tools like Excel, SPSS, and STATA are reliable and familiar, Matplotlib complements them by offering unmatched flexibility, scalability, and integration capabilities. Matplotlib works seamlessly with Pandas, making it ideal for structured dataset visualization. There are several components of Matplotlib:

  • Figure: The overall window or page where your plot appears.
  • Axes: The area where the data is plotted (e.g., within the X and Y axes).
  • Plot Elements: Titles, labels, legends, and gridlines that enhance readability.

To create a plot, from a Pandas data frame import the Pandas and Pyplot module. They both provide functions for preparing the dataset and plotting the charts.

import matplotlib.pyplot as plt 
import pandas as pd

# Sample data

data = {'Year': [2018, 2019, 2020, 2021],

'Literacy_Rate': [75.2, 76.3, 77.5, 78.6]}

df = pd.DataFrame(data)

# Create a line plot

plt.plot (df['Year'], df['Literacy_Rate'], marker='o', color ='blue', linestyle ='--')

# Add title and labels

plt.title("Literacy Rate Over Time")

plt.xlabel("Year")

plt.ylabel("Literacy Rate (%)")

# Show the plot

plt.grid(True) # Show gridlines

plt.show()
Line graph using Pandas and Matplotlib

Key Matplotlib methods and their arguments

plot(x, y, color, marker, linestyle)

  • x: Data for the x-axis.
  • y: Data for the y-axis.
  • color: Line colors like ‘r’ for red, ‘b’ for blue, etc.
  • marker: Marker style like ‘o’ for circles, ‘s’ for squares.
  • line style: Style of the plot line like ‘-‘, ‘–‘, ‘:’.
plt.plot([1, 2, 3], [4, 5, 6], color ='g', marker='o', linestyle ='--')

scatter(x, y, color, alpha, s)

  • x, y: Data points for x and y axes.
  • color: Color of the points like ‘red’ or ‘#ff5733’.
  • alpha: Transparency of the points like value s between 0 and 1.
  • s: Size of the markers as integers.
plt.scatter ([1, 2, 3], [4, 5, 6], color ='blue', alpha=0.7, s=100) 

bar(x, height, color, width)

  • x: Categories for the x-axis.
  • height: Heights of the bars (values).
  • color: Fill color of the bars.
  • width: Width of the bars (default is 0.8).
plt.bar ( ['A', 'B', 'C'], [10, 20, 15], color ='orange', width=0.5) 

hist(x, bins, color, edgecolor)

  • x: Data to create a histogram
  • bins: Number of bins (default is 10)
  • color: Fill color of the bars
  • edgecolor: Color of the edges of the bars
plt.hist([1, 2, 2, 3, 3, 3, 4, 4, 5], bins=5, color='purple', edgecolor='black')

boxplot(x, notch, vert)

  • x: List or array of data to be plotted
  • notch: Whether to draw notches in the box (default is False)
  • vert: Orientation of the plot (True for vertical, False for horizontal)
plt.boxplot ([7, 8, 9, 10, 12, 15], notch=True, vert=True) 

pie(x, labels, auto pct, startangle)

  • x: Sizes of each wedge
  • labels: Labels for each wedge
  • autopct: Format for displaying percentages (e.g., ‘%1.1f%%’)
  • startangle: Angle to start the pie chart (default is 0)
plt.pie([20, 30, 50], labels=['A', 'B', 'C'], autopct ='%1.1f%%', startangle =90)

title(label, fontsize, loc)

  • label: Text for the title
  • fontsize: Font size of the title
  • loc: Location of the title (‘left’, ‘center ‘, ‘right’)
plt.title ("My Plot Title", fontsize =14, loc=' center ') 

xlabel(label, fontsize) and ylabel(label, fontsize)

  • label: Text for the axis labels.
  • fontsize: Font size of the labels.
plt.xlabel ("X-Axis Label", fontsize =12) 

plt.ylabel ("Y-Axis Label", fontsize =12)

grid(visible, color, linestyle, linewidth)

  • visible: Whether gridlines are visible (True or False)
  • color: Color of the gridlines
  • linestyle: Style of the gridlines (‘–‘, ‘:’)
  • linewidth: Thickness of the gridlines
plt.grid (visible=True, color =' gray ', linestyle ='--', linewidth=0.5) 

legend(loc, fontsize)

  • loc: Location of the legend (‘upper right’, ‘lower left’, etc.)
  • fontsize: Font size of the legend text
plt.legend (loc='upper left', fontsize =10) 

savefig(fname, dpi, bbox_inches)

  • fname: File name or path (e.g., ‘plot.png’ or ‘folder/plot.pdf’)
  • dpi: Dots per inch for resolution
  • bbox_inches: Bounding box for the plot (‘tight’ for no extra white space)
plt.savefig ("my_plot.png", dpi=300, bbox_inches ='tight') 
  • show(): Create and display the chart.
plt.show()

Exercise: Analyse e-commerce transactions & plot charts

In the context of analyzing money laundering which often involves identifying suspicious transaction patterns and detecting anomalies achieve the following tasks:

  • Download the dataset.
  • Use Pandas to load the dataset and inspect its structure.
  • Identify key columns in the dataset.
  • Visualize the distribution of orders in different categories using the histogram.
  • Create a bar chart showing the top 10 customers by transaction count. Highlight customers with high transaction frequencies.
  • Plot a line graph to analyze how order amounts change over time. Detect unusual spikes in order amounts on specific dates.

Open Solution

NOTES

Discuss

4 thoughts on “Pandas and Matplotlib for Qualitative analysis”