Pandas and Matplotlib for Qualitative analysis
Pandas and Matplotlib are Python libraries designed to make working with structured data fast, flexible, and intuitive. Pandas provides tools to clean, process, and analyze datasets efficiently. Matplotlib is useful for creating static, interactive, and publication-quality visualizations. It helps in presenting data insights visually. Matplotlib works seamlessly with Pandas, to visualize DataFrame directly.
Manually analyzing datasets in Excel or SPSS can be time-consuming and error prone. In contrast, Pandas simplifies cleaning and preprocessing tasks like handling missing values, filtering data, and merging datasets. Furthermore, Pandas also helps to automate repetitive data cleaning and aggregation for visualization tasks. Pandas and Matplotlib are free, open-source tools that provide functionalities like SPSS or STATA.
Before proceeding further please make sure that Pandas and Matplotlib are installed in your Python environment. If it is not, then you may follow their official website instructions:
Series and DataFrame in Pandas
The one-dimensional labelled array in Pandas is known as a series. The structure of a series is like Python’s data structure dictionary. Both store sequential data, hold multiple elements and are indexable. Moreover, series provides more structured, analysis-ready data representation compared to dictionaries.
Dictionary
fruitPrice = ['apple' : 10, 'banana', : 20, 'cherry' : 30]
print(fruitPrice["cherry"])
30
Pandas Series
import pandas as pd
fruitPanda = pd.Series([10, 20, 30], index=[ 'apple', 'banana', 'cherry'])
print(fruitPanda ['apple'])
10
The key distinctions of Pandas series are:
- It supports more data analysis operations.
- It maintains type consistency.
- It has built-in statistical methods.
- It can perform vectorized operations.
A DataFrame is a two-dimensional labeled data structure in Pandas. It is the most used structure and is also analogous to a table in a database, Excel sheet, or spreadsheet. A DataFrame is made up of multiple Series. DataFrame has both row and column labels for organized data representation. The columns in a DataFrame can have different data types.
import pandas as pd
data = {
"Name": ["Amar", "Baba", "Ramesh"],
"Age": [25, 30, 35],
"Gender": ["Male", "Male", "Male"]
}
df = pd.DataFrame (data)
print(df)
Name Age Gender
0 Amar 25 Male
1 Baba 30 Male
2 Ramesh 35 Male
DataFrames support custom indexing for rows and columns and can be accessed by rows and columns.
import pandas as pd
data = {
"Name": ["Amar", "Baba", "Ramesh"],
"Age": [25, 30, 35],
"Gender": ["Male", "Male", "Male"]
}
df = pd.DataFrame(data)
print(df["Name"]) # Access by column
print(df.iloc[1]) # Access row by integer index
print(df[["Name", "Age"]]) # Select specific columns
#By column
0 Amar
1 Baba
2 Ramesh
Name: Name, dtype: object
#By integer index
Name Baba
Age 30
Gender Male
Name: 1, dtype : object
#By specific columns
Name Age
0 Amar 25
1 Baba 30
2 Ramesh 35
Importing datasets as a DataFrame for analysis
Pandas support importing a variety of file formats for analysis. Below are the commonly used ones:
- CSV (.csv ): Comma-separated values (CSV) is a text file.
df = pd.read_csv("data.csv")
- Excel files (.xls, .xlsx): Requires the library openpyxl or xlrd for reading.
df = pd.read_excel("data.xlsx", sheet_name ="Sheet1")
- JSON (.json ): Semi-structured data, often used in web applications.
df = pd.read_json("data.json")
- HTML (.html): Extracting tabular data from web pages.
pd.read_html("https://example.com/table")
- Pickle Files (.pkl ): Serialized Python objects.
df = pd.read_pickle("data.pkl")
- Clipboard: Copy and paste.
df = pd.read_clipboard()
In the above-discussed methods df is the variable that contains the dataset as a DataFrame.
Essential Pandas methods to learn before working on a dataset
Mastering a set of fundamental Pandas methods is crucial to working confidently on datasets. These methods are useful in data exploration, data cleaning, data manipulation, and analytics.
Data exploration methods that help to understand the structure and content of the dataset are:
- View the first or last rows of a dataset using
df.head(n)for the first anddf.tail(n)for the lastnnumber of rows. df.info()displays a summary of the dataset, including column names, data types, and non-null values.df.shapeprovides a summary of the number of rows and columns in the dataset.df.columnsanddf.indexlists column names and row labels.- To generate summary statistics such as mean, standard deviation, and percentiles for numeric columns use
df.describe().
Data cleaning methods that help to make the dataset ready for analysis by handling missing values, duplicates, and inconsistencies:
- Check missing values in the dataset with
df.isnull()ordf.notnull(). Both methods return Boolean values (true/false) commonly used in conditional statements. - To handle missing values in a dataset use
df.fillna(n), where ‘n’ is the value to be inserted. Usedf.dropna()to drop the rows with missing values. - Identify duplicates in a dataset with
df.duplicated()and remove duplicated rows withdf.drop_duplicates(). - Rename columns or indexes as
df.rename(columns = {"old_name": "new_name"}). - To change the type of a column use
df["column_name"] = df["column_name"].astype(float).
Data manipulation methods that help to filter, sort, and reshape the dataset for analysis:
- To select specific rows and columns based on labels use
df.loc[from: to, "column_name"]. For rows 0 – 5 replacefrom:towith0:5. To select by positions usedf.iloc[FromRows:ToRows, FromCols:ToCols]. For rows 0 – 5 replaceFromRows:ToRowswith0:5& for columns 1 – 3 replaceFromCols:ToColswith1:3. - To sort data by one or more columns use
df.sort_values(by="column_name", ascending=True). Set ascending to false for descending order. - Remove specific columns
df.drop("column_name", axis=1)and rows withdf.drop(0, axis=0). - To apply a custom function containing calculations or manipulation to a column or row use
df["column_name"].apply(lambda x: x * 2). - To group data by categorical values and apply aggregate functions use
df.groupby("category")["value_column"].mean(). - Combine or merge different datasets as
pd.merge(df1, df2, on="common_column") - Convert data into a DateTime format as
pd.to_datetime(arg, format=None, errors='raise', exact=True, unit=None)arg: The input data to convert (e.g., a column, list, or Series).format: A string representing the format of the input data (e.g.,%Y-%m-%d). Optional.errors: Specifies how to handle invalid parsing ('raise','coerce','ignore').unit: Specifies the unit of the timestamp (e.g.,sfor seconds).
Basic analytics methods that help to perform descriptive and exploratory analysis on a dataset.
- Count occurrences of unique values in a column with
df["column_name"].value_counts() - Calculate the mean, median, and standard deviation for numeric data as
df["column_name"].mean(),df["column_name"].median(),df["column_name"].std() - To calculate the correlation matrix between columns use
df.corr(). To calculate the correlation between specific columns usedf['column1'].corr(df['column2']). - To create pivot tables for summarizing data use
df.pivot_table(index="column1",values="column2",aggfunc="mean"). - Perform multiple aggregate functions on grouped data using
df.groupby("category").agg({" value_column":["mean","sum"]}).
Using Matplotlib to visualize data
Representing data graphically using charts, graphs, and plots helps to make the data more accessible, interpretable, and actionable, especially for complex datasets. In quantitative studies, datasets involve multivariate relationships, trends, and patterns, where visualization plays a critical role in communication.
While tools like Excel, SPSS, and STATA are powerful, Python’s Matplotlib offers unique advantages such as:
- Customization and Flexibility: Control every aspect of the chart, from colour schemes and markers to labels and legends.
- Scalability for Large Datasets: Excel and SPSS struggle with very large datasets and require manual interventions for repetitive tasks. Matplotlib can handle large datasets programmatically and can also automate repetitive visualizations.
- Advanced Visualizations: Matplotlib supports advanced plots like heat maps, 3D plots, and subplots for multi-dimensional data.
- Cost-Effectiveness and Accessibility: Matplotlib is Open-source and free to use, making it accessible to all.
While traditional tools like Excel, SPSS, and STATA are reliable and familiar, Matplotlib complements them by offering unmatched flexibility, scalability, and integration capabilities. Matplotlib works seamlessly with Pandas, making it ideal for structured dataset visualization. There are several components of Matplotlib:
- Figure: The overall window or page where your plot appears.
- Axes: The area where the data is plotted (e.g., within the X and Y axes).
- Plot Elements: Titles, labels, legends, and gridlines that enhance readability.
To create a plot, from a Pandas data frame import the Pandas and Pyplot module. They both provide functions for preparing the dataset and plotting the charts.
import matplotlib.pyplot as plt
import pandas as pd
# Sample data
data = {'Year': [2018, 2019, 2020, 2021],
'Literacy_Rate': [75.2, 76.3, 77.5, 78.6]}
df = pd.DataFrame(data)
# Create a line plot
plt.plot (df['Year'], df['Literacy_Rate'], marker='o', color ='blue', linestyle ='--')
# Add title and labels
plt.title("Literacy Rate Over Time")
plt.xlabel("Year")
plt.ylabel("Literacy Rate (%)")
# Show the plot
plt.grid(True) # Show gridlines
plt.show()

Key Matplotlib methods and their arguments
plot(x, y, color, marker, linestyle)
- x: Data for the x-axis.
- y: Data for the y-axis.
- color: Line colors like ‘r’ for red, ‘b’ for blue, etc.
- marker: Marker style like ‘o’ for circles, ‘s’ for squares.
- line style: Style of the plot line like ‘-‘, ‘–‘, ‘:’.
plt.plot([1, 2, 3], [4, 5, 6], color ='g', marker='o', linestyle ='--')
scatter(x, y, color, alpha, s)
- x, y: Data points for x and y axes.
- color: Color of the points like ‘red’ or ‘#ff5733’.
- alpha: Transparency of the points like value s between 0 and 1.
- s: Size of the markers as integers.
plt.scatter ([1, 2, 3], [4, 5, 6], color ='blue', alpha=0.7, s=100)
bar(x, height, color, width)
- x: Categories for the x-axis.
- height: Heights of the bars (values).
- color: Fill color of the bars.
- width: Width of the bars (default is 0.8).
plt.bar ( ['A', 'B', 'C'], [10, 20, 15], color ='orange', width=0.5)
hist(x, bins, color, edgecolor)
- x: Data to create a histogram
- bins: Number of bins (default is 10)
- color: Fill color of the bars
- edgecolor: Color of the edges of the bars
plt.hist([1, 2, 2, 3, 3, 3, 4, 4, 5], bins=5, color='purple', edgecolor='black')
boxplot(x, notch, vert)
- x: List or array of data to be plotted
- notch: Whether to draw notches in the box (default is False)
- vert: Orientation of the plot (True for vertical, False for horizontal)
plt.boxplot ([7, 8, 9, 10, 12, 15], notch=True, vert=True)
pie(x, labels, auto pct, startangle)
- x: Sizes of each wedge
- labels: Labels for each wedge
- autopct: Format for displaying percentages (e.g., ‘%1.1f%%’)
- startangle: Angle to start the pie chart (default is 0)
plt.pie([20, 30, 50], labels=['A', 'B', 'C'], autopct ='%1.1f%%', startangle =90)
title(label, fontsize, loc)
- label: Text for the title
- fontsize: Font size of the title
- loc: Location of the title (‘left’, ‘center ‘, ‘right’)
plt.title ("My Plot Title", fontsize =14, loc=' center ')
xlabel(label, fontsize) and ylabel(label, fontsize)
- label: Text for the axis labels.
- fontsize: Font size of the labels.
plt.xlabel ("X-Axis Label", fontsize =12)
plt.ylabel ("Y-Axis Label", fontsize =12)
grid(visible, color, linestyle, linewidth)
- visible: Whether gridlines are visible (True or False)
- color: Color of the gridlines
- linestyle: Style of the gridlines (‘–‘, ‘:’)
- linewidth: Thickness of the gridlines
plt.grid (visible=True, color =' gray ', linestyle ='--', linewidth=0.5)
legend(loc, fontsize)
- loc: Location of the legend (‘upper right’, ‘lower left’, etc.)
- fontsize: Font size of the legend text
plt.legend (loc='upper left', fontsize =10)
savefig(fname, dpi, bbox_inches)
- fname: File name or path (e.g., ‘plot.png’ or ‘folder/plot.pdf’)
- dpi: Dots per inch for resolution
- bbox_inches: Bounding box for the plot (‘tight’ for no extra white space)
plt.savefig ("my_plot.png", dpi=300, bbox_inches ='tight')
show(): Create and display the chart.
plt.show()
Exercise: Analyse e-commerce transactions & plot charts
In the context of analyzing money laundering which often involves identifying suspicious transaction patterns and detecting anomalies achieve the following tasks:
- Download the dataset.
- Use Pandas to load the dataset and inspect its structure.
- Identify key columns in the dataset.
- Visualize the distribution of orders in different categories using the histogram.
- Create a bar chart showing the top 10 customers by transaction count. Highlight customers with high transaction frequencies.
- Plot a line graph to analyze how order amounts change over time. Detect unusual spikes in order amounts on specific dates.
Discuss