Data preprocessing for analysis and visualization in Power BI
Data preprocessing in Power BI involves preparing, cleaning, and shaping raw data into a suitable format for analysis and visualization. This process is typically done using Power Query within Power BI. With data formatting and filtering capabilities, users can work with data from diverse sources, making integrating and analysing information from various sources easier.
Raw data is the unprocessed and unorganized data collected directly from a source. It is the initial form of data before any processing and analysis. Raw data is typically messy, and unstructured, and may contain errors, inconsistencies, or missing values.
Preprocessing or preparing the data in Power BI involves using the Power Query Editor. To start preprocessing a dataset:
- Download this sample data,
- open Power BI and load the data set,
- in the Home tab, click on Transform data to open the Power Query Editor.
Data preprocessing to remove null values from raw data
Datasets often contain null or missing values and duplicates for various reasons. Most of the time records are deliberately saved with empty values and duplicates due to oversight, errors, or limitations. Not all data fields apply to every record in a dataset, so those fields might remain empty. Technical problems during data capture or transmission can also lead to missing values and duplicates.
Removing null values and duplicates before analysis ensures the quality and reliability of the insights derived from the data. If null values are present in key metrics or fields, calculations such as averages, sums, and ratios can be skewed or lead to incorrect results.
Replacing NULL values
- In the Query Editor’s table select the column that contains null values.
- In the Power Query Editor, go to the Transform tab.
- Click on Replace Values.
- In the Replace Values dialogue box, type null in the “Value to Find” field and enter none to replace nulls within the “Replace With”. If all nulls or some nulls are not replaced, then edit the query replace null with “null” and then press enter. This is to find and replace nu ll as text.
- Click OK to proceed.
Formatting the data
Data formatting is primarily required to ensure that data is presented clearly, accurately, and in a manner that enhances the overall effectiveness of the reports and dashboards. Different data types need appropriate formatting to ensure that they are interpreted correctly.
Properly formatted data labels in charts and graphs can emphasize critical metrics, making it easier for viewers to spot important information immediately. Decision-makers rely on clear and concise information. Formatting helps in presenting the data in a way that is easy to digest, reducing the cognitive load on users.
Power BI provides various formatting options to adjust how data appears while dealing with numbers, dates, text, or visual elements.
- In Power Query Editor, select the column to format.
- Right-click on the selected column and go to Change type.
- Choose the appropriate type to apply to the whole column.
Limiting records is an important step in data preprocessing
Data set filtering refers to the process of limiting or refining the number of records or rows that are displayed in reports, visuals, or dashboards based on a specific criterion. This helps to focus on a particular subset of data. Filtering is a fundamental aspect of data analysis to uncover insights, trends and anomalies that might be hidden within a large dataset.
In Power Query, include or exclude rows based on a column value. Advanced filters provide more complex filtering options, such as specifying multiple conditions or ranges. Furthermore, Power BI also enables more complex and custom filter equations and expressions. Below are some basic filters that can be selected from the interface.
- Text filters are built on regressive equations to find characters in a column. They are Equals, Does Not Equal, Begins With, Does Not Begin With, Ends With, Does Not End With, Contains, and Does Not Contain.
- Number filters look for number values or ranges in a column that Equals, Does Not Equal, Greater Than, Greater Than or Equal To, Less Than, Less Than Or Equal To, or Between.
- Date/Time Filters can also be used to look for specific dates and ranges with Equals, Before, After, Between, In the Next, In the Previous, Is Earliest, Is Latest, Is Not Earliest, Is Not Latest, and Custom Filter.
Filter by removing rows
The ‘Remove Rows’ feature in Power BI is used to clean and prepare data by eliminating unnecessary or irrelevant rows from the dataset. This feature is essential in cases where the dataset contains rows where all or most fields are blank or contain null values.
Removing duplicate rows
Furthermore, it can also be used to Remove Duplicates to keep unique records. Often datasets include outliers or data points that are not relevant to an analysis. Use Remove Rows to exclude such records.
- Select the column or hold the ctrl key and select multiple columns to check for duplicates.
- Go to the Home tab in the Power Query Editor.
- Click on Remove Rows.
- Select Remove Duplicates.
Pivoting the data
Both Pivoting and Unpivoting operations are commonly performed in the Power Query Editor as part of the data preparation process. These operations are used to reshape the data, making it easier to analyze and visualize.
Pivoting is the process of transforming rows of data into columns. This technique is often used to summarize data or to rearrange it so that different categories of data appear as separate columns, making it easier to compare across categories.
Unpivoting the data
Unpivoting is the opposite of pivoting. It transforms columns into rows. This operation is used to normalize data, converting it from a wide format with many columns into a long format with fewer columns and more rows. It is helpful in changing categorical data from different columns into a single column, making it easier to apply filters, create visuals, or perform calculations.
To unpivot :
- Select the columns to unpivot.
- Go to the Transform tab and select Unpivot Columns.
- The selected columns will be unpivoted, with the column headers becoming values in a new column, and the corresponding data moved into rows.
Remove a data preprocessing step or apply changes and exit
In Query Editor, it’s often necessary to decide between removing or undoing a mistake or a transformation that did not generate the desired result. Removing the applied step will revert the data to its previous state. On the right side, locate the Applied Steps pane, listing each transformation that has been applied. Hover over the step to remove and click the small “X”. This will remove the step, and the data will revert to the state before that step is applied.
When all necessary transformations and steps have been completed and the dataset is ready to be loaded into Power BI for analysis, go to the home tab and click on Close and Apply.
I work as an editor and writer for Project Guru. I have a keen interest in new and upcoming learning and teaching methods. I have worked on numerous scholarly projects in the fields of management, marketing and humanities in the last 10 years. Currently, I am working in the footsteps of the National Education Policy of India to help and support fellow professors to emphasise interdisciplinary research and curriculum design.
I am a Senior Analyst at Project Guru, a research and analytics firm based in Gurugram since 2012. I hold a master’s degree in economics from Amity University (2019). Over 4 years, I have worked on worked on various research projects using a range of research tools like SPSS, STATA, VOSViewer, Python, EVIEWS, and NVIVO. My core strength lies in data analysis related to Economics, Accounting, and Financial Management fields.
Discuss