# Problems faced during statistical analysis using panel data with STATA

In one of my recent projects, I had to use panel data for analysis. During the data analysis, I faced some problems which may be the most common problems in panel data analysis. So here are some of the problems with their possible solutions that helped me.

## Importing panel data into STATA

**Problem:** The first step for any statistical analysis is to import data from various sources to the statistical software. In my case, I had to import the data from excel sheets. Unfortunately, STATA does not read data from an excel sheet saved as xls or xlsx.

**Solution**: Exported the excel sheet in CVS (MS-DOS) format and then imported it into STATA

## Panel data management

**Problem:** One of the major problems faced during the panel data analysis was data management. If the data is not arranged properly then it is very difficult to get the regression results. Even if the results are obtained, they will not be robust.

**Solution:** While conducting the panel data analysis the data should be saved in a particular format. For example, if we have data for 5 countries for 5 years then data for one country (country A in this case) should be in the following format.

Country | Id | T ( time period) | Variable 1 | Variable 2 | Variable 3 |
---|---|---|---|---|---|

A | 1 | 2001 | |||

A | 1 | 2002 | |||

A | 1 | 2003 | |||

A | 1 | 2004 | |||

A | 1 | 2005 | |||

B | 2 | 2001 | |||

B | 2 | 2002 | |||

E | 5 | 2004 | |||

E | 5 | 2005 |

## String variable

**Problem 3: **While conducting the analysis in STATA, one common problem which I faced is the problem of string variables. If the variables are string then it is not possible to conduct any analysis.

**Solution: **The string variable can be changed to the float or long format using the STATA command “destring“ or “encode“. We can either replace the string variable or create a new variable.

## Descriptive analysis of panel data

**Problem 4: **Since panel data consists of both the time series and cross-sectional data, the usual descriptive analysis procedure do not give much logical result.

**Solution:** For the descriptive analysis in the panel data, I found “it sum” command very useful. Both the “between” and “within” can be presented in one table using this command.

## Various tests performed in the analysis

While performing regression analysis using panel data, it is important to check the basic assumptions. These assumptions can be tested using the following tests:

#### Normality test

One of the basic assumptions of the panel data is **Normality**. In STATA normality can be tested using the following procedure:

- Run the regression
- Predict the residuals

Now the normality can be tested either through the histogram or using the Jarque- Bera Test.

#### Jarque Bera Test:

Null hypothesis: Normality

Alternative hypothesis: Non- Normality

In the results, if the p-value is not significant at 5 % then we cannot reject the null hypothesis which means that there is normality.

#### Testing the heterskedasticity

If the variance of the variable over the period of time is not constant then the heteroskedasticity exists which violates the basic assumption of a regression model. In Panel data also it is important to test the existence of heteroskedasticity. One can test heteroskedasticity in STATA either using the “rvfplot” (graphical) or the through Breusch – Pagan Test (numerically).

In the Breusch-Pagan Test, the null hypothesis is that of homoscedasticity i.e

Null hypothesis: Homoskedasticity

Alternative hypothesis: Heteroskedasticity

In the results, if the p-value is more than 0.05 ( 5% ) then we cannot reject the null hypothesis.

#### Testing the serial correlation

Higher-order serial correlation in the panel data can be tested using the Breusch Godfrey test, which can be performed using the following steps:

- Run the regression
- Conduct BG test using the command “ estat bgodfrey, lag(1)”, where lag (1) indicates that we have taken one lag for the test.

In the results, if the p-value is not significant then we cannot reject the null hypothesis of “No serial correlation”.

However, if the p-value is significant then we reject the null hypothesis, which means that there is a serial correlation. To remove the serial correlation one can add the lag of the dependent variable as one of the independent variables.

#### Testing unit root

Unit root for the panel data can be tested using either the Leuin-lin-Chu test or the Hadri LM stationarity test.

Null hypothesis: Panels contains unit roots

Alternative hypothesis: Panels are stationary

In the results, if the p-value is less than 0.05 then we can reject the null hypothesis and accept the alternative hypothesis. Similarly, the unit root for the first difference can also be tested using a similar method. The only thing which should be kept in mind is that before testing the first difference one must create a new variable (which can be calculated by subtracting the variable in time period t with the time period t-1).

Now, if the results from the unit root test show that the data is stationary then we can go ahead with further analysis. However, if the results show that our data is non-stationary then we can check stationarity in the first difference. If the first difference is also not stationary and check for the second difference and so on.

## Choosing between random effect and fixed effect in panel data analysis

Another major problem faced while analyzing the panel data analysis is to choose between various forms of panel data analysis and use the appropriate one as per the requirement. This can be tested using the Hausman test and the test can be performed in STATA as follows:

Null hypothesis: Random effect model is appropriate.

Alternative hypothesis: Fixed effect model is appropriate

Now, to test

- Run the regression (fixed effect).
- Store the estimates.
- Run the regression (random effect).
- Store the estimates
- Conduct the Hausman test (STATA command:
*hausman fixed random*)

After running the Hausman test if the p-value is significant at 5% then we have to reject the null hypothesis and accept the alternative hypothesis i.e we should use the fixed effect in our model.

## Discuss