# Problems faced during statistical analysis using panel data with STATA

In one of my recent projects I had to use panel data for analysis. During the data analysis I faced some problems which may be the most common problems in panel data analysis. So here are some of the problems with their possible solutions that helped me.

## Importing panel data into STATA

**Problem:** The first step for any statistical analysis is to import data from various sources to the statistical software. In my case I had to import the the data from excel sheets. Unfortunately, **STATA** does not read data from excel sheet saved as xls or xlsx.

**Solution**: Exported the excel sheet in CVS (MS-DOS) format and then imported it into STATA

## Panel data management

**Problem:** One of the major problem faced during the panel data analysis was data management. If the data is not arranged properly then it is very difficult to get the regression results. Even if the results are obtained, they will not be robust.

**Solution:** While conducting the panel data analysis the data should be saved in a particular format. For example, if we have data for 5 countries for 5 years then data for one country (country A in this case) should be in the following format.

Country | Id | T ( time period) | Variable 1 | Variable 2 | Variable 3 |

A | 1 | 2001 | |||

A | 1 | 2002 | |||

A | 1 | 2003 | |||

A | 1 | 2004 | |||

A | 1 | 2005 | |||

B B . . . . E E | 2 2 . . . . 5 5 | 2001 2002 . . . . 2004 2005 |

## String variable

**Problem 3: **While conducting the analysis in STATA, one common problem which I faced is the problem of string variable. If the variables is string then it not possible to conduct any analysis.

**Solution : **The string variable can be changed to the float or long format using the STATA command “destring“ or “encode“. We can either replace the string variable or create a new variable.

## Descriptive analysis of panel data

**Problem 4: **Since panel data consists of both the time series and cross sectional data, the usual descriptive analysis procedure do not give much logical result.

**Solution:** For the descriptive analysis in the panel data, I found “xtsum” command very useful. Both the “between” and “within” can be presented in one table using this command.

## Various tests performed in the analysis

While performing regression analysis using panel data, it is important to check the basic assumptions. These assumptions can be tested using the following tests:

#### Normality test

One of the basic assumption of the panel data is **Normality**. In STATA normality can be tested using the following procedure:

- Run the regression
- Predict the residuals

Now the normality can be tested either through the histogram or using the Jarque- Bera Test.

#### Jarque Bera Test:

Null hypothesis: Normality

Alternative hypothesis: Non- Normality

In the results if the p value is not significant at 5 % then we cannot reject the null hypotheis which means that there is normality.

#### Testing the heterskedasticity

If the variance of the variable over the period of time are not constant then the heteroskedasticity exists which violates the basic assumption of regression model. In Panel data also it is important to test the existence of heteroskedasticity. One can test heterskedasticity in STATA either using the “rvfplot” (graphical) or the through Breusch – Pagan Test (numerically).

In the Breusch-Pagan Test the null hypothesis is that of homoscedasticity i.e

Null hypothesis : Homoskedasticity

Alternative hypothesis : Heteroskedasticity

In the results if p value is more than 0.05 ( 5% ) then we cannot reject the null hypothesis.

#### Testing the serial correlation

Higher order serial correlation in the panel data can be tested using the Breusch Godfrey test, which can be performed using the following steps:

- Run the regression
- Conduct BG test using the command “ estat bgodfrey, lag(1)”, where lag (1) indicates that we have taken one lag for the test.

In the results if the p value is not significant then we cannot reject the null hypothesis of “No serial correlation”.

However if the p value is significant then we reject the null hypothesis, means that there is serial correlation. To remove the serial correlation one can add the lag of the dependent variable as one of the independent variable.

#### Testing unit root

Unit root for the panel data can be tested using either the Leuin-lin-Chu test or the Hadri LM stationarity test.

Null hypothesis: Panels contains unit roots

Alternative hypothesis: Panels are stationary

In the results if the p value is less than 0.05 then we can reject the null hypothesis and accept the alternative hypothesis. Similarly the unit root for the first difference can also tested using similar method. The only thing which should be keep in mind is that before testing the first difference one must create a new variable (which can be calculated by subtracting the variable in time period t with the time period t-1).

Now, if the results from the unit root test shows that the data is stationary then we can go ahead with further analysis. However if the results shows that our data is non-stationary then we can check stationarity in the first difference. If first difference is also not stationary and check for second difference and so on.

## Choosing between random effect and fixed effect in panel data analysis

Another major problem faced while analyzing the panel data analysis is to choose between various forms of panel data analysis and use the appropriate one as per the requirement. This can be tested using the Hausman test and the test can be performed in STATA as follows:

Null hypothesis: Random effect model is appropriate.

Alternative hypothesis: Fixed effect model is appropriate

Now, to test

- Run the regression (fixed effect).
- Store the estimates.
- Run the regression (random effect).
- Store the estimates
- Conduct the Hausman test (STATA command:
*hausman fixed random*)

After running the hausman test if the p value is significant at 5% then we have to reject the null hypothesis and accept the alternative hypothesis i.e we should use the fixed effect in our model.

- How to conduct path analysis? - November 5, 2017
- How to conduct a survival analysis? - October 30, 2017
- How to perform nonlinear regression? - October 30, 2017

## Discuss