# R software and its useful tools for handling big data

In recent years, industries such as finance, healthcare and Information and communication technology (ICT) industries have witnessed dramatic changes as a result of big data phenomenon. In fact, a recent report by Consumer News and Business Channel (CNBC) (2015) highlighted the role of big data in the real estate sector and explained its evolution. This then has provided analysts with opportunities to restructure raw data and offer related solutions (Catella, 2015). Hence, the role of R software in identifying basic patterns in unstructured big data cannot be underestimated. This is because it helps in standardization of unfamiliar characteristics as well as estimation of important parameters (CoreLogic, 2013) .

Furthermore, R software has plenty of packages and is unique in handling big data. Therefore it can handle both the structured and unstructured data. This makes it suitable for big data analysis also.

## Simultaneous dataset handling

There are many advantages of R software which include its different packages to manage large datasets simultaneously for interactive data analysis. One of the package, BIGMEMORY, helps in implementation of basic manipulation of data by supporting different data types such as double, integer, short and char (Emerson & Kane, 2009). Consider an example of dataset on real estate containing different columns:

#generate the package

library(bigmemory)

#importing the dataset

x <- read.big.matrix("data.csv", sep = "\t", type = "integer", shared = TRUE, col.names = c("age", "income", "rent", "builtyear", areaincome"))

#calculating summary of the research

summary(x)

Thus, the above commands help in storing and solving problems related to big datasets; the latter is broken into simpler parts for further analysis. In the following, data reduction packages to retain only relevant variables for the study are explored.

**Data reduction packages**

The section below discusses what makes R an effective software. It is because its packages undergo continuous development to solve a particular problem. Additionally, R software’s advanced data reduction tools help in decluttering data in order to provide a clear perspective of the study. Hence, regular use of data cleaning tools, estimating and replacing values prepare the dataset for analysis and modeling. Note also that packages such as dplyr simplify complex data through a coding system:

#generate the package

library(dpylr)

#select a subset of rows in a data frame

filter()

#reorders rows as per labels

> arrange ()

#zooms into useful subset in numeric data

select()

#identify unique values in the table

distinct()

#collapse data frame in a single row

summarise()

#take a sample of rows (fixed number)

sample_n()

## Statistical modeling ** **

The** **R software has powerful tools for different regression models such as linear regression, Analysis of Variance (ANOVA), Analysis of Covariance (ANCOVA), logistic regression, log linear regression, binary logistic regression and others (Buechler, 2007).

Fit a linear regression model:

>Fit<- lm( y~ x) #where y is dependent variable #x is independent variable

Fit a one way anova model :

>Fit1<-aov(y~x) # where y is dependent variable #x is independent variable

**Interpretation through graphical representation**

The section below discusses Graphical User Interface (GUI) programming languages. Among all the software, R software is considered an effective language in understanding data through graphs and references. Its graphics package show a diagrammatic representation of barplots, histograms, stem and leaf plots and others. Furtheremore, other packages for interactive graphics can be developed using ggplot, ggvis, dygrpahs, digrammeR or threeJS (Grolemund, 2017).

One should note that the packages have different tools that help in identifying several patterns through simultaneous data handling and applying data reduction techniques. Although R is a well established language for statistical analysis and to develop data mining algorithms, it is a memory-bound language and stores limited data which necessitates the need for extended memory (Rickert, 2011). In conclusion, its limitations are related to statistical procedures. However, the program being open source software, is an effective tool to map future trends(Kelley, Lai, & Wu, 2008).

** References**

- Buechler, S. (2007). statistical models in R. Retrieved from https://www3.nd.edu/~steve/Rcourse/Lecture7v1.pdf
- Catella. (2015).
*Big data in the real estate sector – a big opportunity or a big threat?*Retrieved from https://www.catella.com/Documents/Germany Property Funds/02_Research/01_Studien/Catella Resarch_Big_Data_ 2015_english.pdf - CNBC. (2015, August). How big data is transforming real estate.
*CNBC*. Retrieved from How big data is transforming real estate - CoreLogic. (2013).
*REAL ESTATE*. Retrieved from https://www.corelogic.com/downloadable-docs/real-estate-analytics-commercialtrends.pdf - Cran R. (2016). Introduction to dplyr. Retrieved March 30, 2017, from https://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html
- Emerson, J., & Kane, M. (2009). The R Package bigmemory: Supporting Efficient Computation and Concurrent Programming with Large Data Sets.
*Euler.Stat.Yale.Edu*,*VV*(Ii). Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/wics.10/full\nhttp://euler.stat.yale.edu/~mjk56/temp/bigmemory-vignette.pdf - Grolemund, G. (2017). Quick list of useful R packages – RStudio Support. Retrieved March 30, 2017, from https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages
- Kelley, K., Lai, K., & Wu, P.-J. (2008). Using R for data analysis: A best practice for research.
*Best Practices in Quantitative Methods*, 535–572. https://doi.org/10.4135/9781412995627.d40 - Rickert, J. (2011).
*Big Data Anaysis with Revolution R Enterprise*.*Revolution analytics*. Retrieved from http://www.revolutionanalytics.com/why-revolution-r/whitepapers/Big-Data-WP.pdf\npapers3://publication/uuid/8F228D54-B115-4E19-A9A5-EA759C083426

Priya is the co-founder and Managing Partner of Project Guru, a research and analytics firm based in Gurgaon. She is responsible for the human resource planning and operations functions. Her expertise in analytics has been used in a number of service-based industries like education and financial services.

Her foundational educational is from St. Xaviers High School (Mumbai). She also holds MBA degree in Marketing and Finance from the Indian Institute of Planning and Management, Delhi (2008).

Some of the notable projects she has worked on include:

- Using systems thinking to improve sustainability in operations: A study carried out in Malaysia in partnership with Universiti Kuala Lumpur.
- Assessing customer satisfaction with in-house doctors of Jiva Ayurveda (a project executed for the company)
- Predicting the potential impact of green hydrogen microgirds (A project executed for the Government of South Africa)

She is a key contributor to the in-house research platform Knowledge Tank.

She currently holds over 300 citations from her contributions to the platform.

She has also been a guest speaker at various institutes such as JIMS (Delhi), BPIT (Delhi), and SVU (Tirupati).

## Discuss