Sunday, April 17, 2016

Simple data manipulation on a R data frame

Given a data frame df with a variable named Var, to extract the vector containing all the observations for that variable I should use the dollar sign to connect dataframe to its variable:
df$Var
Now, for that variable I can get its mean:
mean(df$Var)
Standard deviation:
sd(df$Var)
Summary (that works also for a complete data frame):
summary(df$Var)
Which observation has the minimum value for the passed variable:
which.min(df$Var)
Which observation has the maximum value for the passed variable:
which.max(df$Var)

It is easy to generate a scattered plot that correlates two variables in a data frame:
plot(df$Var1, df$Var2)
Here Var1 would get the X axis while Var2 the Y axis.

We can extract a subset from a dataframe evaluating conditions on one or more variables:
sub = subset(df, Var1 > 100 & Var2 < 50)
Notice that the AND logical operator is an ampersand. To see how many observations are in this subset (and in any dataframe) we can use the nrow function:
nrow(sub)

To generate an histogram in R we use the hist() function. The boxplot() function generate boxes, that are quite useful to see the statistical range of a variable.


No comments:

Post a Comment