Monday, April 18, 2016

Saving R plot to png

On Linux, if we want to save a plot to file we have to do specify before plotting to which file and in which format we want to work.

Say that I want to generate a PNG named myPlot.png in my current directory:
png('myPlot.png')
plot(df$var1, df$var2)
dev.off()
The last line says to R that I want to switch back to the default, meaning X11 that is going to show my up the result.

Sunday, April 17, 2016

Simple data manipulation on a R data frame

Given a data frame df with a variable named Var, to extract the vector containing all the observations for that variable I should use the dollar sign to connect dataframe to its variable:
df$Var
Now, for that variable I can get its mean:
mean(df$Var)
Standard deviation:
sd(df$Var)
Summary (that works also for a complete data frame):
summary(df$Var)
Which observation has the minimum value for the passed variable:
which.min(df$Var)
Which observation has the maximum value for the passed variable:
which.max(df$Var)

It is easy to generate a scattered plot that correlates two variables in a data frame:
plot(df$Var1, df$Var2)
Here Var1 would get the X axis while Var2 the Y axis.

We can extract a subset from a dataframe evaluating conditions on one or more variables:
sub = subset(df, Var1 > 100 & Var2 < 50)
Notice that the AND logical operator is an ampersand. To see how many observations are in this subset (and in any dataframe) we can use the nrow function:
nrow(sub)

To generate an histogram in R we use the hist() function. The boxplot() function generate boxes, that are quite useful to see the statistical range of a variable.


Reading and writing CSV files in R

Before loading a file in R is often useful change directory in the environment, this is done by:
setwd('pathname')
If you have a doubt about which is your current working directory, just print it:
getwd()
Reading from a CSV file to a data frame is pretty simple:
df = read.csv('path/to/file.csv')
Now we can get the structure of the dataframe:
str(df)
It gives us information on the number of observations (rows) and variables (columns); names of variables, a few of their values and, when they are detected as 'factors', also the number of 'level'on which that variable is structured.
Another useful function is:
summary(df)
It tries to provide us useful summary for each variable, giving the levels in case of factor, or a few statistic measures otherwise (min, max, mean, median, first and third quartile).

We can create a subset from a dataframe selecting a specific value for a variable, like this:
sub = subset(df, MyVariable = 'a value')
Then we can save this subset to a CSV file:
write.csv(sub, 'path/to/subFile.csv')

Localization problems in R

Just a hint. If you see your R environment behaving strangely, it could be a problem of localization.

Look for the Sys.getlocale function in the R documentation, here is a handy link, courtesy of the Zurich polytechnic.

For my case, to solve the problem I called:
Sys.setlocale("LC_ALL", "C")