Pythonesque: April 2016

Monday, April 18, 2016

Saving R plot to png

On Linux, if we want to save a plot to file we have to do specify before plotting to which file and in which format we want to work.

Say that I want to generate a PNG named myPlot.png in my current directory:

png('myPlot.png')
plot(df$var1, df$var2)
dev.off()

The last line says to R that I want to switch back to the default, meaning X11 that is going to show my up the result.

Sunday, April 17, 2016

Simple data manipulation on a R data frame

Given a data frame df with a variable named Var, to extract the vector containing all the observations for that variable I should use the dollar sign to connect dataframe to its variable:

df$Var

Now, for that variable I can get its mean:

mean(df$Var)

Standard deviation:

sd(df$Var)

Summary (that works also for a complete data frame):

summary(df$Var)

Which observation has the minimum value for the passed variable:

which.min(df$Var)

Which observation has the maximum value for the passed variable:

which.max(df$Var)

It is easy to generate a scattered plot that correlates two variables in a data frame:

plot(df$Var1, df$Var2)

Here Var1 would get the X axis while Var2 the Y axis.

We can extract a subset from a dataframe evaluating conditions on one or more variables:

sub = subset(df, Var1 > 100 & Var2 < 50)

Notice that the AND logical operator is an ampersand. To see how many observations are in this subset (and in any dataframe) we can use the nrow function:

nrow(sub)

To generate an histogram in R we use the hist() function. The boxplot() function generate boxes, that are quite useful to see the statistical range of a variable.

Reading and writing CSV files in R

Before loading a file in R is often useful change directory in the environment, this is done by:

setwd('pathname')

If you have a doubt about which is your current working directory, just print it:

getwd()

Reading from a CSV file to a data frame is pretty simple:

df = read.csv('path/to/file.csv')

Now we can get the structure of the dataframe:

str(df)

It gives us information on the number of observations (rows) and variables (columns); names of variables, a few of their values and, when they are detected as 'factors', also the number of 'level'on which that variable is structured.
Another useful function is:

summary(df)

It tries to provide us useful summary for each variable, giving the levels in case of factor, or a few statistic measures otherwise (min, max, mean, median, first and third quartile).

We can create a subset from a dataframe selecting a specific value for a variable, like this:

sub = subset(df, MyVariable = 'a value')

Then we can save this subset to a CSV file:

write.csv(sub, 'path/to/subFile.csv')

Localization problems in R

Just a hint. If you see your R environment behaving strangely, it could be a problem of localization.

Look for the Sys.getlocale function in the R documentation, here is a handy link, courtesy of the Zurich polytechnic.

For my case, to solve the problem I called:

Sys.setlocale("LC_ALL", "C")