Getting started with PostgreSQL in R

When dealing with large datasets that potentially exceed the memory of your machine it is nice to have another possibility such as your own server with an SQL/PostgreSQL database on it, where you can query the data in smaller digestible chunks. For example, recently I was facing a financial dataset of 5 GB. Although 5 GB fit into my RAM the data uses a lot of resources. One solution is to use an SQL-based database, where I can query data in smaller chunks, leaving resources for the computation.

While MySQL is the more widely used, PostgreSQL has the advantage of being open source and free for all usages. However, we still need to get a server. One possible way to do it is to rent Amazon server, however, as I don’t have a budget for my projects and because I only need the data on my own machine I wanted to set up a server on my Windows 8.1 machine. This is how I did it.

Installing software, Starting the Server and Setting up new Users

First, we need to install the necessary software. Besides R and RStudio we need to install PostgreSQL, that we find here. When installing we are asked to provide a password, just remember it as we need it later. Say for this example we set the password to: “DataScienceRocks”.

Now we can already access and use the database, for example we can start the interface (pgAdmin III) that was automatically installed with PostgreSQL. To connect to the database double click on the server in pgAdmin III and type in your password. The server seems to run after the installation as well. If this is not the case (i.e. you get the error “Server doesn’t listen” when trying to connect to the server with pgAdmin III), you can start the server with the following command in the command line:

pg_ctl -D "C:\Program Files\PostgreSQL\9.4\data" start

As we can see, we only have one user (“postgres“). It is good practice to use the database with another user that has no createrole (think of it as a non-admin user).

To set up a new user I follow this explanation. Start the command line (go to the start menu and type cmd”) and move to the folder where you installed PostgreSQL (more precisely, the bin-folder). In my case I navigated to the folder by typing:

cd C:/Program Files/PostgreSQL/9.4/bin

Now we need to create a new user (openpg), which we can do by executing the following command:

createuser.exe --createdb --username postgres --no-createrole --pwprompt openpg

We have to enter the password for the new user twice (note that there is no feedback from the commandline), for this example I set it to “new_user_password“, lastly we are asked to give our password for the main user (“postgres“) which is in this case “DataScienceRocks“, as specified during the installation.

We can check if we have two users by going into pgAdmin III, which should look like this:

pgAdmin Users
Users in pgAdmin III

Creating a Table in pgAdmin III

An easy way to create a table (database) is to use pgAdmin III. Right click on the “Tables” and choose “New Table”.

Tables in pgAdmin III
Tables in pgAdmin III

For this example we create a table called cartable that we will later populate with the dataset of mtcars. For the dataset we need to specify the columes and their types as shown in the next picture.

pgAdmin III Table Columns
pgAdmin III Table Columns

Lastly, we need to specificy a primary key in constraints. In this case I use the carname column as key.

Connection with R

Now it is time to connect to the database with R. This approach uses the RPostgreSQL package and this approach.

To connect, we need to enter the following commands in R:

# install.packages("RPostgreSQL")
require("RPostgreSQL")

# create a connection
# save the password that we can "hide" it as best as we can by collapsing it
pw <- {
  "correcthorsebatterystaple"
}

# loads the PostgreSQL driver
drv <- dbDriver("PostgreSQL")
# creates a connection to the postgres database
# note that "con" will be used later in each connection to the database
con <- dbConnect(drv, dbname = "postgres",
                 host = "localhost", port = 5432,
                 user = "openpg", password = pw)
rm(pw) # removes the password

# check for the cartable
dbExistsTable(con, "cartable")
# TRUE

If we don’t get an error, that means we are connected to the database.

Write and Load Data with RPostgreSQL

The following code show how we can write and read data to the database:

# creates df, a data.frame with the necessary columns
data(mtcars)
df <- data.frame(carname = rownames(mtcars), 
                 mtcars, 
                 row.names = NULL)
df$carname <- as.character(df$carname)
rm(mtcars)

# writes df to the PostgreSQL database "postgres", table "cartable" 
dbWriteTable(con, "cartable", 
             value = df, append = TRUE, row.names = FALSE)

# query the data from postgreSQL 
df_postgres <- dbGetQuery(con, "SELECT * from cartable")

# compares the two data.frames
identical(df, df_postgres)
# TRUE

# Basic Graph of the Data
require(ggplot2)
ggplot(df_postgres, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) + 
  geom_boxplot() + theme_bw()
Visualization of the data
Visualization of the data

Lastly, if we are finished, we have to disconnect from the server:

# close the connection
dbDisconnect(con)
dbUnloadDriver(drv)

Outro

If you have any questions about the code, PostgreSQL or pgAdmin III or if you have remarks or have found a way to do it better/faster feel free to leave a comment or write me an email.

Useful links:

Get the PostgreSQL software here:
http://www.postgresql.org/download/windows/

PostgreSQL commandline commands: http://www.postgresql.org/docs/9.4/static/app-pg-ctl.html

Create a new User: https://doc.odoo.com/install/windows/postgres/

For a short introduction to postgreSQL queries have a look at this: http://www.postgresql.org/docs/8.4/static/tutorial-select.html

Appendix

If you want to create a table in R instead of pgAdmin III you can do that of course. The following creates the same table as the we did earlier in pgAdmin III:

# specifies the details of the table
sql_command <- "CREATE TABLE cartable
(
  carname character varying NOT NULL,
  mpg numeric(3,1),
  cyl numeric(1,0),
  disp numeric(4,1),  
  hp numeric(3,0),
  drat numeric(3,2),
  wt numeric(4,3),
  qsec numeric(4,2),
  vs numeric(1,0),
  am numeric(1,0),
  gear numeric(1,0),
  carb numeric(1,0),
  CONSTRAINT cartable_pkey PRIMARY KEY (carname)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE cartable
  OWNER TO openpg;
COMMENT ON COLUMN cartable.disp IS '
';"
# sends the command and creates the table
dbGetQuery(con, sql_command)

Appendix 2: Improvements

Almost a year after I wrote this blog post, I have made my own fair amount of experience with SQL and wanted to share some experience with you.

First: You may want to write a short function that connects you to the server and makes sure the connection closes again. What I usually use is something like this:

# 1) for quering from SQL
SQLCommand <- function(query){
  on.exit(dbDisconnect(con))
  
  con <- dbConnect(drv = dbDriver("PostgreSQL"), 
                 dbname = "postgres", host = "localhost", 
                 port = 5432, user = "openpg", 
                 password = "correcthorsebatterystaple")

  tmp <- dbGetQuery(con, query)
}

# 2) for writing to SQL
SQLWriteValues <- function(values, table){
  on.exit(dbDisconnect(con))
  
  con <- dbConnect(drv = dbDriver("PostgreSQL"), 
                 dbname = "postgres", host = "localhost", 
                 port = 5432, user = "openpg", 
                 password = "correcthorsebatterystaple")

  dbWriteTable(con, table, value = values, append = T, row.names = F)
  return(NULL)
}

I would then use it like this:

SQLWriteValues(df, "cartable")

df2 <- SQLCommand("SELECT * FROM cartable")

Lastly, if you want to exclude the details of your connection such as username, password, etc. from your source code, because you work with an open github repository or you want to be able to change your passwords easily, I do the following, which outsources the connection to a config.yml (make sure that you include this file in your .gitignore) file and loads the necessary information using this:

config.yml (make sure that this file is in your working directory):

db:
 host : "localhost"
 dbname : "postgres"
 user : "openpg"
 port : 5432
 password : "correcthorsebatterystaple"

in your r-file:

library(yaml)

SQLCommand <- function(query){
  on.exit(dbDisconnect(con))
  
  con <- do.call(dbConnect, 
                 c(drv = dbDriver("PostgreSQL"), 
                   yaml.load_file("config.yml")$db))
  tmp <- dbGetQuery(con, query)
  return(tmp)
}

SQLWriteValues <- function(values, table){
  on.exit(dbDisconnect(con))
  
  con <- do.call(dbConnect, 
                 c(drv = dbDriver("PostgreSQL"), 
                   yaml.load_file("config.yml")$db))
  
  dbWriteTable(con, table, value = values, append = T, row.names = F)
  return(NULL)
}

# same as before
SQLWriteValues(df, "cartable")

df2 <- SQLCommand("SELECT * FROM cartable")
Advertisements

7 thoughts on “Getting started with PostgreSQL in R

  1. Hello David,
    Thank you for the post!
    I have a question, have you tried loading your data with the ff package?
    If yes, which one is the fastest to access the data?
    Many thanks,
    Fabio

    Like

    • Hello Fabio,
      thank you for the idea of using the ff-package, I will have a look into as soon as I find time for it. šŸ™‚
      I am using the data.table package to load the data initially, which is then loaded into the PostgreSQL database. This approach enables me to query chunks of the data and therefore saves resources.
      Can you load only parts of a csv-file with the ff package (without loading the whole file to the RAM first)?
      Best,
      David

      Like

  2. Hi David,
    I have gotten a return of ‘FALSE’ instead of ‘TRUE’ at the >dbExistsTable(con, “tablename”) step. I have a fully developed database under the user postgres, and so the only step I didn’t follow was to use a new user. Do you think this could be why I’m having trouble, or do you have other ideas?
    Thank you!

    Like

    • Hi Amanda,

      the fact that you use another user and not “openpg” shouldn’t matter. Make sure that you changed the credentials in the connection (con), that “tablename” exists (for example in PGAdmin III), and that your user has the right to access the database.

      Does that give you what you want/need?

      Like

  3. Hi Daid. I always wanted to use a DB. I have two questions:
    1- do you think it’s a performant idea if you need to upload thousands of registers?
    2- I use a lot the wide/long method that Hadley uses. I find that you can really standardize data. Do you think is possible to use a scheme? (this will make registers grow exponentially)

    Your site is great. Thanks for sharing your thoughts.
    Cheers, Cord

    Like

    • Hi Chord,

      glad that you like my site! šŸ™‚

      Regarding your questions:
      ad 1: The DBs are reasonably fast. Depending on your performance goals it might be better to look into data.table and fread (and fwrite with the new version). However, normally the convenience of SQL and other DBs outweigh the speed advantage of data.tables. In your case (with a couple of thousands of entries (registers?)), SQL should be doable. You can also look into SQLite 3 (which is easier than PostgreSQL when it comes to the set-up).

      ad 2: SQL needs a long structure (i.e., it is not easy to add a new column but very easy to add a new row!), thus you would load the data to R and then convert (if necessary).

      Does that answer your questions?

      Cheers,
      David

      Like

  4. Hi,
    your very useful examples have got me started.
    A thing I ran into is that dbWriteTable(… overwrite=TRUE) seems to destroy existing table structure, and dbRemoveTable() is equivalent to DROP table, so I’ve used:

    ltvTable <- "the_table_to_use"
    sql_truncate <- paste("TRUNCATE ", ltvTable) ##unconditional DELETE FROM…
    res <- dbSendQuery(conn=con, statement=sql_truncate)
    dbWriteTable(conn=con, name=ltvTable, value=dfLTV, row.names=FALSE, append=TRUE)

    would you know: is that necessary?

    best wishes – Jeremy

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s