Combine and subset EU-SILC data with r.eusilc-package in R

20 Nov 2014

I have worked a lot with EU-silc data over the years and consulted other researchers on various analytical and data manipulation tasks related to data. Data has a fame of being in the nasty end of survey datas and I thought some other users might have use for solid free software solutions for combining and subsetting the data.

GESIS has provided useful resources for setting variable names and value labels for proprietary software as SPSS or Stata (See EU-SILC tools and EU-SILC: Further Information, for further information). However, there have been no tools for the “bottleneck” procedures, for merging and subseting the raw dataset, until r.eusilc-package! r.eusilc provides functions for merging raw .csv-files into single household/personal level datafile.

Package is very experimental, so please let me know of any bugs/improvements you come up with!

Basic idea

With a single function you can 1. merge together any raw file (household or individual, cross-sectional or longitudinal) and 2. subset the variables and countries you are interested in.

Install and load the package

library(devtools)
install_github("r.eusilc","muuankarski")
library(r.eusilc)

Combine the individual level cross-sectional data with household variables

library(r.eusilc)
both_cross_2010 <- merge_eusilc(path.personal.register  = "~/demo_data/eusilc_raw/2010/cross_rev4/UDB_c10R_ver 2010-5 from 01-03-14.csv",
                        path.personal.data      = "~/demo_data/eusilc_raw/2010/cross_rev4/UDB_c10P_ver 2010-5 from 01-03-14.csv",
                        path.household.register = "~/demo_data/eusilc_raw/2010/cross_rev4/UDB_c10D_ver 2010-5 from 01-03-14.csv",
                        path.household.data     = "~/demo_data/eusilc_raw/2010/cross_rev4/UDB_c10H_ver 2010-5 from 01-03-14.csv",
                        output.path="~/demo_data/eusilc_merged/2010",
                        level="both",
                        type="cross-sectional",
                        year="2010",
                        format="RData",
                        subset.vars.per.reg="all",
                        subset.vars.per.data="all",
                        subset.vars.hh.reg="all",
                        subset.vars.hh.data="all",
                        subset.countries="all") 

Plot the combined data

library(ggplot2)
ggplot(both_cross_2010, aes(x=HY020, y=PY010N)) +
        geom_point(alpha=.1, shape=1) +
        geom_smooth(method=lm) +
        coord_cartesian(xlim=c(-5000,100000), ylim=c(-5000,50000)) +
        facet_wrap(~RB020)

See more profound tutorial with examples in package vignette: http://muuankarski.github.io/r.eusilc/vignettes/r.eusilc_tutorial.html

Informative papers on the use and limitations of EU-silc (longitudinal) data

comments powered by Disqus