Extras: Side Projects & Resources
In addition to my research, I have produced and continue to develop a number of side projects - some offering helpful resources for researchers and others that are simply fun weekend activities. Explore some of these below and feel free to suggest improvements or new ideas.
R Statistics Library
The R STATISTICS LIBRARY is a collection of useful R code for data management, exploration, modeling, evaluation, and reporting that I have compiled over the course of my research career. The Library assembles and organizes code drawn from my own research and graduate coursework and includes useful code shared from colleagues in and around the PhD in Health Policy Program at Harvard University. Like my research, this project continues to evolve as I learn new packages and functionalities. The Library is not and does not aim to be an exhaustive summary of the R statistics world; rather, the Library represents a convenience set of commonly used code that is easier to reference when compiled into a single script. The full R Statistics Library can be found here. See an excerpt below:
# =================================== #
# =================================== #
# R STATISTICS LIBRARY #
# #
# Peter Lyu #
# #
# ----------------------------------- #
# Last Updated: 09/02/2021 #
# =================================== #
# =================================== #
#*******************************************************************************************************#
###### Table of Contents ######
# The library consists of the following sections, organized by purpose:
# (I) MANAGEMENT
# (II) EXPLORATION
# (III) MODELING
# (IV) DIAGNOSTICS/EVALUATION
# (V) REPORTING
# Packages with useful example datasets
library(car)
library(faraway)
#*******************************************************************************************************#
#*******************************************************************************************************#
##### SECTION I: MANAGEMENT #####
##### Part A: Reading Data #####
# Read in different data types
# SAS data ("read_sas" imports data as tables)
library(haven)
dat <- as.data.frame(read_sas("dat.sas7bdat"))
# Stata data
# Note: 'foreign' package can't read beyond Stata 12 data
library(foreign)
dat <- read.dta("dat.dta")
library(haven)
dat <- as.data.frame(read_dta("dat.dta"))
# CSV data
dat <- read.csv(file="dat.csv", header=T, sep=",")
# Read in series of data files, all with same prefix, as a list object
lapply(Sys.glob("data*.csv"),read.csv)
# Read in series of data files, separately
for(i in 1:4) {
nam <- paste("data_", i, sep = "")
assign(nam, read.csv(paste("data_", i,".csv", sep="")))
}
# Determine class of all variables of a dataframe
lapply(dat,class)
##### Part B: Transposing Data #####
# Long-to-Wide
# NOTE: Requires "timevar" who's values will be appended to transposed variables
# IMPORTANT: reshape() requires a DATAFRAME argument! Will not work correctly with DATATABLE objects
# Adding time variable (i.e. counter)
data$i <- with(data,ave(id,id,FUN=seq_along))
reshape(data,idvar="id",timevar="i",direction="wide")
# Wide-to-Long
# NOTE: Much easier when dat is limited to only those variables which are being transposed
reshape(dat,idvar="id",varying=c("var_1","var_2","var_3"),v.names="var",sep="_",direction="long")
reshape(dat,idvar="id",varying=c(2:4),v.names="var",sep="_",direction="long")
##### Part C: Loops #####
# For Loop: [index] in vector
for (i in c(1,2,3,4)) {
var = i
}
# SAS "Macro"-like loops via lapply()
# Example: looping over dataframes indexed by year, e.g. data2000, data2001,... (taken from Stack Overflow)
years <- 2000:2002
dataList <- lapply(years, function(x){
#Create name of data set as character object
dsname <- paste0("data",x)
#Call data set from character object using get() (example process here removes obs with non-missing var)
dat <- get(dat)[is.na(var)==0]
#Create year variable
dat$year <- x
#Output data set in resulting list
dat
})
#Append data sets
dat_2000_2002 <- do.call(rbind,dataList)
##### Part D: Parallel Processing #####
# Using dopar
library(foreach)
library(doParallel)
# STEP 1: Data cleaning/setup before processes that require parallelizing
# STEP 2: Set number of cores to use (below code uses all minus one cores)
no_cores <- detectCores() - 1
# STEP 3: Generate n_cores clusters based on local environment
cl <- makeCluster(no_cores)
# STEP 4: Register clusters (to identify which cluster set to be used for parallelized processing)
registerDoParallel(cl)
# STEP 5: Loop processes, which assign different independent loop iterations to different cores
# Example 1: Parallel Fits
# Fits different models with different formulas (contained in list and defined
# prior to makeCluster() and outputs a nested list. E.g. all_fits[[1]][[1]] contains the estimates
# from Model 1 and all_fits[[1]][[2]] contains the robust var-cov matrix of Model 1.
all_fits <- foreach(i=1:n_models,
.packages = c("sandwich","lmtest","foreach","doParallel"),
.combine = list,
.multicombine = T) %dopar% {
model <- lm(formulas[[i]], data=dat)
cluster_vcov <- vcovHC(model, type="HC1", cluster="clustervar")
list(model,cluster_vcov)
}
# Example 2: Parallel Prints
# Prints different sets of models together using stargazer package.
combos_models <- list(list(all_fits[[1]][[1]], all_fits[[2]][[1]], all_fits[[3]][[1]]),
list(all_fits[[4]][[1]]),
list(all_fits[[5]][[1]]))
combos_se <- list(list(sqrt(diag(all_fits[[1]][[2]])), sqrt(diag(all_fits[[2]][[2]])), sqrt(diag(all_fits[[3]][[2]]))),
list(sqrt(diag(all_fits[[4]][[2]]))),
list(sqrt(diag(all_fits[[5]][[2]]))))
foreach(1:3, .packages=c("stargazer"),.export=c("combos_models","combos_se")) %dopar% {
stargazer(combo_models[[i]],
align=T,
omit.stat="f",
type="html",
se=combos_se[[i]],
out=paste("address/Model Results - Set ",i,".html",sep=""))
}
# STEP 6: Stop/Close clusters
stopCluster(cl)
...
Area Neighbors
For one of my research projects, I faced a very straightforward data challenge: How do I find all neighboring Hospital Referral Regions (HRR) for each HRR? Well, I pulled together that data so others never have to and outlined a procedure that should be generalizable to other other area definitions (e.g., ZIPs). You can find these resources here.
Health Policy Crossword Puzzle
When COVID first struck the US, it was a scary few months and I spent a lot more time inside than I have in a long, long time. To make the most out of these circumstances and share some positivity, I practiced my love of crosswords and wrote a health policy-themed puzzle. It's no Saturday or Sunday crossword, but I hope health policy nerds out there enjoy and appreciate a few of the clues. The puzzle works best printed out, and you can download PDF and DOCX versions here.