Cory wanted to do an unassailable number of bootstrap replications (10,000) to get confidence intervals for six different models he ran using
lavaan. Using his own computer to do 500 replications took, well, too long. So last week we started to set up a script for doing this all on ACISS following along a blog post I wrote some time ago. We discovered a few areas of detail left out of that original post, so I’ll document our process in detail below:
Getting started with ACISS
First, make sure you apply for a new account.
Once you’ve logged on, if you need to do any work at all (running a shell script, installing an R package), you need to request an interactive job from the sun grid engine — the node you log in to has basically no processing capacity at all.
As an example, let’s install the packages we’ll need later on. Below, I request an interactive (-I) job on a generic node, and then load whatever the default version of R is (see
module avail for a list of available software and versions).
[flournoy@hn1 ~]$ qsub -q generic -I qsub: waiting for job 1048662.hn1 to start qsub: job 1048662.hn1 ready [flournoy@cn91 ~]$ module load R
Note the hostname change from hn1 to cn91. Now I can run R and install packages. Which brings up the second obstacle we came across: ACISS doesn’t like making secure HTTP requests (e.g., to https://cloud.r-project.org/). So, when you want to
install.package, you have to make sure to either choose an http mirror when asked, or better yet, just a specific mirror in the command itself. We’ll use the CRAN repository up at OSU.
[flournoy@cn91 ~]$ R R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-unknown-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. ... > install.packages(c('lavaan', 'BatchJobs'), repo='http://ftp.osuosl.org/pub/cran/', Ncpus=12) Installing packages into ‘/ibrix/users/home5/flournoy/R/x86_64-unknown-linux-gnu-library/3.1’ (as ‘lib’ is unspecified) ...
In the above
install.packages command, I ask for both
BatchJobs to be installed, specify the mirror to use in the
repo option, and since we’re getting used to parallelization, I also tell it to use up to 12 cpus (
Ncpus) if it needs to compile anything.
Cory’s Bootstrap Strategy
For simplicity, we decided to distribute all the work for bootstrapping using one node per model. That means one node will do all 10,000 bootstraps for that model. Luckily, bootstrapping in R itself is easy to make parallel, so the job will run on all the node’s 12 processors. So we’ll get a speedup factor of
6*12=72, which is pretty nice, and we gain the added benefit of running this on a remote server.
If we wanted to get more speedup, we could potentially split bootstraps within model type across nodes, and recombine them later. That is, we could do 5,000 bootstrap iterations for model 1 on node 1, then 5,000 more for model 1 on node 2, and then recombine them later. For now, we’ll focus on the simpler version.
This week we’re going to finalize the script that uses
BatchJobs to schedule these things on ACISS’s Sun Grid Engine, and collect the results. We’ll also play around with using R Studio with a remote R console (fingers crossed).