May 17, 2016

Running some jobs on ACISS

If you want to catch up, check out this general post on ACISS, and the first post on how we’re getting set up to run 10,000 bootstrap iterations on 6 different SEM models.

Code walkthrough

I’ll walk through each bit of this script and how it helps us run jobs across multiple nodes on ACISS. Note that once we have a piece of code like this, we will run it on ACISS directly from an R console using source('this_R_script.r').
In this specific example, we need to do something (bootstrap confidence intervals) for each of 6 different models. Our strategy will be to write a function that takes a single argument that is a character string with a lavaan model specification and that will return the bootstrapped confidence intervals. The way the BatchJobs package works is that it takes a function like the one I just described and a vector of the arguments to pass that function. It then requests a node (which is like a separate computer — see the first post) for each argument in the list, and runs that function with that one argument on that node.

So to start with, we need to specify each of the six models:

library(BatchJobs) #load batchjobs first...

uneq.struc.model.ext <-"
ex_T1 =~ c(1, 1, 1, 1)*extr_T1_P1 + c(L2, L2, L2, L2)*extr_T1_P2 + c(L3, L3, L3, L3)*extr_T1_P3 
ex_T2 =~ c(1, 1, 1, 1)*extr_T2_P1 + c(L2, L2, L2, L2)*extr_T2_P2 + c(L3, L3, L3, L3)*extr_T2_P3
ex_T3 =~ c(1, 1, 1, 1)*extr_T3_P1 + c(L2, L2, L2, L2)*extr_T3_P2 + c(L3, L3, L3, L3)*extr_T3_P3
ex_T4 =~ c(1, 1, 1, 1)*extr_T4_P1 + c(L2, L2, L2, L2)*extr_T4_P2 + c(L3, L3, L3, L3)*extr_T4_P3

extr_T1_P1 ~~c(v_P1, v_P1, v_P1, v_P1)*extr_T1_P1
extr_T2_P1 ~~c(v_P1, v_P1, v_P1, v_P1)*extr_T2_P1
extr_T3_P1 ~~c(v_P1, v_P1, v_P1, v_P1)*extr_T3_P1
extr_T4_P1 ~~c(v_P1, v_P1, v_P1, v_P1)*extr_T4_P1
... # truncated for readability
"

uneq.struc.model.agree <-"..." 
uneq.struc.model.consc <-"..." 
uneq.struc.model.neur <-"..." 
uneq.struc.model.open <-"..." 
uneq.struc.model.HP <-"..."

I’ve removed most of the text from the model specification since that’s not the focus here. We now want to put these models into a vector.

starts<-c(
  uneq.struc.model.ext, 
  uneq.struc.model.agree,
  uneq.struc.model.consc,
  uneq.struc.model.neur,
  uneq.struc.model.open,
  uneq.struc.model.HP)

Now we write the function that will take, one at a time, each of these six elements of the vector starts.

CIfunc <-function(x) {
  # Read in the data
  bfibfas_LNT_nat <- read.csv("bfibfas_LNT_nat.csv")
  # Load the SEM package called lavaan
  library(lavaan)

  #Fit the model specified by `x`
  fit.uneq.struc.model <- sem(x, data=bfibfas_LNT_nat, group = "agegroup", missing="ml")
  
  #Use the model fit above to bootstrap parameters 10,000 times.
  out <- bootstrapLavaan(fit.uneq.struc.model, R = 10000, FUN = function(y) {
    standardizedSolution(y)$est }, verbose = TRUE, parallel="snow", ncpus="12")
  
  # Transform the bootstrap output into nice 95% CIs for each parameter
  table.ci <- lapply(as.data.frame(out), quantile, probs=c(0.025, 0.975))

  # Return a list of both the fit model, and CIs
  list(model = fit.uneq.struc.model, cis=table.ci)
}

We now have all the pieces we need to create and run jobs using BatchJobs. Note well the comments in the above code. And also note that I’ve bolded part of the call to bootstrapLavaan. Using the arguments parallel="snow" and ncpus="12" tells this function to spread the 10,000 iterations across all 12 of the nodes processors using the snow backend. Any function you write for the BatchJobs should, ideally, be parallelized in some way. This is easy to do with lavaan, as well as the basic boot package, thankfully.

Note also that we load the data within the function. Basically, anything you need during the function has to be specified there. So we’ve saved our data file in the directory where we make our registry, and the function loads it each time its run.

To understand what happens next, it is important to know that BatchJobs uses what it calls a registry to keep track of all the info you need to give each job, and all the output. Basically, you ask BatchJobs to create a new directory structure for this project (using id = "..." below), and tell is what packages it needs to ensure are accessible to the function you’ve written (lavaan in our case).

After it creates the registry, you ask it to map the elements of the starts vector to the function. This sets up the list of jobs it will submit to nodes on ACISS.

# Create the registry. Note, the id should not contain spaces or weird characters!
reg <- makeRegistry(id = "CIs_Stability_Models", packages = "lavaan")

#In the new registry, map the function `CIfunc` to the elements fo `starts`
ids <- batchMap(reg, CIfunc, starts)

#Now that we created them, we can submit all the jobs in `reg` 
finished <- submitJobs(reg, resources = list(nodes=1, ppn=12))

When we submitJobs, we can specify that each job will get 1 node, and that each node will have 12 processors (resources = list(nodes=1, ppn=12)).

If we want to check the status of the jobs we can use showStatus(reg). And if, as will likely happen, we get kicked out of our session, we can log back in and run the following R commands:

library(BatchJobs)
reg <- loadRegistry("CIs_Stability_Models-files")
showStatus(reg)

If your registry was saved in a different location, you’d substitute out “CIs_Stability_Models-files” with “Your_Reg_Name-files”, or whatever the directory is actually called.

You can also check on your jobs from the command line using qstat with the -u flag and your user name. If you don’t see any output, it means you’ve entered the wrong user name, or your jobs are done. Below is an example of what Cory’s jobs look like when they’re running:


[ccostell@hn1 ~]$ qstat -u ccostell

hn1: 
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
1058081.hn1             ccostell    generic  CIs_Stability_Mo  22781     1      1    --   24:00:00 R  01:55:34
1058082.hn1             ccostell    generic  CIs_Stability_Mo  22792     1      1    --   24:00:00 R  01:55:34
1058083.hn1             ccostell    generic  CIs_Stability_Mo    301     1      1    --   24:00:00 R  01:55:34
1058084.hn1             ccostell    generic  CIs_Stability_Mo    317     1      1    --   24:00:00 R  01:55:33
1058085.hn1             ccostell    generic  CIs_Stability_Mo  12613     1      1    --   24:00:00 R  01:55:33
1058086.hn1             ccostell    generic  CIs_Stability_Mo  12624     1      1    --   24:00:00 R  01:55:32

Once your jobs have finished (use showStatus(reg) to ensure they finished without error or premature termination), you can collect the results and save them to an RData file that you’ll want to transfer to your own computer and play with there:

reg <- loadRegistry("CIs_Stability_Models-files")
showStatus(reg)

reduced_CI_results<-reduceResultsList(reg, fun = function(job, res) res, progressbar = T)

save(reduced_CI_results,file='reduced_CI_results.RData')

A minor problem encountered by us

When we first ran this code, we came back the next day to find that none of the jobs had finished (but they hadn’t thrown errors either). We had accidentally requested nodes of the ‘short’ type, which only allow you to process on them for 4 hours before they just kill whatever you’re doing. We had to edit our simple.tmpl to include the line #PBS -q generic so that we were requesting a generic node that lets things run for up to 24 hours.