FASTcure

An R-package for Estimating Semiparametric PH Cure Models. Optimized for big data.

Table of Contents

1. Motivation
2. Modification
- 2.1. Efficiency
- 2.2. Bug fix
3. Usage
4. Simulation result
5. Todos

1. Motivation

Existing R-packages for estimating PH Cure Models are very inefficient and quite buggy, so I modify the original package smcure just to make it faster and more reliable (hopefully).

2. Modification

2.1. Efficiency

As you can see, when you feed too much data to smcure (in this case, I feed a 34, 008 row dataframe to it), it will consume a significant amount of memory (up to 14 GB !). Besides, smcure will only utilize one cpu core when performs bootstrap steps, even if your new PC or powerful server has many cpu cores !

After profiling smcure, I found this performance bottleneck function:

smsurv <-
function(Time,Status,X,beta,w,model){    
    death_point <- sort(unique(subset(Time, Status==1)))
	if(model=='ph') coxexp <- exp((beta)%*%t(X[,-1]))  
    lambda <- numeric()
    event <- numeric()
      for(i in 1: length(death_point)){
       event[i] <- sum(Status*as.numeric(Time==death_point[i]))
                 if(model=='ph')  temp <- sum(as.numeric(Time>=death_point[i])*w*drop(coxexp))
       		if(model=='aft')  temp <- sum(as.numeric(Time>=death_point[i])*w)
                  temp1 <- event[i]
       lambda[i] <- temp1/temp
        }
    HHazard <- numeric()
    for(i in 1:length(Time)){
        HHazard[i] <- sum(as.numeric(Time[i]>=death_point)*lambda)
        if(Time[i]>max(death_point))HHazard[i] <- Inf
        if(Time[i]<min(death_point))HHazard[i] <- 0
        }
   survival <- exp(-HHazard)
   list(survival=survival)
}

There are two main reasons why this function is slow:

Frequent GC: this expression lambda <- numeric() create an vector and subsequent code dynamically grows its size, which invokes too many memory allocation system call.
Nested loop: when your data has many rows, the two for loops inside this function can be quite slow.

To fix these issues, I rewrite this funciton in C++ (haz.cpp and cumhaz.cpp) using Rcpp library and use foreach package to parallize bootstrap steps.

The following screenshot shows that after optimization, you can use all of your cpu cores and there's a significant reduction in memory usage (and GC frequence, which is not shown here)

2.2. Bug fix

2.2.1. Problem with factor variable in incidence model

cvars <- all.vars(cureform)
Z <- as.matrix(cbind(rep(1,n),data[,cvars]))
colnames(Z) <- c("(Intercept)",cvars)

In this code block, as.matrix will convert dataframe to matrix, but if your dataframe contains factor variables, as.matrix will convert the whole dataframe to string matrix (because matrix can hold exactly one type of data), this behaviour will cause subsequent code crashes:

b <- eval(parse(text = paste("glm", "(", "w~Z[,-1]",",family = quasibinomial(link='", link, "'",")",")",sep = "")))$coef

So I use model.matrix to extract covariates.

2.2.2. Problem with one covariate

When you specify only one covariate in submodels like the following code:

smcure(formula = Surv(time, cancer) ~ stage)

smcure will crash with error: non-conformable arguments, this problem is due to the following expression in smsurv function:

exp(X[, -1] %*% beta)

When X has only two columns, X[, -1] will produce an atomic vector instead of a matrix, this behaviour will cause non-conformable arguments error. So a dim attribute should be added to this vector to make it a matrix.

2.2.3. Infinite bootstrap loop

i<-1
while (i<=nboot){
  # some code
  if (bootfit$tau<eps) i<-i+1
}

When em does not converge (this is common when the dataset is relatively large), this loop will never finish.

3. Usage

smcure(formula = lformula,
                cureform = iformula,
                data = d, model = "ph",emmax = 100, eps = 1e-06, nboot = 50, mc.cores = 10, Var = TRUE, robust = TRUE, silence = TRUE)

mc.cores = 10: create 10 sub-processes to run bootstrap steps

robust = TRUE: only use converged bootstrap step to caculate std.error

silence=TRUE: reduce the number of messages printed to console

4. Simulation result

Setting: n=300, nsims = 500

5. Todos

use better std.error estimation method
testing AFT model
update docs
update plot function

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
R		R
data		data
imgs		imgs
man		man
src		src
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
FASTcure.Rproj		FASTcure.Rproj
LICENSE		LICENSE
MD5		MD5
NAMESPACE		NAMESPACE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FASTcure

1. Motivation

2. Modification

2.1. Efficiency

2.2. Bug fix

2.2.1. Problem with factor variable in incidence model

2.2.2. Problem with one covariate

2.2.3. Infinite bootstrap loop

3. Usage

4. Simulation result

5. Todos

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

ActionSafe/FASTcure

Folders and files

Latest commit

History

Repository files navigation

FASTcure

1. Motivation

2. Modification

2.1. Efficiency

2.2. Bug fix

2.2.1. Problem with factor variable in incidence model

2.2.2. Problem with one covariate

2.2.3. Infinite bootstrap loop

3. Usage

4. Simulation result

5. Todos

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages