Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (2024)

TAL AGASSI, NIR KERET, MALKA GORFINE
Department of Statistics and Operations Research
Tel Aviv University, Tel Aviv 69978, Israel.

gorfinem@tauex.tau.ac.il

Abstract

In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the optimal subsample size. To bridge this gap, our work introduces tools designed for choosing the optimal subsample size. We focus on three settings: the Cox regression model for survival data with rare events and logistic regression for both balanced and imbalanced datasets. Additionally, we present a novel optimal subsampling procedure tailored for logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets.Hypothesis testing; Imbalanced data; Time to event analysis; Relative efficiency;

00footnotetext: To whom correspondence should be addressed.

1 Introduction

The escalating demand to analyze massive datasets with millions of observations often leads to considerable computational time and memory requirements, presenting significant challenges in implementing statistical analyses. In response, subsampling has become a widely adopted and effective method for expediting computation, for various regression models. These models encompass least-squares regression models (Dhillonetal., 2013; Maetal., 2015), logistic regression (Wangetal., 2018, 2021), generalized linear models (Aietal., 2021), quantile regression (Wang andMa, 2021), quasi-likelihood estimators (Yuetal., 2020), time-to-event regression under the additive-hazards model (Zuoetal., 2021), semi-competing risks (Gorfine etal., 2021), the Cox proportional-hazards (PH) model (Keret andGorfine, 2023) and accelerated failure time model (Yangetal., 2024).

In this work, we mainly concentrate on two prominent scenarios associated with rare events: (1) addressing the challenge of highly imbalanced data in logistic regression, where one of the classes is rare, and (2) employing Cox proportional-hazards (PH) regression (Cox, 1972) for survival data characterized by a notably high right-censoring rate.

Wangetal. (2018) introduced an innovative subsampling method optimized for logistic regression, demonstrating high effectiveness for balanced data but acknowledging its limited efficacy for highly imbalanced data. In the context of rare-event data, a natural approach involves subsampling exclusively within the majority group (the common class or censored observations) to prevent the loss of crucial information.Addressing this concern, Wangetal. (2021) focused on logistic regression and proposed an optimal subsampling procedure targeting the rare-event setting, ensuring retention of all events in binary outcome scenarios. However, the underlying assumption is that the proportion of rare events decreases as the sample size increases, a condition that is often considered undesirable.

For survival analysis involving rare events, Gorfine etal. (2021) advocated subsampling solely the observations that have not yet experienced the event (i.e., the censored observations) and implemented a uniform subsampling approach. In a similar vein, Keret andGorfine (2023) presented an optimal subsampling strategy for the Cox PH model within the rare-events framework. In this approach, optimal subsampling exclusively targets censored observations, combining all observed events with the subsample set of censored observations. These optimal subsampling techniques have been convincingly demonstrated to significantly reduce computational burden compared to analyzing the entire dataset, with minimal loss of efficiency.

However, a notable aspect left unaddressed in the aforementioned works is the lack of practical guidelines for determining the subsample size. While our primary goal is to reduce computation time, we are equally committed to maintaining the statistical power or efficiency for answering the research questions and avoiding a substantial increase in standard errors. Hence, it is valuable to offer researchers a tool for choosing the subsample size that aligns with their research objectives.

This work offers notable contributions in two key aspects:

  1. 1.

    We introduce tools designed to optimize the process of selecting subsample sizes in the realm of optimal subsampling. These tools are versatile, and applied here to Cox regression models dealing with rare events and logistic regression models, regardless of the presence of rare events.

  2. 2.

    We present optimal subsampling methods specifically tailored for logistic regression models handling rare events. Notably, our approach assumes that the proportion of rare events converges to a positive constant with increasing sample size, a substantial departure from the assumption made by Wangetal. (2021).

This paper is structured as follows: Section 2 begins by summarizing the key findings on optimal subsampling from Keret andGorfine (2023) to ensure the current paper is self-contained. It then introduces new methodologies for determining optimal subsample size. Section 3 proposes a two-step subsampling algorithm specifically designed for logistic regression in scenarios involving rare events, including techniques for selecting the subsample size. Section 4 focuses on Wang et al.’s (2018) two-step algorithm to nearly balanced datasets and offers strategies for identifying the optimal subsample size. Section 5 summarizes a comprehensive simulation study to evaluate the effectiveness of the proposed approaches. Sections 6 and 7 focus on analyzing two large-scale datasets, a survival regression model with around 350 million records and a logistic regression model with approximately 28 million observations. The paper concludes with a short discussion in Section 8.

2 Optimal Subsample Size for Cox Regression with Optimal Subsampling

2.1 Notation, Formulation and Reservoir-Sampling (Keret andGorfine, 2023)

For the sake of clarity, this section presents the model formulation and pertinent findings from Keret andGorfine (2023). Consider a set of n𝑛nitalic_n independent and identically distributed observations. Let Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the failure time for the i𝑖iitalic_ith observation, Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the right-censoring time, and Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT signify the observed time, Ti=min(Vi,Ci)subscript𝑇𝑖subscript𝑉𝑖subscript𝐶𝑖T_{i}=\min(V_{i},C_{i})italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_min ( italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Define Δi=I(ViCi)subscriptΔ𝑖𝐼subscript𝑉𝑖subscript𝐶𝑖\Delta_{i}=I(V_{i}\leq C_{i})roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_I ( italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and let 𝐗isubscript𝐗𝑖\mathbf{X}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be a vector of potentially time-dependent covariates of size r𝑟ritalic_r. The observed dataset is denoted by 𝒟n={Ti,Δi,𝐗i;i=1,,n}\mathcal{D}_{n}=\{T_{i},\Delta_{i},\mathbf{X}_{i}\,;\,i=1,\dots,n\}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_i = 1 , … , italic_n }. Among the n𝑛nitalic_n observations, there are nesubscript𝑛𝑒n_{e}italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT instances where the failure times are observed, termed “events”. It is assumed that as n𝑛n\rightarrow\inftyitalic_n → ∞, the ratio ne/nsubscript𝑛𝑒𝑛n_{e}/nitalic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT / italic_n converges to a small positive constant. The count of censored observations is represented by nc=nnesubscript𝑛𝑐𝑛subscript𝑛𝑒n_{c}=n-n_{e}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_n - italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, and τ𝜏\tauitalic_τ denotes the maximum follow-up time.

In time-to-event data, the predominant source of information comes from events rather than censored observations. This rationale underlies the two-step algorithm introduced by Keret andGorfine (2023), which utilizes all observed events while sampling a subset of censored observations. Let qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the number of censored observations sampled from the full data, where qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is typically much smaller than n𝑛nitalic_n, and it is assumed that qn/nsubscript𝑞𝑛𝑛q_{n}/nitalic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_n converges to a small positive constant as qn,nsubscript𝑞𝑛𝑛q_{n},n\rightarrow\inftyitalic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞. Define 𝒞𝒞\mathcal{C}caligraphic_C as the index set containing all censored observations in the full data, and 𝒬𝒬\mathcal{Q}caligraphic_Q as the index set encompassing all observations with observed failure times and all censored observations included in the subsample. Due to computational and theoretical considerations, censored observations are sampled with replacement, implying that a censored observation in the original sample may appear more than once in the subsample.

Let 𝜷𝜷\boldsymbol{\beta}bold_italic_β be a vector of size r𝑟ritalic_r of unknown coefficients. Then, under the Cox PH regression, the instantaneous hazard rate of observation i𝑖iitalic_i at time t𝑡titalic_t is given by

λ(t|𝐗i)=λ0(t)e𝜷T𝐗ii=1,,nformulae-sequence𝜆conditional𝑡subscript𝐗𝑖subscript𝜆0𝑡superscript𝑒superscript𝜷𝑇subscript𝐗𝑖𝑖1𝑛\displaystyle\lambda(t|\mathbf{X}_{i})=\lambda_{0}(t)e^{\boldsymbol{\beta}^{T}%\mathbf{X}_{i}}\quad\quad i=1,\dots,nitalic_λ ( italic_t | bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t ) italic_e start_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_i = 1 , … , italic_n

where λ0()subscript𝜆0\lambda_{0}(\cdot)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ ) is an unspecified non-negative function and Λ0(t)=0tλ0(u)𝑑usubscriptΛ0𝑡superscriptsubscript0𝑡subscript𝜆0𝑢differential-d𝑢\Lambda_{0}(t)=\int_{0}^{t}\lambda_{0}(u)duroman_Λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_u ) italic_d italic_u is the cumulative baseline hazard function. The goal is estimating the unknown parameters 𝜷𝜷\boldsymbol{\beta}bold_italic_β and Λ0subscriptΛ0\Lambda_{0}roman_Λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Define 𝐒(k)(𝜷,t)=i=1ne𝜷T𝐗iYi(t)𝐗ik,k=0,1,2formulae-sequencesuperscript𝐒𝑘𝜷𝑡superscriptsubscript𝑖1𝑛superscript𝑒superscript𝜷𝑇subscript𝐗𝑖subscript𝑌𝑖𝑡superscriptsubscript𝐗𝑖tensor-productabsent𝑘𝑘012\mathbf{S}^{(k)}(\boldsymbol{\beta},t)=\sum_{i=1}^{n}e^{\boldsymbol{\beta}^{T}%\mathbf{X}_{i}}Y_{i}(t)\mathbf{X}_{i}^{\otimes k},k=0,1,2bold_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_k end_POSTSUPERSCRIPT , italic_k = 0 , 1 , 2, where 𝐗0=1,𝐗1=𝐗,𝐗2=𝐗𝐗Tformulae-sequencesuperscript𝐗tensor-productabsent01formulae-sequencesuperscript𝐗tensor-productabsent1𝐗superscript𝐗tensor-productabsent2superscript𝐗𝐗𝑇\mathbf{X}^{\otimes 0}=1,\mathbf{X}^{\otimes 1}=\mathbf{X},\mathbf{X}^{\otimes2%}=\mathbf{X}\mathbf{X}^{T}bold_X start_POSTSUPERSCRIPT ⊗ 0 end_POSTSUPERSCRIPT = 1 , bold_X start_POSTSUPERSCRIPT ⊗ 1 end_POSTSUPERSCRIPT = bold_X , bold_X start_POSTSUPERSCRIPT ⊗ 2 end_POSTSUPERSCRIPT = bold_XX start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and Yi(t)=I(Tit)subscript𝑌𝑖𝑡𝐼subscript𝑇𝑖𝑡Y_{i}(t)=I(T_{i}\geq t)italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_I ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_t ) is the at-risk process of observation i𝑖iitalic_i at time t𝑡titalic_t. Denote 𝜷^PLsubscript^𝜷𝑃𝐿\widehat{\boldsymbol{\beta}}_{PL}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT as the full-sample partial-likelihood (PL) estimator of 𝜷𝜷\boldsymbol{\beta}bold_italic_β that solves

l(𝜷)𝜷T=i=1nΔi{𝐗i𝐒(1)(𝜷,Ti)S(0)(𝜷,Ti)}=𝟎.𝑙𝜷superscript𝜷𝑇superscriptsubscript𝑖1𝑛subscriptΔ𝑖subscript𝐗𝑖superscript𝐒1𝜷subscript𝑇𝑖superscript𝑆0𝜷subscript𝑇𝑖0\frac{\partial l(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}^{T}}=\sum_{i=%1}^{n}\Delta_{i}\bigg{\{}\mathbf{X}_{i}-\frac{\mathbf{S}^{(1)}(\boldsymbol{%\beta},T_{i})}{S^{(0)}(\boldsymbol{\beta},T_{i})}\bigg{\}}=\mathbf{0}\,.divide start_ARG ∂ italic_l ( bold_italic_β ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG bold_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_S start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG } = bold_0 .

Suppose that the data are organized such that the censored observations precede the failure times, namely 𝒞={1,,nc}𝒞1subscript𝑛𝑐\mathcal{C}=\{1,\dots,n_{c}\}caligraphic_C = { 1 , … , italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT }, and ={nc+1,,n}subscript𝑛𝑐1𝑛\mathcal{E}=\{n_{c}+1,\dots,n\}caligraphic_E = { italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + 1 , … , italic_n }. Let 𝐩=(p1,,pnc)T𝐩superscriptsubscript𝑝1subscript𝑝subscript𝑛𝑐𝑇\mathbf{p}=(p_{1},\dots,p_{n_{c}})^{T}bold_p = ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be a vector of the sampling probabilities for the censored observations, where i=1ncpi=1superscriptsubscript𝑖1subscript𝑛𝑐subscript𝑝𝑖1\sum_{i=1}^{n_{c}}p_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1, and set

wi={(piqn)1ifΔi=0,pi>01ifΔi=1i=1,,n.formulae-sequencesubscript𝑤𝑖casessuperscriptsubscript𝑝𝑖subscript𝑞𝑛1formulae-sequenceifsubscriptΔ𝑖0subscript𝑝𝑖01ifsubscriptΔ𝑖1𝑖1𝑛w_{i}=\begin{cases}(p_{i}q_{n})^{-1}&\text{if }\Delta_{i}=0,p_{i}>0\\1&\text{if }\Delta_{i}=1\end{cases}\quad\quad i=1,\dots,n\,.italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL if roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL if roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_CELL end_ROW italic_i = 1 , … , italic_n .

The subsample-based counterpart of 𝐒(k)(𝜷,t)superscript𝐒𝑘𝜷𝑡\mathbf{S}^{(k)}(\boldsymbol{\beta},t)bold_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) is 𝐒w(k)(𝜷,t)=i𝒬wie𝜷T𝐗iYi(t)𝐗ik,k=0,1,2formulae-sequencesuperscriptsubscript𝐒𝑤𝑘𝜷𝑡subscript𝑖𝒬subscript𝑤𝑖superscript𝑒superscript𝜷𝑇subscript𝐗𝑖subscript𝑌𝑖𝑡superscriptsubscript𝐗𝑖tensor-productabsent𝑘𝑘012\mathbf{S}_{w}^{(k)}(\boldsymbol{\beta},t)=\sum_{i\in\mathcal{Q}}w_{i}e^{%\boldsymbol{\beta}^{T}\mathbf{X}_{i}}Y_{i}(t)\mathbf{X}_{i}^{\otimes k},k=0,1,2bold_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_k end_POSTSUPERSCRIPT , italic_k = 0 , 1 , 2. Then, 𝜷~~𝜷\widetilde{\boldsymbol{\beta}}over~ start_ARG bold_italic_β end_ARG is defined as the estimator derived from the subsample 𝒬𝒬\mathcal{Q}caligraphic_Q, by solving

l(𝜷)𝜷Ti𝒬Δi{𝐗i𝐒w(1)(𝜷,Ti)Sw(0)(𝜷,Ti)}=𝟎superscript𝑙𝜷superscript𝜷𝑇subscript𝑖𝒬subscriptΔ𝑖subscript𝐗𝑖superscriptsubscript𝐒𝑤1𝜷subscript𝑇𝑖superscriptsubscript𝑆𝑤0𝜷subscript𝑇𝑖0\frac{\partial l^{*}(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}^{T}}%\equiv\sum_{i\in\mathcal{Q}}\Delta_{i}\bigg{\{}\mathbf{X}_{i}-\frac{\mathbf{S}%_{w}^{(1)}(\boldsymbol{\beta},T_{i})}{S_{w}^{(0)}(\boldsymbol{\beta},T_{i})}%\bigg{\}}=\mathbf{0}divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≡ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG bold_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG } = bold_0(1)

where lsuperscript𝑙l^{*}italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the log PL based on the subsample 𝒬𝒬\mathcal{Q}caligraphic_Q. Finally, for a given vector 𝜷𝜷\boldsymbol{\beta}bold_italic_β, defineΛ^0(t,𝜷)=i=1nΔiI(Tit)/S(0)(𝜷,Ti),subscript^Λ0𝑡𝜷superscriptsubscript𝑖1𝑛subscriptΔ𝑖𝐼subscript𝑇𝑖𝑡superscript𝑆0𝜷subscript𝑇𝑖\widehat{\Lambda}_{0}(t,\boldsymbol{\beta})=\sum_{i=1}^{n}{\Delta_{i}I(T_{i}%\leq t)}/{S^{(0)}(\boldsymbol{\beta},T_{i})},over^ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t , bold_italic_β ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_I ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_t ) / italic_S start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,and the Breslow estimator (Breslow, 1972) of Λ0subscriptΛ0\Lambda_{0}roman_Λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT function is produced by Λ^0(t,𝜷^PL)subscript^Λ0𝑡subscript^𝜷𝑃𝐿\widehat{\Lambda}_{0}(t,\widehat{\boldsymbol{\beta}}_{PL})over^ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t , over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT ).

Consistency and asymptotic normality of 𝜷~~𝜷\widetilde{\boldsymbol{\beta}}over~ start_ARG bold_italic_β end_ARG and Λ^0subscript^Λ0\widehat{\Lambda}_{0}over^ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT were established by Keret andGorfine (2023) under some regularity assumptions. Specifically, given the true 𝜷osuperscript𝜷𝑜\boldsymbol{\beta}^{o}bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT,

n𝕍(p,𝜷o)1/2(𝜷~𝜷o)𝐷N(0,𝐈)𝐷𝑛𝕍superscriptpsuperscript𝜷𝑜12~𝜷superscript𝜷𝑜𝑁0𝐈\sqrt{n}\mathbb{V}(\textbf{p},\boldsymbol{\beta}^{o})^{-1/2}(\widetilde{%\boldsymbol{\beta}}-\boldsymbol{\beta}^{o})\xrightarrow[]{D}N(0,\mathbf{I})square-root start_ARG italic_n end_ARG blackboard_V ( p , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_I )

as n,qn𝑛subscript𝑞𝑛n,q_{n}\rightarrow\inftyitalic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞, where 𝐈𝐈\mathbf{I}bold_I is the identity matrix and

𝕍(p,𝜷)=𝓘1(𝜷)+nqn𝓘1(𝜷)𝝋(𝐩,𝜷)𝓘1(𝜷),𝕍p𝜷superscript𝓘1𝜷𝑛subscript𝑞𝑛superscript𝓘1𝜷𝝋𝐩𝜷superscript𝓘1𝜷\mathbb{V}(\textbf{p},\boldsymbol{\beta})=\boldsymbol{\mathcal{I}}^{-1}(%\boldsymbol{\beta})+\frac{n}{q_{n}}\boldsymbol{\mathcal{I}}^{-1}(\boldsymbol{%\beta})\boldsymbol{\varphi}(\mathbf{p},\boldsymbol{\beta})\boldsymbol{\mathcal%{I}}^{-1}(\boldsymbol{\beta})\,,blackboard_V ( p , bold_italic_β ) = bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β ) + divide start_ARG italic_n end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β ) bold_italic_φ ( bold_p , bold_italic_β ) bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β ) ,
𝝋(𝐩,𝜷)=1n2{i𝒞𝐚i(𝜷)𝐚i(𝜷)Tpii,j𝒞𝐚i(𝜷)𝐚j(𝜷)T},𝝋𝐩𝜷1superscript𝑛2subscript𝑖𝒞subscript𝐚𝑖𝜷subscript𝐚𝑖superscript𝜷𝑇subscript𝑝𝑖subscript𝑖𝑗𝒞subscript𝐚𝑖𝜷subscript𝐚𝑗superscript𝜷𝑇\boldsymbol{\varphi}(\mathbf{p},\boldsymbol{\beta})=\frac{1}{n^{2}}\Bigg{\{}%\sum_{i\in\mathcal{C}}\frac{\mathbf{a}_{i}(\boldsymbol{\beta})\mathbf{a}_{i}(%\boldsymbol{\beta})^{T}}{p_{i}}-\sum_{i,j\in\mathcal{C}}\mathbf{a}_{i}(%\boldsymbol{\beta})\mathbf{a}_{j}(\boldsymbol{\beta})^{T}\Bigg{\}}\,,bold_italic_φ ( bold_p , bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_C end_POSTSUBSCRIPT divide start_ARG bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ caligraphic_C end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_β ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } ,
𝐚i(𝜷)=0τ{𝐗i𝐒(1)(𝜷,t)S(0)(𝜷,t)}Yi(t)e𝜷T𝐗iS(0)(𝜷,t)𝑑N.(t),formulae-sequencesubscript𝐚𝑖𝜷superscriptsubscript0𝜏subscript𝐗𝑖superscript𝐒1𝜷𝑡superscript𝑆0𝜷𝑡subscript𝑌𝑖𝑡superscript𝑒superscript𝜷𝑇subscript𝐗𝑖superscript𝑆0𝜷𝑡differential-d𝑁𝑡\mathbf{a}_{i}(\boldsymbol{\beta})=\int_{0}^{\tau}\bigg{\{}\mathbf{X}_{i}-%\frac{\mathbf{S}^{(1)}(\boldsymbol{\beta},t)}{S^{(0)}(\boldsymbol{\beta},t)}%\bigg{\}}\frac{Y_{i}(t)e^{\boldsymbol{\beta}^{T}\mathbf{X}_{i}}}{S^{(0)}(%\boldsymbol{\beta},t)}dN.(t)\,,bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT { bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG bold_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG } divide start_ARG italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_e start_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_S start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG italic_d italic_N . ( italic_t ) ,

where N.(t)=i=1nΔiI(Tit)subscript𝑁.𝑡superscriptsubscript𝑖1𝑛subscriptΔ𝑖𝐼subscript𝑇𝑖𝑡N_{.}(t)=\sum_{i=1}^{n}\Delta_{i}I(T_{i}\leq t)italic_N start_POSTSUBSCRIPT . end_POSTSUBSCRIPT ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_I ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_t ) and

𝓘(𝜷)=1n2l(𝜷)𝜷T𝜷=1n0τ{𝐒(2)(𝜷,t)S(0)(𝜷,t)(𝐒(1)(𝜷,t)S(0)(𝜷,t))(𝐒(1)(𝜷,t)S(0)(𝜷,t))T}𝑑N.(t).formulae-sequence𝓘𝜷1𝑛superscript2𝑙𝜷superscript𝜷𝑇𝜷1𝑛superscriptsubscript0𝜏superscript𝐒2𝜷𝑡superscript𝑆0𝜷𝑡superscript𝐒1𝜷𝑡superscript𝑆0𝜷𝑡superscriptsuperscript𝐒1𝜷𝑡superscript𝑆0𝜷𝑡𝑇differential-d𝑁𝑡\boldsymbol{\boldsymbol{\mathcal{I}}}(\boldsymbol{\beta})=\frac{1}{n}\frac{%\partial^{2}l(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}^{T}\partial%\boldsymbol{\beta}}=-\frac{1}{n}\int_{0}^{\tau}\Bigg{\{}\frac{\mathbf{S}^{(2)}%(\boldsymbol{\beta},t)}{S^{(0)}(\boldsymbol{\beta},t)}-\Big{(}\frac{\mathbf{S}%^{(1)}(\boldsymbol{\beta},t)}{S^{(0)}(\boldsymbol{\beta},t)}\Big{)}\Big{(}%\frac{\mathbf{S}^{(1)}(\boldsymbol{\beta},t)}{S^{(0)}(\boldsymbol{\beta},t)}%\Big{)}^{T}\Bigg{\}}dN.(t)\,.bold_caligraphic_I ( bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l ( bold_italic_β ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∂ bold_italic_β end_ARG = - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT { divide start_ARG bold_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG - ( divide start_ARG bold_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG ) ( divide start_ARG bold_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } italic_d italic_N . ( italic_t ) .

As 𝓘𝓘\boldsymbol{\mathcal{I}}bold_caligraphic_I and 𝝋𝝋\boldsymbol{\varphi}bold_italic_φ involve the entire dataset, their subsampling-based counterparts, 𝓘~~𝓘\widetilde{\boldsymbol{\mathcal{I}}}over~ start_ARG bold_caligraphic_I end_ARG and 𝝋~~𝝋\widetilde{\boldsymbol{\varphi}}over~ start_ARG bold_italic_φ end_ARG, will be utilized in the variance estimator 𝕍~(p,𝜷~)~𝕍p~𝜷\widetilde{\mathbb{V}}(\textbf{p},\widetilde{\boldsymbol{\beta}})over~ start_ARG blackboard_V end_ARG ( p , over~ start_ARG bold_italic_β end_ARG ). The Exact expressions of 𝓘~~𝓘\widetilde{\boldsymbol{\mathcal{I}}}over~ start_ARG bold_caligraphic_I end_ARG and 𝝋~~𝝋\widetilde{\boldsymbol{\varphi}}over~ start_ARG bold_italic_φ end_ARG can be found in the Supplementary Material (SM) file Section S1.

The above results laid the foundation for establishing the subsequent optimal subsampling probabilities. The A-optimal sampling probabilities vector, denoted as 𝐩Asuperscript𝐩𝐴\mathbf{p}^{A}bold_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and derived from minimizing the trace of 𝕍(𝐩,𝜷o)𝕍𝐩superscript𝜷𝑜\mathbb{V}(\mathbf{p},\boldsymbol{\beta}^{o})blackboard_V ( bold_p , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ), is expressed as follows

pmA=𝓘1(𝜷o)𝐚m(𝜷o)2i𝒞𝓘1(𝜷o)𝐚i(𝜷o)2for all m𝒞formulae-sequencesuperscriptsubscript𝑝𝑚𝐴subscriptnormsuperscript𝓘1superscript𝜷𝑜subscript𝐚𝑚superscript𝜷𝑜2subscript𝑖𝒞subscriptnormsuperscript𝓘1superscript𝜷𝑜subscript𝐚𝑖superscript𝜷𝑜2for all m𝒞p_{m}^{A}=\frac{\|\boldsymbol{\mathcal{I}}^{-1}(\boldsymbol{\beta}^{o})\mathbf%{a}_{m}(\boldsymbol{\beta}^{o})\|_{2}}{\sum_{i\in\mathcal{C}}\|\boldsymbol{%\mathcal{I}}^{-1}(\boldsymbol{\beta}^{o})\mathbf{a}_{i}(\boldsymbol{\beta}^{o}%)\|_{2}}\quad\textit{ for all m}\in\mathcal{C}italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT = divide start_ARG ∥ bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_C end_POSTSUBSCRIPT ∥ bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG for all m ∈ caligraphic_C(2)

where 2\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT euclidean norm. The L-optimal sampling probabilities vector, denoted as 𝐩Lsuperscript𝐩𝐿\mathbf{p}^{L}bold_p start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT and obtained from minimizing the trace of 𝝋(p,𝜷o)𝝋psuperscript𝜷𝑜\boldsymbol{\varphi}(\textbf{p},\boldsymbol{\beta}^{o})bold_italic_φ ( p , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ), is given by

pmL=𝐚m(𝜷o)2i𝒞𝐚i(𝜷o)2for all m𝒞.formulae-sequencesuperscriptsubscript𝑝𝑚𝐿subscriptnormsubscript𝐚𝑚superscript𝜷𝑜2subscript𝑖𝒞subscriptnormsubscript𝐚𝑖superscript𝜷𝑜2for all m𝒞p_{m}^{L}=\frac{\|\mathbf{a}_{m}(\boldsymbol{\beta}^{o})\|_{2}}{\sum_{i\in%\mathcal{C}}\|\mathbf{a}_{i}(\boldsymbol{\beta}^{o})\|_{2}}\quad\textit{ for %all m}\in\mathcal{C}\,.italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = divide start_ARG ∥ bold_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_C end_POSTSUBSCRIPT ∥ bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG for all m ∈ caligraphic_C .(3)

Evidently, 𝐩Asuperscript𝐩𝐴\mathbf{p}^{A}bold_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT incorporate 𝓘1superscript𝓘1\boldsymbol{\mathcal{I}}^{-1}bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, enabling a more efficient estimation of 𝜷𝜷\boldsymbol{\beta}bold_italic_β in contrast to the estimator using probabilities from the L-optimal criterion. Nevertheless, for the same reason, 𝐩Asuperscript𝐩𝐴\mathbf{p}^{A}bold_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT demands a greater computational time.

As both 𝐩Asuperscript𝐩𝐴\mathbf{p}^{A}bold_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and 𝐩Lsuperscript𝐩𝐿\mathbf{p}^{L}bold_p start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT rely on the true unknown regression vector, 𝜷osuperscript𝜷𝑜\boldsymbol{\beta}^{o}bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, the following two-step procedure has been proposed. It commences with a quick and straightforward consistent estimator of the regression vector to estimate the optimal sampling probabilities. The complete implementation of the two-step procedure is outlined below:

Step 1:Select q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations uniformly from 𝒞𝒞\mathcal{C}caligraphic_C and combine them with the observed events to get 𝒬pilot𝒬𝑝𝑖𝑙𝑜𝑡\mathcal{Q}{pilot}caligraphic_Q italic_p italic_i italic_l italic_o italic_t. Conduct a weighted Cox regression on 𝒬pilot𝒬𝑝𝑖𝑙𝑜𝑡\mathcal{Q}{pilot}caligraphic_Q italic_p italic_i italic_l italic_o italic_t and obtain 𝜷~Usubscript~𝜷𝑈\widetilde{\boldsymbol{\beta}}_{U}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT based on Eq. (1). Compute approximated optimal sampling probabilities by substituting 𝜷osuperscript𝜷𝑜\boldsymbol{\beta}^{o}bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT with 𝜷~Usubscript~𝜷𝑈\widetilde{\boldsymbol{\beta}}_{U}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT in Eq. (2) or (3).

Step 2:Select qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT observations from 𝒞𝒞\mathcal{C}caligraphic_C based on the sampling probabilities of Step 1. Combine these selected observations with the observed events and get 𝒬𝒬\mathcal{Q}caligraphic_Q. Perform a weighted Cox regression on 𝒬𝒬\mathcal{Q}caligraphic_Q, based on Eq. (1), and get the two-step estimator 𝜷~TS.subscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}.over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT .

The original algorithm employs q0=qnsubscript𝑞0subscript𝑞𝑛q_{0}=q_{n}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. However, here we suggest using a small value of q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for the initial uniform sampling of Step 1. Subsequently, the methods outlined in the following subsections focus on determining the value of qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT under various criteria. To maintain computational efficiency, we recommend setting q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as c0nesubscript𝑐0subscript𝑛𝑒c_{0}n_{e}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, where c0subscript𝑐0c_{0}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a small scalar (e.g., c0<5subscript𝑐05c_{0}<5italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < 5). Our simulation study and real-data analysis indicate that this recommendation is generally sufficient. Furthermore, our findings suggest that when none of the covariates exhibit long-tailed distributions, setting c0=1subscript𝑐01c_{0}=1italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 is often adequate.

The asymptotic properties of 𝜷~TSsubscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT and Λ^0(t,𝜷~TS)subscript^Λ0𝑡subscript~𝜷𝑇𝑆\widehat{\Lambda}_{0}(t,\widetilde{\boldsymbol{\beta}}_{TS})over^ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) were established (Keret andGorfine, 2023). Specifically, it was shown that under standard assumptions,

n𝕍(𝐩opt,𝜷o)1/2(𝜷~TS𝜷o)𝐷N(𝟎,𝐈)𝐷𝑛𝕍superscriptsuperscript𝐩𝑜𝑝𝑡superscript𝜷𝑜12subscript~𝜷𝑇𝑆superscript𝜷𝑜𝑁0𝐈\sqrt{n}\mathbb{V}(\mathbf{p}^{opt},\boldsymbol{\beta}^{o})^{-1/2}(\widetilde{%\boldsymbol{\beta}}_{TS}-\boldsymbol{\beta}^{o})\xrightarrow[]{D}N(\mathbf{0},%\mathbf{I})square-root start_ARG italic_n end_ARG blackboard_V ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_ARROW overitalic_D → end_ARROW italic_N ( bold_0 , bold_I )

as qn,nsubscript𝑞𝑛𝑛q_{n},n\rightarrow\inftyitalic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞, where 𝐩optsuperscript𝐩𝑜𝑝𝑡\mathbf{p}^{opt}bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT is either 𝐩Asuperscript𝐩𝐴\mathbf{p}^{A}bold_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT or 𝐩Lsuperscript𝐩𝐿\mathbf{p}^{L}bold_p start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT. Moreover, the asymptotic theory accommodates left truncation, stratified analysis, time-dependent covariates and time-dependent coefficients. However, a practical methodology for selecting the size of qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT was not studied, despite its considerable importance. This motivates us to propose the following frameworks for determining the necessary size of qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT based on specific objectives.

Often, datasets are too voluminous to fit within the RAM limitations of standard computers, a challenge highlighted in the real-data analyses of Section 6. Keret andGorfine (2023) proposed a speedy and memory-efficient approach for batch-based reservoir sampling, designed to operate on conventional computer systems. Summarizing, the observed-events dataset, \mathcal{E}caligraphic_E, is consistently maintained in the RAM. Conversely, the censored observations are split into B𝐵Bitalic_B batches, labeled as 1,,Bsubscript1subscript𝐵\mathcal{B}_{1},\ldots,\mathcal{B}_{B}caligraphic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_B start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. At any given moment, only one batch is in the RAM. In every batch b𝑏bitalic_b, b=1,,B𝑏1𝐵b=1,\ldots,Bitalic_b = 1 , … , italic_B, we approximate 𝐩Asuperscript𝐩𝐴\mathbf{p}^{A}bold_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT or 𝐩Lsuperscript𝐩𝐿\mathbf{p}^{L}bold_p start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT by considering the dataset comprised of bsubscript𝑏\mathcal{E}\cup\mathcal{B}_{b}caligraphic_E ∪ caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. The approximation employs distinct weights, 1 for an event and nc/|b|subscript𝑛𝑐subscript𝑏n_{c}/|\mathcal{B}_{b}|italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / | caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | for each censored observation. The reservoir-sampling algorithm selects qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT observations with replacement in a single iteration. The key idea involves keeping a sample of qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT observations (referred to as the “reservoir”), where replacements can occur as new batches are loaded. For an in-depth description and proof of the algorithm validity, refer to Section 2.6 in Keret andGorfine (2023). This reservoir-sampling algorithm can be applied to any scenario of sampling with replacement, independently of the original regression problem. In this work, we employ it for survival regression with approximately 350 million records.

2.2 Subsample Size Based on Relative Efficiency

What is the optimal size of qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that maintains a small efficiency loss? Here, we introduce a tool that enables us to evaluate the efficiency loss resulting from the subsampling approach. We begin by defining an estimator of the relative efficiency (RE) of the two-step estimator compared to the full PL estimator by

RE(qn)=n1𝓘1(𝜷~TS)+qn1𝓘1(𝜷~TS)𝝋(𝐩opt,𝜷~TS)𝓘1(𝜷~TS)Fn1𝓘1(𝜷^PL)F𝑅𝐸subscript𝑞𝑛subscriptnormsuperscript𝑛1superscript𝓘1subscript~𝜷𝑇𝑆superscriptsubscript𝑞𝑛1superscript𝓘1subscript~𝜷𝑇𝑆𝝋superscript𝐩𝑜𝑝𝑡subscript~𝜷𝑇𝑆superscript𝓘1subscript~𝜷𝑇𝑆𝐹subscriptnormsuperscript𝑛1superscript𝓘1subscript^𝜷𝑃𝐿𝐹RE(q_{n})=\frac{\|n^{-1}\boldsymbol{\mathcal{I}}^{-1}(\widetilde{\boldsymbol{%\beta}}_{TS})+q_{n}^{-1}\boldsymbol{\mathcal{I}}^{-1}(\widetilde{\boldsymbol{%\beta}}_{TS})\boldsymbol{\varphi}(\mathbf{p}^{opt},\widetilde{\boldsymbol{%\beta}}_{TS})\boldsymbol{\mathcal{I}}^{-1}(\widetilde{\boldsymbol{\beta}}_{TS}%)\|_{F}}{\|n^{-1}\boldsymbol{\mathcal{I}}^{-1}(\widehat{\boldsymbol{\beta}}_{%PL})\|_{F}}italic_R italic_E ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) + italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) bold_italic_φ ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG(4)

where AF=i,j|ai,j|2subscriptnorm𝐴𝐹subscript𝑖𝑗superscriptsubscript𝑎𝑖𝑗2\|A\|_{F}=\sqrt{\sum_{i,j}|a_{i,j}|^{2}}∥ italic_A ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. The lower limit of Eq. (4) is close to 1111.

If interest lies in the effect of a particular covariate, e.g., the p𝑝pitalic_p-th covariate, then we may utilize

REp(qn)=[n1𝓘1(𝜷~TS)+qn1𝓘1(𝜷~TS)𝝋(𝐩opt,𝜷~TS)𝓘1(𝜷~TS)]pp[n1𝓘1(𝜷^PL)]pp𝑅subscript𝐸𝑝subscript𝑞𝑛subscriptdelimited-[]superscript𝑛1superscript𝓘1subscript~𝜷𝑇𝑆superscriptsubscript𝑞𝑛1superscript𝓘1subscript~𝜷𝑇𝑆𝝋superscript𝐩𝑜𝑝𝑡subscript~𝜷𝑇𝑆superscript𝓘1subscript~𝜷𝑇𝑆𝑝𝑝subscriptdelimited-[]superscript𝑛1superscript𝓘1subscript^𝜷𝑃𝐿𝑝𝑝RE_{p}(q_{n})=\frac{\Big{[}n^{-1}\boldsymbol{\mathcal{I}}^{-1}(\widetilde{%\boldsymbol{\beta}}_{TS})+q_{n}^{-1}\boldsymbol{\mathcal{I}}^{-1}(\widetilde{%\boldsymbol{\beta}}_{TS})\boldsymbol{\varphi}(\mathbf{p}^{opt},\widetilde{%\boldsymbol{\beta}}_{TS})\boldsymbol{\mathcal{I}}^{-1}(\widetilde{\boldsymbol{%\beta}}_{TS})\Big{]}_{pp}}{\Big{[}n^{-1}\boldsymbol{\mathcal{I}}^{-1}(\widehat%{\boldsymbol{\beta}}_{PL})\Big{]}_{pp}}italic_R italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG [ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) + italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) bold_italic_φ ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT end_ARG start_ARG [ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT end_ARG(5)

where [A]ppsubscriptdelimited-[]𝐴𝑝𝑝\big{[}A\big{]}_{pp}[ italic_A ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT is the pp𝑝𝑝ppitalic_p italic_p element of the matrix A𝐴Aitalic_A. Adjusted optimal sampling probabilities to target a subset of covariates, while retaining the rest in the model to control for confounders, are available in Keret andGorfine (2023) (see, Equations 7 and 8).

The equations above pose practical challenges: firstly, they include 𝜷^PLsubscript^𝜷𝑃𝐿\widehat{\boldsymbol{\beta}}_{PL}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT—whose calculations we aim to avoid; secondly, they involve 𝜷~TSsubscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT, which can be computed after determining the subsample size, qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. However, leveraging the consistent estimator 𝜷~Usubscript~𝜷𝑈\widetilde{\boldsymbol{\beta}}_{U}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT from Step 1 resolves it by substituting 𝜷~TSsubscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT and 𝜷^PLsubscript^𝜷𝑃𝐿\widehat{\boldsymbol{\beta}}_{PL}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT with 𝜷~Usubscript~𝜷𝑈\widetilde{\boldsymbol{\beta}}_{U}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT. An additional challenge arises as 𝓘1(𝜷~U)superscript𝓘1subscript~𝜷𝑈{\boldsymbol{\mathcal{I}}}^{-1}(\widetilde{\boldsymbol{\beta}}_{U})bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and 𝝋(𝐩opt,𝜷~U)𝝋superscript𝐩𝑜𝑝𝑡subscript~𝜷𝑈{\boldsymbol{\varphi}}(\mathbf{p}^{opt},\widetilde{\boldsymbol{\beta}}_{U})bold_italic_φ ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) involve the full data. Alternatively, their subsampling-based counterparts, 𝓘~1(𝜷~U)superscript~𝓘1subscript~𝜷𝑈\widetilde{\boldsymbol{\mathcal{I}}}^{-1}(\widetilde{\boldsymbol{\beta}}_{U})over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and 𝝋~(𝐩opt,𝜷~U)~𝝋superscript𝐩𝑜𝑝𝑡subscript~𝜷𝑈\widetilde{\boldsymbol{\varphi}}(\mathbf{p}^{opt},\widetilde{\boldsymbol{\beta%}}_{U})over~ start_ARG bold_italic_φ end_ARG ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ),can be used. However, it is advisable to refrain from using 𝓘~1(𝜷~U)superscript~𝓘1subscript~𝜷𝑈\widetilde{\boldsymbol{\mathcal{I}}}^{-1}(\widetilde{\boldsymbol{\beta}}_{U})over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and 𝝋~(𝐩opt,𝜷~U)~𝝋superscript𝐩𝑜𝑝𝑡subscript~𝜷𝑈\widetilde{\boldsymbol{\varphi}}(\mathbf{p}^{opt},\widetilde{\boldsymbol{\beta%}}_{U})over~ start_ARG bold_italic_φ end_ARG ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) based on the uniform subsample of Step 1, since uniform sampling allows the selection of observations with extremely small optimal probabilities 𝐩optsuperscript𝐩𝑜𝑝𝑡\mathbf{p}^{opt}bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT. Consequently, dividing by these probabilities often renders Eq. (4) or (5) numerically unstable. Our proposed approach involves sampling an additional small subsample of size q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, but this time using the approximated optimal probabilities obtained by substituting 𝜷osuperscript𝜷𝑜\boldsymbol{\beta}^{o}bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, in Eq.s (2) and (3), with 𝜷~Usubscript~𝜷𝑈\widetilde{\boldsymbol{\beta}}_{U}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT. Let 𝒬1.5subscript𝒬1.5\mathcal{Q}_{1.5}caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT be the index set containing all observations whose failure time was observed, along with the censored observations included in this new subsample of size q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We denote the counterparts of 𝓘1superscript𝓘1\boldsymbol{\mathcal{I}}^{-1}bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and 𝝋𝝋\boldsymbol{\varphi}bold_italic_φ for this subsample as 𝓘~Q1.51(𝜷~U)subscriptsuperscript~𝓘1subscript𝑄1.5subscript~𝜷𝑈\widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}(\widetilde{\boldsymbol{%\beta}}_{U})over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and 𝝋~Q1.5(𝐩opt,𝜷~U)subscript~𝝋subscript𝑄1.5superscript𝐩𝑜𝑝𝑡subscript~𝜷𝑈\widetilde{\boldsymbol{\varphi}}_{Q_{1.5}}(\mathbf{p}^{opt},\widetilde{%\boldsymbol{\beta}}_{U})over~ start_ARG bold_italic_φ end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) (see the SM, Section S1, for details).

Hence, the proposed RE estimator is given by

RE^(qn)=n1𝓘~Q1.51(𝜷~U)+qn1𝓘~Q1.51(𝜷~U)𝝋~Q1.5(𝐩~opt,𝜷~U)𝓘~Q1.51(𝜷~U)Fn1𝓘~Q1.51(𝜷~U)F^𝑅𝐸subscript𝑞𝑛subscriptnormsuperscript𝑛1subscriptsuperscript~𝓘1subscript𝑄1.5subscript~𝜷𝑈superscriptsubscript𝑞𝑛1subscriptsuperscript~𝓘1subscript𝑄1.5subscript~𝜷𝑈subscript~𝝋subscript𝑄1.5superscript~𝐩𝑜𝑝𝑡subscript~𝜷𝑈subscriptsuperscript~𝓘1subscript𝑄1.5subscript~𝜷𝑈𝐹subscriptnormsuperscript𝑛1subscriptsuperscript~𝓘1subscript𝑄1.5subscript~𝜷𝑈𝐹\widehat{RE}(q_{n})=\frac{\|n^{-1}\widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q%_{1.5}}(\widetilde{\boldsymbol{\beta}}_{U})+q_{n}^{-1}\widetilde{\boldsymbol{%\mathcal{I}}}^{-1}_{Q_{1.5}}(\widetilde{\boldsymbol{\beta}}_{U})\widetilde{%\boldsymbol{\varphi}}_{Q_{1.5}}(\widetilde{\mathbf{p}}^{opt},\widetilde{%\boldsymbol{\beta}}_{U})\widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}(%\widetilde{\boldsymbol{\beta}}_{U})\|_{F}}{\|n^{-1}\widetilde{\boldsymbol{%\mathcal{I}}}^{-1}_{Q_{1.5}}(\widetilde{\boldsymbol{\beta}}_{U})\|_{F}}over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) + italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) over~ start_ARG bold_italic_φ end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG(6)

where 𝐩~optsuperscript~𝐩𝑜𝑝𝑡\widetilde{\mathbf{p}}^{opt}over~ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT is the estimated optimal-probabilities vector calculated in Step 1. To save computational time, we utilize 𝐩~optsuperscript~𝐩𝑜𝑝𝑡\widetilde{\mathbf{p}}^{opt}over~ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT and 𝜷~Usubscript~𝜷𝑈\widetilde{\boldsymbol{\beta}}_{U}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT from Step 1 instead of re-estimating the optimal probabilities and the coefficient vector based on 𝒬1.5subscript𝒬1.5\mathcal{Q}_{1.5}caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT. The computational time for this additional step is very short, since q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is very small compared to n𝑛nitalic_n. Lastly, by calculating 𝓘~Q1.51(𝜷~U)subscriptsuperscript~𝓘1subscript𝑄1.5subscript~𝜷𝑈\widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}(\widetilde{\boldsymbol{%\beta}}_{U})over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and 𝝋~Q1.5(𝐩~opt,𝜷~U)subscript~𝝋subscript𝑄1.5superscript~𝐩𝑜𝑝𝑡subscript~𝜷𝑈\widetilde{\boldsymbol{\varphi}}_{Q_{1.5}}(\widetilde{\mathbf{p}}^{opt},%\widetilde{\boldsymbol{\beta}}_{U})over~ start_ARG bold_italic_φ end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) only once, a plot of RE^(qn)^𝑅𝐸subscript𝑞𝑛\widehat{RE}(q_{n})over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) as a function of qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can be generated quickly and effortlessly. This additional step can be added easily in the above two-step Algorithm 1, between Steps 1 and 2, as follows:

Step 1.5: Sample q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observation from 𝒞𝒞\mathcal{C}caligraphic_C using the optimal sampling probabilities computed at Step 1. Combine these observations with the observed failure times to form 𝒬1.5subscript𝒬1.5\mathcal{Q}_{1.5}caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT and compute 𝓘~Q1.51(𝜷~U)subscriptsuperscript~𝓘1subscript𝑄1.5subscript~𝜷𝑈\widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}(\widetilde{\boldsymbol{%\beta}}_{U})over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and 𝝋~Q1.5(𝐩opt,𝜷~U)subscript~𝝋subscript𝑄1.5superscript𝐩𝑜𝑝𝑡subscript~𝜷𝑈\widetilde{\boldsymbol{\varphi}}_{Q_{1.5}}(\mathbf{p}^{opt},\widetilde{%\boldsymbol{\beta}}_{U})over~ start_ARG bold_italic_φ end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ). Plot RE^(qn)^𝑅𝐸subscript𝑞𝑛\widehat{RE}(q_{n})over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) as a function of qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Choose the minimal qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that provides the required RE.

In practice, with a large n𝑛nitalic_n, the curve of RE^(qn)^𝑅𝐸subscript𝑞𝑛\widehat{RE}(q_{n})over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is anticipated to show a rapid decrease followed by a gradual decline, resembling an ‘elbow’ shape. A sensible selection for qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT would be in the region where the decline becomes moderate, as the incremental efficiency gain from further increasing qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is likely to be minimal. Comprehensive examples from simulations and real data analysis are presented in Sections 5–7 for further insights.

2.3 Subsample Size Based on Hypothesis Testing

Let βposubscriptsuperscript𝛽𝑜𝑝\beta^{o}_{p}italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT be the pthsuperscript𝑝𝑡p^{th}italic_p start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT element of 𝜷osuperscript𝜷𝑜\boldsymbol{\beta}^{o}bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT. Suppose we aim to test the hypothesis H0:βpo=0:subscript𝐻0subscriptsuperscript𝛽𝑜𝑝0H_{0}:\beta^{o}_{p}=0italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0 against H1:βpo0:subscript𝐻1superscriptsubscript𝛽𝑝𝑜0H_{1}:\beta_{p}^{o}\neq 0italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ≠ 0 at a significance level of α𝛼\alphaitalic_α with a power of γ𝛾\gammaitalic_γ. Our current objective is to determine the necessary subsample size, given that βpo=βpsubscriptsuperscript𝛽𝑜𝑝superscriptsubscript𝛽𝑝\beta^{o}_{p}=\beta_{p}^{*}italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, where βpsuperscriptsubscript𝛽𝑝\beta_{p}^{*}italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is specified by the researcher. Since the minimal n𝑛nitalic_n should satisfy

n=(Z1α/2+Zγ)2{𝓘1(𝜷o)+nqn𝓘1(𝜷o)𝝋(𝐩opt,𝜷o)𝓘1(𝜷o)}ppβp2,𝑛superscriptsubscript𝑍1𝛼2subscript𝑍𝛾2subscriptsuperscript𝓘1superscript𝜷𝑜𝑛subscript𝑞𝑛superscript𝓘1superscript𝜷𝑜𝝋superscript𝐩𝑜𝑝𝑡superscript𝜷𝑜superscript𝓘1superscript𝜷𝑜𝑝𝑝superscriptsubscript𝛽𝑝absent2n=\left\lceil{(Z_{1-\alpha/2}+Z_{\gamma})^{2}\left\{\boldsymbol{\mathcal{I}}^{%-1}({\boldsymbol{\beta}}^{o})+\frac{n}{q_{n}}{\boldsymbol{\mathcal{I}}}^{-1}({%\boldsymbol{\beta}}^{o}){\boldsymbol{\varphi}}({\mathbf{p}}^{opt},{\boldsymbol%{\beta}}^{o}){\boldsymbol{\mathcal{I}}}^{-1}(\boldsymbol{\beta}^{o})\right\}_{%pp}\beta_{p}^{*-2}}\right\rceil\,,italic_n = ⌈ ( italic_Z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT { bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) + divide start_ARG italic_n end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_italic_φ ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ - 2 end_POSTSUPERSCRIPT ⌉ ,

where .\left\lceil{.}\right\rceil⌈ . ⌉ is the ceiling function, the required qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is obtained by solving for qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and using estimated quantities, namely by

q~n={𝓘~Q1.51(𝜷~U)𝝋~Q1.5(𝐩~opt,𝜷~U)𝓘~Q1.51(𝜷~U)}pp(Z1α/2+Zγ)2(βp2n1𝓘~Q1.51(𝜷~U)pp)(Z1α/2+Zγ)2.subscript~𝑞𝑛subscriptsubscriptsuperscript~𝓘1subscript𝑄1.5subscript~𝜷𝑈subscript~𝝋subscript𝑄1.5superscript~𝐩𝑜𝑝𝑡subscript~𝜷𝑈subscriptsuperscript~𝓘1subscript𝑄1.5subscript~𝜷𝑈𝑝𝑝superscriptsubscript𝑍1𝛼2subscript𝑍𝛾2superscriptsuperscriptsubscript𝛽𝑝2superscript𝑛1subscriptsuperscript~𝓘1subscript𝑄1.5subscriptsubscript~𝜷𝑈𝑝𝑝superscriptsubscript𝑍1𝛼2subscript𝑍𝛾2\widetilde{q}_{n}=\left\lceil\frac{\left\{\widetilde{\boldsymbol{\mathcal{I}}}%^{-1}_{Q_{1.5}}(\widetilde{\boldsymbol{\beta}}_{U})\widetilde{\boldsymbol{%\varphi}}_{Q_{1.5}}(\widetilde{\mathbf{p}}^{opt},\widetilde{\boldsymbol{\beta}%}_{U})\widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}(\widetilde{%\boldsymbol{\beta}}_{U})\right\}_{pp}(Z_{1-\alpha/2}+Z_{\gamma})^{2}}{\left({{%\beta_{p}^{*}}}^{2}-n^{-1}\widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}(%\widetilde{\boldsymbol{\beta}}_{U})_{pp}\right)(Z_{1-\alpha/2}+Z_{\gamma})^{2}%}\right\rceil.over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ⌈ divide start_ARG { over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) over~ start_ARG bold_italic_φ end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ) ( italic_Z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⌉ .(7)

This formula is convenient and practically valuable because, upon completing Steps 1 and 1.5, we can straightforwardly plot q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as a function of γ𝛾\gammaitalic_γ. A negative value of q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT indicates that the required power cannot be achieved even with the entire sample n𝑛nitalic_n. Our simulation study demonstrates that in scenarios where the required power is attainable, typically only a small fraction of the censored observations is necessary.

We summarize the additional step for a single-covariate hypothesis testing by adding the following mid-step to the original two-step Algorithm 1:

Step 1.5*: Sample q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations from 𝒞𝒞\mathcal{C}caligraphic_C using the optimal sampling probabilities computed in Step 1. Combine these observations with the observed events to create 𝒬1.5subscript𝒬1.5\mathcal{Q}_{1.5}caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT. Compute 𝓘~Q1.51(𝜷~U)subscriptsuperscript~𝓘1subscript𝑄1.5subscript~𝜷𝑈\widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}(\widetilde{\boldsymbol{%\beta}}_{U})over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ), 𝝋~Q1.5(𝐩opt,𝜷~U)subscript~𝝋subscript𝑄1.5superscript𝐩𝑜𝑝𝑡subscript~𝜷𝑈\widetilde{\boldsymbol{\varphi}}_{Q_{1.5}}(\mathbf{p}^{opt},\widetilde{%\boldsymbol{\beta}}_{U})over~ start_ARG bold_italic_φ end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ), and q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, with the desired values of α𝛼\alphaitalic_α and γ𝛾\gammaitalic_γ. If q~n<0subscript~𝑞𝑛0\widetilde{q}_{n}<0over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT < 0, achieving the required power is not feasible even with the entire sample. Otherwise, set qn=q~nsubscript𝑞𝑛subscript~𝑞𝑛q_{n}=\widetilde{q}_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

3 Logistic Regression with Rare Events

3.1 Two-Step Algorithm

While Wangetal. (2018) introduced a two-stage optimal subsampling algorithm for logistic regression, it was observed that their asymptotic variance may not perform well in cases of highly imbalanced data (i.e., when the rate of cases is below 15%). Section 4 presents a method for choosing a subsample size for their optimal subsampling algorithm, and our simulation results indeed indicate its effectiveness primarily when the event is not rare.

In the realm of subsampling for imbalanced binary data, various methods have been explored and developed (Wang, 2020; Wangetal., 2021). However, their results were derived under the assumption that the intercept approaches zero as the sample size goes to infinity, and the other coefficients are fixed, leading to the probability of experiencing an event decreasing to zero as the sample size goes to infinity.Our current work, akin to Wangetal. (2021), is based on subsampling only among non-cases observations while retaining all cases. Notably, our approach does not necessitate the undesired assumption that the event probability approaches zero as the sample size increases. Furthermore, our method yields a simpler formula for the asymptotic variance, enabling evaluation of the required subsample size in a practically efficient manner.

Let Di{0,1}subscript𝐷𝑖01D_{i}\in\{0,1\}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } be the response of individual i𝑖iitalic_i, i=1,,n𝑖1𝑛i=1,\dots,nitalic_i = 1 , … , italic_n. In order to include an intercept term, we extend the vector of covariates of each individual from r𝑟ritalic_r to r+1𝑟1r+1italic_r + 1 to include the value of 1 in its first element, and for simplicity of presentation we continue using the notation 𝐗isubscript𝐗𝑖\mathbf{X}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let 𝒩={i;Di=0}𝒩𝑖subscript𝐷𝑖0\mathcal{N}=\{i\,;\,D_{i}=0\}caligraphic_N = { italic_i ; italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 } and n0=|𝒩|subscript𝑛0𝒩n_{0}=|\mathcal{N}|italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = | caligraphic_N |. The logistic regression model is of the form

Pr(Di=1|𝐗i)=μi(𝜷)=e𝐗iT𝜷(1+e𝐗iT𝜷)1i=1,,n.formulae-sequencePrsubscript𝐷𝑖conditional1subscript𝐗𝑖subscript𝜇𝑖𝜷superscript𝑒superscriptsubscript𝐗𝑖𝑇𝜷superscript1superscript𝑒superscriptsubscript𝐗𝑖𝑇𝜷1𝑖1𝑛\Pr(D_{i}=1|\mathbf{X}_{i})=\mu_{i}(\boldsymbol{\beta})={e^{\mathbf{X}_{i}^{T}%\boldsymbol{\beta}}}\left(1+e^{\mathbf{X}_{i}^{T}\boldsymbol{\beta}}\right)^{-%1}\quad\quad i=1,\ldots,n\,.roman_Pr ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 | bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) = italic_e start_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β end_POSTSUPERSCRIPT ( 1 + italic_e start_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_i = 1 , … , italic_n .

and the maximum likelihood estimator (MLE) is given by

𝜷^MLE=argmax𝜷i=1n[Dilogμi(𝜷)+(1Di)log{1μi(𝜷)}].subscript^𝜷𝑀𝐿𝐸subscript𝜷superscriptsubscript𝑖1𝑛delimited-[]subscript𝐷𝑖subscript𝜇𝑖𝜷1subscript𝐷𝑖1subscript𝜇𝑖𝜷\widehat{\boldsymbol{\beta}}_{MLE}=\arg\max_{\boldsymbol{\beta}}\sum_{i=1}^{n}%\big{[}D_{i}\log\mu_{i}(\boldsymbol{\beta})+(1-D_{i})\log\{1-\mu_{i}(%\boldsymbol{\beta})\}\big{]}\,.over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT bold_italic_β end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT [ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) + ( 1 - italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log { 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) } ] .

As before, let qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the size of the subsample from 𝒩𝒩\mathcal{N}caligraphic_N, πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the sampling probability of individual i𝑖iitalic_i, i𝒩πi=1subscript𝑖𝒩subscript𝜋𝑖1\sum_{i\in\mathcal{N}}\pi_{i}=1∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1, and 𝒬𝒬\mathcal{Q}caligraphic_Q is the index set containing of all the observed cases (i.e., Di=1subscript𝐷𝑖1D_{i}=1italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1) and the subsampled non-cases (Di=0subscript𝐷𝑖0D_{i}=0italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0). Finally, for i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n, set

wi={(πiqn)1,ifDi=01,ifDi=1subscript𝑤𝑖casessuperscriptsubscript𝜋𝑖subscript𝑞𝑛1ifsubscript𝐷𝑖01ifsubscript𝐷𝑖1w_{i}=\begin{cases}(\pi_{i}q_{n})^{-1},&\text{if }D_{i}=0\\1,&\text{if }D_{i}=1\end{cases}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL ( italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL if italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_CELL end_ROW

as the sampling weights. Then, the estimator 𝜷~~𝜷\widetilde{\boldsymbol{\beta}}over~ start_ARG bold_italic_β end_ARG that is based on 𝒬𝒬\mathcal{Q}caligraphic_Q is obtained by maximizing the pseudo log-likelihood function

l(𝜷)=i𝒬wi[Dilogμi(𝜷)+(1Di)log{1μi(𝜷)}].superscript𝑙𝜷subscript𝑖𝒬subscript𝑤𝑖delimited-[]subscript𝐷𝑖subscript𝜇𝑖𝜷1subscript𝐷𝑖1subscript𝜇𝑖𝜷l^{*}(\boldsymbol{\beta})=\sum_{i\in\mathcal{Q}}w_{i}\big{[}D_{i}\log\mu_{i}(%\boldsymbol{\beta})+(1-D_{i})\log\{1-\mu_{i}(\boldsymbol{\beta})\}\big{]}\,.italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) + ( 1 - italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log { 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) } ] .(8)

The following theorem provides the asymptotic distribution of a general subsampling-based estimator 𝜷~~𝜷\widetilde{\boldsymbol{\beta}}over~ start_ARG bold_italic_β end_ARG, for any vector of sampling probabilities, given standard assumptions (see the SM, Section S2). Based on the asymptotic distribution, the optimal sampling probabilities will be derived. Since the optimal sampling probabilities will be shown to involve the true unknown 𝜷osuperscript𝜷𝑜\boldsymbol{\beta}^{o}bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, we describe the two-step algorithm for logistic regression, which uses approximation of the optimal probabilities.

Theorem 1

If Assumptions A.1-A.4 hold, then as qn,nsubscript𝑞𝑛𝑛q_{n},n\rightarrow\inftyitalic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞,

nR(𝝅,𝜷o)1/2(𝜷~𝜷o)𝐷N(0,𝐈)𝐷𝑛superscript𝑅superscript𝝅superscript𝜷𝑜12~𝜷superscript𝜷𝑜𝑁0𝐈\sqrt{n}\mathbb{H}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})^{-1/2}(%\widetilde{\boldsymbol{\beta}}-\boldsymbol{\beta}^{o})\xrightarrow{D}N(0,%\mathbf{I})square-root start_ARG italic_n end_ARG blackboard_H start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_I )

where

R(𝝅,𝜷)=𝐌X1(𝜷)+nqn𝐌X1(𝜷)𝐊R(𝝅,𝜷)𝐌X1(𝜷),superscript𝑅𝝅𝜷superscriptsubscript𝐌𝑋1𝜷𝑛subscript𝑞𝑛superscriptsubscript𝐌𝑋1𝜷superscript𝐊𝑅𝝅𝜷superscriptsubscript𝐌𝑋1𝜷\mathbb{H}^{R}(\boldsymbol{\pi},\boldsymbol{\beta})=\mathbf{M}_{X}^{-1}(%\boldsymbol{\beta})+\frac{n}{q_{n}}\mathbf{M}_{X}^{-1}(\boldsymbol{\beta})%\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta})\mathbf{M}_{X}^{-1}(%\boldsymbol{\beta})\,,blackboard_H start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) = bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β ) + divide start_ARG italic_n end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β ) bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β ) ,
𝐌X(𝜷)=n1i=1nμi(𝜷){1μi(𝜷)}𝐗i𝐗iT,subscript𝐌𝑋𝜷superscript𝑛1superscriptsubscript𝑖1𝑛subscript𝜇𝑖𝜷1subscript𝜇𝑖𝜷subscript𝐗𝑖superscriptsubscript𝐗𝑖𝑇\mathbf{M}_{X}(\boldsymbol{\beta})=n^{-1}\sum_{i=1}^{n}\mu_{i}(\boldsymbol{%\beta})\{1-\mu_{i}(\boldsymbol{\beta})\}\mathbf{X}_{i}\mathbf{X}_{i}^{T}\,,bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( bold_italic_β ) = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) { 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) } bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ,

and

𝐊R(𝝅,𝜷)=1n2{i𝒩μi2(𝜷)𝐗i𝐗iTπii,j𝒩μi(𝜷)μj(𝜷)𝐗i𝐗jT}.superscript𝐊𝑅𝝅𝜷1superscript𝑛2subscript𝑖𝒩subscriptsuperscript𝜇2𝑖𝜷subscript𝐗𝑖superscriptsubscript𝐗𝑖𝑇subscript𝜋𝑖subscript𝑖𝑗𝒩subscript𝜇𝑖𝜷subscript𝜇𝑗𝜷subscript𝐗𝑖superscriptsubscript𝐗𝑗𝑇\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta})=\frac{1}{n^{2}}\bigg{\{}%\sum_{i\in\mathcal{N}}\frac{\mu^{2}_{i}(\boldsymbol{\beta})\mathbf{X}_{i}%\mathbf{X}_{i}^{T}}{\pi_{i}}-\sum_{i,j\in\mathcal{N}}\mu_{i}(\boldsymbol{\beta%})\mu_{j}(\boldsymbol{\beta})\mathbf{X}_{i}\mathbf{X}_{j}^{T}\bigg{\}}\,.bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } .

We now turn to derive optimal sampling probabilities while considering the A-optimal and L-optimal criteria, as before.

Theorem 2

The respective A-optimal and L-optimal sampling probability vectors, denoted by 𝛑R,Asuperscript𝛑𝑅𝐴\boldsymbol{\pi}^{R,A}bold_italic_π start_POSTSUPERSCRIPT italic_R , italic_A end_POSTSUPERSCRIPT and 𝛑R,Lsuperscript𝛑𝑅𝐿\boldsymbol{\pi}^{R,L}bold_italic_π start_POSTSUPERSCRIPT italic_R , italic_L end_POSTSUPERSCRIPT, which minimize the trace of R(𝛑,𝛃o)superscript𝑅𝛑superscript𝛃𝑜\mathbb{H}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})blackboard_H start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) and 𝐊R(𝛑,𝛃)superscript𝐊𝑅𝛑𝛃\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta})bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ), respectively, are given by

πmR,A=μm(𝜷o)𝐌X1(𝜷o)𝐗m2j𝒩μj(𝜷o)𝐌X1(𝜷o)𝐗j2for allm𝒩formulae-sequencesuperscriptsubscript𝜋𝑚𝑅𝐴subscript𝜇𝑚superscript𝜷𝑜subscriptnormsuperscriptsubscript𝐌𝑋1superscript𝜷𝑜subscript𝐗𝑚2subscript𝑗𝒩subscript𝜇𝑗superscript𝜷𝑜subscriptnormsuperscriptsubscript𝐌𝑋1superscript𝜷𝑜subscript𝐗𝑗2for all𝑚𝒩\pi_{m}^{R,A}=\frac{\mu_{m}(\boldsymbol{\beta}^{o})\|\mathbf{M}_{X}^{-1}(%\boldsymbol{\beta}^{o})\mathbf{X}_{m}\|_{2}}{\sum_{j\in\mathcal{N}}\mu_{j}(%\boldsymbol{\beta}^{o})\|\mathbf{M}_{X}^{-1}(\boldsymbol{\beta}^{o})\mathbf{X}%_{j}\|_{2}}\quad\text{ for all }\quad m\in\mathcal{N}italic_π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R , italic_A end_POSTSUPERSCRIPT = divide start_ARG italic_μ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG for all italic_m ∈ caligraphic_N(9)

and

πmR,L=μm(𝜷o)𝐗m2j𝒩μj(𝜷o)𝐗j2for allm𝒩.formulae-sequencesuperscriptsubscript𝜋𝑚𝑅𝐿subscript𝜇𝑚superscript𝜷𝑜subscriptnormsubscript𝐗𝑚2subscript𝑗𝒩subscript𝜇𝑗superscript𝜷𝑜subscriptnormsubscript𝐗𝑗2for all𝑚𝒩\pi_{m}^{R,L}=\frac{\mu_{m}(\boldsymbol{\beta}^{o})\|\mathbf{X}_{m}\|_{2}}{%\sum_{j\in\mathcal{N}}\mu_{j}(\boldsymbol{\beta}^{o})\|\mathbf{X}_{j}\|_{2}}%\quad\text{ for all }\quad m\in\mathcal{N}.italic_π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R , italic_L end_POSTSUPERSCRIPT = divide start_ARG italic_μ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG for all italic_m ∈ caligraphic_N .(10)

Notably, the optimal probabilities expressed in (9) and (10) bear a resemblance to those derived for the subsampling approach in Wangetal. (2018) applied to a balanced design. The discrepancy between these optimal probabilities and their counterparts in Wangetal. (2018) stems from the fact that, here, the summation in the denominators is restricted to the set 𝒩𝒩\mathcal{N}caligraphic_N rather than the entire sample. Since 𝝅R,Asuperscript𝝅𝑅𝐴\boldsymbol{\pi}^{R,A}bold_italic_π start_POSTSUPERSCRIPT italic_R , italic_A end_POSTSUPERSCRIPT and 𝝅R,Lsuperscript𝝅𝑅𝐿\boldsymbol{\pi}^{R,L}bold_italic_π start_POSTSUPERSCRIPT italic_R , italic_L end_POSTSUPERSCRIPT involve the unknown 𝜷osuperscript𝜷𝑜\boldsymbol{\beta}^{o}bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, we suggest the following two-step algorithm, in the spirit of the previous section and Wangetal. (2018):

Step 1:Sample q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations uniformly from 𝒩𝒩\mathcal{N}caligraphic_N and combine them with all the observed events to create 𝒬pilot𝒬𝑝𝑖𝑙𝑜𝑡\mathcal{Q}{pilot}caligraphic_Q italic_p italic_i italic_l italic_o italic_t. Perform a weighted logistic regression on 𝒬pilot𝒬𝑝𝑖𝑙𝑜𝑡\mathcal{Q}{pilot}caligraphic_Q italic_p italic_i italic_l italic_o italic_t, based on Eq. (8), and obtain 𝜷~Usubscript~𝜷𝑈\widetilde{\boldsymbol{\beta}}_{U}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT. Utilize this estimator to derive approximated optimal sampling probabilities by substituting 𝜷osuperscript𝜷𝑜\boldsymbol{\beta}^{o}bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT with 𝜷~Usubscript~𝜷𝑈\widetilde{\boldsymbol{\beta}}_{U}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT in Eq. (9) or (10).

Step 2:Sample qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT observations from 𝒩𝒩\mathcal{N}caligraphic_N using the sampling probabilities computed in Step 1. Combine these observations with the observed events to create 𝒬𝒬\mathcal{Q}caligraphic_Q and conduct a weighted logistic regression on 𝒬𝒬\mathcal{Q}caligraphic_Q, based on Eq. (8), to obtain the two-step estimator 𝜷~TSsubscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT.

Consistency and asymptotic normality of 𝜷~TSsubscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT can be shown by following the main steps of Wangetal. (2018) and Keret andGorfine (2023), as detailed in the SM, Section S2. As for the Cox regression, the value of q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is recommended to be c0(nn0)subscript𝑐0𝑛subscript𝑛0c_{0}(n-n_{0})italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) with small values of c0subscript𝑐0c_{0}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Once 𝜷~TSsubscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT is calculated, inference can be executed by using the subsample counterparts of R(𝝅,𝜷~TS)superscript𝑅𝝅subscript~𝜷𝑇𝑆\mathbb{H}^{R}(\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}}_{TS})blackboard_H start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) and 𝐊R(𝝅,𝜷~TS)superscript𝐊𝑅𝝅subscript~𝜷𝑇𝑆\mathbf{K}^{R}(\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}}_{TS})bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ), namely

~R(𝝅,𝜷~TS)=𝐌~X1(𝜷~TS)+nqn𝐌~X1(𝜷~TS)𝐊~R(𝝅,𝜷~TS)𝐌~X1(𝜷~TS)superscript~𝑅𝝅subscript~𝜷𝑇𝑆superscriptsubscript~𝐌𝑋1subscript~𝜷𝑇𝑆𝑛subscript𝑞𝑛superscriptsubscript~𝐌𝑋1subscript~𝜷𝑇𝑆superscript~𝐊𝑅𝝅subscript~𝜷𝑇𝑆superscriptsubscript~𝐌𝑋1subscript~𝜷𝑇𝑆\widetilde{\mathbb{H}}^{R}(\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}}_{TS%})=\widetilde{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{TS})+\frac{%n}{q_{n}}\widetilde{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{TS})%\widetilde{\mathbf{K}}^{R}(\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}}_{TS%})\widetilde{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{TS})over~ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) = over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) + divide start_ARG italic_n end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) over~ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT )(11)

and

𝐊~(𝝅,𝜷~TS)=1n2{1qni𝒬μi(𝜷~TS)2𝐗i(𝐗i)Tπi21qn2i𝒬μi(𝜷~TS)𝐗iπi(i𝒬μi(𝜷~TS)𝐗iπi)T},~𝐊𝝅subscript~𝜷𝑇𝑆1superscript𝑛21subscript𝑞𝑛subscript𝑖𝒬subscript𝜇𝑖superscriptsubscript~𝜷𝑇𝑆2subscript𝐗𝑖superscriptsubscript𝐗𝑖𝑇superscriptsubscript𝜋𝑖21superscriptsubscript𝑞𝑛2subscript𝑖𝒬subscript𝜇𝑖subscript~𝜷𝑇𝑆subscript𝐗𝑖subscript𝜋𝑖superscriptsubscript𝑖𝒬subscript𝜇𝑖subscript~𝜷𝑇𝑆subscript𝐗𝑖subscript𝜋𝑖𝑇\widetilde{\mathbf{K}}(\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}}_{TS})=%\frac{1}{n^{2}}\Bigg{\{}\frac{1}{q_{n}}\sum_{i\in\mathcal{Q}\setminus\mathcal{%E}}\frac{\mu_{i}(\widetilde{\boldsymbol{\beta}}_{TS})^{2}\mathbf{X}_{i}(%\mathbf{X}_{i})^{T}}{\pi_{i}^{2}}-\frac{1}{q_{n}^{2}}\sum_{i\in\mathcal{Q}%\setminus\mathcal{E}}\frac{\mu_{i}(\widetilde{\boldsymbol{\beta}}_{TS})\mathbf%{X}_{i}}{\pi_{i}}\bigg{(}\sum_{i\in\mathcal{Q}\setminus\mathcal{E}}\frac{\mu_{%i}(\widetilde{\boldsymbol{\beta}}_{TS})\mathbf{X}_{i}}{\pi_{i}}\bigg{)}^{T}%\Bigg{\}}\,,over~ start_ARG bold_K end_ARG ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } ,

where ={i:Di=1}conditional-set𝑖subscript𝐷𝑖1\mathcal{E}=\{i\,:\,D_{i}=1\}caligraphic_E = { italic_i : italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 } and

𝐌~X(𝜷~TS)=1ni𝒬wiμi(𝜷~TS){1μi(𝜷~TS)}𝐗i(𝐗i)T.subscript~𝐌𝑋subscript~𝜷𝑇𝑆1𝑛subscript𝑖𝒬subscript𝑤𝑖subscript𝜇𝑖subscript~𝜷𝑇𝑆1subscript𝜇𝑖subscript~𝜷𝑇𝑆subscript𝐗𝑖superscriptsubscript𝐗𝑖𝑇\widetilde{\mathbf{M}}_{X}(\widetilde{\boldsymbol{\beta}}_{TS})=\frac{1}{n}%\sum_{i\in\mathcal{Q}}w_{i}\mu_{i}(\widetilde{\boldsymbol{\beta}}_{TS})\left\{%1-\mu_{i}(\widetilde{\boldsymbol{\beta}}_{TS})\right\}\mathbf{X}_{i}(\mathbf{X%}_{i})^{T}\,.over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) { 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) } bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .

3.2 Choosing Subsample Size by Relative Efficiency or Hypothesis Testing

In the spirit of Section 2.2, we can estimate the RE of the two-step estimator relative to the estimator based on the entire dataset, in order to assess the required subsample size of Step 2. To this end, define

ˇR(𝝅opt,𝜷~U)=𝐌ˇX1(𝜷ˇU)+nq0𝐌ˇX1(𝜷ˇU)𝐊ˇR(𝝅opt,𝜷ˇU)𝐌ˇX1(𝜷ˇU)superscriptˇ𝑅superscript𝝅𝑜𝑝𝑡subscript~𝜷𝑈superscriptsubscriptˇ𝐌𝑋1subscriptˇ𝜷𝑈𝑛subscript𝑞0superscriptsubscriptˇ𝐌𝑋1subscriptˇ𝜷𝑈superscriptˇ𝐊𝑅superscript𝝅𝑜𝑝𝑡subscriptˇ𝜷𝑈superscriptsubscriptˇ𝐌𝑋1subscriptˇ𝜷𝑈\check{\mathbb{H}}^{R}(\boldsymbol{\pi}^{opt},\widetilde{\boldsymbol{\beta}}_{%U})=\check{\mathbf{M}}_{X}^{-1}(\check{\boldsymbol{\beta}}_{U})+\frac{n}{q_{0}%}\check{\mathbf{M}}_{X}^{-1}(\check{\boldsymbol{\beta}}_{U})\check{\mathbf{K}}%^{R}(\boldsymbol{\pi}^{opt},\check{\boldsymbol{\beta}}_{U})\check{\mathbf{M}}_%{X}^{-1}(\check{\boldsymbol{\beta}}_{U})overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) = overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( overroman_ˇ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) + divide start_ARG italic_n end_ARG start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( overroman_ˇ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , overroman_ˇ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( overroman_ˇ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT )(12)

where

𝐌ˇX(𝜷~U)=1ni𝒬1.5wˇiμi(𝜷~U){1μi(𝜷~U)}𝐗i𝐗iT,subscriptˇ𝐌𝑋subscript~𝜷𝑈1𝑛subscript𝑖subscript𝒬1.5subscriptˇ𝑤𝑖subscript𝜇𝑖subscript~𝜷𝑈1subscript𝜇𝑖subscript~𝜷𝑈subscript𝐗𝑖superscriptsubscript𝐗𝑖𝑇\check{\mathbf{M}}_{X}(\widetilde{\boldsymbol{\beta}}_{U})=\frac{1}{n}\sum_{i%\in\mathcal{Q}_{1.5}}\check{w}_{i}\mu_{i}(\widetilde{\boldsymbol{\beta}}_{U})%\left\{1-\mu_{i}(\widetilde{\boldsymbol{\beta}}_{U})\right\}\mathbf{X}_{i}%\mathbf{X}_{i}^{T}\,,overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT overroman_ˇ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) { 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) } bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ,(13)
wˇi={(πioptq0)1,ifDi=01,ifDi=1subscriptˇ𝑤𝑖casessuperscriptsuperscriptsubscript𝜋𝑖𝑜𝑝𝑡subscript𝑞01ifsubscript𝐷𝑖01ifsubscript𝐷𝑖1\check{w}_{i}=\begin{cases}(\pi_{i}^{opt}q_{0})^{-1},&\text{if }D_{i}=0\\1,&\text{if }D_{i}=1\end{cases}overroman_ˇ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL ( italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL if italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_CELL end_ROW

and

𝐊ˇR(𝝅,𝜷~U)=1n2{1q0i𝒬1.5μi2(𝜷~U)𝐗i(𝐗i)Tπi21q02i𝒬1.5μi(𝜷~U)𝐗iπi(i𝒬1.5μi(𝜷~U)𝐗iπi)T}.superscriptˇ𝐊𝑅𝝅subscript~𝜷𝑈1superscript𝑛21subscript𝑞0subscript𝑖subscript𝒬1.5subscriptsuperscript𝜇2𝑖subscript~𝜷𝑈subscript𝐗𝑖superscriptsubscript𝐗𝑖𝑇superscriptsubscript𝜋𝑖21superscriptsubscript𝑞02subscript𝑖subscript𝒬1.5subscript𝜇𝑖subscript~𝜷𝑈subscript𝐗𝑖subscript𝜋𝑖superscriptsubscript𝑖subscript𝒬1.5subscript𝜇𝑖subscript~𝜷𝑈subscript𝐗𝑖subscript𝜋𝑖𝑇\check{\mathbf{K}}^{R}(\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}}_{U})=%\frac{1}{n^{2}}\left\{\frac{1}{q_{0}}\sum_{i\in\mathcal{Q}_{1.5}\setminus%\mathcal{E}}\frac{\mu^{2}_{i}(\widetilde{\boldsymbol{\beta}}_{U})\mathbf{X}_{i%}(\mathbf{X}_{i})^{T}}{\pi_{i}^{2}}-\frac{1}{q_{0}^{2}}\sum_{i\in\mathcal{Q}_{%1.5}\setminus\mathcal{E}}\frac{\mu_{i}(\widetilde{\boldsymbol{\beta}}_{U})%\mathbf{X}_{i}}{\pi_{i}}\left(\sum_{i\in\mathcal{Q}_{1.5}\setminus\mathcal{E}}%\frac{\mu_{i}(\widetilde{\boldsymbol{\beta}}_{U})\mathbf{X}_{i}}{\pi_{i}}%\right)^{T}\right\}\,.overroman_ˇ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } .

Finally, we define the RE estimators as

RE^(qn)=n1𝐌ˇX1(𝜷~U)+qn1𝐌ˇX1(𝜷~U)ˇR(𝝅~opt,𝜷~U)𝐌ˇX1(𝜷~U)Fn1𝐌ˇX1(𝜷~U)F^𝑅𝐸subscript𝑞𝑛subscriptnormsuperscript𝑛1superscriptsubscriptˇ𝐌𝑋1subscript~𝜷𝑈superscriptsubscript𝑞𝑛1superscriptsubscriptˇ𝐌𝑋1subscript~𝜷𝑈superscriptˇ𝑅superscript~𝝅𝑜𝑝𝑡subscript~𝜷𝑈superscriptsubscriptˇ𝐌𝑋1subscript~𝜷𝑈𝐹subscriptnormsuperscript𝑛1superscriptsubscriptˇ𝐌𝑋1subscript~𝜷𝑈𝐹\widehat{RE}(q_{n})=\frac{\|n^{-1}\check{\mathbf{M}}_{X}^{-1}(\widetilde{%\boldsymbol{\beta}}_{U})+q_{n}^{-1}\check{\mathbf{M}}_{X}^{-1}(\widetilde{%\boldsymbol{\beta}}_{U})\check{\mathbb{H}}^{R}(\widetilde{\boldsymbol{\pi}}^{%opt},\widetilde{\boldsymbol{\beta}}_{U})\check{\mathbf{M}}_{X}^{-1}(\widetilde%{\boldsymbol{\beta}}_{U})\|_{F}}{\|n^{-1}\check{\mathbf{M}}_{X}^{-1}(%\widetilde{\boldsymbol{\beta}}_{U})\|_{F}}over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) + italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_π end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG(14)

and

RE^p(qn)=[n1𝐌ˇX1(𝜷~U)+qn1𝐌ˇX1(𝜷~U)ˇR(𝝅~opt,𝜷~U)𝐌~X1(𝜷~U)]pp[n1𝐌ˇX1(𝜷~U)]pp.subscript^𝑅𝐸𝑝subscript𝑞𝑛subscriptdelimited-[]superscript𝑛1superscriptsubscriptˇ𝐌𝑋1subscript~𝜷𝑈superscriptsubscript𝑞𝑛1superscriptsubscriptˇ𝐌𝑋1subscript~𝜷𝑈superscriptˇ𝑅superscript~𝝅𝑜𝑝𝑡subscript~𝜷𝑈superscriptsubscript~𝐌𝑋1subscript~𝜷𝑈𝑝𝑝subscriptdelimited-[]superscript𝑛1superscriptsubscriptˇ𝐌𝑋1subscript~𝜷𝑈𝑝𝑝\widehat{RE}_{p}(q_{n})=\frac{\big{[}n^{-1}\check{\mathbf{M}}_{X}^{-1}(%\widetilde{\boldsymbol{\beta}}_{U})+q_{n}^{-1}\check{\mathbf{M}}_{X}^{-1}(%\widetilde{\boldsymbol{\beta}}_{U})\check{\mathbb{H}}^{R}(\widetilde{%\boldsymbol{\pi}}^{opt},\widetilde{\boldsymbol{\beta}}_{U})\widetilde{\mathbf{%M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{U})\big{]}_{pp}}{\big{[}n^{-1}%\check{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{U})\big{]}_{pp}}\,.over^ start_ARG italic_R italic_E end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG [ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) + italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_π end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT end_ARG start_ARG [ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT end_ARG .(15)

The procedure can be easily incorporated within the two-step Algorithm 2 by the following additional step:

Step 1.5: Sample q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations from 𝒩𝒩\mathcal{N}caligraphic_N using the optimal sampling probabilities from Step 1. Combine the sampled observations with \mathcal{E}caligraphic_E to create 𝒬1.5subscript𝒬1.5\mathcal{Q}_{1.5}caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT. Calculate 𝐌ˇX1(𝜷~U)superscriptsubscriptˇ𝐌𝑋1subscript~𝜷𝑈\check{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{U})overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and ˇR(𝝅~opt,𝜷~U)superscriptˇ𝑅superscript~𝝅𝑜𝑝𝑡subscript~𝜷𝑈\check{\mathbb{H}}^{R}(\widetilde{\boldsymbol{\pi}}^{opt},\widetilde{%\boldsymbol{\beta}}_{U})overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_π end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ). Plot RE^(qn)^𝑅𝐸subscript𝑞𝑛\widehat{RE}(q_{n})over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) or RE^p(qn)subscript^𝑅𝐸𝑝subscript𝑞𝑛\widehat{RE}_{p}(q_{n})over^ start_ARG italic_R italic_E end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) as a function of qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and select the minimum qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that satisfies the required relative efficiency.

The minimal subsample size for testing H0:βpo=0:subscript𝐻0subscriptsuperscript𝛽𝑜𝑝0H_{0}:\beta^{o}_{p}=0italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0 against a two-sided alternative, given βpo=βpsubscriptsuperscript𝛽𝑜𝑝superscriptsubscript𝛽𝑝\beta^{o}_{p}=\beta_{p}^{*}italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, a significance level α𝛼\alphaitalic_α and a power γ𝛾\gammaitalic_γ, is given by

q~n={𝐌ˇX1(𝜷~U)𝐊ˇ(𝝅opt,𝜷~U)𝐌ˇX1(𝜷~U)}pp(Z1α/2+Zγ)2βp2n1𝐌ˇX1(𝜷~U)pp(Z1α/2+Zγ)2.subscript~𝑞𝑛subscriptsuperscriptsubscriptˇ𝐌𝑋1subscript~𝜷𝑈ˇ𝐊superscript𝝅𝑜𝑝𝑡subscript~𝜷𝑈superscriptsubscriptˇ𝐌𝑋1subscript~𝜷𝑈𝑝𝑝superscriptsubscript𝑍1𝛼2subscript𝑍𝛾2superscriptsuperscriptsubscript𝛽𝑝2superscript𝑛1superscriptsubscriptˇ𝐌𝑋1subscriptsubscript~𝜷𝑈𝑝𝑝superscriptsubscript𝑍1𝛼2subscript𝑍𝛾2\widetilde{q}_{n}=\left\lceil{\frac{\left\{\check{\mathbf{M}}_{X}^{-1}(%\widetilde{\boldsymbol{\beta}}_{U})\check{\mathbf{K}}(\boldsymbol{\pi}^{opt},%\widetilde{\boldsymbol{\beta}}_{U})\check{\mathbf{M}}_{X}^{-1}(\widetilde{%\boldsymbol{\beta}}_{U})\right\}_{pp}(Z_{1-\alpha/2}+Z_{\gamma})^{2}}{{{\beta_%{p}^{*}}}^{2}-n^{-1}\check{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}%_{U})_{pp}(Z_{1-\alpha/2}+Z_{\gamma})^{2}}}\right\rceil\,.over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ⌈ divide start_ARG { overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_K end_ARG ( bold_italic_π start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⌉ .(16)

A plot of q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as a function of γ𝛾\gammaitalic_γ can be easily generated. The algorithm for a single covariate hypothesis testing is defined by adding the following mid-step to the two-step Algorithm 2:

Step 1.5*: Sample q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations from 𝒩𝒩\mathcal{N}caligraphic_N using the optimal sampling probabilities of Step 1. Combine these sampled observations with \mathcal{E}caligraphic_E to form 𝒬1.5subscript𝒬1.5\mathcal{Q}_{1.5}caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT. Compute 𝐌ˇX1(𝜷~U)superscriptsubscriptˇ𝐌𝑋1subscript~𝜷𝑈\check{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{U})overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and ˇR(𝝅~opt,𝜷~U)superscriptˇ𝑅superscript~𝝅𝑜𝑝𝑡subscript~𝜷𝑈\check{\mathbb{H}}^{R}(\widetilde{\boldsymbol{\pi}}^{opt},\widetilde{%\boldsymbol{\beta}}_{U})overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_π end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ). Plot q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT against γ𝛾\gammaitalic_γ. If q~n<0subscript~𝑞𝑛0\widetilde{q}_{n}<0over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT < 0, achieving the required power is unattainable even with the entire dataset n𝑛nitalic_n. Otherwise, set qn=q~nsubscript𝑞𝑛subscript~𝑞𝑛q_{n}=\widetilde{q}_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

4 Logistic Regression with Nearly Balanced Data

4.1 The Two-Step Optimal Subsampling Algorithm (Wangetal., 2018)

While Wangetal. (2018) presented an optimal two-step subsampling algorithm for logistic regression in the context of nearly balanced data and laid the theoretical asymptotic foundations for this approach, they did not offer a method for selecting the subsample size. This section aims to remedy this gap. To enhance clarity, we begin by providing a summary of their optimal two-step subsampling algorithm.

In the rare event setting, sampling is performed exclusively from the majority class, whereas in the nearly balanced binary outcome scenario, sampling is conducted from the entire sample. Hence, now we redefine 𝒬𝒬\mathcal{Q}caligraphic_Q as the index set of all the observations included in the subsample with sampling weights wi=(πiqn)1,i=1,,nformulae-sequencesubscript𝑤𝑖superscriptsubscript𝜋𝑖subscript𝑞𝑛1𝑖1𝑛w_{i}=(\pi_{i}q_{n})^{-1},i=1,\dots,nitalic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_i = 1 , … , italic_n. Then, the estimator 𝜷~~𝜷\widetilde{\boldsymbol{\beta}}over~ start_ARG bold_italic_β end_ARG that is based on 𝒬𝒬\mathcal{Q}caligraphic_Q is obtained by maximizing the pseudo log-likelihood function (8).

Under some regularity assumptions (Wangetal., 2018), they showed that given n={Di,𝐗i, 1,,n}subscript𝑛subscript𝐷𝑖subscript𝐗𝑖1𝑛\mathcal{F}_{n}=\{D_{i},\mathbf{X}_{i}\,,\,1,\ldots,n\}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 , … , italic_n }, the asymptotic distribution of 𝜷~~𝜷\widetilde{\boldsymbol{\beta}}over~ start_ARG bold_italic_β end_ARG is

nB(𝝅,𝜷^MLE)1/2(𝜷~𝜷^MLE)𝐷N(0,𝐈)𝐷𝑛superscript𝐵superscript𝝅subscript^𝜷𝑀𝐿𝐸12~𝜷subscript^𝜷𝑀𝐿𝐸𝑁0𝐈\sqrt{n}\mathbb{H}^{B}(\boldsymbol{\pi},\widehat{\boldsymbol{\beta}}_{MLE})^{-%1/2}(\widetilde{\boldsymbol{\beta}}-\widehat{\boldsymbol{\beta}}_{MLE})%\xrightarrow{D}N(0,\mathbf{I})square-root start_ARG italic_n end_ARG blackboard_H start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_I )(17)

as n,qn𝑛subscript𝑞𝑛n,q_{n}\rightarrow\inftyitalic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞, where

B(𝝅,𝜷)=𝐌X1(𝜷)𝐊B(𝝅,𝜷)𝐌X1(𝜷)superscript𝐵𝝅𝜷superscriptsubscript𝐌𝑋1𝜷superscript𝐊𝐵𝝅𝜷superscriptsubscript𝐌𝑋1𝜷\mathbb{H}^{B}(\boldsymbol{\pi},\boldsymbol{\beta})=\mathbf{M}_{X}^{-1}(%\boldsymbol{\beta})\mathbf{K}^{B}(\boldsymbol{\pi},\boldsymbol{\beta})\mathbf{%M}_{X}^{-1}(\boldsymbol{\beta})blackboard_H start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) = bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β ) bold_K start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β )

and

𝐊B(𝝅,𝜷)=1n2i=1nwi{Diμi(𝜷)}2𝐗i𝐗iT.superscript𝐊𝐵𝝅𝜷1superscript𝑛2superscriptsubscript𝑖1𝑛subscript𝑤𝑖superscriptsubscript𝐷𝑖subscript𝜇𝑖𝜷2subscript𝐗𝑖superscriptsubscript𝐗𝑖𝑇\mathbf{K}^{B}(\boldsymbol{\pi},\boldsymbol{\beta})=\frac{1}{n^{2}}\sum_{i=1}^%{n}w_{i}\left\{D_{i}-\mu_{i}(\boldsymbol{\beta})\right\}^{2}\mathbf{X}_{i}%\mathbf{X}_{i}^{T}\,.bold_K start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .

Then, the A-optimal and L-optimal subsampling probabilities are given by

πiB,A=|Diμi(𝜷^MLE)|𝐌X1𝐗ij=1n|Djμj(𝜷^MLE)|𝐌X1𝐗j,i=1,,n\pi_{i}^{B,A}=\frac{|D_{i}-\mu_{i}(\widehat{\boldsymbol{\beta}}_{MLE})|\|%\mathbf{M}_{X}^{-1}\mathbf{X}_{i}\|}{\sum_{j=1}^{n}|D_{j}-\mu_{j}(\widehat{%\boldsymbol{\beta}}_{MLE})|\|\mathbf{M}_{X}^{-1}\mathbf{X}_{j}\|}\quad,\quad i%=1,\dots,n\,italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B , italic_A end_POSTSUPERSCRIPT = divide start_ARG | italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) | ∥ bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) | ∥ bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ end_ARG , italic_i = 1 , … , italic_n(18)

and

πiB,L=|Diμi(𝜷^MLE)|𝐗ij=1n|Djμj(𝜷^MLE)|𝐗j,i=1,,n.\pi_{i}^{B,L}=\frac{|D_{i}-\mu_{i}(\widehat{\boldsymbol{\beta}}_{MLE})|\|%\mathbf{X}_{i}\|}{\sum_{j=1}^{n}|D_{j}-\mu_{j}(\widehat{\boldsymbol{\beta}}_{%MLE})|\|\mathbf{X}_{j}\|}\quad,\quad i=1,\dots,n\,.italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B , italic_L end_POSTSUPERSCRIPT = divide start_ARG | italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) | ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) | ∥ bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ end_ARG , italic_i = 1 , … , italic_n .(19)

Since we wish to avoid evaluating the full-data estimator 𝜷^MLEsubscript^𝜷𝑀𝐿𝐸\widehat{\boldsymbol{\beta}}_{MLE}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT, the following two-step algorithm is given by Wangetal. (2018):

Step 1:Sample q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations using the following probabilities

πiprop={(2n0)1ifDi=0(2n1)1ifDi=1i=1,,n.formulae-sequencesuperscriptsubscript𝜋𝑖𝑝𝑟𝑜𝑝casessuperscript2subscript𝑛01ifsubscript𝐷𝑖0superscript2subscript𝑛11ifsubscript𝐷𝑖1𝑖1𝑛\pi_{i}^{prop}=\begin{cases}(2n_{0})^{-1}&\text{if }D_{i}=0\\(2n_{1})^{-1}&\text{if }D_{i}=1\end{cases}\quad\quad i=1,\dots,n\,.italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_o italic_p end_POSTSUPERSCRIPT = { start_ROW start_CELL ( 2 italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL if italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 end_CELL end_ROW start_ROW start_CELL ( 2 italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL if italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_CELL end_ROW italic_i = 1 , … , italic_n .(20)

where n1=nn0subscript𝑛1𝑛subscript𝑛0n_{1}=n-n_{0}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Conduct a weighted logistic regression with the subsample , based on Eq. (8), and get 𝜷~propsubscript~𝜷𝑝𝑟𝑜𝑝\widetilde{\boldsymbol{\beta}}_{prop}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT. Derive the approximated optimal sampling probabilities by substituting 𝜷^MLEsubscript^𝜷𝑀𝐿𝐸\widehat{\boldsymbol{\beta}}_{MLE}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT with 𝜷~propsubscript~𝜷𝑝𝑟𝑜𝑝\widetilde{\boldsymbol{\beta}}_{prop}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT in (18) or (19).

Step 2:Sample qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT observations from the entire sample using the probabilities of Step 1 and get 𝒬𝒬\mathcal{Q}caligraphic_Q. Conduct a weighted logistic regression on 𝒬𝒬\mathcal{Q}caligraphic_Q , based on Eq. (8), and obtain the two-step estimator 𝜷~TSsubscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT.

Once 𝜷~TSsubscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT is calculated, inference can be carried out by using the variance estimator ~B(𝜷~TS)=𝐌~X(𝜷~TS)1𝐊~B(𝜷~TS)𝐌~X(𝜷~TS)1superscript~𝐵subscript~𝜷𝑇𝑆subscript~𝐌𝑋superscriptsubscript~𝜷𝑇𝑆1superscript~𝐊𝐵subscript~𝜷𝑇𝑆subscript~𝐌𝑋superscriptsubscript~𝜷𝑇𝑆1\widetilde{\mathbb{H}}^{B}(\widetilde{\boldsymbol{\beta}}_{TS})=\widetilde{%\mathbf{M}}_{X}(\widetilde{\boldsymbol{\beta}}_{TS})^{-1}\widetilde{\mathbf{K}%}^{B}(\widetilde{\boldsymbol{\beta}}_{TS})\widetilde{\mathbf{M}}_{X}(%\widetilde{\boldsymbol{\beta}}_{TS})^{-1}over~ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) = over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, where

𝐊~B(𝜷~)=1n2i𝒬wi2{Diμi(𝜷~)}2𝐗i𝐗iT.superscript~𝐊𝐵~𝜷1superscript𝑛2subscript𝑖𝒬superscriptsubscript𝑤𝑖2superscriptsubscript𝐷𝑖subscript𝜇𝑖~𝜷2subscript𝐗𝑖superscriptsubscript𝐗𝑖𝑇\widetilde{\mathbf{K}}^{B}(\widetilde{\boldsymbol{\beta}})=\frac{1}{n^{2}}\sum%_{i\in\mathcal{Q}}w_{i}^{2}\left\{D_{i}-\mu_{i}(\widetilde{\boldsymbol{\beta}}%)\right\}^{2}\mathbf{X}_{i}\mathbf{X}_{i}^{T}\,.over~ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG ) } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .

Certainly, we can utilize the concepts discussed earlier to determine the desired values of qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. However, the asymptotic properties outlined in Wangetal. (2018) are confined to the conditional space, conditioning on the entire observed data, nsubscript𝑛\mathcal{F}_{n}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. In contrast, our approach for the optimal qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT necessitates the consideration of the asymptotic distribution under the unconditional space. The subsequent theorem presents this result, and the proof is available in the SM, Section S2.

Theorem 3

Given Assumptions A.1-A.3 (see SM, Section S2) and as qn,nsubscript𝑞𝑛𝑛q_{n},n\rightarrow\inftyitalic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞,

nB(𝝅,𝜷o)1/2(𝜷~TS𝜷o)𝐷N(0,𝐈).𝐷𝑛superscript𝐵superscript𝝅superscript𝜷𝑜12subscript~𝜷𝑇𝑆superscript𝜷𝑜𝑁0𝐈\sqrt{n}\mathbb{H}^{B}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})^{-1/2}(%\widetilde{\boldsymbol{\beta}}_{TS}-\boldsymbol{\beta}^{o})\xrightarrow{D}N(0,%\mathbf{I})\,.square-root start_ARG italic_n end_ARG blackboard_H start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_I ) .

4.2 Choosing Subsample Size by Relative Efficiency or Hypothesis Testing

An estimator of the RE of the two-step estimator relative to the estimator based on the entire datatset is given by

RE(qn)=B(𝜷~TS)Fn1𝐌X(𝜷^MLE)F,𝑅𝐸subscript𝑞𝑛subscriptnormsuperscript𝐵subscript~𝜷𝑇𝑆𝐹subscriptnormsuperscript𝑛1subscript𝐌𝑋subscript^𝜷𝑀𝐿𝐸𝐹RE(q_{n})=\frac{\|\mathbb{H}^{B}(\widetilde{\boldsymbol{\beta}}_{TS})\|_{F}}{%\|n^{-1}\mathbf{M}_{X}(\widehat{\boldsymbol{\beta}}_{MLE})\|_{F}}\,,italic_R italic_E ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG ∥ blackboard_H start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG ,

and the respective estimator that focuses on the pthsuperscript𝑝𝑡p^{th}italic_p start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT covariate is given by

REp(qn)=[B(𝜷~TS)]pp[n1𝐌X1(𝜷^MLE)]pp.𝑅subscript𝐸𝑝subscript𝑞𝑛subscriptdelimited-[]superscript𝐵subscript~𝜷𝑇𝑆𝑝𝑝subscriptdelimited-[]superscript𝑛1superscriptsubscript𝐌𝑋1subscript^𝜷𝑀𝐿𝐸𝑝𝑝RE_{p}(q_{n})=\frac{\left[\mathbb{H}^{B}(\widetilde{\boldsymbol{\beta}}_{TS})%\right]_{pp}}{\left[n^{-1}\mathbf{M}_{X}^{-1}(\widehat{\boldsymbol{\beta}}_{%MLE})\right]_{pp}}\,.italic_R italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG [ blackboard_H start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT end_ARG start_ARG [ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT end_ARG .

Once again, we substitute 𝜷~TSsubscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT and 𝜷^MLEsubscript^𝜷𝑀𝐿𝐸\widehat{\boldsymbol{\beta}}_{MLE}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT with the consistent estimator 𝜷~propsubscript~𝜷𝑝𝑟𝑜𝑝\widetilde{\boldsymbol{\beta}}_{prop}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT from Step 1. To ensure numerical stability in approximating B(𝜷~prop)superscript𝐵subscript~𝜷𝑝𝑟𝑜𝑝\mathbb{H}^{B}(\widetilde{\boldsymbol{\beta}}_{prop})blackboard_H start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) and 𝐌X1(𝜷~prop)superscriptsubscript𝐌𝑋1subscript~𝜷𝑝𝑟𝑜𝑝\mathbf{M}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{prop})bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ), we recommend utilizing the estimated optimal probabilities of Step 1 to sample an additional set of size q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, denoted as 𝒬1.5subscript𝒬1.5\mathcal{Q}_{1.5}caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT. Let

ˇB(𝝅,𝜷~)=𝐌ˇX1(𝜷~)𝐊ˇB(𝝅,𝜷~)𝐌ˇX1(𝜷~)superscriptˇ𝐵𝝅~𝜷superscriptsubscriptˇ𝐌𝑋1~𝜷superscriptˇ𝐊𝐵𝝅~𝜷superscriptsubscriptˇ𝐌𝑋1~𝜷\check{\mathbb{H}}^{B}(\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}})=\check%{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}})\check{\mathbf{K}}^{B}(%\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}})\check{\mathbf{M}}_{X}^{-1}(%\widetilde{\boldsymbol{\beta}})overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG ) = overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG ) overroman_ˇ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG ) overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG )

where

𝐊ˇB(𝜷~)=1n2i𝒬1.5wˇi2{Diμi(𝜷~)}2𝐗i𝐗iT,superscriptˇ𝐊𝐵~𝜷1superscript𝑛2subscript𝑖subscript𝒬1.5superscriptsubscriptˇ𝑤𝑖2superscriptsubscript𝐷𝑖subscript𝜇𝑖~𝜷2subscript𝐗𝑖superscriptsubscript𝐗𝑖𝑇\check{\mathbf{K}}^{B}(\widetilde{\boldsymbol{\beta}})=\frac{1}{n^{2}}\sum_{i%\in\mathcal{Q}_{1.5}}\check{w}_{i}^{2}\left\{D_{i}-\mu_{i}(\widetilde{%\boldsymbol{\beta}})\right\}^{2}\mathbf{X}_{i}\mathbf{X}_{i}^{T}\,,overroman_ˇ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT overroman_ˇ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG ) } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ,

and wˇi=(q0πi)1subscriptˇ𝑤𝑖superscriptsubscript𝑞0subscript𝜋𝑖1\check{w}_{i}=(q_{0}\pi_{i})^{-1}overroman_ˇ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, i=1,,n𝑖1𝑛i=1,\dots,nitalic_i = 1 , … , italic_n. Finally,

RE^(qn)=q0qn1ˇB(𝜷~prop)n1𝐌ˇX1(𝜷~prop).^𝑅𝐸subscript𝑞𝑛normsubscript𝑞0superscriptsubscript𝑞𝑛1superscriptˇ𝐵subscript~𝜷𝑝𝑟𝑜𝑝normsuperscript𝑛1superscriptsubscriptˇ𝐌𝑋1subscript~𝜷𝑝𝑟𝑜𝑝\widehat{RE}(q_{n})=\frac{\|q_{0}q_{n}^{-1}\check{\mathbb{H}}^{B}(\widetilde{%\boldsymbol{\beta}}_{prop})\|}{\|n^{-1}\check{\mathbf{M}}_{X}^{-1}(\widetilde{%\boldsymbol{\beta}}_{prop})\|}\,.over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG ∥ italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) ∥ end_ARG start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) ∥ end_ARG .(21)

Unlike the RE estimator in the rare event setting, Eq. (21) approaches zero as qnsubscript𝑞𝑛q_{n}\rightarrow\inftyitalic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ while keeping q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and n𝑛nitalic_n fixed. Consequently, only practical sizes for qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT should be taken into account. In other words, values of qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that are close to n𝑛nitalic_n should not be considered in the plot of RE^(qn)^𝑅𝐸subscript𝑞𝑛\widehat{RE}(q_{n})over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) as a function of qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. This procedure can be seamlessly integrated into the two-step Algorithm 3 with minimal additional computation time, as outlined below:

Step 1.5: Draw a sample of q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations from the entire dataset using the optimal sampling probabilities obtained in Step 1 to create 𝒬1.5subscript𝒬1.5\mathcal{Q}_{1.5}caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT. Compute ˇB(𝜷~prop)superscriptˇ𝐵subscript~𝜷𝑝𝑟𝑜𝑝\check{\mathbb{H}}^{B}(\widetilde{\boldsymbol{\beta}}_{prop})overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) and 𝐌ˇX1(𝜷~prop)superscriptsubscriptˇ𝐌𝑋1subscript~𝜷𝑝𝑟𝑜𝑝\check{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{prop})overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ). Generate a plot of RE^(qn)^𝑅𝐸subscript𝑞𝑛\widehat{RE}(q_{n})over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) against qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and select the smallest qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that meets the desired RE.

Similarly, the minimal subsample size for testing H0:βpo=0:subscript𝐻0subscriptsuperscript𝛽𝑜𝑝0H_{0}:\beta^{o}_{p}=0italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0 against a two-sided alternative, given βpo=βpsubscriptsuperscript𝛽𝑜𝑝superscriptsubscript𝛽𝑝\beta^{o}_{p}=\beta_{p}^{*}italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, a significance level α𝛼\alphaitalic_α and power γ𝛾\gammaitalic_γ, is given by

q~n=q0(Z1α/2+Zγ)2[ˇB(𝜷~prop)]ppβp2.subscript~𝑞𝑛subscript𝑞0superscriptsubscript𝑍1𝛼2subscript𝑍𝛾2subscriptdelimited-[]superscriptˇ𝐵subscript~𝜷𝑝𝑟𝑜𝑝𝑝𝑝superscriptsuperscriptsubscript𝛽𝑝2\widetilde{q}_{n}=\left\lceil{\frac{q_{0}(Z_{1-\alpha/2}+Z_{\gamma})^{2}\big{[%}\check{\mathbb{H}}^{B}(\widetilde{\boldsymbol{\beta}}_{prop})\big{]}_{pp}}{{%\beta_{p}^{*}}^{2}}}\right\rceil\,.over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ⌈ divide start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⌉ .(22)

A plot of qn~~subscript𝑞𝑛\widetilde{q_{n}}over~ start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG as a function of γ𝛾\gammaitalic_γ can be easily generated. The values derived from Eq. (22) might exceed the sample size n𝑛nitalic_n. Nevertheless, exceeding the sample size may not yield any additional information beyond what is already captured by the full-data MLE, 𝜷^MLEsubscript^𝜷𝑀𝐿𝐸\widehat{\boldsymbol{\beta}}_{MLE}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT. Despite this, a value surpassing n𝑛nitalic_n remains informative, indicating that the desired statistical power cannot be attained. In conclusion, the algorithm for a single covariate hypothesis testing is provided by adding the following mid-step to the two-step Algorithm 3:

Step 1.5*: Draw a sample of q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations from the entire dataset using the optimal sampling probabilities obtained in Step 1 to create 𝒬1.5subscript𝒬1.5\mathcal{Q}_{1.5}caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT. Compute ˇB(𝜷~prop)superscriptˇ𝐵subscript~𝜷𝑝𝑟𝑜𝑝\check{\mathbb{H}}^{B}(\widetilde{\boldsymbol{\beta}}_{prop})overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) and 𝐌ˇX1(𝜷~prop)superscriptsubscriptˇ𝐌𝑋1subscript~𝜷𝑝𝑟𝑜𝑝\check{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{prop})overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ). Generate a plot of RE^(qn)^𝑅𝐸subscript𝑞𝑛\widehat{RE}(q_{n})over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) against γ𝛾\gammaitalic_γ and select the smallest qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that meets the desired power.

5 Simulation Study

5.1 Cox Regression

5.1.1 Data Generation

The sampling designs are similar to that ofKeret andGorfine (2023). For each of the settings described below, 500 samples were drawn, n=15,000𝑛15000n=15,000italic_n = 15 , 000 with 𝜷o=(0.3,0.5,0.1,0.1,0.1,0.3)Tsuperscript𝜷𝑜superscript0.30.50.10.10.10.3𝑇\boldsymbol{\beta}^{o}=(0.3,-0.5,0.1,-0.1,0.1,-0.3)^{T}bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = ( 0.3 , - 0.5 , 0.1 , - 0.1 , 0.1 , - 0.3 ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Censoring times were generated from an exponential distribution with rate 0.2, independently of failure times. The instantaneous baseline hazard rate was set to be λ0(t)=0.001I(t<6)+cλ0I(t6)\lambda_{0}(t)=0.001I(t<6)+c_{\lambda_{0}}I(t\geq_{6})italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t ) = 0.001 italic_I ( italic_t < 6 ) + italic_c start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_I ( italic_t ≥ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ).The distributions of the covariates and the parameter cλ0subscript𝑐subscript𝜆0c_{\lambda_{0}}italic_c start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT of each setting, I, II, III, were as follows:

  1. 1.

    Setting I: XjUnif(0,4)similar-tosubscript𝑋𝑗𝑈𝑛𝑖𝑓04X_{j}\sim Unif(0,4)italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ italic_U italic_n italic_i italic_f ( 0 , 4 ), j=1,,6𝑗16j=1,\ldots,6italic_j = 1 , … , 6 and cλ0=0.075subscript𝑐subscript𝜆00.075c_{\lambda_{0}}=0.075italic_c start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.075. This is a setting of equal variances and no correlation between the covariates.

  2. 2.

    Setting II: XjUnif(0,θj)similar-tosubscript𝑋𝑗𝑈𝑛𝑖𝑓0subscript𝜃𝑗X_{j}\sim Unif(0,\theta_{j})italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ italic_U italic_n italic_i italic_f ( 0 , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), (θ1,θ2,θ3,θ4,θ5,θ6)=(1,6,2,2,1,6)subscript𝜃1subscript𝜃2subscript𝜃3subscript𝜃4subscript𝜃5subscript𝜃6162216(\theta_{1},\theta_{2},\theta_{3},\theta_{4},\theta_{5},\theta_{6})=(1,6,2,2,1%,6)( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) = ( 1 , 6 , 2 , 2 , 1 , 6 ) and cλ0=0.15subscript𝑐subscript𝜆00.15c_{\lambda_{0}}=0.15italic_c start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.15. This is a setting of unequal variances and no correlation between covariates.

  3. 3.

    Setting III: X1,X2subscript𝑋1subscript𝑋2X_{1},X_{2}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are independently sampled from Unif(0,4)𝑈𝑛𝑖𝑓04Unif(0,4)italic_U italic_n italic_i italic_f ( 0 , 4 ), X4=0.5X1+0.5X2+ε1subscript𝑋40.5subscript𝑋10.5subscript𝑋2subscript𝜀1X_{4}=0.5X_{1}+0.5X_{2}+\varepsilon_{1}italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 0.5 italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 0.5 italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, X5=X1+ε2subscript𝑋5subscript𝑋1subscript𝜀2X_{5}=X_{1}+\varepsilon_{2}italic_X start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, X6=X1+ε3subscript𝑋6subscript𝑋1subscript𝜀3X_{6}=X_{1}+\varepsilon_{3}italic_X start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and cλ0=0.05subscript𝑐subscript𝜆00.05c_{\lambda_{0}}=0.05italic_c start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.05, where ε1N(0,0.1)similar-tosubscript𝜀1𝑁00.1\varepsilon_{1}\sim N(0,0.1)italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 0.1 ), ε2N(0,1)similar-tosubscript𝜀2𝑁01\varepsilon_{2}\sim N(0,1)italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 1 ), ε3N(1,1.5)similar-tosubscript𝜀3𝑁11.5\varepsilon_{3}\sim N(1,1.5)italic_ε start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∼ italic_N ( 1 , 1.5 ), and the ε𝜀\varepsilonitalic_ε’s are independent. The strongest correlation between two covariates is about 0.75.

5.1.2 Results

Eq. (6) and (7) become practically valuable only when the approximations made in Step 1.5 and Step 1.5* of Sections 2.2 and 2.3, respectively, closely align with their true values. In Fig. S1 of the SM, we compare the Frobenius norm of three covariance matrices: (i) The covariance matrix of the two-step estimator, 𝜷~TSsubscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT. (ii) The approximated covariance matrix utilized in Step 1.5. (iii) The empirical covariance matrix of 𝜷~TSsubscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT. The results are obtained with qn=5nesubscript𝑞𝑛5subscript𝑛𝑒q_{n}=5n_{e}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 5 italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and q0=c0nesubscript𝑞0subscript𝑐0subscript𝑛𝑒q_{0}=c_{0}n_{e}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, where c0subscript𝑐0c_{0}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ranges from 1 to 5. Clearly, the Frobenius norm of Step 1.5 is remarkably close to the covariance matrix of the two-step estimator, and both are in close agreement with the empirical variance, even for small values of c0subscript𝑐0c_{0}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such as c0=1subscript𝑐01c_{0}=1italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1.

A comparison between the RE as defined by Eq. (4) and its approximation in Eq. (6) is summarized in Fig. 1, where q0=2nesubscript𝑞02subscript𝑛𝑒q_{0}=2n_{e}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 2 italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, qn=cnesubscript𝑞𝑛𝑐subscript𝑛𝑒q_{n}=cn_{e}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_c italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, and c=1,,9𝑐19c=1,\ldots,9italic_c = 1 , … , 9. The results indicate that the approximated RE of Step 1.5, (i.e., Eq. (6)) closely mirrors Eq. (4). The presence of an ‘elbow’ shape around c=3𝑐3c=3italic_c = 3 with RE fairly close to 1111 suggests that qn=3nesubscript𝑞𝑛3subscript𝑛𝑒q_{n}=3n_{e}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 3 italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is sufficiently large under these specific settings. Clearly, the two optimal sampling strategies substantially outperform uniform sampling in terms of RE.

To assess the effectiveness of q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as defined in Eq. (7), we conducted a comparison of the empirical and nominal power of the test for H0:β5=0:subscript𝐻0subscript𝛽50H_{0}:\beta_{5}=0italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_β start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = 0 against a two-sided alternative using q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT based on the proposed three-step estimation algorithm, comprising Steps 1, 1.5*, and 2. The tests were performed with α=0.05𝛼0.05\alpha=0.05italic_α = 0.05. Here, we increased the sample size to n=150,000𝑛150000n=150,000italic_n = 150 , 000, and for Setting I cλ0=0.005subscript𝑐subscript𝜆00.005c_{\lambda_{0}}=0.005italic_c start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.005, while for Settings II and III cλ0=0.05subscript𝑐subscript𝜆00.05c_{\lambda_{0}}=0.05italic_c start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.05. Consequently, the respective event rates were 0.65%percent0.650.65\%0.65 %, 1.3%percent1.31.3\%1.3 % and 3%percent33\%3 %.

The results are outlined in Table 1 with q0=2nesubscript𝑞02subscript𝑛𝑒q_{0}=2n_{e}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 2 italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. Clearly, q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT achieves the intended nominal power. When considering the mean and standard deviation (SD) of q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we observe that both optimality criteria exhibit similar performance under Settings I and II, while A surpasses L in Setting III. These results further validate our assertion that in extensive datasets with rare events, only a small fraction of the censored data is practically necessary. For instance, in Setting III, it is demonstrate that subsampling approximately 6000600060006000 censored observations from about 145,550 censored observations is sufficient to achieve a power of 0.950.950.950.95.

5.2 Logistic Regression with Rare Events

5.2.1 Data Generation

The sampling designs are similar to that of Wangetal. (2018) with some modification to represent settings of rare events. 500 samples were drawn for each setting, each dataset is of size n=100,000𝑛100000n=100,000italic_n = 100 , 000. The following covariates’ distributions were considered:

  1. 1.

    mzNormal. 𝐗𝐗\mathbf{X}bold_X follows a multivariate normal distribution N(𝟎,𝚺)𝑁0𝚺N(\mathbf{0},\mathbf{\Sigma})italic_N ( bold_0 , bold_Σ ), where Σij=0.5I(ij)subscriptΣ𝑖𝑗superscript0.5𝐼𝑖𝑗{\Sigma}_{ij}=0.5^{I(i\neq j)}roman_Σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0.5 start_POSTSUPERSCRIPT italic_I ( italic_i ≠ italic_j ) end_POSTSUPERSCRIPT.

  2. 2.

    mixNormal. 𝐗𝐗\mathbf{X}bold_X is a mixture of two multivariate normal distribution, 𝐗0.5N(𝟏,𝚺)+0.5N(𝟏,𝚺)similar-to𝐗0.5𝑁1𝚺0.5𝑁1𝚺\mathbf{X}\sim 0.5N(\mathbf{1},\mathbf{\Sigma})+0.5N(\mathbf{-1},\mathbf{%\Sigma})bold_X ∼ 0.5 italic_N ( bold_1 , bold_Σ ) + 0.5 italic_N ( - bold_1 , bold_Σ ) so the distribution of 𝐗𝐗\mathbf{X}bold_X is bimodal.

  3. 3.

    T3. 𝐗𝐗\mathbf{X}bold_X follows a multivariate t𝑡titalic_t distribution with 3 degrees of freedom, 𝐗t3(𝟎,𝚺)/10similar-to𝐗subscript𝑡30𝚺10\mathbf{X}\sim t_{3}(\mathbf{0},\mathbf{\Sigma})/10bold_X ∼ italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_0 , bold_Σ ) / 10. Hence, the distribution of 𝐗𝐗\mathbf{X}bold_X has heavy tails.

  4. 4.

    EXP. Components of 𝐗𝐗\mathbf{X}bold_X are independent and each has an exponential distribution with a rate parameter of 2. The distribution of 𝐗𝐗\mathbf{X}bold_X is skewed and has a heaviertail on the right.

We set q0=1,000subscript𝑞01000q_{0}=1,000italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 , 000 and explored various values for qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, ranging from 1,00010001,0001 , 000 to 10,0001000010,00010 , 000 in increments of 1,00010001,0001 , 000. We set βi=0.5subscript𝛽𝑖0.5\beta_{i}=0.5italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0.5, i=1,,6𝑖16i=1,\ldots,6italic_i = 1 , … , 6, and employed distinct values for the intercept β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to regulate the event rate. Specifically, β0=6subscript𝛽06\beta_{0}=-6italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 6 for mzNormal (yielding an event rate of 2%), β0=5subscript𝛽05\beta_{0}=-5italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 5 for mixNormal (event rate of 2.1%), β0=5subscript𝛽05\beta_{0}=-5italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 5 for T3 (event rate of 1.5%), and β0=11subscript𝛽011\beta_{0}=-11italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 11 for EXP (event rate of 1.3%).

5.2.2 Results

The comparison among different estimators involved assessing the empirical root mean squared errors (RMSEs) with respect to 𝜷osuperscript𝜷𝑜\boldsymbol{\beta}^{o}bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT and 𝜷^PLsubscript^𝜷𝑃𝐿\widehat{\boldsymbol{\beta}}_{PL}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT. Namely, B1j=1Bi=16(β^i(j)βio)2superscript𝐵1superscriptsubscript𝑗1𝐵superscriptsubscript𝑖16superscriptsuperscriptsubscript^𝛽𝑖𝑗superscriptsubscript𝛽𝑖𝑜2B^{-1}\sum_{j=1}^{B}\sqrt{\sum_{i=1}^{6}(\widehat{\beta}_{i}^{(j)}-\beta_{i}^{%o})^{2}}italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG and B1j=1Bi=16(β^i(j)β^PL,i(j))2superscript𝐵1superscriptsubscript𝑗1𝐵superscriptsubscript𝑖16superscriptsuperscriptsubscript^𝛽𝑖𝑗superscriptsubscript^𝛽𝑃𝐿𝑖𝑗2B^{-1}\sum_{j=1}^{B}\sqrt{\sum_{i=1}^{6}(\widehat{\beta}_{i}^{(j)}-\widehat{%\beta}_{PL,i}^{(j)})^{2}}italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, where 𝜷^^𝜷\widehat{\boldsymbol{\beta}}over^ start_ARG bold_italic_β end_ARG represents the relevant estimator, the superscript (j)𝑗(j)( italic_j ) denotes the j𝑗jitalic_j’th sample, and B=500𝐵500B=500italic_B = 500 signifies the number of repetitions. Fig. 2 shows the RMSEs of the two-step estimators of Algorithm 2 with 𝐩Asuperscript𝐩𝐴{\bf p}^{A}bold_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and 𝐩Lsuperscript𝐩𝐿{\bf p}^{L}bold_p start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, the full-data MLE, and a one-step estimator with uniform subsampling from the non-cases data. Clearly, the optimal subsampling methods outperform uniform subsampling in terms of RMSE, with A-optimal yielding slightly superior results compared to L-optimal, as anticipated. Table 2 presents a comparison of the running times. Evidently, the optimal subsampling methods are substantially faster than 𝜷^PLsubscript^𝜷𝑃𝐿\widehat{\boldsymbol{\beta}}_{PL}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT while still maintaining low RMSE.

Fig. S2 of the SM demonstrate the validity of the variance estimator (11), and the effectiveness of optimal subsampling over uniform subsampling. In Figure 3(a) it is demonstrated that Eq. (14) provides a good approximation of the RE based on the actual two-step estimator, thereby endorsing the validity of the proposed three-step estimator that includes steps 1, 1.5 and 2.

To evaluate the utility of q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT derived from Eq. (16), a comparison was made between the empirical and nominal power of testing H0:β5=0:subscript𝐻0subscript𝛽50H_{0}:\beta_{5}=0italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_β start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = 0 against a two-sided alternative and α=0.05𝛼0.05\alpha=0.05italic_α = 0.05, using q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the three-step estimation algorithm of Section 3.1 with steps 1, 1.5* and 2. Due to impractical subsample sizes for some higher values of γ𝛾\gammaitalic_γ, meaning the required power could not be attained even with the entire sample, the coefficient vector 𝜷osuperscript𝜷𝑜\boldsymbol{\beta}^{o}bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT was modified:

  1. 1.

    mzNormal. β0=3.5subscript𝛽03.5\beta_{0}=-3.5italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 3.5 and βj=0.1subscript𝛽𝑗0.1\beta_{j}=0.1italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0.1, i=1,,6𝑖16i=1,\ldots,6italic_i = 1 , … , 6, with an event rate of 3.2%.

  2. 2.

    mixNormal. β0=4.5subscript𝛽04.5\beta_{0}=-4.5italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 4.5 and βj=0.2subscript𝛽𝑗0.2\beta_{j}=0.2italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0.2, i=1,,6𝑖16i=1,\ldots,6italic_i = 1 , … , 6, with an event rate of 1.3%.

  3. 3.

    T3. β0=3subscript𝛽03\beta_{0}=-3italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 3 and βj=0.15subscript𝛽𝑗0.15\beta_{j}=0.15italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0.15, j=1,,6𝑗16j=1,\ldots,6italic_j = 1 , … , 6, with an event rate of 5%.

  4. 4.

    EXP. β0=4subscript𝛽04\beta_{0}=-4italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 4 and βj=0.15subscript𝛽𝑗0.15\beta_{j}=0.15italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0.15, j=1,,6𝑗16j=1,\ldots,6italic_j = 1 , … , 6, with an event rate of 2.8%.

The results are summarized in Table 3 employing q0=1,000subscript𝑞01000q_{0}=1,000italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 , 000 with 5,000 repetitions for each γ𝛾\gammaitalic_γ value. Our conclusion is that q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT yields power close to the nominal level across all scenarios. Regarding the mean and standard deviation of q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the A-optimal approach consistently outperforms L-optimal.

5.3 Logistic Regression with Nearly Balanced Data

The configurations examined correspond to those outlined in Wangetal. (2018) (Section 5.1): mzNormal, nzNormal, mixNormal, T3, and EXP. The setting mzNormal is not balanced, and with event rate of 0.73. For each specified scenario, we generated 500 samples, each consisting of 100,000 observations and q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT was set to 5,000. The results, succinctly illustrated in Fig. 3(b), demonstrate a strong agreement between the proposed RE estimator (21) and the RE based on the actual two-step Algorithm 3.

Table 4 provides a summary of the comparison between empirical and nominal power of testing H0:β6=0:subscript𝐻0subscript𝛽60H_{0}:\beta_{6}=0italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_β start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = 0 against a two-sided alternative and α=0.05𝛼0.05\alpha=0.05italic_α = 0.05, utilizing q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the proposed three-step estimation algorithm outlined in Section 4.2. These results are derived from 5,000 repetitions for each configuration, employing a smaller subsample size for steps 1 and 1.5*, with q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT set to 1,000. Evidently, q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT provides the desired nominal power, which supports the use of Eq. (22). In terms of the mean and standard deviation of q~nsubscript~𝑞𝑛\widetilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, A-optimal outperforms L-optimal.

The optimal approach of Wangetal. (2018) involves subsampling from both cases and controls, exhibiting strong performance when dealing with balanced data. The sensitivity of the optimal subsample size, as defined by Eq. (22), to imbalanced data is illustrated in Fig. 4. The setting mzNormal is explored with varying sample sizes n=a×100,000𝑛𝑎100000n=a\times 100,000italic_n = italic_a × 100 , 000, a=1,,5𝑎15a=1,\ldots,5italic_a = 1 , … , 5, and diverse values of β0o=5,4,,1superscriptsubscript𝛽0𝑜541\beta_{0}^{o}=-5,-4,\ldots,-1italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = - 5 , - 4 , … , - 1, corresponding to event rates of 0.8%, 2%, 5%, 13%, and 28%, respectively. Evidently, Eq. (22) fails to deliver the required power as the event rate decreases, regardless of the sample size. These findings underscore the imperative need for a distinct consideration of the imbalanced setting, as addressed in this work.

6 Survival Analysis of UKBiobank Colorectal Cancer

We conducted an analysis complementing the one presented in Keret andGorfine (2023) and studied the required subsample size based on RE. The event time is defined as the age at colorectal cancer (CRC) diagnosis, while the censoring time is specified as the age at death before CRC diagnosis or the current age without CRC. The analysis encompasses established environmental CRC risk factors, including body mass index (BMI), smoking status (no/yes), family history of CRC (no/yes), physical activity (no/yes), sex (female/male), alcohol consumption (non or occasional/light frequent drinker/very frequent drinker), education (lower than high school/high school/higher vocational education/college or university graduate/prefer not to answer), NSAIDs drug use (none/Aspirin or Ibuprofen), and post-menopausal hormones (no/yes). Additionally, 139 single-nucleotide polymorphisms (SNPs) associated with CRC through GWAS (Jeonetal., 2018) were included along with six principal components to account for population substructure. The SNPs were standardized to have a mean of zero and unit variance.

Building on the analysis in Keret andGorfine (2023), a time-dependent effect β(t)𝛽𝑡\beta(t)italic_β ( italic_t ) is essential for sex, due to violation of the proportional hazard assumption. In total, 180 regression coefficients were considered for the model, with 5,342 observed events and 479,343 censored observations. However, the introduction of time-dependent coefficients results in the partitioning of each observation into several distinct time-fixed “dummy-observations”, each having an “entrance” and “exit” time (Therneauetal., 2017). This creates non-overlapping intervals that reconstruct the original time interval and inflating the dataset to approximately 350 million rows and ne=5,342subscript𝑛𝑒5342n_{e}=5,342italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = 5 , 342. Subsampling is then performed from the censored dummy-observations using the reservoir-sampling approach.

We set c0=15subscript𝑐015c_{0}=15italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 15 and investigated the RE based on Step 1.5. The results are summarized in Fig. 5 (a) and Table 5. Notably, c=100𝑐100c=100italic_c = 100 with the L-optimal subsampling approach (i.e., approximately 500K “dummy-observations” instead of nearly 350 million) proves sufficient. However, in subsequent analyses, we also applied our proposed algorithms for c=40𝑐40c=40italic_c = 40 and 160, for comparison purposes. Table 6 presents the RMSE of the estimators with respect to the full-data PL estimator, the Frobenius norm of the covariance matrices of the estimators, and their running times. Clearly, the optimal methods outperform uniform subsampling regarding both RMSE and Forbenius norm, with the A-optimal method consistently exhibiting somewhat better values than the L-optimal method, as expected. While the running time required for the full dataset is 14.5 hours, the time required for the L-optimal method with c=100𝑐100c=100italic_c = 100 is reduced to 3.287 hours, with minimal loss in terms of efficiency, as demonstrated in Figures 5 (b) and (c).In summary, this analysis, incorporating Step 1.5, highlights the effectiveness of selecting the optimalqnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT according to the RE criterion.

7 Linked Birth and Infant Death Data - Logistic Regression

The birth and infant death data sets, sourced from the National Bureau of Economic Research’s public-use data archives, combine information from death certificates with corresponding birth certificates for infants under one year old who pass away in the United States, Puerto Rico, The Virgin Islands, and Guam. This linkage aims to leverage the additional information available in birth certificates, such as age, parents’ race, birth weight, period of gestation, plurality, prenatal care usage, maternal education, live birth order, marital status, and maternal smoking, to enable more comprehensive analyses of infant mortality patterns.

The data from years 2007 to 2013 were amalgamated into a single extensive dataset comprising n=28,586,919𝑛28586919n=28,586,919italic_n = 28 , 586 , 919 rows. From the raw data, a set of features was derived, resulting in a covariate matrix with 103103103103 columns, encompassing 18 interaction terms with sex and 23 interaction terms with birth year. The covariates in the model are summarized in Tables S1–S3 of the SM. The primary outcome of interest is whether an infant passed away before reaching one year of age. Exactly 176,400176400176,400176 , 400 deaths were observed, constituting about 0.6%percent0.60.6\%0.6 % of event rate, justifying the use of a subsampling algorithm for rare events. The results are summarized in Fig. 6.

In Fig. 6(a), the RE based on Step 1.5 is displayed. Notably, the RE exhibits a distinct ‘elbow’ around qn=1,500,000subscript𝑞𝑛1500000q_{n}=1,500,000italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 , 500 , 000, where the RE is also close to 1. We opted for a slightly higher value, qn=1,7640,000subscript𝑞𝑛17640000q_{n}=1,7640,000italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 , 7640 , 000, with c=10𝑐10c=10italic_c = 10, indicating that 10 controls were sampled for each event. To offer a more comprehensive assessment of the algorithm’s performance, we include results of analyses with c=5𝑐5c=5italic_c = 5 and 25. The approximated RE, varying with c𝑐citalic_c, is presented in Table 7. As anticipated, the A-optimal outperforms the L-optimal in terms of RE.

In Fig. 6(b), the running time of various methods is illustrated as a function of c𝑐citalic_c. The effectiveness of optimal subsampling becomes apparent when compared to the full-data MLE. With our chosen c=10𝑐10c=10italic_c = 10, the running times for A and L criteria are 1700170017001700 seconds and 628628628628 seconds, respectively, whereas the full-data MLE estimator takes 6484648464846484 seconds. It is evident that the additional computational time required for optimal subsampling, as opposed to uniform subsampling, is relatively short, especially for the L method. This outcome reinforces the efficacy of the proposed procedure.

In Fig. 6(c), the RMSE relative to 𝜷^MLEsubscript^𝜷𝑀𝐿𝐸\widehat{\boldsymbol{\beta}}_{MLE}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT is depicted. The findings validate the judicious selection of qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT since increasing c𝑐citalic_c from 5555 to 10101010 substantially reduces the RMSE. However, a further increment to c=25𝑐25c=25italic_c = 25 incurs a longer computational time and yields a comparatively modest improvement. Additionally, it is evident that optimal subsampling yields results substantially superior to those obtained through uniform subsampling.

The effectiveness of optimal subsampling methods over uniform subsampling is also evident in Fig.s 6(d) and 6(e). In Figure 6(d), the estimated coefficients of each subsampling method are compared to their 𝜷^MLEsubscript^𝜷𝑀𝐿𝐸\widehat{\boldsymbol{\beta}}_{MLE}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT counterparts. Optimal subsampling yields results much closer to the full-data estimator than uniform subsampling. Figure 6(e) displays the standard errors of 𝜷^MLEsubscript^𝜷𝑀𝐿𝐸\widehat{\boldsymbol{\beta}}_{MLE}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT versus the standard errors of their subsampling counterparts. Uniform subsampling produces notably larger standard errors.

We completed the analysis by conducting hypothesis testing, H0:βi=0:subscript𝐻0subscript𝛽𝑖0H_{0}:\beta_{i}=0italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 versus H1:βi0:subscript𝐻1subscript𝛽𝑖0H_{1}:\beta_{i}\neq 0italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ 0, i=1,,103𝑖1103i=1,\dots,103italic_i = 1 , … , 103 with FDR adjustment for multiplicity (Benjamini andHochberg, 1995). This process was iterated for c=5,10𝑐510c=5,10italic_c = 5 , 10 and 25252525. In Figure 6(f), the total number of rejected hypotheses under each c𝑐citalic_c is presented, contrasting with the number of rejections based on the full-data analysis. Notably, the A-optimal and L-optimal sampling methods outperforms uniform sampling. Even with a relatively small subsample size, the optimal sampling estimator yields results highly similar to those of the full data, surpassing the number of rejections achieved with uniform subsampling. For our chosen c=10𝑐10c=10italic_c = 10, both A-optimal and L-optimal methods result in rejecting 56 hypotheses, almost matching the full-data analysis of 57 rejections. In contrast, uniform subsampling at c=10𝑐10c=10italic_c = 10 only leads to 37 rejected hypotheses, underscoring the effectiveness of the optimal subsampling.

This dataset possesses a noteworthy characteristic–many of its features consist of rare binary variables. Examples include newborns with congenital anomalies like anencephaly, spina bifida, omphalocele, and Down’s syndrome, alongside rare features related to the mother and delivery. Additionally, these features exhibit significant correlations with the outcome of interest, namely, death within the first year of life. The optimal subsampling procedures offer a notable advantage over uniform subsampling by ensuring that observations with rare features associated with the outcome have larger sampling probabilities. Consequently, they are more likely to be included in the subsample, leading to lower variance. In Table 8, the 20 rarest features in the data are presented, along with their corresponding proportions in both the full dataset and subsampling procedures for c=10𝑐10c=10italic_c = 10. The results affirm that optimal subsamples better capture observations with rare indicators. These insights shed light on the efficiency of our proposed estimators, elucidating their superiority over uniform subsampling.

Regarding the findings derived from the analysis, Tables S4–S6 of the SM present the estimated coefficients for each method, with c=10𝑐10c=10italic_c = 10. While the results are organized into three tables for clarity, it is essential to note that the FDR procedure was executed once, encompassing all coefficients collectively.

Among the significant results, both the mother’s age and the squared mother’s age emerged as noteworthy, corroborating established findings on the impact of maternal age on infant mortality (MacDorman etal., 1997; Standfastetal., 1980). This suggests heightened risks associated with motherhood at either a young or advanced age compared to medium age.

Other variables demonstrating significance in our analysis, consistent with prior literature, include lower risk as a function of number of prenatal visits (Carter etal., 2016), gestational weight gain (Naeve, 1979; Thorsdottir etal., 2002), five-minute Apgar score (Lietal., 2013), and plurality (Ahrens etal., 2017). Conversely, factors known as increasing risk in the literature and affirmed in this study include live birth order (MacDorman etal., 1997; Modin, 2002), eclampsia (Duley, 2009), and certain congenital malformations linked to infant mortality, such as Spina Bifida (Paceetal., 2019), Omphalocele (Marshalletal., 2015), cleft lip (Carlsonetal., 2013), and Down’s syndrome (Sadetzki etal., 1999).

Birth year, confined to the years 2007-2013, did not yield a significant effect. Similarly, no distinctions were observed among different months of the year. Treating Sunday as the baseline, negative impacts were noted for all days of the week except Saturday, indicating a significant difference between workdays and weekends. Concerning parental racial attributes, negative significant effects were identified for native-American ancestry in both parents and for African ancestry from the father’s side. Additionally, an unknown father’s race exhibited statistical significance with a negative effect. In contrast to some prior studies (HolmesJretal., 2020; Xie etal., 2015), our findings indicate a lower risk of infant mortality among Caesarean section. Regarding interaction terms, five sex-interaction terms (weight gain, 5 minutes Apgar score, pre-pregnancy-associated hypertension, induction of labor, and cleft lip) and five birth year-interaction terms (African ancestry for the mother, induction of labor, tocolysis, Anencephaly, and Down’s syndrome) were found to be statistically significant.

8 Discussion

This study makes significant enhancements to the efficient two-step algorithms proposed by Wangetal. (2018) and Keret andGorfine (2023). We introduced practical tools for selecting optimal subsample sizes, illustrating their effectiveness through simulations and real-world data. Additionally, we proposed a new subsampling algorithm designed for logistic regression with rare events. This algorithm, which exclusively subsamples among non-cases, demonstrated speed and efficiency compared to full-data maximum-likelihood estimation. Its superiority over uniform subsampling was established in both simulated and real data, as evidenced by lower RMSE and variance. Furthermore, we demonstrated the algorithm’s nearly equivalent performance to the full-data estimator in hypothesis testing while significantly reducing computational time.

Similar approaches to those proposed in this study can be extended to other two-step subsampling methods, including algorithms for generalized linear models (Aietal., 2018), quantile regression (Aietal., 2021; Fanetal., 2021), and quasi-likelihood regression (Yuetal., 2020).

Datasets with rare events often pose challenges for classification algorithms primarily oriented toward prediction rather than inference. The subsampling-based algorithm proposed for logistic regression with rare events in this study could serve as a practical tool for sampling probabilities in computationally-intensive methods. Notably, it may be worth exploring its application in methods like random forests (Breiman, 2001) and gradient boosting (Friedman, 2001), among others.

9 Software

R codes for the data analysis and reported simulation results along with a complete documentation are available at Github site https://github.com/tal-agassi/optimal-subsampling.

10 Supplementary Material

Supplementary material is available online athttp://biostatistics.oxfordjournals.org.

Acknowledgments

The work was supported by the Israel Science Foundation (ISF) grant number 767/21 and by a grant from the Tel Aviv University Center for AI and Data Science (TAD).

Conflict of Interest: None declared.

References

  • Ahrens etal. (2017)Ahrens, K.A., M.E. Thoma, L.M. Rossen, M.Warner, and A.E. Simon (2017).Plurality of birth and infant mortality due to external causes in theunited states, 2000–2010.American journal of epidemiology185(5), 335–344.
  • Aietal. (2021)Ai, M., F.Wang, J.Yu, and H.Zhang (2021).Optimal subsampling for large-scale quantile regression.Journal of Complexity62, 101512.
  • Aietal. (2018)Ai, M., J.Yu, H.Zhang, and H.Wang (2018).Optimal subsampling algorithms for big data generalized linearmodels.arXiv preprint arXiv:1806.06761.
  • Benjamini andHochberg (1995)Benjamini, Y. and Y.Hochberg (1995).Controlling the false discovery rate: a practical and powerfulapproach to multiple testing.Journal of the Royal statistical society: series B(Methodological)57(1), 289–300.
  • Breiman (2001)Breiman, L. (2001).Random forests.Machine learning45(1), 5–32.
  • Breslow (1972)Breslow, N.E. (1972).Contribution to discussion of paper by dr cox.J. Roy. Statist. Soc., Ser. B34, 216–217.
  • Carlsonetal. (2013)Carlson, L., K.W. Hatcher, and R.VanderBurg (2013).Elevated infant mortality rates among oral cleft and isolated oralcleft cases: a meta-analysis of studies from 1943 to 2010.The Cleft Palate-Craniofacial Journal50(1), 2–12.
  • Carter etal. (2016)Carter, E.B., M.G. Tuuli, A.B. Caughey, A.O. Odibo, G.A. Macones, andA.G. Cahill (2016).Number of prenatal visits and pregnancy outcomes in low-risk women.Journal of perinatology36(3), 178–181.
  • Cox (1972)Cox, D.R. (1972).Regression models and life-tables.Journal of the Royal Statistical Society: Series B(Methodological)34(2), 187–202.
  • Dhillonetal. (2013)Dhillon, P., Y.Lu, D.P. Foster, and L.Ungar (2013).New subsampling algorithms for fast least squares regression.pp. 360–368.
  • Duley (2009)Duley, L. (2009).The global impact of pre-eclampsia and eclampsia.In Seminars in perinatology, Volume33, pp. 130–137.Elsevier.
  • Fanetal. (2021)Fan, Y., Y.Liu, and L.Zhu (2021).Optimal subsampling for linear quantile regression models.Canadian Journal of Statistics.
  • Friedman (2001)Friedman, J.H. (2001).Greedy function approximation: a gradient boosting machine.Annals of statistics, 1189–1232.
  • Gorfine etal. (2021)Gorfine, M., N.Keret, A.BenArie, D.Zucker, and L.Hsu (2021).Marginalized frailty-based illness-death model: application to theuk-biobank survival data.Journal of the American Statistical Association116(535), 1155–1167.
  • HolmesJretal. (2020)HolmesJr, L., L.O’Neill, H.Elmi, C.Chinacherem, C.Comeaux, L.Pelaez,K.W. Dabney, O.Akinola, and M.Enwere (2020).Implication of vagin*l and cesarean section delivery method inblack–white differentials in infant mortality in the united states: Linkedbirth/infant death records, 2007–2016.International Journal of Environmental Research and PublicHealth17(9), 3146.
  • Jeonetal. (2018)Jeon, J., M.Du, R.E. Schoen, M.Hoffmeister, P.A. Newcomb, S.I. Berndt,B.Caan, P.T. Campbell, A.T. Chan, J.Chang-Claude, etal. (2018).Determining risk of colorectal cancer and starting age of screeningbased on lifestyle, environmental, and genetic factors.Gastroenterology154(8), 2152–2164.
  • Keret andGorfine (2023)Keret, N. and M.Gorfine (2023).Analyzing big ehr data—optimal cox regression subsampling procedurewith rare events.Journal of the American Statistical Association118(544), 2262–2275.
  • Lietal. (2013)Li, F., T.Wu, X.Lei, H.Zhang, M.Mao, and J.Zhang (2013).The apgar score and infant mortality.PloS one8(7), e69072.
  • Maetal. (2015)Ma, P., M.W. Mahoney, and B.Yu (2015).A statistical perspective on algorithmic leveraging.The Journal of Machine Learning Research16(1),861–911.
  • MacDorman etal. (1997)MacDorman, M.F., S.Cnattingius, H.J. Hoffman, M.S. Kramer, and B.Haglund(1997).Sudden infant death syndrome and smoking in the united states andsweden.American journal of epidemiology146(3), 249–257.
  • Marshalletal. (2015)Marshall, J., J.L. Salemi, J.P. Tanner, R.Ramakrishnan, M.L. Feldkamp,L.K. Marengo, R.E. Meyer, C.M. Druschel, R.Rickard, R.S. Kirby, etal.(2015).Prevalence, correlates, and outcomes of omphalocele in the unitedstates, 1995–2005.Obstetrics & Gynecology126(2), 284–293.
  • Modin (2002)Modin, B. (2002).Birth order and mortality: a life-long follow-up of 14,200 boys andgirls born in early 20th century sweden.Social science & medicine54(7), 1051–1064.
  • Naeve (1979)Naeve, R.L. (1979).Weight gain and the outcome of pregnancy.American journal of obstetrics and gynecology135(1),3–9.
  • Paceetal. (2019)Pace, N.D., A.M. Siega-Riz, A.F. Olshan, N.C. Chescheir, S.R. Cole, T.A.Desrosiers, S.C. Tinker, A.T. Hoyt, M.A. Canfield, S.L. Carmichael,etal. (2019).Survival of infants with spina bifida and the role of maternalprepregnancy body mass index.Birth defects research111(16), 1205–1216.
  • Sadetzki etal. (1999)Sadetzki, S., A.Chetrit, E.Akstein, O.Luxenburg, L.Keinan, I.Litvak, andB.Modan (1999).Risk factors for infant mortality in down’s syndrome: a nationwidestudy.Paediatric and Perinatal Epidemiology13(4),442–451.
  • Standfastetal. (1980)Standfast, S.J., S.Jereb, and D.T. Janerich (1980).The epidemiology of sudden infant death in upstate new york: Ii:birth characteristics.American Journal of Public Health70(10), 1061–1067.
  • Therneauetal. (2017)Therneau, T., C.Crowson, and E.Atkinson (2017).Using time dependent covariates and time dependent coefficients inthe cox model.Survival Vignettes2(3), 1–25.
  • Thorsdottir etal. (2002)Thorsdottir, I., J.E. Torfadottir, B.E. Birgisdottir, and R.T. Geirsson(2002).Weight gain in women of normal weight before pregnancy: complicationsin pregnancy or delivery and birth outcome.Obstetrics & Gynecology99(5), 799–806.
  • VanderVaart (2000)Vander Vaart, A.W. (2000).Asymptotic statistics, Volume3.Cambridge university press.
  • Wang (2020)Wang, H. (2020).Logistic regression for massive data with rare events.In International Conference on Machine Learning, pp.9829–9836. PMLR.
  • Wang andMa (2021)Wang, H. and Y.Ma (2021).Optimal subsampling for quantile regression in big data.Biometrika108(1), 99–112.
  • Wangetal. (2021)Wang, H., A.Zhang, and C.Wang (2021).Nonuniform negative sampling and log odds correction with rare eventsdata.In Thirty-Fifth Conference on Neural Information ProcessingSystems.
  • Wangetal. (2018)Wang, H., R.Zhu, and P.Ma (2018).Optimal subsampling for large sample logistic regression.Journal of the American Statistical Association113(522), 829–844.
  • Xie etal. (2015)Xie, R.-h., L.Gaudet, D.Krewski, I.D. Graham, M.C. Walker, and S.W. Wen(2015).Higher cesarean delivery rates are associated with higher infantmortality rates in industrialized countries.Birth42(1), 62–69.
  • Yangetal. (2024)Yang, Z., H.Wang, and J.Yan (2024).Subsampling approach for least squares fitting of semi-parametricaccelerated failure time models to massive survival data.Statistics and Computing34(2), 1–11.
  • Yuetal. (2020)Yu, J., H.Wang, M.Ai, and H.Zhang (2020).Optimal distributed subsampling for maximum quasi-likelihoodestimators with massive data.Journal of the American Statistical Association, 1–12.
  • Zuoetal. (2021)Zuo, L., H.Zhang, H.Wang, and L.Liu (2021).Sampling-based estimation for massive survival data with additivehazards model.Statistics in Medicine40(2), 441–450.
SettingNominal PowerEmpirical PowerMean (SD) of q~nsubscript~𝑞𝑛\tilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
ALAL
I0.800.8120.794857 (45)843 (37)
0.830.8180.8141027 (61)1007 (55)
0.850.8520.8421175 (76)1154 (69)
0.870.8820.8821381 (97)1342 (95)
0.900.8960.8741856 (179)1814 (164)
0.910.9200.8902098 (212)2064 (199)
0.930.9360.9342901 (425)2866 (379)
0.950.9520.9605179 (1550)4922 (1074)
II0.800.7740.7941112 (36)1407 (48)
0.830.8040.8121301 (50)1652 (63)
0.850.8340.8541470 (57)1862 (74)
0.870.8820.8441681 (74)2122 (92)
0.900.8920.9062155 (112)2714 (142)
0.910.9160.9022369 (122)2997 (163)
0.930.9320.9083046 (208)3849 (247)
0.950.9400.9304361 (418)5533 (496)
III0.800.7740.7821640 (49)2677 (81)
0.830.8240.7841911 (62)3132 (108)
0.850.8440.8042148 (74)3516 (123)
0.870.8580.8602449 (94)3999 (156)
0.900.8940.8763105 (135)5103 (216)
0.910.8860.8743420 (168)5603 (253)
0.930.8940.9344289 (234)7071 (375)
0.950.9420.9306078 (412)9872 (675)
SettingMLELAUniform
mzNormal0.815 (0.243)0.112 (0.058)0.116 (0.055)0.047 (0.030)
mixNormal0.779 (0.123)0.101 (0.045)0.111 (0.052)0.041 (0.029)
T30.740 (0.133)0.102 (0.042)0.113 (0.047)0.040 (0.029)
EXP1.084 (0.303)0.120 (0.036)0.130 (0.052)0.048 (0.026)
SettingNominal PowerEmpirical PowerMean (SD) of q~nsubscript~𝑞𝑛\tilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
ALAL
mzNormal0.800.8040.7991495 (91)1648 (107)
0.830.8330.8241715 (113)1893 (129)
0.850.8430.8491900 (127)2093 (149)
0.870.8570.8612122 (148)2345 (175)
0.900.9000.8872591 (196)2865 (230)
0.930.9210.9313369 (296)3718 (352)
0.950.9480.9414317 (442)4758 (524)
mixNormal0.800.7920.786821 (58)911 (69)
0.830.8270.815958 (74)1061 (85)
0.850.8440.8461074 (88)1191 (102)
0.870.8630.8641223 (108)1353 (128)
0.900.8990.9001542 (156)1711 (183)
0.930.9300.9352123 (265)2364 (322)
0.950.9470.9462957 (473)3289 (585)
T30.800.7970.7981175 (168)1347 (214)
0.830.8190.8211334 (193)1536 (247)
0.850.8420.8431467 (221)1686 (284)
0.870.8530.8501625 (250)1865 (318)
0.900.8900.8911947 (319)2243 (404)
0.930.9260.9182449 (421)2820 (548)
0.950.9470.9513021 (558)3475 (725)
EXP0.800.7940.7911263 (225)1264 (214)
0.830.8280.8311449 (259)1458 (256)
0.850.8430.8411613 (301)1624 (296)
0.870.8630.8571812 (346)1826 (351)
0.900.8920.8912225 (477)2249 (481)
0.930.9250.9202913 (742)2924 (727)
0.950.9430.9553814 (1162)3826 (1182)
SettingNominal PowerEmpirical PowerMean (SD) of q~nsubscript~𝑞𝑛\tilde{q}_{n}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
ALAL
mzNormal0.800.8060.8204126 (230)4402 (241)
0.830.8540.8184513 (253)4811 (269)
0.850.8400.8224831 (266)5111 (285)
0.870.8720.8605114 (271)5461 (292)
0.890.8800.8725506 (307)5883 (330)
0.910.9000.9245952 (324)6339 (352)
0.930.9280.9226502 (353)6945 (376)
0.950.9580.9407219 (415)7690 (404)
nzNormal0.800.7840.7524169 (263)4921 (292)
0.830.8200.8564604 (297)5392 (353)
0.850.8660.8524855 (319)5706 (348)
0.870.8740.8805208 (326)6094 (382)
0.890.9040.8805582 (368)6591 (410)
0.910.9100.8906049 (398)7092 (415)
0.930.9200.9206597 (420)7727 (477)
0.950.9520.9427360 (472)8587 (535)
mixNormal0.800.8420.8128682 (467)9160 (443)
0.830.8100.8369514 (481)9952 (534)
0.850.8640.82810120 (548)10575 (554)
0.870.8560.86410866 (608)11367 (595)
0.890.8740.90011623 (603)12181 (608)
0.910.9260.91212565 (624)13219 (710)
0.930.9300.93213777 (665)14396 (753)
0.950.9560.91815286 (787)16076 (857)
T30.800.8160.78011900 (1534)12899 (1668)
0.830.8220.80412966 (1659)14200 (1843)
0.850.8380.83813940 (1741)15171 (2158)
0.870.8540.85014698 (1917)16340 (2112)
0.890.8800.87416012 (1959)17557 (2326)
0.910.9120.92217265 (2199)19050 (2485)
0.930.9280.91618837 (2371)20551 (2805)
0.950.9280.92420634 (2660)23076 (3047)
exp0.800.8040.8087526 (832)7661 (832)
0.830.8340.8008200 (829)8396 (918)
0.850.8500.8568652 (940)8957 (1010)
0.870.8720.8589281 (997)9463 (1089)
0.890.9060.9029938 (1057)10190 (1102)
0.910.8900.91210725 (1162)11027 (1206)
0.930.9320.91011786 (1217)12009 (1230)
0.950.9460.95013195 (1380)13385 (1369)
c𝑐citalic_cqnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTALUniform
40213,6801.00761.03161.1091
60320,5201.00511.02061.0725
80427,3601.00381.01541.0543
100534,2001.00311.01231.0434
120641,0401.00251.01031.0361
140747,8801.00221.00881.0309
160854,7201.00191.00771.0271
180961,5601.00171.00681.0240
2001,068,4001.00151.00621.0216
RMSE with respectFrobenius normComputation Time
to 𝜷^PLsubscriptbold-^𝜷𝑃𝐿\boldsymbol{\widehat{\beta}}_{PL}overbold_^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT (×100absent100\times 100× 100)of covariance matrix (×100absent100\times 100× 100)in hours
c𝑐citalic_cALUniformALUniformALUniform
406.3087.62711.1862.3872.4572.6495.9272.9110.306
1003.0334.3565.6902.3432.3742.4475.8703.2870.387
1602.5283.2257.2012.3382.3542.3985.8473.2540.428
c𝑐citalic_cqnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTAL
5882,0001.0231.180
101,764,0001.0131.090
254,410,0001.0051.036

CoefficientFull-sample ProportionSubsample ProportionRatio
ALAL
Anencephaly = no0.000110.035290.00416309.4168936.52090
Spina Bifida = no0.000160.033850.00227206.5314913.85638
Omphalocele = no0.000380.045230.00497118.4802213.02339
Downs syndrome = no0.000480.035360.0071073.3357414.73338
Cleft lip = no0.000720.042080.0090858.7259512.66362
Residence status = 40.001900.009270.001184.880570.62083
Eclampsia = no0.002530.041240.0075416.285282.97785
Attendant = other midwife0.006380.011210.003871.755690.60659
Attendant = other0.006540.021020.015753.212122.40667
Forceps delivery = no0.006630.033970.004335.124720.65301
Father’s race = american indian0.008700.022410.011092.575161.27423
Mother’s race = american indian0.011650.030250.015962.597071.37015
Birth place = not in hospital0.012050.027190.018352.257301.52346
Tocolysis = no0.012110.052510.039464.336863.25930
Cronic hypertension = no0.013250.045220.033193.414232.50554
Residence status = 30.021190.043460.036472.051131.72109
Precipitous labor = no0.024710.046150.038531.867801.55949
Vacuum delivery = no0.030090.039160.013601.301580.45214
Prepregnacny associated hypertension = no0.042750.065450.057631.531001.34794
Meconium = no0.047360.059640.041991.259280.88652

Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (1)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (2)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (3)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (4)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (5)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (6)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (7)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (8)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (9)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (10)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (11)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (12)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (13)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (14)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (15)

Supplementary Material

S1 Additional Technical Details

The following functions are required for 𝕍~𝜷~(p,𝜷^)subscript~𝕍~𝜷p^𝜷\widetilde{\mathbb{V}}_{\widetilde{\boldsymbol{\beta}}}(\textbf{p},\widehat{%\boldsymbol{\beta}})over~ start_ARG blackboard_V end_ARG start_POSTSUBSCRIPT over~ start_ARG bold_italic_β end_ARG end_POSTSUBSCRIPT ( p , over^ start_ARG bold_italic_β end_ARG ):

𝓘~(𝜷)=1n2l(𝜷)𝜷T𝜷=1n0τ{𝐒w(2)(𝜷,t)Sw(0)(𝜷,t)(𝐒w(1)(𝜷,t)Sw(0)(𝜷,t))(𝐒w(1)(𝜷,t)Sw(0)(𝜷,t))T}𝑑N.(t)formulae-sequence~𝓘𝜷1𝑛superscript2superscript𝑙𝜷superscript𝜷𝑇𝜷1𝑛superscriptsubscript0𝜏superscriptsubscript𝐒𝑤2𝜷𝑡superscriptsubscript𝑆𝑤0𝜷𝑡superscriptsubscript𝐒𝑤1𝜷𝑡superscriptsubscript𝑆𝑤0𝜷𝑡superscriptsuperscriptsubscript𝐒𝑤1𝜷𝑡superscriptsubscript𝑆𝑤0𝜷𝑡𝑇differential-d𝑁𝑡\widetilde{\boldsymbol{\mathcal{I}}}(\boldsymbol{\beta})=\frac{1}{n}\frac{%\partial^{2}l^{*}(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}^{T}\partial%\boldsymbol{\beta}}=-\frac{1}{n}\int_{0}^{\tau}\left\{\frac{\mathbf{S}_{w}^{(2%)}(\boldsymbol{\beta},t)}{S_{w}^{(0)}(\boldsymbol{\beta},t)}-\left(\frac{%\mathbf{S}_{w}^{(1)}(\boldsymbol{\beta},t)}{S_{w}^{(0)}(\boldsymbol{\beta},t)}%\right)\left(\frac{\mathbf{S}_{w}^{(1)}(\boldsymbol{\beta},t)}{S_{w}^{(0)}(%\boldsymbol{\beta},t)}\right)^{T}\right\}dN.(t)over~ start_ARG bold_caligraphic_I end_ARG ( bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∂ bold_italic_β end_ARG = - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT { divide start_ARG bold_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG - ( divide start_ARG bold_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG ) ( divide start_ARG bold_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } italic_d italic_N . ( italic_t )

and

𝝋~(𝐩,𝜷)=1n2{1qi𝒬𝐚~i(𝜷)𝐚~i(𝜷)Tpi21q2i𝒬𝐚~i(𝜷)pi(i𝒬𝐚~i(𝜷)pi)T}~𝝋𝐩𝜷1superscript𝑛21𝑞subscript𝑖𝒬subscript~𝐚𝑖𝜷subscript~𝐚𝑖superscript𝜷𝑇superscriptsubscript𝑝𝑖21superscript𝑞2subscript𝑖𝒬subscript~𝐚𝑖𝜷subscript𝑝𝑖superscriptsubscript𝑖𝒬subscript~𝐚𝑖𝜷subscript𝑝𝑖𝑇\widetilde{\boldsymbol{\varphi}}(\mathbf{p},\boldsymbol{\beta})=\frac{1}{n^{2}%}\left\{\frac{1}{q}\sum_{i\in\mathcal{Q}\setminus\mathcal{E}}\frac{\widetilde{%\mathbf{a}}_{i}(\boldsymbol{\beta})\widetilde{\mathbf{a}}_{i}(\boldsymbol{%\beta})^{T}}{p_{i}^{2}}-\frac{1}{q^{2}}\sum_{i\in\mathcal{Q}\setminus\mathcal{%E}}\frac{\widetilde{\mathbf{a}}_{i}(\boldsymbol{\beta})}{p_{i}}\left(\sum_{i%\in\mathcal{Q}\setminus\mathcal{E}}\frac{\widetilde{\mathbf{a}}_{i}(%\boldsymbol{\beta})}{p_{i}}\right)^{T}\right\}over~ start_ARG bold_italic_φ end_ARG ( bold_p , bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { divide start_ARG 1 end_ARG start_ARG italic_q end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT }

where

𝐚~i(𝜷)=0τ{𝐗i𝐒w(1)(𝜷,t)Sw(0)(𝜷,t)}Yi(t)e𝜷T𝐗iSw(0)(𝜷,t)𝑑N.(t).formulae-sequencesubscript~𝐚𝑖𝜷superscriptsubscript0𝜏subscript𝐗𝑖superscriptsubscript𝐒𝑤1𝜷𝑡superscriptsubscript𝑆𝑤0𝜷𝑡subscript𝑌𝑖𝑡superscript𝑒superscript𝜷𝑇subscript𝐗𝑖superscriptsubscript𝑆𝑤0𝜷𝑡differential-d𝑁𝑡\widetilde{\mathbf{a}}_{i}(\boldsymbol{\beta})=\int_{0}^{\tau}\left\{\mathbf{X%}_{i}-\frac{\mathbf{S}_{w}^{(1)}(\boldsymbol{\beta},t)}{S_{w}^{(0)}(%\boldsymbol{\beta},t)}\right\}\frac{Y_{i}(t)e^{\boldsymbol{\beta}^{T}\mathbf{X%}_{i}}}{S_{w}^{(0)}(\boldsymbol{\beta},t)}dN.(t)\,.over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT { bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG bold_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG } divide start_ARG italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_e start_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG italic_d italic_N . ( italic_t ) .

The following functions are required for RE^(qn)^𝑅𝐸subscript𝑞𝑛\widehat{RE}(q_{n})over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ):

𝓘~Q1.51(𝜷)=1n2l(𝜷)𝜷T𝜷=1n0τ{𝐒w,Q1.5(2)(𝜷,t)Sw,Q1.5(0)(𝜷,t)(𝐒w,Q1.5(1)(𝜷,t)Sw,Q1.5(0)(𝜷,t))(𝐒w,Q1.5(1)(𝜷,t)Sw,Q1.5(0)(𝜷,t))T}𝑑N.(t)formulae-sequencesubscriptsuperscript~𝓘1subscript𝑄1.5𝜷1𝑛superscript2superscript𝑙𝜷superscript𝜷𝑇𝜷1𝑛superscriptsubscript0𝜏superscriptsubscript𝐒𝑤subscript𝑄1.52𝜷𝑡superscriptsubscript𝑆𝑤subscript𝑄1.50𝜷𝑡superscriptsubscript𝐒𝑤subscript𝑄1.51𝜷𝑡superscriptsubscript𝑆𝑤subscript𝑄1.50𝜷𝑡superscriptsuperscriptsubscript𝐒𝑤subscript𝑄1.51𝜷𝑡superscriptsubscript𝑆𝑤subscript𝑄1.50𝜷𝑡𝑇differential-d𝑁𝑡\widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}({\boldsymbol{\beta}})=%\frac{1}{n}\frac{\partial^{2}l^{*}(\boldsymbol{\beta})}{\partial\boldsymbol{%\beta}^{T}\partial\boldsymbol{\beta}}=-\frac{1}{n}\int_{0}^{\tau}\left\{\frac{%\mathbf{S}_{w,Q_{1.5}}^{(2)}(\boldsymbol{\beta},t)}{S_{w,Q_{1.5}}^{(0)}(%\boldsymbol{\beta},t)}-\left(\frac{\mathbf{S}_{w,Q_{1.5}}^{(1)}(\boldsymbol{%\beta},t)}{S_{w,Q_{1.5}}^{(0)}(\boldsymbol{\beta},t)}\right)\left(\frac{%\mathbf{S}_{w,Q_{1.5}}^{(1)}(\boldsymbol{\beta},t)}{S_{w,Q_{1.5}}^{(0)}(%\boldsymbol{\beta},t)}\right)^{T}\right\}dN.(t)over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∂ bold_italic_β end_ARG = - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT { divide start_ARG bold_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG - ( divide start_ARG bold_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG ) ( divide start_ARG bold_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } italic_d italic_N . ( italic_t )

and

𝝋~Q1.5(𝐩,𝜷)=1n2{1qi𝒬1.5𝐚~i(𝜷)𝐚~i(𝜷)Tpi21q2i𝒬1.5𝐚~i(𝜷)pi(i𝒬1.5𝐚~i(𝜷)pi)T}subscript~𝝋subscript𝑄1.5𝐩𝜷1superscript𝑛21𝑞subscript𝑖subscript𝒬1.5subscript~𝐚𝑖𝜷subscript~𝐚𝑖superscript𝜷𝑇superscriptsubscript𝑝𝑖21superscript𝑞2subscript𝑖subscript𝒬1.5subscript~𝐚𝑖𝜷subscript𝑝𝑖superscriptsubscript𝑖subscript𝒬1.5subscript~𝐚𝑖𝜷subscript𝑝𝑖𝑇\widetilde{\boldsymbol{\varphi}}_{Q_{1.5}}(\mathbf{p},{\boldsymbol{\beta}})=%\frac{1}{n^{2}}\left\{\frac{1}{q}\sum_{i\in\mathcal{Q}_{1.5}\setminus\mathcal{%E}}\frac{\widetilde{\mathbf{a}}_{i}(\boldsymbol{\beta})\widetilde{\mathbf{a}}_%{i}(\boldsymbol{\beta})^{T}}{p_{i}^{2}}-\frac{1}{q^{2}}\sum_{i\in\mathcal{Q}_{%1.5}\setminus\mathcal{E}}\frac{\widetilde{\mathbf{a}}_{i}(\boldsymbol{\beta})}%{p_{i}}\left(\sum_{i\in\mathcal{Q}_{1.5}\setminus\mathcal{E}}\frac{\widetilde{%\mathbf{a}}_{i}(\boldsymbol{\beta})}{p_{i}}\right)^{T}\right\}over~ start_ARG bold_italic_φ end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_p , bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { divide start_ARG 1 end_ARG start_ARG italic_q end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT }

where

𝐚~i(𝜷)=0τ{𝐗i𝐒w,Q1.5(1)(𝜷,t)Sw,Q1.5(0)(𝜷,t)}Yi(t)e𝜷T𝐗iSw,Q1.5(0)(𝜷,t)𝑑N.(t),formulae-sequencesubscript~𝐚𝑖𝜷superscriptsubscript0𝜏subscript𝐗𝑖superscriptsubscript𝐒𝑤subscript𝑄1.51𝜷𝑡superscriptsubscript𝑆𝑤subscript𝑄1.50𝜷𝑡subscript𝑌𝑖𝑡superscript𝑒superscript𝜷𝑇subscript𝐗𝑖superscriptsubscript𝑆𝑤subscript𝑄1.50𝜷𝑡differential-d𝑁𝑡\widetilde{\mathbf{a}}_{i}(\boldsymbol{\beta})=\int_{0}^{\tau}\left\{\mathbf{X%}_{i}-\frac{\mathbf{S}_{w,Q_{1.5}}^{(1)}(\boldsymbol{\beta},t)}{S_{w,Q_{1.5}}^%{(0)}(\boldsymbol{\beta},t)}\right\}\frac{Y_{i}(t)e^{\boldsymbol{\beta}^{T}%\mathbf{X}_{i}}}{S_{w,Q_{1.5}}^{(0)}(\boldsymbol{\beta},t)}dN.(t)\,,over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT { bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG bold_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG } divide start_ARG italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_e start_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG italic_d italic_N . ( italic_t ) ,
𝐒w,Q1.5(k)(𝜷,t)=i𝒬1.5wie𝜷T𝐗iYi(t)𝐗ikk=0,1,2,formulae-sequencesuperscriptsubscript𝐒𝑤subscript𝑄1.5𝑘𝜷𝑡subscript𝑖subscript𝒬1.5subscript𝑤𝑖superscript𝑒superscript𝜷𝑇subscript𝐗𝑖subscript𝑌𝑖𝑡superscriptsubscript𝐗𝑖tensor-productabsent𝑘𝑘012{\mathbf{S}}_{{w,Q_{1.5}}}^{(k)}(\boldsymbol{\beta},t)=\sum_{i\in\mathcal{Q}_{%1.5}}{w}_{i}e^{\boldsymbol{\beta}^{T}\mathbf{X}_{i}}Y_{i}(t)\mathbf{X}_{i}^{%\otimes k}\quad\quad k=0,1,2\,,bold_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_k end_POSTSUPERSCRIPT italic_k = 0 , 1 , 2 ,

and

wi={(piq0)1ifΔi=0,pi>01ifΔi=1i=1,,n.formulae-sequencesubscript𝑤𝑖casessuperscriptsubscript𝑝𝑖subscript𝑞01formulae-sequenceifsubscriptΔ𝑖0subscript𝑝𝑖01ifsubscriptΔ𝑖1𝑖1𝑛w_{i}=\begin{cases}(p_{i}q_{0})^{-1}&\text{if }\Delta_{i}=0,p_{i}>0\\1&\text{if }\Delta_{i}=1\end{cases}\quad\quad i=1,\dots,n\,.italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL if roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL if roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_CELL end_ROW italic_i = 1 , … , italic_n .

S2 Logistic Regression - Assumptions and Proofs

The 𝐗isubscript𝐗𝑖\mathbf{X}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s are assumed to be independent and identically distributed and the following additional assumptions are required for the asymptotic results:

A.1

As n𝑛n\rightarrow\inftyitalic_n → ∞, n1i=1n𝐗i3=OP(1)superscript𝑛1superscriptsubscript𝑖1𝑛superscriptnormsubscript𝐗𝑖3subscript𝑂𝑃1n^{-1}\sum_{i=1}^{n}\|\mathbf{X}_{i}\|^{3}=O_{P}(1)italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT = italic_O start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( 1 ) and 𝐌X(𝜷o)subscript𝐌𝑋superscript𝜷𝑜\mathbf{M}_{X}(\boldsymbol{\beta}^{o})bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) goes in probability to a positive-definite matrix 𝚺(𝜷o)𝚺superscript𝜷𝑜\boldsymbol{\Sigma}(\boldsymbol{\beta}^{o})bold_Σ ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ), where

𝐌X(𝜷)=n1i=1npi(𝜷)(1pi(𝜷))𝐗i𝐗iT.subscript𝐌𝑋𝜷superscript𝑛1superscriptsubscript𝑖1𝑛subscript𝑝𝑖𝜷1subscript𝑝𝑖𝜷subscript𝐗𝑖superscriptsubscript𝐗𝑖𝑇\mathbf{M}_{X}(\boldsymbol{\beta})=n^{-1}\sum_{i=1}^{n}p_{i}(\boldsymbol{\beta%})\big{(}1-p_{i}(\boldsymbol{\beta})\big{)}\mathbf{X}_{i}\mathbf{X}_{i}^{T}.bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( bold_italic_β ) = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .
A.2

n2i=1nπi1𝐗ik=OP(1)superscript𝑛2superscriptsubscript𝑖1𝑛superscriptsubscript𝜋𝑖1superscriptnormsubscript𝐗𝑖𝑘subscript𝑂𝑃1n^{-2}\sum_{i=1}^{n}\pi_{i}^{-1}\|\mathbf{X}_{i}\|^{k}=O_{P}(1)italic_n start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_O start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( 1 ) for k=2,4𝑘24k=2,4italic_k = 2 , 4.

A.3

There exists some δ>0𝛿0\delta>0italic_δ > 0 such that n2+δi=1nπi1δ𝐗i2+δ=OP(1)superscript𝑛2𝛿superscriptsubscript𝑖1𝑛superscriptsubscript𝜋𝑖1𝛿superscriptnormsubscript𝐗𝑖2𝛿subscript𝑂𝑃1n^{-2+\delta}\sum_{i=1}^{n}\pi_{i}^{-1-\delta}\|\mathbf{X}_{i}\|^{2+\delta}=O_%{P}(1)italic_n start_POSTSUPERSCRIPT - 2 + italic_δ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 - italic_δ end_POSTSUPERSCRIPT ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT = italic_O start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( 1 ).

A.4

qn/nsubscript𝑞𝑛𝑛q_{n}/nitalic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_n and (nn0)/n𝑛subscript𝑛0𝑛(n-n_{0})/n( italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) / italic_n converge to small positive constants as qn,nsubscript𝑞𝑛𝑛q_{n},n\rightarrow\inftyitalic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞.

The first three assumptions are essentially general moment conditions (Wangetal., 2018). In Assumption A.4 it is assumed that the sampled event rate goes to a positive constant as n𝑛nitalic_n goes to infinity.

S2.1 Proof of Theorem 3.1

This proof follows derivation similar to that of Keret andGorfine (2023). Wangetal. (2018) have already shown that 𝜷~~𝜷\widetilde{\boldsymbol{\beta}}over~ start_ARG bold_italic_β end_ARG is consistent to 𝜷^MLEsubscript^𝜷𝑀𝐿𝐸\widehat{\boldsymbol{\beta}}_{MLE}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT in the conditional space, given nsubscript𝑛\mathcal{F}_{n}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. We begin by expanding this result and show that 𝜷~~𝜷\widetilde{\boldsymbol{\beta}}over~ start_ARG bold_italic_β end_ARG is consistent to 𝜷osuperscript𝜷𝑜\boldsymbol{\beta}^{o}bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT in the unconditional space. Based on Theorem 1 of Wangetal. (2018), for any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0,

limqn,nPr(𝜷~𝜷^MLE2>ϵ|n)=0.subscriptsubscript𝑞𝑛𝑛Prsubscriptnorm~𝜷subscript^𝜷𝑀𝐿𝐸2conditionalitalic-ϵsubscript𝑛0\lim_{q_{n},n\rightarrow\infty}\Pr({\|\widetilde{\boldsymbol{\beta}}-\widehat{%\boldsymbol{\beta}}_{MLE}\|}_{2}>\epsilon|\mathcal{F}_{n})=0\,.roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = 0 .

In the unconditional probability space, Pr(𝜷~𝜷^MLE2>ϵ|n)Prsubscriptnorm~𝜷subscript^𝜷𝑀𝐿𝐸2conditionalitalic-ϵsubscript𝑛\Pr({\|\widetilde{\boldsymbol{\beta}}-\widehat{\boldsymbol{\beta}}_{MLE}\|}_{2%}>\epsilon|\mathcal{F}_{n})roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) itself is a random variable. Hence, denote it by ζn,qnsubscript𝜁𝑛subscript𝑞𝑛\zeta_{n,q_{n}}italic_ζ start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT and it follows that

Pr(limqn,nζn,qn=0)=1,Prsubscriptsubscript𝑞𝑛𝑛subscript𝜁𝑛subscript𝑞𝑛01\Pr(\lim_{q_{n},n\rightarrow\infty}\zeta_{n,q_{n}}=0)=1,roman_Pr ( roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT italic_ζ start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0 ) = 1 ,

in the sense that ζn,qna.s.0\zeta_{n,q_{n}}\xrightarrow{a.s.}0italic_ζ start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ARROW start_OVERACCENT italic_a . italic_s . end_OVERACCENT → end_ARROW 0 as qn,nsubscript𝑞𝑛𝑛q_{n},n\rightarrow\inftyitalic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞. Then, for any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0,

limqn,nPr(𝜷~𝜷^MLE2>ϵ)=limqn,nE(ζn,qn)=E(limqn,nζn,qn)=0subscriptsubscript𝑞𝑛𝑛Prsubscriptnorm~𝜷subscript^𝜷𝑀𝐿𝐸2italic-ϵsubscriptsubscript𝑞𝑛𝑛𝐸subscript𝜁𝑛subscript𝑞𝑛𝐸subscriptsubscript𝑞𝑛𝑛subscript𝜁𝑛subscript𝑞𝑛0\lim_{q_{n},n\rightarrow\infty}\Pr({\|\widetilde{\boldsymbol{\beta}}-\widehat{%\boldsymbol{\beta}}_{MLE}\|}_{2}>\epsilon)=\lim_{q_{n},n\rightarrow\infty}E(%\zeta_{n,q_{n}})=E(\lim_{q_{n},n\rightarrow\infty}\zeta_{n,q_{n}})=0roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ ) = roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT italic_E ( italic_ζ start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = italic_E ( roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT italic_ζ start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = 0(S.23)

where the interchange of expectation and limit is allowed due to the dominated convergence theorem, since ζn,qnsubscript𝜁𝑛subscript𝑞𝑛\zeta_{n,q_{n}}italic_ζ start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT is trivially bounded by 1111. Next, we write

Pr(𝜷~𝜷o2>ϵ)Prsubscriptnorm~𝜷superscript𝜷𝑜2italic-ϵ\displaystyle\Pr({\|\widetilde{\boldsymbol{\beta}}-\boldsymbol{\beta}^{o}\|}_{%2}>\epsilon)roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ )=Pr(𝜷~+𝜷^MLE𝜷^MLE𝜷o2>ϵ)absentPrsubscriptnorm~𝜷subscript^𝜷𝑀𝐿𝐸subscript^𝜷𝑀𝐿𝐸superscript𝜷𝑜2italic-ϵ\displaystyle=\Pr({\|\widetilde{\boldsymbol{\beta}}+\widehat{\boldsymbol{\beta%}}_{MLE}-\widehat{\boldsymbol{\beta}}_{MLE}-\boldsymbol{\beta}^{o}\|}_{2}>\epsilon)= roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG + over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ )
Pr(𝜷~𝜷^MLE2+𝜷^MLE𝜷o2>ϵ)absentPrsubscriptnorm~𝜷subscript^𝜷𝑀𝐿𝐸2subscriptnormsubscript^𝜷𝑀𝐿𝐸superscript𝜷𝑜2italic-ϵ\displaystyle\leq\Pr({\|\widetilde{\boldsymbol{\beta}}-\widehat{\boldsymbol{%\beta}}_{MLE}\|}_{2}+{\|\widehat{\boldsymbol{\beta}}_{MLE}-\boldsymbol{\beta}^%{o}\|}_{2}>\epsilon)≤ roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ )
Pr({𝜷~𝜷^MLE2>ϵ/2}{𝜷^MLE𝜷o2>ϵ/2})absentPrsubscriptnorm~𝜷subscript^𝜷𝑀𝐿𝐸2italic-ϵ2subscriptnormsubscript^𝜷𝑀𝐿𝐸superscript𝜷𝑜2italic-ϵ2\displaystyle\leq\Pr\big{(}\{{\|\widetilde{\boldsymbol{\beta}}-\widehat{%\boldsymbol{\beta}}_{MLE}\|}_{2}>\epsilon/2\}\cup\{{\|\widehat{\boldsymbol{%\beta}}_{MLE}-\boldsymbol{\beta}^{o}\|}_{2}>\epsilon/2\}\big{)}≤ roman_Pr ( { ∥ over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ / 2 } ∪ { ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ / 2 } )
Pr(𝜷~𝜷^MLE2>ϵ/2)+Pr(𝜷^MLE𝜷o2>ϵ/2).absentPrsubscriptnorm~𝜷subscript^𝜷𝑀𝐿𝐸2italic-ϵ2Prsubscriptnormsubscript^𝜷𝑀𝐿𝐸superscript𝜷𝑜2italic-ϵ2\displaystyle\leq\Pr({\|\widetilde{\boldsymbol{\beta}}-\widehat{\boldsymbol{%\beta}}_{MLE}\|}_{2}>\epsilon/2)+\Pr({\|\widehat{\boldsymbol{\beta}}_{MLE}-%\boldsymbol{\beta}^{o}\|}_{2}>\epsilon/2)\,.≤ roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ / 2 ) + roman_Pr ( ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ / 2 ) .

Taking limits on both sides, yields

limqn,nPr(𝜷~𝜷o2>ϵ)subscriptsubscript𝑞𝑛𝑛Prsubscriptnorm~𝜷superscript𝜷𝑜2italic-ϵ\displaystyle\lim_{q_{n},n\rightarrow\infty}\Pr({\|\widetilde{\boldsymbol{%\beta}}-\boldsymbol{\beta}^{o}\|}_{2}>\epsilon)roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ )
limqn,nPr(𝜷~𝜷^MLE2>ϵ/2)+limqn,nPr(𝜷^MLE𝜷o2>ϵ/2)=0absentsubscriptsubscript𝑞𝑛𝑛Prsubscriptnorm~𝜷subscript^𝜷𝑀𝐿𝐸2italic-ϵ2subscriptsubscript𝑞𝑛𝑛Prsubscriptnormsubscript^𝜷𝑀𝐿𝐸superscript𝜷𝑜2italic-ϵ20\displaystyle\leq\lim_{q_{n},n\rightarrow\infty}\Pr({\|\widetilde{\boldsymbol{%\beta}}-\widehat{\boldsymbol{\beta}}_{MLE}\|}_{2}>\epsilon/2)+\lim_{q_{n},n%\rightarrow\infty}\Pr({\|\widehat{\boldsymbol{\beta}}_{MLE}-\boldsymbol{\beta}%^{o}\|}_{2}>\epsilon/2)=0≤ roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ / 2 ) + roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT roman_Pr ( ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ / 2 ) = 0

where the first addend is 00 due to Equation (S.23) and the second addend is 00 based on the well-known properties of logistic regression MLE. Then, we conclude that

limqn,nPr(𝜷~𝜷o2>ϵ)=0.subscriptsubscript𝑞𝑛𝑛Prsubscriptnorm~𝜷superscript𝜷𝑜2italic-ϵ0\lim_{q_{n},n\rightarrow\infty}\Pr({\|\widetilde{\boldsymbol{\beta}}-%\boldsymbol{\beta}^{o}\|}_{2}>\epsilon)=0.roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ ) = 0 .

Similarly to Eq. (S.12) in Wangetal. (2018), a Taylor expansion for the subsample-based pseudo-score function evaluated at 𝜷~~𝜷\widetilde{\boldsymbol{\beta}}over~ start_ARG bold_italic_β end_ARG around 𝜷osuperscript𝜷𝑜\boldsymbol{\beta}^{o}bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT instead of 𝜷^MLEsubscript^𝜷𝑀𝐿𝐸\widehat{\boldsymbol{\beta}}_{MLE}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT gives

𝜷~𝜷o=𝐌~X1(𝜷o){1nl(𝜷o)(𝜷T)+oP(𝜷~𝜷o)}~𝜷superscript𝜷𝑜superscriptsubscript~𝐌𝑋1superscript𝜷𝑜1𝑛superscript𝑙superscript𝜷𝑜superscript𝜷𝑇subscript𝑜𝑃norm~𝜷superscript𝜷𝑜\widetilde{\boldsymbol{\beta}}-\boldsymbol{\beta}^{o}=-\widetilde{\mathbf{M}}_%{X}^{-1}(\boldsymbol{\beta}^{o})\bigg{\{}\frac{1}{n}\frac{\partial l^{*}(%\boldsymbol{\beta}^{o})}{\partial(\boldsymbol{\beta}^{T})}+o_{P}(\|\widetilde{%\boldsymbol{\beta}}-\boldsymbol{\beta}^{o}\|)\bigg{\}}over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = - over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ ( bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) end_ARG + italic_o start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( ∥ over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ ) }(S.24)

where the consistency of 𝐌~𝐗(𝜷~)subscript~𝐌𝐗~𝜷\widetilde{\mathbf{M}}_{\mathbf{X}}(\widetilde{\boldsymbol{\beta}})over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG ) to 𝐌𝐗(𝜷o)subscript𝐌𝐗superscript𝜷𝑜\mathbf{M}_{\mathbf{X}}(\boldsymbol{\beta}^{o})bold_M start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) is derived in a similar manner to the proof of the consistency of 𝜷~~𝜷\widetilde{\boldsymbol{\beta}}over~ start_ARG bold_italic_β end_ARG to 𝜷osuperscript𝜷𝑜\boldsymbol{\beta}^{o}bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT and based on Eq. (S.1) in Wang et al. Wangetal. (2018).

Denote by Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the number of times observation i𝑖iitalic_i appears in the subsample. Then,

l(𝜷)𝜷Tsuperscript𝑙𝜷superscript𝜷𝑇\displaystyle\frac{\partial l^{*}(\boldsymbol{\beta})}{\partial\boldsymbol{%\beta}^{T}}divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG=\displaystyle==i𝒬wi(Diμi(𝜷))𝐗isubscript𝑖𝒬superscriptsubscript𝑤𝑖superscriptsubscript𝐷𝑖superscriptsubscript𝜇𝑖𝜷subscript𝐗𝑖\displaystyle\sum_{i\in\mathcal{Q}}w_{i}^{*}\big{(}D_{i}^{*}-\mu_{i}^{*}(%\boldsymbol{\beta})\big{)}\mathbf{X}_{i}∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT(S.25)
=\displaystyle==iwi(1μi(𝜷))𝐗ii{𝒬}wiμi(𝜷)𝐗isubscript𝑖subscript𝑤𝑖1subscript𝜇𝑖𝜷subscript𝐗𝑖subscript𝑖𝒬subscript𝑤𝑖subscript𝜇𝑖𝜷subscript𝐗𝑖\displaystyle\sum_{i\in\mathcal{E}}w_{i}\big{(}1-\mu_{i}(\boldsymbol{\beta})%\big{)}\mathbf{X}_{i}-\sum_{i\in\{\mathcal{Q}\setminus\mathcal{E}\}}w_{i}\mu_{%i}(\boldsymbol{\beta})\mathbf{X}_{i}∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_E end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ { caligraphic_Q ∖ caligraphic_E } end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
=\displaystyle==i(1μi(𝜷))𝐗ii{𝒬}wiμi(𝜷)𝐗isubscript𝑖1subscript𝜇𝑖𝜷subscript𝐗𝑖subscript𝑖𝒬subscript𝑤𝑖subscript𝜇𝑖𝜷subscript𝐗𝑖\displaystyle\sum_{i\in\mathcal{E}}\big{(}1-\mu_{i}(\boldsymbol{\beta})\big{)}%\mathbf{X}_{i}-\sum_{i\in\{\mathcal{Q}\setminus\mathcal{E}\}}w_{i}\mu_{i}(%\boldsymbol{\beta})\mathbf{X}_{i}∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_E end_POSTSUBSCRIPT ( 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ { caligraphic_Q ∖ caligraphic_E } end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
=\displaystyle==i(1μi(𝜷))𝐗ii{𝒬}wiμi(𝜷)𝐗ii𝒩μi(𝜷)𝐗i+i𝒩μi(𝜷)𝐗isubscript𝑖1subscript𝜇𝑖𝜷subscript𝐗𝑖subscript𝑖𝒬subscript𝑤𝑖subscript𝜇𝑖𝜷subscript𝐗𝑖subscript𝑖𝒩subscript𝜇𝑖𝜷subscript𝐗𝑖subscript𝑖𝒩subscript𝜇𝑖𝜷subscript𝐗𝑖\displaystyle\sum_{i\in\mathcal{E}}\big{(}1-\mu_{i}(\boldsymbol{\beta})\big{)}%\mathbf{X}_{i}-\sum_{i\in\{\mathcal{Q}\setminus\mathcal{E}\}}w_{i}\mu_{i}(%\boldsymbol{\beta})\mathbf{X}_{i}-\sum_{i\in\mathcal{N}}\mu_{i}(\boldsymbol{%\beta})\mathbf{X}_{i}+\sum_{i\in\mathcal{N}}\mu_{i}(\boldsymbol{\beta})\mathbf%{X}_{i}∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_E end_POSTSUBSCRIPT ( 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ { caligraphic_Q ∖ caligraphic_E } end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
=\displaystyle==i=1n(Diμi(𝜷))𝐗ii{𝒬}wiμi(𝜷)𝐗i+i𝒩μi(𝜷)𝐗isuperscriptsubscript𝑖1𝑛subscript𝐷𝑖subscript𝜇𝑖𝜷subscript𝐗𝑖subscript𝑖𝒬subscript𝑤𝑖subscript𝜇𝑖𝜷subscript𝐗𝑖subscript𝑖𝒩subscript𝜇𝑖𝜷subscript𝐗𝑖\displaystyle\sum_{i=1}^{n}\big{(}D_{i}-\mu_{i}(\boldsymbol{\beta})\big{)}%\mathbf{X}_{i}-\sum_{i\in\{\mathcal{Q}\setminus\mathcal{E}\}}w_{i}\mu_{i}(%\boldsymbol{\beta})\mathbf{X}_{i}+\sum_{i\in\mathcal{N}}\mu_{i}(\boldsymbol{%\beta})\mathbf{X}_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ { caligraphic_Q ∖ caligraphic_E } end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
=\displaystyle==i=1n(Diμi(𝜷))𝐗ii𝒩Riwiμi(𝜷)𝐗i+i𝒩μi(𝜷)𝐗isuperscriptsubscript𝑖1𝑛subscript𝐷𝑖subscript𝜇𝑖𝜷subscript𝐗𝑖subscript𝑖𝒩subscript𝑅𝑖subscript𝑤𝑖subscript𝜇𝑖𝜷subscript𝐗𝑖subscript𝑖𝒩subscript𝜇𝑖𝜷subscript𝐗𝑖\displaystyle\sum_{i=1}^{n}\big{(}D_{i}-\mu_{i}(\boldsymbol{\beta})\big{)}%\mathbf{X}_{i}-\sum_{i\in\mathcal{N}}R_{i}w_{i}\mu_{i}(\boldsymbol{\beta})%\mathbf{X}_{i}+\sum_{i\in\mathcal{N}}\mu_{i}(\boldsymbol{\beta})\mathbf{X}_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
=\displaystyle==i=1n(Diμi(𝜷))𝐗i+i𝒩(1wiRi)μi(𝜷)𝐗isuperscriptsubscript𝑖1𝑛subscript𝐷𝑖subscript𝜇𝑖𝜷subscript𝐗𝑖subscript𝑖𝒩1subscript𝑤𝑖subscript𝑅𝑖subscript𝜇𝑖𝜷subscript𝐗𝑖\displaystyle\sum_{i=1}^{n}\big{(}D_{i}-\mu_{i}(\boldsymbol{\beta})\big{)}%\mathbf{X}_{i}+\sum_{i\in\mathcal{N}}(1-w_{i}R_{i})\mu_{i}(\boldsymbol{\beta})%\mathbf{X}_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
=\displaystyle==l(𝜷)𝜷T+i=1n(1wiRi)μi(𝜷)𝐗i.𝑙𝜷superscript𝜷𝑇superscriptsubscript𝑖1𝑛1subscript𝑤𝑖subscript𝑅𝑖subscript𝜇𝑖𝜷subscript𝐗𝑖\displaystyle\frac{\partial l(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}^%{T}}+\sum_{i=1}^{n}(1-w_{i}R_{i})\mu_{i}(\boldsymbol{\beta})\mathbf{X}_{i}.divide start_ARG ∂ italic_l ( bold_italic_β ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

Based on Eq.s (S.24) and (S.25), we conclude that

n(𝜷~𝜷o)=𝐌~X1(𝜷o)1nl(𝜷o)𝜷T𝐌~X1(𝜷o)1ni=1n(1wiRi)μi(𝜷)𝐗i+oP(n𝜷~𝜷o2).𝑛~𝜷superscript𝜷𝑜superscriptsubscript~𝐌𝑋1superscript𝜷𝑜1𝑛𝑙superscript𝜷𝑜superscript𝜷𝑇superscriptsubscript~𝐌𝑋1superscript𝜷𝑜1𝑛superscriptsubscript𝑖1𝑛1subscript𝑤𝑖subscript𝑅𝑖subscript𝜇𝑖𝜷subscript𝐗𝑖subscript𝑜𝑃𝑛subscriptnorm~𝜷superscript𝜷𝑜2\sqrt{n}(\widetilde{\boldsymbol{\beta}}-\boldsymbol{\beta}^{o})=-\widetilde{%\mathbf{M}}_{X}^{-1}(\boldsymbol{\beta}^{o})\frac{1}{\sqrt{n}}\frac{\partial l%(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{T}}-\widetilde{\mathbf{M%}}_{X}^{-1}(\boldsymbol{\beta}^{o})\frac{1}{\sqrt{n}}\sum_{i=1}^{n}(1-w_{i}R_{%i})\mu_{i}(\boldsymbol{\beta})\mathbf{X}_{i}+o_{P}(\sqrt{n}{\|\widetilde{%\boldsymbol{\beta}}-\boldsymbol{\beta}^{o}\|}_{2}).square-root start_ARG italic_n end_ARG ( over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) = - over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_o start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( square-root start_ARG italic_n end_ARG ∥ over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .(S.26)

Now it will be shown that n1/2l(𝜷o)/𝜷Tsuperscript𝑛12𝑙superscript𝜷𝑜superscript𝜷𝑇n^{-1/2}\partial l(\boldsymbol{\beta}^{o})/\partial\boldsymbol{\beta}^{T}italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) / ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and n1/2i=1n(1wiRi)μi(𝜷)𝐗isuperscript𝑛12superscriptsubscript𝑖1𝑛1subscript𝑤𝑖subscript𝑅𝑖subscript𝜇𝑖𝜷subscript𝐗𝑖n^{-1/2}\sum_{i=1}^{n}(1-w_{i}R_{i})\mu_{i}(\boldsymbol{\beta})\mathbf{X}_{i}italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are asymptotically independent and each one of them is asymptotically normal. From the asymptotic theory of standard logistic regression,

𝐌X1/2(𝜷o)1nl(𝜷o)𝜷T𝐷N(0,𝐈),𝐷superscriptsubscript𝐌𝑋12superscript𝜷𝑜1𝑛𝑙superscript𝜷𝑜superscript𝜷𝑇𝑁0𝐈-\mathbf{M}_{X}^{-1/2}(\boldsymbol{\beta}^{o})\frac{1}{\sqrt{n}}\frac{\partiall%(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{T}}\xrightarrow[]{D}N(0,%\mathbf{I})\,,- bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_I ) ,

and

1nl(𝜷o)𝜷T𝐷N(0,𝚺(𝜷o)).𝐷1𝑛𝑙superscript𝜷𝑜superscript𝜷𝑇𝑁0𝚺superscript𝜷𝑜\frac{1}{\sqrt{n}}\frac{\partial l(\boldsymbol{\beta}^{o})}{\partial%\boldsymbol{\beta}^{T}}\xrightarrow[]{D}N\big{(}0,\boldsymbol{\Sigma}(%\boldsymbol{\beta}^{o})\big{)}\,.divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_Σ ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ) .(S.27)

Also, qn/ni𝒩wiRiμi(𝜷)𝐗isubscript𝑞𝑛𝑛subscript𝑖𝒩subscript𝑤𝑖subscript𝑅𝑖subscript𝜇𝑖𝜷subscript𝐗𝑖\sqrt{q_{n}}/n\sum_{i\in\mathcal{N}}w_{i}R_{i}\mu_{i}(\boldsymbol{\beta})%\mathbf{X}_{i}square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG / italic_n ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be alternatively expressed as a sum of independent identically distributed observations in the conditional space, namely

qnni𝒩wiRiμi(𝜷)𝐗isubscript𝑞𝑛𝑛subscript𝑖𝒩subscript𝑤𝑖subscript𝑅𝑖subscript𝜇𝑖𝜷subscript𝐗𝑖\displaystyle\frac{\sqrt{q_{n}}}{n}\sum_{i\in\mathcal{N}}w_{i}R_{i}\mu_{i}(%\boldsymbol{\beta})\mathbf{X}_{i}divide start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT=\displaystyle==qnni=1qnwiμi(𝜷)𝐗isubscript𝑞𝑛𝑛superscriptsubscript𝑖1subscript𝑞𝑛superscriptsubscript𝑤𝑖superscriptsubscript𝜇𝑖𝜷superscriptsubscript𝐗𝑖\displaystyle\frac{\sqrt{q_{n}}}{n}\sum_{i=1}^{q_{n}}w_{i}^{*}\mu_{i}^{*}(%\boldsymbol{\beta})\mathbf{X}_{i}^{*}divide start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
=\displaystyle==qnni=1qnμi(𝜷)𝐗iπiqnsubscript𝑞𝑛𝑛superscriptsubscript𝑖1subscript𝑞𝑛superscriptsubscript𝜇𝑖𝜷superscriptsubscript𝐗𝑖superscriptsubscript𝜋𝑖subscript𝑞𝑛\displaystyle\frac{\sqrt{q_{n}}}{n}\sum_{i=1}^{q_{n}}\frac{\mu_{i}^{*}(%\boldsymbol{\beta})\mathbf{X}_{i}^{*}}{\pi_{i}^{*}q_{n}}divide start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG
=\displaystyle==1qni=1qnμi(𝜷)𝐗inπi1subscript𝑞𝑛superscriptsubscript𝑖1subscript𝑞𝑛superscriptsubscript𝜇𝑖𝜷superscriptsubscript𝐗𝑖𝑛superscriptsubscript𝜋𝑖\displaystyle\frac{1}{\sqrt{q_{n}}}\sum_{i=1}^{q_{n}}\frac{\mu_{i}^{*}(%\boldsymbol{\beta})\mathbf{X}_{i}^{*}}{n\pi_{i}^{*}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG italic_n italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG
\displaystyle\equiv1qni=1qn𝝎i(𝝅,𝜷o).1subscript𝑞𝑛superscriptsubscript𝑖1subscript𝑞𝑛subscript𝝎𝑖𝝅superscript𝜷𝑜\displaystyle\frac{1}{\sqrt{q_{n}}}\sum_{i=1}^{q_{n}}\boldsymbol{\omega}_{i}(%\mathbf{\boldsymbol{\pi}},\boldsymbol{\beta}^{o}).divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) .

Since the distribution of 𝝎i(𝝅,𝜷𝒐)subscript𝝎𝑖𝝅superscript𝜷𝒐\boldsymbol{\omega}_{i}(\boldsymbol{\pi},\boldsymbol{\beta^{o}})bold_italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT bold_italic_o end_POSTSUPERSCRIPT ) changes as a function of n𝑛nitalic_n and qnsubscript𝑞𝑛q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the Lindeberg-Feller condition (VanderVaart, 2000, proposition 2.27) should be established as it covers the setting of triangular arrays.First, let us denote 𝐊R(𝝅,𝜷)Var(𝝎i(𝝅,𝜷o)|n)superscript𝐊𝑅𝝅𝜷𝑉𝑎𝑟conditionalsubscript𝝎𝑖𝝅superscript𝜷𝑜subscript𝑛\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta})\equiv Var(\boldsymbol{%\omega}_{i}(\mathbf{\boldsymbol{\pi}},\boldsymbol{\beta}^{o})|\mathcal{F}_{n})bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) ≡ italic_V italic_a italic_r ( bold_italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). It follows that

𝐊R(𝝅,𝜷)superscript𝐊𝑅𝝅𝜷\displaystyle\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta})bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β )=E(𝝎(𝝅,𝜷o)𝝎T(𝝅,𝜷o)|n)E(𝝎(𝝅,𝜷o)|n)E(𝝎(𝝅,𝜷o)|n)Tabsent𝐸conditional𝝎𝝅superscript𝜷𝑜superscript𝝎𝑇𝝅superscript𝜷𝑜subscript𝑛𝐸conditional𝝎𝝅superscript𝜷𝑜subscript𝑛𝐸superscriptconditional𝝎𝝅superscript𝜷𝑜subscript𝑛𝑇\displaystyle=E\big{(}\boldsymbol{\omega}(\mathbf{\boldsymbol{\pi}},%\boldsymbol{\beta}^{o})\boldsymbol{\omega}^{T}(\mathbf{\boldsymbol{\pi}},%\boldsymbol{\beta}^{o})|\mathcal{F}_{n}\big{)}-E(\boldsymbol{\omega}(\mathbf{%\boldsymbol{\pi}},\boldsymbol{\beta}^{o})|\mathcal{F}_{n})E(\boldsymbol{\omega%}(\mathbf{\boldsymbol{\pi}},\boldsymbol{\beta}^{o})|\mathcal{F}_{n})^{T}= italic_E ( bold_italic_ω ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_italic_ω start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_E ( bold_italic_ω ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_E ( bold_italic_ω ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
=1n2{i𝒩μi2(𝜷)𝐗i𝐗iTπii,j𝒩μi(𝜷)μj(𝜷)𝐗i𝐗j}=O|n(1)\displaystyle=\frac{1}{n^{2}}\bigg{\{}\sum_{i\in\mathcal{N}}\frac{\mu^{2}_{i}(%\boldsymbol{\beta})\mathbf{X}_{i}\mathbf{X}_{i}^{T}}{\pi_{i}}-\sum_{i,j\in%\mathcal{N}}\mu_{i}(\boldsymbol{\beta})\mu_{j}(\boldsymbol{\beta})\mathbf{X}_{%i}\mathbf{X}_{j}\bigg{\}}=O_{|\mathcal{F}_{n}}(1)= divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } = italic_O start_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 )

where the last equation is due to Assumptions A.1 and A.2.

Now, for every ε>0𝜀0\varepsilon>0italic_ε > 0 and some δ>0𝛿0\delta>0italic_δ > 0,

i=1qnsuperscriptsubscript𝑖1subscript𝑞𝑛\displaystyle\sum_{i=1}^{q_{n}}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPTE{qn1/2𝝎i(𝝅,𝜷o)22I(qn1/2𝝎i(𝝅,𝜷o)>ε)|n}.𝐸conditionalsuperscriptsubscriptnormsuperscriptsubscript𝑞𝑛12subscript𝝎𝑖𝝅superscript𝜷𝑜22𝐼superscriptsubscript𝑞𝑛12subscript𝝎𝑖𝝅superscript𝜷𝑜𝜀subscript𝑛\displaystyle E\{\|q_{n}^{-1/2}\boldsymbol{\omega}_{i}(\boldsymbol{\pi},%\boldsymbol{\beta}^{o})\|_{2}^{2}I(q_{n}^{-1/2}\boldsymbol{\omega}_{i}(%\boldsymbol{\pi},\boldsymbol{\beta}^{o})>\varepsilon)|\mathcal{F}_{n}\}.italic_E { ∥ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) > italic_ε ) | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } .
1qn1+δ/2εδi=1qE{μi(𝜷o)𝐗inπi22+δ|n}absent1superscriptsubscript𝑞𝑛1𝛿2superscript𝜀𝛿superscriptsubscript𝑖1𝑞𝐸conditionalsuperscriptsubscriptnormsuperscriptsubscript𝜇𝑖superscript𝜷𝑜superscriptsubscript𝐗𝑖𝑛superscriptsubscript𝜋𝑖22𝛿subscript𝑛\displaystyle\leq\frac{1}{q_{n}^{1+\delta/2}\varepsilon^{\delta}}\sum_{i=1}^{q%}E\Bigg{\{}\bigg{\|}\frac{\mu_{i}^{*}(\boldsymbol{\beta}^{o})\mathbf{X}_{i}^{*%}}{n\pi_{i}^{*}}\bigg{\|}_{2}^{2+\delta}\Bigg{|}\mathcal{F}_{n}\Bigg{\}}≤ divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 + italic_δ / 2 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT italic_E { ∥ divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG italic_n italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }
=1qnδ/2ϵδn2+δi𝒩{μi(𝜷)}2𝐗i22+δπiδ+1absent1superscriptsubscript𝑞𝑛𝛿2superscriptitalic-ϵ𝛿superscript𝑛2𝛿subscript𝑖𝒩superscriptsubscript𝜇𝑖𝜷2superscriptsubscriptnormsubscript𝐗𝑖22𝛿superscriptsubscript𝜋𝑖𝛿1\displaystyle=\frac{1}{q_{n}^{\delta/2}\epsilon^{\delta}n^{2+\delta}}\sum_{i%\in\mathcal{N}}\frac{\{\mu_{i}(\boldsymbol{\beta})\}^{2}\|\mathbf{X}_{i}\|_{2}%^{2+\delta}}{\pi_{i}^{\delta+1}}= divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ / 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG { italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ + 1 end_POSTSUPERSCRIPT end_ARG
1qnδ/2ϵδn2+δi𝒩𝐗i22+δπiδ+1=oP|n(1)absent1superscriptsubscript𝑞𝑛𝛿2superscriptitalic-ϵ𝛿superscript𝑛2𝛿subscript𝑖𝒩superscriptsubscriptnormsubscript𝐗𝑖22𝛿superscriptsubscript𝜋𝑖𝛿1subscript𝑜conditional𝑃subscript𝑛1\displaystyle\leq\frac{1}{q_{n}^{\delta/2}\epsilon^{\delta}n^{2+\delta}}\sum_{%i\in\mathcal{N}}\frac{\|\mathbf{X}_{i}\|_{2}^{2+\delta}}{\pi_{i}^{\delta+1}}=o%_{P|\mathcal{F}_{n}}(1)≤ divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ / 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ + 1 end_POSTSUPERSCRIPT end_ARG = italic_o start_POSTSUBSCRIPT italic_P | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 )

where the first inequality is due to Van der Vaart (VanderVaart, 2000, p. 21) and the last equality is due to Assumption A.3. Since E(1wiRi|n)=0𝐸1conditionalsubscript𝑤𝑖subscript𝑅𝑖subscript𝑛0E(1-w_{i}R_{i}|\mathcal{F}_{n})=0italic_E ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = 0, it holds that qn1/2n1K(𝝅,𝜷)1/2i𝒩(1wiRi)μi(𝜷o)𝐗isuperscriptsubscript𝑞𝑛12superscript𝑛1𝐾superscript𝝅𝜷12subscript𝑖𝒩1subscript𝑤𝑖subscript𝑅𝑖subscript𝜇𝑖superscript𝜷𝑜subscript𝐗𝑖q_{n}^{1/2}n^{-1}K(\boldsymbol{\pi},\boldsymbol{\beta})^{-1/2}\sum_{i\in%\mathcal{N}}(1-w_{i}R_{i})\mu_{i}(\boldsymbol{\beta}^{o})\mathbf{X}_{i}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_K ( bold_italic_π , bold_italic_β ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT converges conditionally on nsubscript𝑛\mathcal{F}_{n}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to a standard multivariate distribution. Put differently, for any 𝐮r𝐮superscript𝑟\mathbf{u}\in\mathbb{R}^{r}bold_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT,

Pr{qnn𝐊R(𝝅,𝜷)1/2i𝒩(1wiRi)μi(𝜷o)𝐗i𝐮|n}Φ(𝐮).Prsubscript𝑞𝑛𝑛superscript𝐊𝑅superscript𝝅𝜷12subscript𝑖𝒩1subscript𝑤𝑖subscript𝑅𝑖subscript𝜇𝑖superscript𝜷𝑜subscript𝐗𝑖conditional𝐮subscript𝑛Φ𝐮\Pr\bigg{\{}\frac{\sqrt{q_{n}}}{n}\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{%\beta})^{-1/2}\sum_{i\in\mathcal{N}}(1-w_{i}R_{i})\mu_{i}(\boldsymbol{\beta}^{%o})\mathbf{X}_{i}\leq\mathbf{u}|\mathcal{F}_{n}\bigg{\}}\rightarrow\Phi(%\mathbf{u})\,.roman_Pr { divide start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_n end_ARG bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ bold_u | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } → roman_Φ ( bold_u ) .(S.28)

where ΦΦ\Phiroman_Φ is the cumulative distribution function of the standard multivariate normal distribution. Since the conditional probability is a random variable in the unconditional space, then due to Eq. (S.28) it converges almost surely to Φ(𝐮)Φ𝐮\Phi(\mathbf{u})roman_Φ ( bold_u ). Being additionally bounded, then due to the dominated convergence theorem, it follows that for any 𝐮r𝐮superscript𝑟\mathbf{u}\in\mathbb{R}^{r}bold_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT,

Pr{qnn𝐊R(𝝅,𝜷)1/2i𝒩(1wiRi)μi(𝜷o)𝐗i𝐮}Φ(𝐮).Prsubscript𝑞𝑛𝑛superscript𝐊𝑅superscript𝝅𝜷12subscript𝑖𝒩1subscript𝑤𝑖subscript𝑅𝑖subscript𝜇𝑖superscript𝜷𝑜subscript𝐗𝑖𝐮Φ𝐮\Pr\bigg{\{}\frac{\sqrt{q_{n}}}{n}\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{%\beta})^{-1/2}\sum_{i\in\mathcal{N}}(1-w_{i}R_{i})\mu_{i}(\boldsymbol{\beta}^{%o})\mathbf{X}_{i}\leq\mathbf{u}\bigg{\}}\rightarrow\Phi(\mathbf{u})\,.roman_Pr { divide start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_n end_ARG bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ bold_u } → roman_Φ ( bold_u ) .(S.29)

Suppose that 𝐊R(𝝅,𝜷o)𝑃𝚿(𝝅,𝜷o)𝑃superscript𝐊𝑅𝝅superscript𝜷𝑜𝚿𝝅superscript𝜷𝑜\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})\xrightarrow{P}%\boldsymbol{\Psi}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_ARROW overitalic_P → end_ARROW bold_Ψ ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) where 𝚿(𝝅,𝜷o)𝚿𝝅superscript𝜷𝑜{\boldsymbol{\Psi}}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})bold_Ψ ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) is a positive-definite matrix. Denote θ𝜃\thetaitalic_θ as the limit of qn/nsubscript𝑞𝑛𝑛q_{n}/nitalic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_n, which we assumed its existence earlier. Then, from Eq. (S.29)

1ni𝒩(1wiRi)μi(𝜷o)𝐷N(0,θ𝚿(𝝅,𝜷o)).𝐷1𝑛subscript𝑖𝒩1subscript𝑤𝑖subscript𝑅𝑖subscript𝜇𝑖superscript𝜷𝑜𝑁0𝜃𝚿𝝅superscript𝜷𝑜\frac{1}{\sqrt{n}}\sum_{i\in\mathcal{N}}(1-w_{i}R_{i})\mu_{i}(\boldsymbol{%\beta}^{o})\xrightarrow{D}N(0,\theta\boldsymbol{\Psi}(\boldsymbol{\pi},%\boldsymbol{\beta}^{o})).divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_ARROW overitalic_D → end_ARROW italic_N ( 0 , italic_θ bold_Ψ ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ) .(S.30)

In the following, it will be shown that the two addends are asymptotically independent. Write

limn,qnsubscript𝑛subscript𝑞𝑛\displaystyle\lim_{n,q_{n}\rightarrow\infty}roman_lim start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPTPr(1nl(𝜷o)𝜷T𝐮,1ni𝒩(1wiRi)pi(𝜷o)𝐗i𝐯)Pr1𝑛𝑙superscript𝜷𝑜superscript𝜷𝑇𝐮1𝑛subscript𝑖𝒩1subscript𝑤𝑖subscript𝑅𝑖subscript𝑝𝑖superscript𝜷𝑜subscript𝐗𝑖𝐯\displaystyle\Pr\bigg{(}\frac{1}{n}\frac{\partial l(\boldsymbol{\beta}^{o})}{%\partial\boldsymbol{\beta}^{T}}\leq\mathbf{u}\,,\,\frac{1}{\sqrt{n}}\sum_{i\in%\mathcal{N}}(1-w_{i}R_{i})p_{i}(\boldsymbol{\beta}^{o})\mathbf{X}_{i}\leq%\mathbf{v}\bigg{)}roman_Pr ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≤ bold_u , divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ bold_v )(S.31)
=\displaystyle==limn,qnE(I{1nl(𝜷o)𝜷T𝐮}Pr{1ni𝒩(1wiRi)pi(𝜷o)𝐗i𝐯|n})subscript𝑛subscript𝑞𝑛𝐸𝐼1𝑛𝑙superscript𝜷𝑜superscript𝜷𝑇𝐮Pr1𝑛subscript𝑖𝒩1subscript𝑤𝑖subscript𝑅𝑖subscript𝑝𝑖superscript𝜷𝑜subscript𝐗𝑖conditional𝐯subscript𝑛\displaystyle\lim_{n,q_{n}\rightarrow\infty}E\bigg{(}I\bigg{\{}\frac{1}{n}%\frac{\partial l(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{T}}\leq%\mathbf{u}\bigg{\}}\Pr\bigg{\{}\frac{1}{\sqrt{n}}\sum_{i\in\mathcal{N}}(1-w_{i%}R_{i})p_{i}(\boldsymbol{\beta}^{o})\mathbf{X}_{i}\leq\mathbf{v}|\mathcal{F}_{%n}\bigg{\}}\bigg{)}roman_lim start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT italic_E ( italic_I { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≤ bold_u } roman_Pr { divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ bold_v | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } )
=\displaystyle==E(limn,qnI{1nl(𝜷o)𝜷T𝐮}limn,qPr{1ni𝒩(1wiRi)pi(𝜷o)𝐗i𝐯|n})𝐸subscript𝑛subscript𝑞𝑛𝐼1𝑛𝑙superscript𝜷𝑜superscript𝜷𝑇𝐮subscript𝑛𝑞Pr1𝑛subscript𝑖𝒩1subscript𝑤𝑖subscript𝑅𝑖subscript𝑝𝑖superscript𝜷𝑜subscript𝐗𝑖conditional𝐯subscript𝑛\displaystyle E\bigg{(}\lim_{n,q_{n}\rightarrow\infty}I\bigg{\{}\frac{1}{n}%\frac{\partial l(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{T}}\leq%\mathbf{u}\bigg{\}}\lim_{n,q\rightarrow\infty}\Pr\bigg{\{}\frac{1}{\sqrt{n}}%\sum_{i\in\mathcal{N}}(1-w_{i}R_{i})p_{i}(\boldsymbol{\beta}^{o})\mathbf{X}_{i%}\leq\mathbf{v}|\mathcal{F}_{n}\bigg{\}}\bigg{)}italic_E ( roman_lim start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT italic_I { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≤ bold_u } roman_lim start_POSTSUBSCRIPT italic_n , italic_q → ∞ end_POSTSUBSCRIPT roman_Pr { divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ bold_v | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } )
=\displaystyle==E(limn,qnI{1nl(𝜷o)𝜷T𝐮}Φ(θ1/2𝚿(𝝅,𝜷o)1/2𝐯))𝐸subscript𝑛subscript𝑞𝑛𝐼1𝑛𝑙superscript𝜷𝑜superscript𝜷𝑇𝐮Φsuperscript𝜃12𝚿superscript𝝅superscript𝜷𝑜12𝐯\displaystyle E\bigg{(}\lim_{n,q_{n}\rightarrow\infty}I\bigg{\{}\frac{1}{n}%\frac{\partial l(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{T}}\leq%\mathbf{u}\bigg{\}}\Phi(\theta^{-1/2}\boldsymbol{\Psi}(\boldsymbol{\pi},%\boldsymbol{\beta}^{o})^{-1/2}\mathbf{v})\bigg{)}italic_E ( roman_lim start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT italic_I { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≤ bold_u } roman_Φ ( italic_θ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_Ψ ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_v ) )
=\displaystyle==limn,qnE(I{1nl(𝜷o)𝜷T𝐮}Φ(θ1/2𝚿(𝝅,𝜷o)1/2𝐯))subscript𝑛subscript𝑞𝑛𝐸𝐼1𝑛𝑙superscript𝜷𝑜superscript𝜷𝑇𝐮Φsuperscript𝜃12𝚿superscript𝝅superscript𝜷𝑜12𝐯\displaystyle\lim_{n,q_{n}\rightarrow\infty}E\bigg{(}I\bigg{\{}\frac{1}{n}%\frac{\partial l(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{T}}\leq%\mathbf{u}\bigg{\}}\Phi(\theta^{-1/2}\boldsymbol{\Psi}(\boldsymbol{\pi},%\boldsymbol{\beta}^{o})^{-1/2}\mathbf{v})\bigg{)}roman_lim start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT italic_E ( italic_I { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≤ bold_u } roman_Φ ( italic_θ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_Ψ ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_v ) )
=\displaystyle==limn,qnPr(1nl(𝜷o)𝜷T𝐮)Φ(θ1/2𝚿(𝝅,𝜷o)1/2𝐯)subscript𝑛subscript𝑞𝑛Pr1𝑛𝑙superscript𝜷𝑜superscript𝜷𝑇𝐮Φsuperscript𝜃12𝚿superscript𝝅superscript𝜷𝑜12𝐯\displaystyle\lim_{n,q_{n}\rightarrow\infty}\Pr\bigg{(}\frac{1}{n}\frac{%\partial l(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{T}}\leq\mathbf%{u}\bigg{)}\Phi(\theta^{-1/2}\boldsymbol{\Psi}(\boldsymbol{\pi},\boldsymbol{%\beta}^{o})^{-1/2}\mathbf{v})roman_lim start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT roman_Pr ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≤ bold_u ) roman_Φ ( italic_θ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_Ψ ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_v )
=\displaystyle==Φ(𝚺(𝜷o)1/2𝐮)Φ(θ1/2𝚿(𝝅,𝜷o)1/2𝐯)Φ𝚺superscriptsuperscript𝜷𝑜12𝐮Φsuperscript𝜃12𝚿superscript𝝅superscript𝜷𝑜12𝐯\displaystyle\Phi\left(\boldsymbol{\Sigma}(\boldsymbol{\beta}^{o})^{-1/2}%\mathbf{u}\right)\Phi\left(\theta^{-1/2}\boldsymbol{\Psi}(\boldsymbol{\pi},%\boldsymbol{\beta}^{o})^{-1/2}\mathbf{v}\right)roman_Φ ( bold_Σ ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_u ) roman_Φ ( italic_θ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_Ψ ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_v )

and we have used the dominated convergence theorem.

Since 𝐌~𝐗(𝜷o)subscript~𝐌𝐗superscript𝜷𝑜\widetilde{\mathbf{M}}_{\mathbf{X}}(\boldsymbol{\beta}^{o})over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) is a consistent estimator of 𝐌𝐗(𝜷o)subscript𝐌𝐗superscript𝜷𝑜\mathbf{M}_{\mathbf{X}}(\boldsymbol{\beta}^{o})bold_M start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ), its consistency to 𝚺(𝜷o)𝚺superscript𝜷𝑜\boldsymbol{\Sigma}(\boldsymbol{\beta}^{o})bold_Σ ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) could be easily shown. Then, from slu*tsky’s theorem and Eq.s (S.27), (S.30) and (S.31) it follows that Eq. (S.26) converges in distribution to a multivariate normal distribution with zero mean and a covariance matrix asymptotically equivalent to R(𝝅,𝜷o)superscript𝑅𝝅superscript𝜷𝑜\mathbb{H}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})blackboard_H start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ). The two variance components correspond to two orthogonal sources of variance, the variance of the original full-data MLE, and the additional variance generated by the subsampling procedure.

S2.2 Proof of Theorem 3.2

A-optimal criterion is equivalent to minimizing the asymptotic MSE of 𝜷~TSsubscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT, which is the trace of R(𝝅,𝜷o)superscript𝑅𝝅superscript𝜷𝑜\mathbb{H}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})blackboard_H start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ). However,

Tr(R(𝝅,𝜷o))=Tr(nqn𝐌X1(𝜷o)𝕂R(𝝅,𝜷o)𝐌X1(𝜷o))+d𝑇𝑟superscript𝑅𝝅superscript𝜷𝑜𝑇𝑟𝑛subscript𝑞𝑛superscriptsubscript𝐌𝑋1superscript𝜷𝑜superscript𝕂𝑅𝝅superscript𝜷𝑜superscriptsubscript𝐌𝑋1superscript𝜷𝑜𝑑Tr\big{(}\mathbb{H}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})\big{)}=Tr%\bigg{(}\frac{n}{q_{n}}\mathbf{M}_{X}^{-1}(\boldsymbol{\beta}^{o})\mathbb{K}^{%R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})\mathbf{M}_{X}^{-1}(\boldsymbol{%\beta}^{o})\bigg{)}+ditalic_T italic_r ( blackboard_H start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ) = italic_T italic_r ( divide start_ARG italic_n end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) blackboard_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ) + italic_d

where d𝑑ditalic_d is a constant that does not involve 𝝅𝝅\boldsymbol{\pi}bold_italic_π, and

Tr(nqn𝐌X1(𝜷o)𝕂R(𝝅,𝜷o)𝐌X1(𝜷o))𝑇𝑟𝑛subscript𝑞𝑛superscriptsubscript𝐌𝑋1superscript𝜷𝑜superscript𝕂𝑅𝝅superscript𝜷𝑜superscriptsubscript𝐌𝑋1superscript𝜷𝑜\displaystyle Tr\bigg{(}\frac{n}{q_{n}}\mathbf{M}_{X}^{-1}(\boldsymbol{\beta}^%{o})\mathbb{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})\mathbf{M}_{X}^{-1}%(\boldsymbol{\beta}^{o})\bigg{)}italic_T italic_r ( divide start_ARG italic_n end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) blackboard_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) )
=Tr(1nqn𝐌X1(𝜷o){i𝒩μi2(𝜷o)πi𝐗i𝐗iTi,j𝒩μi(𝜷o)μj(𝜷o)𝐗i𝐗jT}𝐌X1(𝜷o)).𝑇𝑟1𝑛subscript𝑞𝑛superscriptsubscript𝐌𝑋1superscript𝜷𝑜subscript𝑖𝒩subscriptsuperscript𝜇2𝑖superscript𝜷𝑜subscript𝜋𝑖subscript𝐗𝑖superscriptsubscript𝐗𝑖𝑇subscript𝑖𝑗𝒩subscript𝜇𝑖superscript𝜷𝑜subscript𝜇𝑗superscript𝜷𝑜subscript𝐗𝑖superscriptsubscript𝐗𝑗𝑇superscriptsubscript𝐌𝑋1superscript𝜷𝑜\displaystyle\quad=\quad Tr\Bigg{(}\frac{1}{nq_{n}}\mathbf{M}_{X}^{-1}(%\boldsymbol{\beta}^{o})\bigg{\{}\sum_{i\in\mathcal{N}}\frac{\mu^{2}_{i}(%\boldsymbol{\beta}^{o})}{\pi_{i}}\mathbf{X}_{i}\mathbf{X}_{i}^{T}-\sum_{i,j\in%\mathcal{N}}\mu_{i}(\boldsymbol{\beta}^{o})\mu_{j}(\boldsymbol{\beta}^{o})%\mathbf{X}_{i}\mathbf{X}_{j}^{T}\bigg{\}}\mathbf{M}_{X}^{-1}(\boldsymbol{\beta%}^{o})\Bigg{)}.= italic_T italic_r ( divide start_ARG 1 end_ARG start_ARG italic_n italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) { ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ) .

By removing the part that does not involve 𝝅𝝅\boldsymbol{\pi}bold_italic_π and the factor (nqn)1superscript𝑛subscript𝑞𝑛1(nq_{n})^{-1}( italic_n italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT which does not alter the optimization process, we are left with

Tr(i𝒩μi2(𝜷o)πi𝐌X1(𝜷o)𝐗i𝐗iT𝐌X1(𝜷o))𝑇𝑟subscript𝑖𝒩subscriptsuperscript𝜇2𝑖superscript𝜷𝑜subscript𝜋𝑖superscriptsubscript𝐌𝑋1superscript𝜷𝑜subscript𝐗𝑖superscriptsubscript𝐗𝑖𝑇superscriptsubscript𝐌𝑋1superscript𝜷𝑜\displaystyle Tr\bigg{(}\sum_{i\in\mathcal{N}}\frac{\mu^{2}_{i}(\boldsymbol{%\beta}^{o})}{\pi_{i}}\mathbf{M}_{X}^{-1}(\boldsymbol{\beta}^{o})\mathbf{X}_{i}%\mathbf{X}_{i}^{T}\mathbf{M}_{X}^{-1}(\boldsymbol{\beta}^{o})\bigg{)}italic_T italic_r ( ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) )=\displaystyle==i𝒩μi2(𝜷o)πiTr(𝐗iT𝐌X2(𝜷o)𝐗i)subscript𝑖𝒩subscriptsuperscript𝜇2𝑖superscript𝜷𝑜subscript𝜋𝑖𝑇𝑟superscriptsubscript𝐗𝑖𝑇superscriptsubscript𝐌𝑋2superscript𝜷𝑜subscript𝐗𝑖\displaystyle\sum_{i\in\mathcal{N}}\frac{\mu^{2}_{i}(\boldsymbol{\beta}^{o})}{%\pi_{i}}Tr\big{(}\mathbf{X}_{i}^{T}\mathbf{M}_{X}^{-2}(\boldsymbol{\beta}^{o})%\mathbf{X}_{i}\big{)}∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_T italic_r ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
=\displaystyle==i𝒩μi2(𝜷o)πi𝐌X1𝐗i22.subscript𝑖𝒩subscriptsuperscript𝜇2𝑖superscript𝜷𝑜subscript𝜋𝑖superscriptsubscriptnormsuperscriptsubscript𝐌𝑋1subscript𝐗𝑖22\displaystyle\sum_{i\in\mathcal{N}}\frac{\mu^{2}_{i}(\boldsymbol{\beta}^{o})}{%\pi_{i}}\|\mathbf{M}_{X}^{-1}\mathbf{X}_{i}\|_{2}^{2}\,.∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∥ bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Define the following Lagrangian function, with multiplier α𝛼\alphaitalic_α,

g(𝝅)=i𝒩μi2(𝜷o)πi𝐌x1𝐗i22+α(1i𝒩πi).𝑔𝝅subscript𝑖𝒩subscriptsuperscript𝜇2𝑖superscript𝜷𝑜subscript𝜋𝑖superscriptsubscriptnormsuperscriptsubscript𝐌𝑥1subscript𝐗𝑖22𝛼1subscript𝑖𝒩subscript𝜋𝑖g(\boldsymbol{\pi})=\sum_{i\in\mathcal{N}}\frac{\mu^{2}_{i}(\boldsymbol{\beta}%^{o})}{\pi_{i}}\|\mathbf{M}_{x}^{-1}\mathbf{X}_{i}\|_{2}^{2}+\alpha\left(1-%\sum_{i\in\mathcal{N}}\pi_{i}\right)\,.italic_g ( bold_italic_π ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∥ bold_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α ( 1 - ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Differentiating g(𝝅)𝑔𝝅g(\boldsymbol{\pi})italic_g ( bold_italic_π ) with respect to πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for any i𝒩𝑖𝒩i\in\mathcal{N}italic_i ∈ caligraphic_N and setting the derivative to 0, gives

g(𝝅)πi=μi2(𝜷o)𝐌x1𝐗i22πi2α0,𝑔𝝅subscript𝜋𝑖subscriptsuperscript𝜇2𝑖superscript𝜷𝑜superscriptsubscriptnormsuperscriptsubscript𝐌𝑥1subscript𝐗𝑖22superscriptsubscript𝜋𝑖2𝛼0\frac{\partial g(\boldsymbol{\pi})}{\partial\pi_{i}}=-\frac{\mu^{2}_{i}(%\boldsymbol{\beta}^{o})\|\mathbf{M}_{x}^{-1}\mathbf{X}_{i}\|_{2}^{2}}{\pi_{i}^%{2}}-\alpha\equiv 0,divide start_ARG ∂ italic_g ( bold_italic_π ) end_ARG start_ARG ∂ italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = - divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ bold_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - italic_α ≡ 0 ,

and

πi=μi(𝜷o)𝐌x1𝐗i2α.subscript𝜋𝑖subscript𝜇𝑖superscript𝜷𝑜subscriptnormsuperscriptsubscript𝐌𝑥1subscript𝐗𝑖2𝛼\pi_{i}=\frac{\mu_{i}(\boldsymbol{\beta}^{o})\|\mathbf{M}_{x}^{-1}\mathbf{X}_{%i}\|_{2}}{\sqrt{-\alpha}}\,.italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ bold_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG - italic_α end_ARG end_ARG .

Since i𝒩πi=1subscript𝑖𝒩subscript𝜋𝑖1\sum_{i\in\mathcal{N}}\pi_{i}=1∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1,

α=i𝒩μi(𝜷o)𝐌x1𝐗i2,𝛼subscript𝑖𝒩subscript𝜇𝑖superscript𝜷𝑜subscriptnormsuperscriptsubscript𝐌𝑥1subscript𝐗𝑖2\sqrt{-\alpha}=\sum_{i\in\mathcal{N}}\mu_{i}(\boldsymbol{\beta}^{o})\|\mathbf{%M}_{x}^{-1}\mathbf{X}_{i}\|_{2},square-root start_ARG - italic_α end_ARG = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ bold_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

which yields Eq. (3.9) in the main text. The proof of Eq. (3.10) of the main text follows similarly.

S2.3 Proof of Theorem 4.1

Following the main steps of the proof of Theorem 2 in Wangetal. (2018), it is straightforward to show that given nsubscript𝑛\mathcal{F}_{n}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT,

1n𝐊R(𝝅,𝜷o)1/2l(𝜷o)𝜷o=1qn{Var(𝜼i|n)}1/2i=1qn𝜼i𝐷N(0,𝐈)1𝑛superscript𝐊𝑅superscript𝝅superscript𝜷𝑜12superscript𝑙superscript𝜷𝑜superscript𝜷𝑜1subscript𝑞𝑛superscript𝑉𝑎𝑟conditionalsubscript𝜼𝑖subscript𝑛12superscriptsubscript𝑖1subscript𝑞𝑛subscript𝜼𝑖𝐷𝑁0𝐈\frac{1}{n}\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})^{1/2}\frac{%\partial l^{*}(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{o}}=\frac{%1}{\sqrt{q_{n}}}\{Var(\boldsymbol{\eta}_{i}|\mathcal{F}_{n})\}^{-1/2}\sum_{i=1%}^{q_{n}}\boldsymbol{\eta}_{i}\xrightarrow[]{D}N(0,\mathbf{I})divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG { italic_V italic_a italic_r ( bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_I )

where

𝜼𝒊{Diμi(𝜷o)}𝐗inπi,i=1,,qn\boldsymbol{\eta_{i}}\equiv\frac{\{D_{i}^{*}-\mu_{i}^{*}(\boldsymbol{\beta}^{o%})\}\mathbf{X}_{i}^{*}}{n\pi_{i}^{*}}\quad,\quad i=1,\dots,q_{n}bold_italic_η start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ≡ divide start_ARG { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) } bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG italic_n italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG , italic_i = 1 , … , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

are independent and identically distributed with mean 𝟎0\mathbf{0}bold_0 and variance qn𝐊B(𝝅,𝜷o)subscript𝑞𝑛superscript𝐊𝐵𝝅superscript𝜷𝑜q_{n}\mathbf{K}^{B}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_K start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ). In other words, for all 𝐮r𝐮superscript𝑟\mathbf{u}\in\mathbb{R}^{r}bold_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT,

Pr{n1𝐊R(𝝅,𝜷o)1/2l(𝜷o)𝜷o𝐮|n}𝑃Φ(𝐮).𝑃Prsuperscript𝑛1superscript𝐊𝑅superscript𝝅superscript𝜷𝑜12superscript𝑙superscript𝜷𝑜superscript𝜷𝑜conditional𝐮subscript𝑛Φ𝐮\Pr\Big{\{}n^{-1}\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})^{1/2}%\frac{\partial l^{*}(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{o}}%\leq\mathbf{u}|\mathcal{F}_{n}\Big{\}}\xrightarrow[]{P}\Phi(\mathbf{u})\,.roman_Pr { italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_ARG ≤ bold_u | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_ARROW overitalic_P → end_ARROW roman_Φ ( bold_u ) .(S.32)

The conditional probability in Eq. (S.32) is a bounded random variable, thus convergence in probability to a constant implies convergence in the mean. Therefore,

Pr{n1𝐊R(𝝅,𝜷o)1/2l(𝜷o)𝜷o𝐮}=E{Pr{n1𝐊R(𝝅,𝜷o)1/2l(𝜷o)𝜷o𝐮}|n}Φ(𝐮),Prsuperscript𝑛1superscript𝐊𝑅superscript𝝅superscript𝜷𝑜12superscript𝑙superscript𝜷𝑜superscript𝜷𝑜𝐮𝐸conditional-setPrsuperscript𝑛1superscript𝐊𝑅superscript𝝅superscript𝜷𝑜12superscript𝑙superscript𝜷𝑜superscript𝜷𝑜𝐮subscript𝑛absentΦ𝐮\Pr\Big{\{}n^{-1}\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})^{1/2}%\frac{\partial l^{*}(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{o}}%\leq\mathbf{u}\Big{\}}=E\Bigg{\{}\Pr\Big{\{}n^{-1}\mathbf{K}^{R}(\boldsymbol{%\pi},\boldsymbol{\beta}^{o})^{1/2}\frac{\partial l^{*}(\boldsymbol{\beta}^{o})%}{\partial\boldsymbol{\beta}^{o}}\leq\mathbf{u}\Big{\}}|\mathcal{F}_{n}\Bigg{%\}}\xrightarrow[]{}\Phi(\mathbf{u})\,,roman_Pr { italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_ARG ≤ bold_u } = italic_E { roman_Pr { italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_ARG ≤ bold_u } | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW roman_Φ ( bold_u ) ,

and therefore

1n𝐊R(𝝅,𝜷o)1/2l(𝜷o)𝜷o𝐷N(0,𝐈)𝐷1𝑛superscript𝐊𝑅superscript𝝅superscript𝜷𝑜12superscript𝑙superscript𝜷𝑜superscript𝜷𝑜𝑁0𝐈\frac{1}{n}\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})^{1/2}\frac{%\partial l^{*}(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{o}}%\xrightarrow[]{D}N(0,\mathbf{I})divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_ARG start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_I )

in the unconditional space. The rest of the proof follows directly from Wangetal. (2018).

S3 Additional Simulation Results

In Fig. S1, we compare the Frobenius norm of three covariance matrices: (i) The covariance matrix of the two-step estimator, 𝜷~TSsubscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT. (ii) The approximated covariance matrix utilized in Step 1.5. (iii) The empirical covariance matrix of 𝜷~TSsubscript~𝜷𝑇𝑆\widetilde{\boldsymbol{\beta}}_{TS}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT. Fig. S2 demonstrates the validity of the variance estimator (3.11), and the effectiveness of optimal subsampling over uniform subsampling.

Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (16)
Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (17)

S4 Linked Birth and Infant Death Data - Additional Results

The covariates in the model are summarized in Tables S1–S3. Tables S4–S6 present the estimated coefficients for each method, with c=10𝑐10c=10italic_c = 10. While the results are organized into three tables for clarity, it is essential to note that the FDR procedure was executed once, encompassing all coefficients collectively.

Non-eventsEvents
(N=28410519)(N=176400)
Mother’s age (limited 12-50)
Mean (SD)27.71 (6.09)26.9 (6.5)
Median [Min, Max]28 [12, 15]26 [12, 50]
Live Birth Order
Mean (SD)2.08 (1.24)2.19 (1.4)
Median [Min, Max]2 [1, 8]2 [1, 8]
Number of Prenatal Visits
Mean (SD)11.26 (3.94)8.21 (5.11)
Median [Min, Max]12 [0, 49]8 [0,49]
Weight Gain (limited to 99 pounds)
Mean (SD)30.51 (14.32)22.95 (15.04)
Median [Min, Max]30 [0, 98]22 [0, 98]
Five Minute APGAR Score
Mean (SD)8.84 [0.71]5.19 (3.44)
Median [Min, Max]9 [0,10]6 [0, 10]
Plurality (limited to 5)
Mean (SD)1.04 (0.19)1.65 (0.42)
Median [Min, Max]1 [1, 5]1 [1, 5]
Gestation weeks
Mean (SD)38.65 (2.37)30.21 (7.68)
Median [Min, Max]39 [17, 47]30 [17,47]
Years after 2007
Mean (SD)2.93 (2.01)2.85 (2.01)
Median [Min, Max]3 [0, 6]3 [0, 6]
Birth month
January8.19%8.36%
February7.63%7.72%
March8.31%8.35%
April7.98%8.19%
May8.33%8.56%
June8.32%8.32%
July8.80%8.66%
August8.93%8.79%
September8.66%8.42%
October8.50%8.52%
November8.02%7.97%
December8.33%8.14%
Birth weekday
Sunday9.31%11.36%
Monday15.18%14.73%
Tuesday16.59%15.51%
Wednesday16.27%15.38%
Thursday16.21%15.57%
Friday15.84%15.25%
Saturday10.60%12.20%
Birth place
In hospital98.80%98.53%
Not in hospital1.20%1.47%
Residence status
Resident72.97%65.82%
Interstate nonresident (type 1)24.73%30.26%
Interstate nonresident (type 2)2.11%3.83%
Foreign resident0.19%0.09%
Non-eventsEvents
(N=28410519)(N=176400)
Mother’s race
White76.72%64.51%
Black15.81%29.58%
American Indian / Alaskan Native1.16%1.55%
Asian / Pacific Islander6.31%4.36%
Mother’s marital status
Married59.52%45.56%
Not Married40.48%54.44%
Father’s race
White63.37%47.25%
Black11.58%17.65%
American Indian / Alaskan Native0.87%1.01%
Asian / Pacific Islander5.25%3.24%
Unknown18.93%30.85%
Diabetes
Yes5.14%4.65%
No94.86%95.35%
Chronic Hypertension
Yes1.32%2.65%
No98.68%97.35%
Prepregnacny Associated Hypertension
Yes4.27%4.82%
No95.73%95.18%
Eclampsia
Yes0.25%0.58%
No99.75%99.42%
Induction of Labor
Yes23.01%13.01%
No76.99%86.99%
Tocolysis
Yes1.19%4.30%
No98.81%95.70%
Meconium
Yes4.74%3.71%
No95.26%96.29%
Precipitous Labor
Yes2.46%4.46%
No97.54%95.54%
Breech
Yes5.36%20.60%
No94.64%79.40%
Forceps delivery
Yes0.66%0.38%
No99.34%99.62%
Vacuum delivery
Yes3.02%1.08%
No96.98%98.92%
Delivery method
vagin*l67.54%60.98%
C-Section32.46%39.02%
Non-eventsEvents
(N=28410519)(N=176400)
Attendant
Doctor of Medicine (MD)85.40%90.00%
Doctor of Osteopathy (DO)5.56%4.96%
Certified Nurse Midwife (CNM)7.75%3.12%
Other Midwife0.64%0.27%
Other0.65%1.65%
Sex
Female48.86%44.17%
Male51.14%55.83%
Birth Weight
227- 1499 grams1.15%53.28%
1500 – 2499 grams6.62%14.70%
2500 - 8165 grams92.23%32.02%
Anencephalus
Yes0.01%1.01%
No99.99%98.99%
Spina Bifida
Yes0.01%0.25%
No99.99%99.75%
Omphalocele
Yes0.03%0.68%
No99.97%99.32%
Cleft Lip
Yes0.07%1.08%
No99.93%98.92%
Downs Syndrome
Yes0.05%0.55%
No99.95%99.45%

EstimateStandard DeviationAdjusted P-value
MLEALuniformMLEALuniformMLEALuniform
Intercept18.503518.473918.479119.00260.27620.27820.28920.66430.00000.00000.00000.0000
Mother Age-0.0938-0.0947-0.0919-0.09440.00360.00380.00380.00690.00000.00000.00000.0000
Live birth order0.11090.11200.11110.10800.00230.00240.00240.00460.00000.00000.00000.0000
Number of prenatal visits-0.0134-0.0132-0.0134-0.01510.00070.00070.00070.00130.00000.00000.00000.0000
Weight gain-0.0029-0.0029-0.0030-0.00300.00030.00030.00030.00060.00000.00000.00000.0000
Five minute APGAR score-0.5183-0.5182-0.5180-0.51400.00190.00190.00190.00440.00000.00000.00000.0000
Plurality-0.0872-0.0846-0.0811-0.05530.01220.01280.01270.02500.00000.00000.00000.0691
Gestation weeks-0.1307-0.1308-0.1300-0.12820.00120.00130.00130.00250.00000.00000.00000.0000
Year0.01260.02600.0215-0.09010.06570.06620.06900.15230.89130.75230.82680.7115
Squared mother age0.00120.00120.00120.00130.00010.00010.00010.00010.00000.00000.00000.0000
Birth place = not in hospital0.61510.62520.60980.64200.03170.03250.03310.06430.00000.00000.00000.0000
Diabetes = no-0.01480.0001-0.0172-0.01100.02850.02930.02940.05110.68960.99750.66110.9378
Chronic hypertension = no0.11770.11520.09520.01780.04020.04090.04150.07680.00720.01010.04180.9339
Prepregnacny hypertension = no0.33430.34500.32080.34000.02710.02800.02820.04740.00000.00000.00000.0000
Eclampsia = no0.49960.49830.52990.29070.07700.07740.07960.13710.00000.00000.00000.0823
Induction of labor = no0.00540.00290.0020-0.00830.01760.01840.01830.02460.80810.90820.91340.8705
Tocolysis = no0.09920.10730.1005-0.01200.03130.03210.03270.06400.00340.00190.00470.9418
Meconium = no-0.1216-0.1277-0.1117-0.08890.02620.02720.02750.04750.00000.00000.00010.1242
Precipitous labor = no0.03610.03510.0248-0.02130.02860.02940.02980.06220.27210.31290.50800.8705
Breech = no-0.1003-0.0980-0.1055-0.10110.01470.01550.01530.03300.00000.00000.00000.0066
Forceps delivery = no0.12530.11610.11950.14640.08980.09020.09370.12820.22930.27780.28400.4185
Vacuum delivery = no0.29820.31450.30240.25980.05080.05140.05290.05680.00000.00000.00000.0000
Delivery method = C-Section-0.0401-0.0432-0.0377-0.03680.00980.01030.01020.01860.00010.00010.00050.1008
Sex = male-0.7590-0.7601-0.7388-0.92130.26650.26860.27960.64480.00900.00990.01680.2792
Anencephaly = no-4.1255-4.1271-4.1709-4.08310.11180.11200.11720.34710.00000.00000.00000.0000
Spina Bifida = no-2.1350-2.1323-2.1315-2.09560.14880.14870.15590.37370.00000.00000.00000.0000
Omphalocele = no-1.7259-1.7286-1.7383-1.88220.08290.08310.08760.21150.00000.00000.00000.0000
Cleft lip = no-2.8745-2.8681-2.8451-2.95240.06560.06600.06850.14560.00000.00000.00000.0000
Downs syndrome = no-2.3438-2.3402-2.3462-2.44190.08630.08680.08860.16390.00000.00000.00000.0000
Birth month vs. January
Birth month = February0.01030.01190.01230.05920.01450.01530.01520.02680.56110.51330.51780.0691
Birth month = March-0.0305-0.0312-0.02780.00510.01430.01510.01490.02650.05710.07030.10640.9418
Birth month = April-0.0294-0.0275-0.0268-0.01210.01440.01520.01500.02690.07110.11930.11840.8082
Birth month = May-0.0434-0.0441-0.0402-0.00140.01430.01500.01490.02720.00510.00730.01450.9683
Birth month = June-0.0344-0.0296-0.02940.00060.01430.01510.01490.02680.02990.08830.08700.9829
Birth month = July-0.0301-0.0259-0.02710.01630.01410.01490.01470.02650.05710.13390.10640.7080
Birth month = August-0.0213-0.0185-0.02410.01740.01410.01480.01470.02640.20020.29450.15140.6974
Birth month = September-0.0180-0.0165-0.01500.04180.01420.01500.01480.02630.27210.35110.41160.2134
Birth month = October-0.0287-0.0275-0.02730.03140.01420.01500.01480.02620.07310.11360.10640.3872
Birth month = November-0.0311-0.0414-0.03020.00290.01440.01520.01500.02710.05610.01290.08080.9524
Birth month = December-0.0409-0.0377-0.0422-0.02480.01430.01510.01490.02740.00900.02400.01000.5444
Birth weekday vs. Sunday
Birth weekday = Monday0.09070.09100.09780.12390.01180.01250.01230.02280.00000.00000.00000.0000
Birth weekday = Tuesday0.09810.10370.10450.10350.01160.01230.01220.02270.00000.00000.00000.0000
Birth weekday = Wednesday0.09410.09710.09990.07810.01170.01230.01220.02260.00000.00000.00000.0018
Birth weekday = Thursday0.09020.09030.09740.11450.01170.01230.01220.02240.00000.00000.00000.0000
Birth weekday = Friday0.07550.07260.07830.07110.01170.01240.01220.02290.00000.00000.00000.0060
Birth weekday = Saturday0.02080.01990.02050.03020.01240.01310.01300.02450.15040.20100.16870.3731
Resdience status vs. 1
Residence status = 20.11560.11230.11180.11590.00660.00690.00680.01240.00000.00000.00000.0000
Residence status = 30.23550.23060.24430.27170.01620.01690.01680.03380.00000.00000.00000.0000
Residence status = 4-0.4333-0.4285-0.4366-0.13900.08730.08760.08970.09310.00000.00000.00000.2513
Mother’s race vs. white
Mother’s race = black-0.0156-0.0196-0.0136-0.06780.01650.01740.01740.03380.43310.34410.52950.0980
Mother’s race = american indian0.23870.24440.22060.09300.04600.04670.04820.09900.00000.00000.00000.5240
Mother’s race = asian0.01210.02070.00990.03640.03620.03680.03750.05750.79890.64850.85780.7075
Paternity acknowledged = no0.09910.10220.09880.09450.00740.00780.00760.01360.00000.00000.00000.0000
Father’s race vs. white
Father’s race = black0.13570.14090.13890.22390.01990.02090.02090.03870.00000.00000.00000.0000
Father’s race = american indian0.22000.22300.23230.22980.05620.05690.05910.11220.00020.00020.00020.0895
Father’s race = asian-0.0198-0.0329-0.0227-0.01000.04130.04210.04260.06610.70660.51320.69480.9481
Father’s race = unknown0.17460.17240.17320.20930.01410.01480.01470.02650.00000.00000.00000.0000
Attendant vs. MD
Attendant = DO-0.0444-0.0513-0.0462-0.05370.01340.01400.01390.02390.00200.00060.00210.0656
Attendant = CNM-0.2782-0.2861-0.2786-0.31190.01570.01650.01640.02410.00000.00000.00000.0000
Attendant = other midwife-0.2361-0.2453-0.2209-0.31630.05420.05490.05630.09010.00000.00000.00020.0015
Attendant = other0.18220.17650.17970.21980.03110.03180.03250.06170.00000.00000.00000.0013
Birth weight recode vs. 1
Birth weight recode = 2-0.7391-0.7430-0.7353-0.73280.01160.01220.01210.02200.00000.00000.00000.0000
Birth weight recode = 3-1.6916-1.6899-1.6930-1.68080.01410.01490.01470.02650.00000.00000.00000.0000

EstimateStandard DeviationAdjusted P-value
MLEALuniformMLEALuniformMLEALuniform
Weight gain-0.0012-0.0012-0.0010-0.00130.00040.00040.00040.00080.01270.01550.04180.1586
Apgar0.03150.03120.03020.03070.00250.00260.00260.00590.00000.00000.00000.0000
Plurality0.01800.01650.0075-0.01030.01630.01720.01690.03400.34470.42070.74350.8911
Gestation week-0.0016-0.0014-0.0022-0.00550.00120.00130.00130.00250.25180.35070.12570.0701
Diabetes = no0.03880.02530.04180.09640.02660.02770.02750.04700.21470.44570.18770.0895
Chronic hypertension = no-0.0586-0.0510-0.04280.01970.03760.03860.03880.07610.18460.26450.36000.9196
Prepregnacny hypertension = no-0.0692-0.0634-0.0620-0.04210.02600.02720.02710.04780.01520.03730.04180.5539
Eclampsia = no-0.0270-0.0385-0.0374-0.08060.07400.07450.07710.12860.78270.67630.72540.7075
Induction of labor = no0.04090.04110.04690.04270.01730.01820.01800.02460.03280.04400.01830.1586
Tocolysis = no-0.0056-0.0120-0.00760.07240.03120.03230.03260.06760.89210.76200.86550.4548
Forceps delivery = no0.06920.08660.11510.07570.08750.08790.09170.12520.51320.41160.29030.7088
Vacuum delivery = no-0.0215-0.0308-0.0060-0.00800.04950.05020.05180.05570.73470.61690.91340.9481
Delivery method = C-Section-0.0217-0.0212-0.0242-0.01830.01260.01330.01310.02410.13800.18240.10640.6312
Anencephaly = no0.14470.14660.10470.70300.10950.10960.11410.34290.25180.26160.46120.0895
Spina Bifida = no0.29200.29260.2826-0.19520.15140.15170.15940.35680.08890.09460.12020.7321
Omphalocele = no-0.0071-0.0054-0.0096-0.19270.07990.08040.08460.20310.94730.96470.91340.5240
Cleft lip = no0.31970.31590.27680.47030.06440.06490.06720.13700.00000.00000.00010.0019
Downs syndrome = no0.06770.07480.10290.17700.08520.08570.08800.18790.51320.46300.32740.5240

CoefficientCoefficient sdP-value
MLEALuniformMLEALuniformMLEALuniform
Father’s race = black-0.0079-0.0084-0.0078-0.02650.00570.00600.00590.01120.22930.24000.27510.0498
Father’s race = american indian-0.0267-0.0250-0.0322-0.02490.01610.01630.01680.03050.15220.19690.09860.5991
Father’s race = asian-0.0128-0.0105-0.0114-0.02570.01160.01190.01200.01870.34470.46250.44490.2938
Father’s race = unknown-0.0029-0.0028-0.0019-0.01290.00390.00410.00410.00730.53620.57330.73450.1541
Mother’s race = black-0.0174-0.0169-0.0169-0.00120.00470.00500.00500.00980.00060.00160.00160.9489
Mother’s race = american indian-0.0188-0.0196-0.01030.00910.01320.01340.01380.02700.22280.21930.54810.8705
Mother’s race = asian-0.0049-0.0032-0.0034-0.00280.01020.01040.01050.01630.70660.80680.82370.9448
Diabetes = no-0.0058-0.0082-0.0045-0.01620.00660.00680.00680.01180.45890.31290.60810.2938
Chronic hypertension = no-0.0028-0.00470.00190.00660.00940.00960.00960.01850.80810.69300.87720.8705
Prepregnacny hypertension = no0.01620.01160.01980.01270.00640.00660.00660.01150.02080.13390.00620.4381
Eclampsia = no-0.0232-0.0204-0.03190.02340.01860.01870.01920.03300.27580.35350.14840.6627
Induction of labor = no-0.0157-0.0150-0.0147-0.01250.00420.00440.00430.00600.00040.00150.00160.0861
Tocolysis = no0.02770.02390.02590.04400.00770.00790.00800.01600.00070.00560.00280.0182
Meconium = no-0.0008-0.0004-0.0033-0.00740.00720.00750.00760.01290.94450.96580.74500.7177
Precipitous labor = no-0.0106-0.0109-0.0071-0.00090.00770.00800.00810.01650.23670.25430.48400.9683
Breech = no-0.0002-0.00110.0010-0.00120.00400.00430.00420.00920.96850.84430.86550.9481
Forceps delivery = no-0.0199-0.0184-0.0264-0.02410.02150.02160.02250.03040.43900.47270.32740.6090
Vacuum delivery = no-0.0176-0.0187-0.0234-0.01370.01200.01210.01250.01350.21090.19680.10640.4905
Anencephaly = no-0.1061-0.1052-0.0938-0.24370.02710.02720.02840.08990.00020.00030.00220.0194
Spina Bifida = no0.05320.05200.05030.10320.03730.03740.03930.07490.22280.24490.28400.2938
Omphalocele = no-0.0008-0.00140.00430.11530.02000.02000.02120.04960.96850.96470.87720.0549
Cleft lip = no0.00850.00710.0022-0.00260.01590.01600.01660.03490.68260.72020.91340.9675
Downs syndrome = no0.05590.05330.04920.08270.02110.02120.02180.04280.01520.02350.04430.1105

Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions (2024)
Top Articles
Latest Posts
Article information

Author: Golda Nolan II

Last Updated:

Views: 5981

Rating: 4.8 / 5 (58 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Golda Nolan II

Birthday: 1998-05-14

Address: Suite 369 9754 Roberts Pines, West Benitaburgh, NM 69180-7958

Phone: +522993866487

Job: Sales Executive

Hobby: Worldbuilding, Shopping, Quilting, Cooking, Homebrewing, Leather crafting, Pet

Introduction: My name is Golda Nolan II, I am a thoughtful, clever, cute, jolly, brave, powerful, splendid person who loves writing and wants to share my knowledge and understanding with you.