TAL AGASSI, NIR KERET, MALKA GORFINE∗ Department of Statistics and Operations Research Tel Aviv University, Tel Aviv 69978, Israel. gorfinem@tauex.tau.ac.il
Abstract In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the optimal subsample size. To bridge this gap, our work introduces tools designed for choosing the optimal subsample size. We focus on three settings: the Cox regression model for survival data with rare events and logistic regression for both balanced and imbalanced datasets. Additionally, we present a novel optimal subsampling procedure tailored for logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets.Hypothesis testing; Imbalanced data; Time to event analysis; Relative efficiency;
0 0 footnotetext: To whom correspondence should be addressed.1 IntroductionThe escalating demand to analyze massive datasets with millions of observations often leads to considerable computational time and memory requirements, presenting significant challenges in implementing statistical analyses. In response, subsampling has become a widely adopted and effective method for expediting computation, for various regression models. These models encompass least-squares regression models (Dhillonetal., 2013 ; Maetal., 2015 ) , logistic regression (Wangetal., 2018 , 2021 ) , generalized linear models (Aietal., 2021 ) , quantile regression (Wang andMa, 2021 ) , quasi-likelihood estimators (Yuetal., 2020 ) , time-to-event regression under the additive-hazards model (Zuoetal., 2021 ) , semi-competing risks (Gorfine etal., 2021 ) , the Cox proportional-hazards (PH) model (Keret andGorfine, 2023 ) and accelerated failure time model (Yangetal., 2024 ) .
In this work, we mainly concentrate on two prominent scenarios associated with rare events: (1) addressing the challenge of highly imbalanced data in logistic regression, where one of the classes is rare, and (2) employing Cox proportional-hazards (PH) regression (Cox, 1972 ) for survival data characterized by a notably high right-censoring rate.
Wangetal. (2018 ) introduced an innovative subsampling method optimized for logistic regression, demonstrating high effectiveness for balanced data but acknowledging its limited efficacy for highly imbalanced data. In the context of rare-event data, a natural approach involves subsampling exclusively within the majority group (the common class or censored observations) to prevent the loss of crucial information.Addressing this concern, Wangetal. (2021 ) focused on logistic regression and proposed an optimal subsampling procedure targeting the rare-event setting, ensuring retention of all events in binary outcome scenarios. However, the underlying assumption is that the proportion of rare events decreases as the sample size increases, a condition that is often considered undesirable.
For survival analysis involving rare events, Gorfine etal. (2021 ) advocated subsampling solely the observations that have not yet experienced the event (i.e., the censored observations) and implemented a uniform subsampling approach. In a similar vein, Keret andGorfine (2023 ) presented an optimal subsampling strategy for the Cox PH model within the rare-events framework. In this approach, optimal subsampling exclusively targets censored observations, combining all observed events with the subsample set of censored observations. These optimal subsampling techniques have been convincingly demonstrated to significantly reduce computational burden compared to analyzing the entire dataset, with minimal loss of efficiency.
However, a notable aspect left unaddressed in the aforementioned works is the lack of practical guidelines for determining the subsample size. While our primary goal is to reduce computation time, we are equally committed to maintaining the statistical power or efficiency for answering the research questions and avoiding a substantial increase in standard errors. Hence, it is valuable to offer researchers a tool for choosing the subsample size that aligns with their research objectives.
This work offers notable contributions in two key aspects:
1. We introduce tools designed to optimize the process of selecting subsample sizes in the realm of optimal subsampling. These tools are versatile, and applied here to Cox regression models dealing with rare events and logistic regression models, regardless of the presence of rare events.
2. We present optimal subsampling methods specifically tailored for logistic regression models handling rare events. Notably, our approach assumes that the proportion of rare events converges to a positive constant with increasing sample size, a substantial departure from the assumption made by Wangetal. (2021 ) .
This paper is structured as follows: Section 2 begins by summarizing the key findings on optimal subsampling from Keret andGorfine (2023 ) to ensure the current paper is self-contained. It then introduces new methodologies for determining optimal subsample size. Section 3 proposes a two-step subsampling algorithm specifically designed for logistic regression in scenarios involving rare events, including techniques for selecting the subsample size. Section 4 focuses on Wang et al.’s (2018) two-step algorithm to nearly balanced datasets and offers strategies for identifying the optimal subsample size. Section 5 summarizes a comprehensive simulation study to evaluate the effectiveness of the proposed approaches. Sections 6 and 7 focus on analyzing two large-scale datasets, a survival regression model with around 350 million records and a logistic regression model with approximately 28 million observations. The paper concludes with a short discussion in Section 8 .
2 Optimal Subsample Size for Cox Regression with Optimal Subsampling2.1 Notation, Formulation and Reservoir-Sampling (Keret andGorfine, 2023 ) For the sake of clarity, this section presents the model formulation and pertinent findings from Keret andGorfine (2023 ) . Consider a set of n 𝑛 n italic_n independent and identically distributed observations. Let V i subscript 𝑉 𝑖 V_{i} italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the failure time for the i 𝑖 i italic_i th observation, C i subscript 𝐶 𝑖 C_{i} italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the right-censoring time, and T i subscript 𝑇 𝑖 T_{i} italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT signify the observed time, T i = min ( V i , C i ) subscript 𝑇 𝑖 subscript 𝑉 𝑖 subscript 𝐶 𝑖 T_{i}=\min(V_{i},C_{i}) italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_min ( italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . Define Δ i = I ( V i ≤ C i ) subscript Δ 𝑖 𝐼 subscript 𝑉 𝑖 subscript 𝐶 𝑖 \Delta_{i}=I(V_{i}\leq C_{i}) roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_I ( italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , and let 𝐗 i subscript 𝐗 𝑖 \mathbf{X}_{i} bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be a vector of potentially time-dependent covariates of size r 𝑟 r italic_r . The observed dataset is denoted by 𝒟 n = { T i , Δ i , 𝐗 i ; i = 1 , … , n } \mathcal{D}_{n}=\{T_{i},\Delta_{i},\mathbf{X}_{i}\,;\,i=1,\dots,n\} caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_i = 1 , … , italic_n } . Among the n 𝑛 n italic_n observations, there are n e subscript 𝑛 𝑒 n_{e} italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT instances where the failure times are observed, termed “events”. It is assumed that as n → ∞ → 𝑛 n\rightarrow\infty italic_n → ∞ , the ratio n e / n subscript 𝑛 𝑒 𝑛 n_{e}/n italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT / italic_n converges to a small positive constant. The count of censored observations is represented by n c = n − n e subscript 𝑛 𝑐 𝑛 subscript 𝑛 𝑒 n_{c}=n-n_{e} italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_n - italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , and τ 𝜏 \tau italic_τ denotes the maximum follow-up time.
In time-to-event data, the predominant source of information comes from events rather than censored observations. This rationale underlies the two-step algorithm introduced by Keret andGorfine (2023 ) , which utilizes all observed events while sampling a subset of censored observations. Let q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the number of censored observations sampled from the full data, where q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is typically much smaller than n 𝑛 n italic_n , and it is assumed that q n / n subscript 𝑞 𝑛 𝑛 q_{n}/n italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_n converges to a small positive constant as q n , n → ∞ → subscript 𝑞 𝑛 𝑛
q_{n},n\rightarrow\infty italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ . Define 𝒞 𝒞 \mathcal{C} caligraphic_C as the index set containing all censored observations in the full data, and 𝒬 𝒬 \mathcal{Q} caligraphic_Q as the index set encompassing all observations with observed failure times and all censored observations included in the subsample. Due to computational and theoretical considerations, censored observations are sampled with replacement, implying that a censored observation in the original sample may appear more than once in the subsample.
Let 𝜷 𝜷 \boldsymbol{\beta} bold_italic_β be a vector of size r 𝑟 r italic_r of unknown coefficients. Then, under the Cox PH regression, the instantaneous hazard rate of observation i 𝑖 i italic_i at time t 𝑡 t italic_t is given by
λ ( t | 𝐗 i ) = λ 0 ( t ) e 𝜷 T 𝐗 i i = 1 , … , n formulae-sequence 𝜆 conditional 𝑡 subscript 𝐗 𝑖 subscript 𝜆 0 𝑡 superscript 𝑒 superscript 𝜷 𝑇 subscript 𝐗 𝑖 𝑖 1 … 𝑛
\displaystyle\lambda(t|\mathbf{X}_{i})=\lambda_{0}(t)e^{\boldsymbol{\beta}^{T}%\mathbf{X}_{i}}\quad\quad i=1,\dots,n italic_λ ( italic_t | bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t ) italic_e start_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_i = 1 , … , italic_n
where λ 0 ( ⋅ ) subscript 𝜆 0 ⋅ \lambda_{0}(\cdot) italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ ) is an unspecified non-negative function and Λ 0 ( t ) = ∫ 0 t λ 0 ( u ) 𝑑 u subscript Λ 0 𝑡 superscript subscript 0 𝑡 subscript 𝜆 0 𝑢 differential-d 𝑢 \Lambda_{0}(t)=\int_{0}^{t}\lambda_{0}(u)du roman_Λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_u ) italic_d italic_u is the cumulative baseline hazard function. The goal is estimating the unknown parameters 𝜷 𝜷 \boldsymbol{\beta} bold_italic_β and Λ 0 subscript Λ 0 \Lambda_{0} roman_Λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . Define 𝐒 ( k ) ( 𝜷 , t ) = ∑ i = 1 n e 𝜷 T 𝐗 i Y i ( t ) 𝐗 i ⊗ k , k = 0 , 1 , 2 formulae-sequence superscript 𝐒 𝑘 𝜷 𝑡 superscript subscript 𝑖 1 𝑛 superscript 𝑒 superscript 𝜷 𝑇 subscript 𝐗 𝑖 subscript 𝑌 𝑖 𝑡 superscript subscript 𝐗 𝑖 tensor-product absent 𝑘 𝑘 0 1 2
\mathbf{S}^{(k)}(\boldsymbol{\beta},t)=\sum_{i=1}^{n}e^{\boldsymbol{\beta}^{T}%\mathbf{X}_{i}}Y_{i}(t)\mathbf{X}_{i}^{\otimes k},k=0,1,2 bold_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_k end_POSTSUPERSCRIPT , italic_k = 0 , 1 , 2 , where 𝐗 ⊗ 0 = 1 , 𝐗 ⊗ 1 = 𝐗 , 𝐗 ⊗ 2 = 𝐗𝐗 T formulae-sequence superscript 𝐗 tensor-product absent 0 1 formulae-sequence superscript 𝐗 tensor-product absent 1 𝐗 superscript 𝐗 tensor-product absent 2 superscript 𝐗𝐗 𝑇 \mathbf{X}^{\otimes 0}=1,\mathbf{X}^{\otimes 1}=\mathbf{X},\mathbf{X}^{\otimes2%}=\mathbf{X}\mathbf{X}^{T} bold_X start_POSTSUPERSCRIPT ⊗ 0 end_POSTSUPERSCRIPT = 1 , bold_X start_POSTSUPERSCRIPT ⊗ 1 end_POSTSUPERSCRIPT = bold_X , bold_X start_POSTSUPERSCRIPT ⊗ 2 end_POSTSUPERSCRIPT = bold_XX start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and Y i ( t ) = I ( T i ≥ t ) subscript 𝑌 𝑖 𝑡 𝐼 subscript 𝑇 𝑖 𝑡 Y_{i}(t)=I(T_{i}\geq t) italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_I ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_t ) is the at-risk process of observation i 𝑖 i italic_i at time t 𝑡 t italic_t . Denote 𝜷 ^ P L subscript ^ 𝜷 𝑃 𝐿 \widehat{\boldsymbol{\beta}}_{PL} over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT as the full-sample partial-likelihood (PL) estimator of 𝜷 𝜷 \boldsymbol{\beta} bold_italic_β that solves
∂ l ( 𝜷 ) ∂ 𝜷 T = ∑ i = 1 n Δ i { 𝐗 i − 𝐒 ( 1 ) ( 𝜷 , T i ) S ( 0 ) ( 𝜷 , T i ) } = 𝟎 . 𝑙 𝜷 superscript 𝜷 𝑇 superscript subscript 𝑖 1 𝑛 subscript Δ 𝑖 subscript 𝐗 𝑖 superscript 𝐒 1 𝜷 subscript 𝑇 𝑖 superscript 𝑆 0 𝜷 subscript 𝑇 𝑖 0 \frac{\partial l(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}^{T}}=\sum_{i=%1}^{n}\Delta_{i}\bigg{\{}\mathbf{X}_{i}-\frac{\mathbf{S}^{(1)}(\boldsymbol{%\beta},T_{i})}{S^{(0)}(\boldsymbol{\beta},T_{i})}\bigg{\}}=\mathbf{0}\,. divide start_ARG ∂ italic_l ( bold_italic_β ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG bold_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_S start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG } = bold_0 .
Suppose that the data are organized such that the censored observations precede the failure times, namely 𝒞 = { 1 , … , n c } 𝒞 1 … subscript 𝑛 𝑐 \mathcal{C}=\{1,\dots,n_{c}\} caligraphic_C = { 1 , … , italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } , and ℰ = { n c + 1 , … , n } ℰ subscript 𝑛 𝑐 1 … 𝑛 \mathcal{E}=\{n_{c}+1,\dots,n\} caligraphic_E = { italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + 1 , … , italic_n } . Let 𝐩 = ( p 1 , … , p n c ) T 𝐩 superscript subscript 𝑝 1 … subscript 𝑝 subscript 𝑛 𝑐 𝑇 \mathbf{p}=(p_{1},\dots,p_{n_{c}})^{T} bold_p = ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be a vector of the sampling probabilities for the censored observations, where ∑ i = 1 n c p i = 1 superscript subscript 𝑖 1 subscript 𝑛 𝑐 subscript 𝑝 𝑖 1 \sum_{i=1}^{n_{c}}p_{i}=1 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , and set
w i = { ( p i q n ) − 1 if Δ i = 0 , p i > 0 1 if Δ i = 1 i = 1 , … , n . formulae-sequence subscript 𝑤 𝑖 cases superscript subscript 𝑝 𝑖 subscript 𝑞 𝑛 1 formulae-sequence if subscript Δ 𝑖 0 subscript 𝑝 𝑖 0 1 if subscript Δ 𝑖 1 𝑖 1 … 𝑛
w_{i}=\begin{cases}(p_{i}q_{n})^{-1}&\text{if }\Delta_{i}=0,p_{i}>0\\1&\text{if }\Delta_{i}=1\end{cases}\quad\quad i=1,\dots,n\,. italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL if roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL if roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_CELL end_ROW italic_i = 1 , … , italic_n .
The subsample-based counterpart of 𝐒 ( k ) ( 𝜷 , t ) superscript 𝐒 𝑘 𝜷 𝑡 \mathbf{S}^{(k)}(\boldsymbol{\beta},t) bold_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) is 𝐒 w ( k ) ( 𝜷 , t ) = ∑ i ∈ 𝒬 w i e 𝜷 T 𝐗 i Y i ( t ) 𝐗 i ⊗ k , k = 0 , 1 , 2 formulae-sequence superscript subscript 𝐒 𝑤 𝑘 𝜷 𝑡 subscript 𝑖 𝒬 subscript 𝑤 𝑖 superscript 𝑒 superscript 𝜷 𝑇 subscript 𝐗 𝑖 subscript 𝑌 𝑖 𝑡 superscript subscript 𝐗 𝑖 tensor-product absent 𝑘 𝑘 0 1 2
\mathbf{S}_{w}^{(k)}(\boldsymbol{\beta},t)=\sum_{i\in\mathcal{Q}}w_{i}e^{%\boldsymbol{\beta}^{T}\mathbf{X}_{i}}Y_{i}(t)\mathbf{X}_{i}^{\otimes k},k=0,1,2 bold_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_k end_POSTSUPERSCRIPT , italic_k = 0 , 1 , 2 . Then, 𝜷 ~ ~ 𝜷 \widetilde{\boldsymbol{\beta}} over~ start_ARG bold_italic_β end_ARG is defined as the estimator derived from the subsample 𝒬 𝒬 \mathcal{Q} caligraphic_Q , by solving
∂ l ∗ ( 𝜷 ) ∂ 𝜷 T ≡ ∑ i ∈ 𝒬 Δ i { 𝐗 i − 𝐒 w ( 1 ) ( 𝜷 , T i ) S w ( 0 ) ( 𝜷 , T i ) } = 𝟎 superscript 𝑙 𝜷 superscript 𝜷 𝑇 subscript 𝑖 𝒬 subscript Δ 𝑖 subscript 𝐗 𝑖 superscript subscript 𝐒 𝑤 1 𝜷 subscript 𝑇 𝑖 superscript subscript 𝑆 𝑤 0 𝜷 subscript 𝑇 𝑖 0 \frac{\partial l^{*}(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}^{T}}%\equiv\sum_{i\in\mathcal{Q}}\Delta_{i}\bigg{\{}\mathbf{X}_{i}-\frac{\mathbf{S}%_{w}^{(1)}(\boldsymbol{\beta},T_{i})}{S_{w}^{(0)}(\boldsymbol{\beta},T_{i})}%\bigg{\}}=\mathbf{0} divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≡ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG bold_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG } = bold_0 (1)
where l ∗ superscript 𝑙 l^{*} italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the log PL based on the subsample 𝒬 𝒬 \mathcal{Q} caligraphic_Q . Finally, for a given vector 𝜷 𝜷 \boldsymbol{\beta} bold_italic_β , defineΛ ^ 0 ( t , 𝜷 ) = ∑ i = 1 n Δ i I ( T i ≤ t ) / S ( 0 ) ( 𝜷 , T i ) , subscript ^ Λ 0 𝑡 𝜷 superscript subscript 𝑖 1 𝑛 subscript Δ 𝑖 𝐼 subscript 𝑇 𝑖 𝑡 superscript 𝑆 0 𝜷 subscript 𝑇 𝑖 \widehat{\Lambda}_{0}(t,\boldsymbol{\beta})=\sum_{i=1}^{n}{\Delta_{i}I(T_{i}%\leq t)}/{S^{(0)}(\boldsymbol{\beta},T_{i})}, over^ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t , bold_italic_β ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_I ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_t ) / italic_S start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , and the Breslow estimator (Breslow, 1972 ) of Λ 0 subscript Λ 0 \Lambda_{0} roman_Λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT function is produced by Λ ^ 0 ( t , 𝜷 ^ P L ) subscript ^ Λ 0 𝑡 subscript ^ 𝜷 𝑃 𝐿 \widehat{\Lambda}_{0}(t,\widehat{\boldsymbol{\beta}}_{PL}) over^ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t , over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT ) .
Consistency and asymptotic normality of 𝜷 ~ ~ 𝜷 \widetilde{\boldsymbol{\beta}} over~ start_ARG bold_italic_β end_ARG and Λ ^ 0 subscript ^ Λ 0 \widehat{\Lambda}_{0} over^ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT were established by Keret andGorfine (2023 ) under some regularity assumptions. Specifically, given the true 𝜷 o superscript 𝜷 𝑜 \boldsymbol{\beta}^{o} bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ,
n 𝕍 ( p , 𝜷 o ) − 1 / 2 ( 𝜷 ~ − 𝜷 o ) → 𝐷 N ( 0 , 𝐈 ) 𝐷 → 𝑛 𝕍 superscript p superscript 𝜷 𝑜 1 2 ~ 𝜷 superscript 𝜷 𝑜 𝑁 0 𝐈 \sqrt{n}\mathbb{V}(\textbf{p},\boldsymbol{\beta}^{o})^{-1/2}(\widetilde{%\boldsymbol{\beta}}-\boldsymbol{\beta}^{o})\xrightarrow[]{D}N(0,\mathbf{I}) square-root start_ARG italic_n end_ARG blackboard_V ( p , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_I )
as n , q n → ∞ → 𝑛 subscript 𝑞 𝑛
n,q_{n}\rightarrow\infty italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ , where 𝐈 𝐈 \mathbf{I} bold_I is the identity matrix and
𝕍 ( p , 𝜷 ) = 𝓘 − 1 ( 𝜷 ) + n q n 𝓘 − 1 ( 𝜷 ) 𝝋 ( 𝐩 , 𝜷 ) 𝓘 − 1 ( 𝜷 ) , 𝕍 p 𝜷 superscript 𝓘 1 𝜷 𝑛 subscript 𝑞 𝑛 superscript 𝓘 1 𝜷 𝝋 𝐩 𝜷 superscript 𝓘 1 𝜷 \mathbb{V}(\textbf{p},\boldsymbol{\beta})=\boldsymbol{\mathcal{I}}^{-1}(%\boldsymbol{\beta})+\frac{n}{q_{n}}\boldsymbol{\mathcal{I}}^{-1}(\boldsymbol{%\beta})\boldsymbol{\varphi}(\mathbf{p},\boldsymbol{\beta})\boldsymbol{\mathcal%{I}}^{-1}(\boldsymbol{\beta})\,, blackboard_V ( p , bold_italic_β ) = bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β ) + divide start_ARG italic_n end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β ) bold_italic_φ ( bold_p , bold_italic_β ) bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β ) ,
𝝋 ( 𝐩 , 𝜷 ) = 1 n 2 { ∑ i ∈ 𝒞 𝐚 i ( 𝜷 ) 𝐚 i ( 𝜷 ) T p i − ∑ i , j ∈ 𝒞 𝐚 i ( 𝜷 ) 𝐚 j ( 𝜷 ) T } , 𝝋 𝐩 𝜷 1 superscript 𝑛 2 subscript 𝑖 𝒞 subscript 𝐚 𝑖 𝜷 subscript 𝐚 𝑖 superscript 𝜷 𝑇 subscript 𝑝 𝑖 subscript 𝑖 𝑗
𝒞 subscript 𝐚 𝑖 𝜷 subscript 𝐚 𝑗 superscript 𝜷 𝑇 \boldsymbol{\varphi}(\mathbf{p},\boldsymbol{\beta})=\frac{1}{n^{2}}\Bigg{\{}%\sum_{i\in\mathcal{C}}\frac{\mathbf{a}_{i}(\boldsymbol{\beta})\mathbf{a}_{i}(%\boldsymbol{\beta})^{T}}{p_{i}}-\sum_{i,j\in\mathcal{C}}\mathbf{a}_{i}(%\boldsymbol{\beta})\mathbf{a}_{j}(\boldsymbol{\beta})^{T}\Bigg{\}}\,, bold_italic_φ ( bold_p , bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_C end_POSTSUBSCRIPT divide start_ARG bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ caligraphic_C end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_β ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } ,
𝐚 i ( 𝜷 ) = ∫ 0 τ { 𝐗 i − 𝐒 ( 1 ) ( 𝜷 , t ) S ( 0 ) ( 𝜷 , t ) } Y i ( t ) e 𝜷 T 𝐗 i S ( 0 ) ( 𝜷 , t ) 𝑑 N . ( t ) , formulae-sequence subscript 𝐚 𝑖 𝜷 superscript subscript 0 𝜏 subscript 𝐗 𝑖 superscript 𝐒 1 𝜷 𝑡 superscript 𝑆 0 𝜷 𝑡 subscript 𝑌 𝑖 𝑡 superscript 𝑒 superscript 𝜷 𝑇 subscript 𝐗 𝑖 superscript 𝑆 0 𝜷 𝑡 differential-d 𝑁 𝑡 \mathbf{a}_{i}(\boldsymbol{\beta})=\int_{0}^{\tau}\bigg{\{}\mathbf{X}_{i}-%\frac{\mathbf{S}^{(1)}(\boldsymbol{\beta},t)}{S^{(0)}(\boldsymbol{\beta},t)}%\bigg{\}}\frac{Y_{i}(t)e^{\boldsymbol{\beta}^{T}\mathbf{X}_{i}}}{S^{(0)}(%\boldsymbol{\beta},t)}dN.(t)\,, bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT { bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG bold_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG } divide start_ARG italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_e start_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_S start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG italic_d italic_N . ( italic_t ) ,
where N . ( t ) = ∑ i = 1 n Δ i I ( T i ≤ t ) subscript 𝑁 . 𝑡 superscript subscript 𝑖 1 𝑛 subscript Δ 𝑖 𝐼 subscript 𝑇 𝑖 𝑡 N_{.}(t)=\sum_{i=1}^{n}\Delta_{i}I(T_{i}\leq t) italic_N start_POSTSUBSCRIPT . end_POSTSUBSCRIPT ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_I ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_t ) and
𝓘 ( 𝜷 ) = 1 n ∂ 2 l ( 𝜷 ) ∂ 𝜷 T ∂ 𝜷 = − 1 n ∫ 0 τ { 𝐒 ( 2 ) ( 𝜷 , t ) S ( 0 ) ( 𝜷 , t ) − ( 𝐒 ( 1 ) ( 𝜷 , t ) S ( 0 ) ( 𝜷 , t ) ) ( 𝐒 ( 1 ) ( 𝜷 , t ) S ( 0 ) ( 𝜷 , t ) ) T } 𝑑 N . ( t ) . formulae-sequence 𝓘 𝜷 1 𝑛 superscript 2 𝑙 𝜷 superscript 𝜷 𝑇 𝜷 1 𝑛 superscript subscript 0 𝜏 superscript 𝐒 2 𝜷 𝑡 superscript 𝑆 0 𝜷 𝑡 superscript 𝐒 1 𝜷 𝑡 superscript 𝑆 0 𝜷 𝑡 superscript superscript 𝐒 1 𝜷 𝑡 superscript 𝑆 0 𝜷 𝑡 𝑇 differential-d 𝑁 𝑡 \boldsymbol{\boldsymbol{\mathcal{I}}}(\boldsymbol{\beta})=\frac{1}{n}\frac{%\partial^{2}l(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}^{T}\partial%\boldsymbol{\beta}}=-\frac{1}{n}\int_{0}^{\tau}\Bigg{\{}\frac{\mathbf{S}^{(2)}%(\boldsymbol{\beta},t)}{S^{(0)}(\boldsymbol{\beta},t)}-\Big{(}\frac{\mathbf{S}%^{(1)}(\boldsymbol{\beta},t)}{S^{(0)}(\boldsymbol{\beta},t)}\Big{)}\Big{(}%\frac{\mathbf{S}^{(1)}(\boldsymbol{\beta},t)}{S^{(0)}(\boldsymbol{\beta},t)}%\Big{)}^{T}\Bigg{\}}dN.(t)\,. bold_caligraphic_I ( bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l ( bold_italic_β ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∂ bold_italic_β end_ARG = - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT { divide start_ARG bold_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG - ( divide start_ARG bold_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG ) ( divide start_ARG bold_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } italic_d italic_N . ( italic_t ) .
As 𝓘 𝓘 \boldsymbol{\mathcal{I}} bold_caligraphic_I and 𝝋 𝝋 \boldsymbol{\varphi} bold_italic_φ involve the entire dataset, their subsampling-based counterparts, 𝓘 ~ ~ 𝓘 \widetilde{\boldsymbol{\mathcal{I}}} over~ start_ARG bold_caligraphic_I end_ARG and 𝝋 ~ ~ 𝝋 \widetilde{\boldsymbol{\varphi}} over~ start_ARG bold_italic_φ end_ARG , will be utilized in the variance estimator 𝕍 ~ ( p , 𝜷 ~ ) ~ 𝕍 p ~ 𝜷 \widetilde{\mathbb{V}}(\textbf{p},\widetilde{\boldsymbol{\beta}}) over~ start_ARG blackboard_V end_ARG ( p , over~ start_ARG bold_italic_β end_ARG ) . The Exact expressions of 𝓘 ~ ~ 𝓘 \widetilde{\boldsymbol{\mathcal{I}}} over~ start_ARG bold_caligraphic_I end_ARG and 𝝋 ~ ~ 𝝋 \widetilde{\boldsymbol{\varphi}} over~ start_ARG bold_italic_φ end_ARG can be found in the Supplementary Material (SM) file Section S1.
The above results laid the foundation for establishing the subsequent optimal subsampling probabilities. The A-optimal sampling probabilities vector, denoted as 𝐩 A superscript 𝐩 𝐴 \mathbf{p}^{A} bold_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and derived from minimizing the trace of 𝕍 ( 𝐩 , 𝜷 o ) 𝕍 𝐩 superscript 𝜷 𝑜 \mathbb{V}(\mathbf{p},\boldsymbol{\beta}^{o}) blackboard_V ( bold_p , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) , is expressed as follows
p m A = ‖ 𝓘 − 1 ( 𝜷 o ) 𝐚 m ( 𝜷 o ) ‖ 2 ∑ i ∈ 𝒞 ‖ 𝓘 − 1 ( 𝜷 o ) 𝐚 i ( 𝜷 o ) ‖ 2 for all m ∈ 𝒞 formulae-sequence superscript subscript 𝑝 𝑚 𝐴 subscript norm superscript 𝓘 1 superscript 𝜷 𝑜 subscript 𝐚 𝑚 superscript 𝜷 𝑜 2 subscript 𝑖 𝒞 subscript norm superscript 𝓘 1 superscript 𝜷 𝑜 subscript 𝐚 𝑖 superscript 𝜷 𝑜 2 for all m 𝒞 p_{m}^{A}=\frac{\|\boldsymbol{\mathcal{I}}^{-1}(\boldsymbol{\beta}^{o})\mathbf%{a}_{m}(\boldsymbol{\beta}^{o})\|_{2}}{\sum_{i\in\mathcal{C}}\|\boldsymbol{%\mathcal{I}}^{-1}(\boldsymbol{\beta}^{o})\mathbf{a}_{i}(\boldsymbol{\beta}^{o}%)\|_{2}}\quad\textit{ for all m}\in\mathcal{C} italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT = divide start_ARG ∥ bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_C end_POSTSUBSCRIPT ∥ bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG for all m ∈ caligraphic_C (2)
where ∥ ⋅ ∥ 2 \|\cdot\|_{2} ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the l 2 subscript 𝑙 2 l_{2} italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT euclidean norm. The L-optimal sampling probabilities vector, denoted as 𝐩 L superscript 𝐩 𝐿 \mathbf{p}^{L} bold_p start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT and obtained from minimizing the trace of 𝝋 ( p , 𝜷 o ) 𝝋 p superscript 𝜷 𝑜 \boldsymbol{\varphi}(\textbf{p},\boldsymbol{\beta}^{o}) bold_italic_φ ( p , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) , is given by
p m L = ‖ 𝐚 m ( 𝜷 o ) ‖ 2 ∑ i ∈ 𝒞 ‖ 𝐚 i ( 𝜷 o ) ‖ 2 for all m ∈ 𝒞 . formulae-sequence superscript subscript 𝑝 𝑚 𝐿 subscript norm subscript 𝐚 𝑚 superscript 𝜷 𝑜 2 subscript 𝑖 𝒞 subscript norm subscript 𝐚 𝑖 superscript 𝜷 𝑜 2 for all m 𝒞 p_{m}^{L}=\frac{\|\mathbf{a}_{m}(\boldsymbol{\beta}^{o})\|_{2}}{\sum_{i\in%\mathcal{C}}\|\mathbf{a}_{i}(\boldsymbol{\beta}^{o})\|_{2}}\quad\textit{ for %all m}\in\mathcal{C}\,. italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = divide start_ARG ∥ bold_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_C end_POSTSUBSCRIPT ∥ bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG for all m ∈ caligraphic_C . (3)
Evidently, 𝐩 A superscript 𝐩 𝐴 \mathbf{p}^{A} bold_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT incorporate 𝓘 − 1 superscript 𝓘 1 \boldsymbol{\mathcal{I}}^{-1} bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , enabling a more efficient estimation of 𝜷 𝜷 \boldsymbol{\beta} bold_italic_β in contrast to the estimator using probabilities from the L-optimal criterion. Nevertheless, for the same reason, 𝐩 A superscript 𝐩 𝐴 \mathbf{p}^{A} bold_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT demands a greater computational time.
As both 𝐩 A superscript 𝐩 𝐴 \mathbf{p}^{A} bold_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and 𝐩 L superscript 𝐩 𝐿 \mathbf{p}^{L} bold_p start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT rely on the true unknown regression vector, 𝜷 o superscript 𝜷 𝑜 \boldsymbol{\beta}^{o} bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , the following two-step procedure has been proposed. It commences with a quick and straightforward consistent estimator of the regression vector to estimate the optimal sampling probabilities. The complete implementation of the two-step procedure is outlined below:
Step 1: Select q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations uniformly from 𝒞 𝒞 \mathcal{C} caligraphic_C and combine them with the observed events to get 𝒬 p i l o t 𝒬 𝑝 𝑖 𝑙 𝑜 𝑡 \mathcal{Q}{pilot} caligraphic_Q italic_p italic_i italic_l italic_o italic_t . Conduct a weighted Cox regression on 𝒬 p i l o t 𝒬 𝑝 𝑖 𝑙 𝑜 𝑡 \mathcal{Q}{pilot} caligraphic_Q italic_p italic_i italic_l italic_o italic_t and obtain 𝜷 ~ U subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\beta}}_{U} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT based on Eq. (1 ). Compute approximated optimal sampling probabilities by substituting 𝜷 o superscript 𝜷 𝑜 \boldsymbol{\beta}^{o} bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT with 𝜷 ~ U subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\beta}}_{U} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT in Eq. (2 ) or (3 ).
Step 2: Select q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT observations from 𝒞 𝒞 \mathcal{C} caligraphic_C based on the sampling probabilities of Step 1. Combine these selected observations with the observed events and get 𝒬 𝒬 \mathcal{Q} caligraphic_Q . Perform a weighted Cox regression on 𝒬 𝒬 \mathcal{Q} caligraphic_Q , based on Eq. (1 ), and get the two-step estimator 𝜷 ~ T S . subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS}. over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT .
The original algorithm employs q 0 = q n subscript 𝑞 0 subscript 𝑞 𝑛 q_{0}=q_{n} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT . However, here we suggest using a small value of q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for the initial uniform sampling of Step 1. Subsequently, the methods outlined in the following subsections focus on determining the value of q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT under various criteria. To maintain computational efficiency, we recommend setting q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as c 0 n e subscript 𝑐 0 subscript 𝑛 𝑒 c_{0}n_{e} italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , where c 0 subscript 𝑐 0 c_{0} italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a small scalar (e.g., c 0 < 5 subscript 𝑐 0 5 c_{0}<5 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < 5 ). Our simulation study and real-data analysis indicate that this recommendation is generally sufficient. Furthermore, our findings suggest that when none of the covariates exhibit long-tailed distributions, setting c 0 = 1 subscript 𝑐 0 1 c_{0}=1 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 is often adequate.
The asymptotic properties of 𝜷 ~ T S subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT and Λ ^ 0 ( t , 𝜷 ~ T S ) subscript ^ Λ 0 𝑡 subscript ~ 𝜷 𝑇 𝑆 \widehat{\Lambda}_{0}(t,\widetilde{\boldsymbol{\beta}}_{TS}) over^ start_ARG roman_Λ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) were established (Keret andGorfine, 2023 ) . Specifically, it was shown that under standard assumptions,
n 𝕍 ( 𝐩 o p t , 𝜷 o ) − 1 / 2 ( 𝜷 ~ T S − 𝜷 o ) → 𝐷 N ( 𝟎 , 𝐈 ) 𝐷 → 𝑛 𝕍 superscript superscript 𝐩 𝑜 𝑝 𝑡 superscript 𝜷 𝑜 1 2 subscript ~ 𝜷 𝑇 𝑆 superscript 𝜷 𝑜 𝑁 0 𝐈 \sqrt{n}\mathbb{V}(\mathbf{p}^{opt},\boldsymbol{\beta}^{o})^{-1/2}(\widetilde{%\boldsymbol{\beta}}_{TS}-\boldsymbol{\beta}^{o})\xrightarrow[]{D}N(\mathbf{0},%\mathbf{I}) square-root start_ARG italic_n end_ARG blackboard_V ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_ARROW overitalic_D → end_ARROW italic_N ( bold_0 , bold_I )
as q n , n → ∞ → subscript 𝑞 𝑛 𝑛
q_{n},n\rightarrow\infty italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ , where 𝐩 o p t superscript 𝐩 𝑜 𝑝 𝑡 \mathbf{p}^{opt} bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT is either 𝐩 A superscript 𝐩 𝐴 \mathbf{p}^{A} bold_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT or 𝐩 L superscript 𝐩 𝐿 \mathbf{p}^{L} bold_p start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT . Moreover, the asymptotic theory accommodates left truncation, stratified analysis, time-dependent covariates and time-dependent coefficients. However, a practical methodology for selecting the size of q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT was not studied, despite its considerable importance. This motivates us to propose the following frameworks for determining the necessary size of q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT based on specific objectives.
Often, datasets are too voluminous to fit within the RAM limitations of standard computers, a challenge highlighted in the real-data analyses of Section 6. Keret andGorfine (2023 ) proposed a speedy and memory-efficient approach for batch-based reservoir sampling, designed to operate on conventional computer systems. Summarizing, the observed-events dataset, ℰ ℰ \mathcal{E} caligraphic_E , is consistently maintained in the RAM. Conversely, the censored observations are split into B 𝐵 B italic_B batches, labeled as ℬ 1 , … , ℬ B subscript ℬ 1 … subscript ℬ 𝐵
\mathcal{B}_{1},\ldots,\mathcal{B}_{B} caligraphic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_B start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT . At any given moment, only one batch is in the RAM. In every batch b 𝑏 b italic_b , b = 1 , … , B 𝑏 1 … 𝐵
b=1,\ldots,B italic_b = 1 , … , italic_B , we approximate 𝐩 A superscript 𝐩 𝐴 \mathbf{p}^{A} bold_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT or 𝐩 L superscript 𝐩 𝐿 \mathbf{p}^{L} bold_p start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT by considering the dataset comprised of ℰ ∪ ℬ b ℰ subscript ℬ 𝑏 \mathcal{E}\cup\mathcal{B}_{b} caligraphic_E ∪ caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT . The approximation employs distinct weights, 1 for an event and n c / | ℬ b | subscript 𝑛 𝑐 subscript ℬ 𝑏 n_{c}/|\mathcal{B}_{b}| italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / | caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | for each censored observation. The reservoir-sampling algorithm selects q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT observations with replacement in a single iteration. The key idea involves keeping a sample of q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT observations (referred to as the “reservoir”), where replacements can occur as new batches are loaded. For an in-depth description and proof of the algorithm validity, refer to Section 2.6 in Keret andGorfine (2023 ) . This reservoir-sampling algorithm can be applied to any scenario of sampling with replacement, independently of the original regression problem. In this work, we employ it for survival regression with approximately 350 million records.
2.2 Subsample Size Based on Relative EfficiencyWhat is the optimal size of q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that maintains a small efficiency loss? Here, we introduce a tool that enables us to evaluate the efficiency loss resulting from the subsampling approach. We begin by defining an estimator of the relative efficiency (RE) of the two-step estimator compared to the full PL estimator by
R E ( q n ) = ‖ n − 1 𝓘 − 1 ( 𝜷 ~ T S ) + q n − 1 𝓘 − 1 ( 𝜷 ~ T S ) 𝝋 ( 𝐩 o p t , 𝜷 ~ T S ) 𝓘 − 1 ( 𝜷 ~ T S ) ‖ F ‖ n − 1 𝓘 − 1 ( 𝜷 ^ P L ) ‖ F 𝑅 𝐸 subscript 𝑞 𝑛 subscript norm superscript 𝑛 1 superscript 𝓘 1 subscript ~ 𝜷 𝑇 𝑆 superscript subscript 𝑞 𝑛 1 superscript 𝓘 1 subscript ~ 𝜷 𝑇 𝑆 𝝋 superscript 𝐩 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑇 𝑆 superscript 𝓘 1 subscript ~ 𝜷 𝑇 𝑆 𝐹 subscript norm superscript 𝑛 1 superscript 𝓘 1 subscript ^ 𝜷 𝑃 𝐿 𝐹 RE(q_{n})=\frac{\|n^{-1}\boldsymbol{\mathcal{I}}^{-1}(\widetilde{\boldsymbol{%\beta}}_{TS})+q_{n}^{-1}\boldsymbol{\mathcal{I}}^{-1}(\widetilde{\boldsymbol{%\beta}}_{TS})\boldsymbol{\varphi}(\mathbf{p}^{opt},\widetilde{\boldsymbol{%\beta}}_{TS})\boldsymbol{\mathcal{I}}^{-1}(\widetilde{\boldsymbol{\beta}}_{TS}%)\|_{F}}{\|n^{-1}\boldsymbol{\mathcal{I}}^{-1}(\widehat{\boldsymbol{\beta}}_{%PL})\|_{F}} italic_R italic_E ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) + italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) bold_italic_φ ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG (4)
where ‖ A ‖ F = ∑ i , j | a i , j | 2 subscript norm 𝐴 𝐹 subscript 𝑖 𝑗
superscript subscript 𝑎 𝑖 𝑗
2 \|A\|_{F}=\sqrt{\sum_{i,j}|a_{i,j}|^{2}} ∥ italic_A ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . The lower limit of Eq. (4 ) is close to 1 1 1 1 .
If interest lies in the effect of a particular covariate, e.g., the p 𝑝 p italic_p -th covariate, then we may utilize
R E p ( q n ) = [ n − 1 𝓘 − 1 ( 𝜷 ~ T S ) + q n − 1 𝓘 − 1 ( 𝜷 ~ T S ) 𝝋 ( 𝐩 o p t , 𝜷 ~ T S ) 𝓘 − 1 ( 𝜷 ~ T S ) ] p p [ n − 1 𝓘 − 1 ( 𝜷 ^ P L ) ] p p 𝑅 subscript 𝐸 𝑝 subscript 𝑞 𝑛 subscript delimited-[] superscript 𝑛 1 superscript 𝓘 1 subscript ~ 𝜷 𝑇 𝑆 superscript subscript 𝑞 𝑛 1 superscript 𝓘 1 subscript ~ 𝜷 𝑇 𝑆 𝝋 superscript 𝐩 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑇 𝑆 superscript 𝓘 1 subscript ~ 𝜷 𝑇 𝑆 𝑝 𝑝 subscript delimited-[] superscript 𝑛 1 superscript 𝓘 1 subscript ^ 𝜷 𝑃 𝐿 𝑝 𝑝 RE_{p}(q_{n})=\frac{\Big{[}n^{-1}\boldsymbol{\mathcal{I}}^{-1}(\widetilde{%\boldsymbol{\beta}}_{TS})+q_{n}^{-1}\boldsymbol{\mathcal{I}}^{-1}(\widetilde{%\boldsymbol{\beta}}_{TS})\boldsymbol{\varphi}(\mathbf{p}^{opt},\widetilde{%\boldsymbol{\beta}}_{TS})\boldsymbol{\mathcal{I}}^{-1}(\widetilde{\boldsymbol{%\beta}}_{TS})\Big{]}_{pp}}{\Big{[}n^{-1}\boldsymbol{\mathcal{I}}^{-1}(\widehat%{\boldsymbol{\beta}}_{PL})\Big{]}_{pp}} italic_R italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG [ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) + italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) bold_italic_φ ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT end_ARG start_ARG [ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT end_ARG (5)
where [ A ] p p subscript delimited-[] 𝐴 𝑝 𝑝 \big{[}A\big{]}_{pp} [ italic_A ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT is the p p 𝑝 𝑝 pp italic_p italic_p element of the matrix A 𝐴 A italic_A . Adjusted optimal sampling probabilities to target a subset of covariates, while retaining the rest in the model to control for confounders, are available in Keret andGorfine (2023 ) (see, Equations 7 and 8).
The equations above pose practical challenges: firstly, they include 𝜷 ^ P L subscript ^ 𝜷 𝑃 𝐿 \widehat{\boldsymbol{\beta}}_{PL} over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT —whose calculations we aim to avoid; secondly, they involve 𝜷 ~ T S subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT , which can be computed after determining the subsample size, q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT . However, leveraging the consistent estimator 𝜷 ~ U subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\beta}}_{U} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT from Step 1 resolves it by substituting 𝜷 ~ T S subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT and 𝜷 ^ P L subscript ^ 𝜷 𝑃 𝐿 \widehat{\boldsymbol{\beta}}_{PL} over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT with 𝜷 ~ U subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\beta}}_{U} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT . An additional challenge arises as 𝓘 − 1 ( 𝜷 ~ U ) superscript 𝓘 1 subscript ~ 𝜷 𝑈 {\boldsymbol{\mathcal{I}}}^{-1}(\widetilde{\boldsymbol{\beta}}_{U}) bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and 𝝋 ( 𝐩 o p t , 𝜷 ~ U ) 𝝋 superscript 𝐩 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 {\boldsymbol{\varphi}}(\mathbf{p}^{opt},\widetilde{\boldsymbol{\beta}}_{U}) bold_italic_φ ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) involve the full data. Alternatively, their subsampling-based counterparts, 𝓘 ~ − 1 ( 𝜷 ~ U ) superscript ~ 𝓘 1 subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\mathcal{I}}}^{-1}(\widetilde{\boldsymbol{\beta}}_{U}) over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and 𝝋 ~ ( 𝐩 o p t , 𝜷 ~ U ) ~ 𝝋 superscript 𝐩 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\varphi}}(\mathbf{p}^{opt},\widetilde{\boldsymbol{\beta%}}_{U}) over~ start_ARG bold_italic_φ end_ARG ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ,can be used. However, it is advisable to refrain from using 𝓘 ~ − 1 ( 𝜷 ~ U ) superscript ~ 𝓘 1 subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\mathcal{I}}}^{-1}(\widetilde{\boldsymbol{\beta}}_{U}) over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and 𝝋 ~ ( 𝐩 o p t , 𝜷 ~ U ) ~ 𝝋 superscript 𝐩 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\varphi}}(\mathbf{p}^{opt},\widetilde{\boldsymbol{\beta%}}_{U}) over~ start_ARG bold_italic_φ end_ARG ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) based on the uniform subsample of Step 1, since uniform sampling allows the selection of observations with extremely small optimal probabilities 𝐩 o p t superscript 𝐩 𝑜 𝑝 𝑡 \mathbf{p}^{opt} bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT . Consequently, dividing by these probabilities often renders Eq. (4 ) or (5 ) numerically unstable. Our proposed approach involves sampling an additional small subsample of size q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , but this time using the approximated optimal probabilities obtained by substituting 𝜷 o superscript 𝜷 𝑜 \boldsymbol{\beta}^{o} bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , in Eq.s (2 ) and (3 ), with 𝜷 ~ U subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\beta}}_{U} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT . Let 𝒬 1.5 subscript 𝒬 1.5 \mathcal{Q}_{1.5} caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT be the index set containing all observations whose failure time was observed, along with the censored observations included in this new subsample of size q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . We denote the counterparts of 𝓘 − 1 superscript 𝓘 1 \boldsymbol{\mathcal{I}}^{-1} bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and 𝝋 𝝋 \boldsymbol{\varphi} bold_italic_φ for this subsample as 𝓘 ~ Q 1.5 − 1 ( 𝜷 ~ U ) subscript superscript ~ 𝓘 1 subscript 𝑄 1.5 subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}(\widetilde{\boldsymbol{%\beta}}_{U}) over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and 𝝋 ~ Q 1.5 ( 𝐩 o p t , 𝜷 ~ U ) subscript ~ 𝝋 subscript 𝑄 1.5 superscript 𝐩 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\varphi}}_{Q_{1.5}}(\mathbf{p}^{opt},\widetilde{%\boldsymbol{\beta}}_{U}) over~ start_ARG bold_italic_φ end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) (see the SM, Section S1, for details).
Hence, the proposed RE estimator is given by
R E ^ ( q n ) = ‖ n − 1 𝓘 ~ Q 1.5 − 1 ( 𝜷 ~ U ) + q n − 1 𝓘 ~ Q 1.5 − 1 ( 𝜷 ~ U ) 𝝋 ~ Q 1.5 ( 𝐩 ~ o p t , 𝜷 ~ U ) 𝓘 ~ Q 1.5 − 1 ( 𝜷 ~ U ) ‖ F ‖ n − 1 𝓘 ~ Q 1.5 − 1 ( 𝜷 ~ U ) ‖ F ^ 𝑅 𝐸 subscript 𝑞 𝑛 subscript norm superscript 𝑛 1 subscript superscript ~ 𝓘 1 subscript 𝑄 1.5 subscript ~ 𝜷 𝑈 superscript subscript 𝑞 𝑛 1 subscript superscript ~ 𝓘 1 subscript 𝑄 1.5 subscript ~ 𝜷 𝑈 subscript ~ 𝝋 subscript 𝑄 1.5 superscript ~ 𝐩 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 subscript superscript ~ 𝓘 1 subscript 𝑄 1.5 subscript ~ 𝜷 𝑈 𝐹 subscript norm superscript 𝑛 1 subscript superscript ~ 𝓘 1 subscript 𝑄 1.5 subscript ~ 𝜷 𝑈 𝐹 \widehat{RE}(q_{n})=\frac{\|n^{-1}\widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q%_{1.5}}(\widetilde{\boldsymbol{\beta}}_{U})+q_{n}^{-1}\widetilde{\boldsymbol{%\mathcal{I}}}^{-1}_{Q_{1.5}}(\widetilde{\boldsymbol{\beta}}_{U})\widetilde{%\boldsymbol{\varphi}}_{Q_{1.5}}(\widetilde{\mathbf{p}}^{opt},\widetilde{%\boldsymbol{\beta}}_{U})\widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}(%\widetilde{\boldsymbol{\beta}}_{U})\|_{F}}{\|n^{-1}\widetilde{\boldsymbol{%\mathcal{I}}}^{-1}_{Q_{1.5}}(\widetilde{\boldsymbol{\beta}}_{U})\|_{F}} over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) + italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) over~ start_ARG bold_italic_φ end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG (6)
where 𝐩 ~ o p t superscript ~ 𝐩 𝑜 𝑝 𝑡 \widetilde{\mathbf{p}}^{opt} over~ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT is the estimated optimal-probabilities vector calculated in Step 1. To save computational time, we utilize 𝐩 ~ o p t superscript ~ 𝐩 𝑜 𝑝 𝑡 \widetilde{\mathbf{p}}^{opt} over~ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT and 𝜷 ~ U subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\beta}}_{U} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT from Step 1 instead of re-estimating the optimal probabilities and the coefficient vector based on 𝒬 1.5 subscript 𝒬 1.5 \mathcal{Q}_{1.5} caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT . The computational time for this additional step is very short, since q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is very small compared to n 𝑛 n italic_n . Lastly, by calculating 𝓘 ~ Q 1.5 − 1 ( 𝜷 ~ U ) subscript superscript ~ 𝓘 1 subscript 𝑄 1.5 subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}(\widetilde{\boldsymbol{%\beta}}_{U}) over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and 𝝋 ~ Q 1.5 ( 𝐩 ~ o p t , 𝜷 ~ U ) subscript ~ 𝝋 subscript 𝑄 1.5 superscript ~ 𝐩 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\varphi}}_{Q_{1.5}}(\widetilde{\mathbf{p}}^{opt},%\widetilde{\boldsymbol{\beta}}_{U}) over~ start_ARG bold_italic_φ end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) only once, a plot of R E ^ ( q n ) ^ 𝑅 𝐸 subscript 𝑞 𝑛 \widehat{RE}(q_{n}) over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) as a function of q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can be generated quickly and effortlessly. This additional step can be added easily in the above two-step Algorithm 1 , between Steps 1 and 2, as follows:
Step 1.5 : Sample q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observation from 𝒞 𝒞 \mathcal{C} caligraphic_C using the optimal sampling probabilities computed at Step 1. Combine these observations with the observed failure times to form 𝒬 1.5 subscript 𝒬 1.5 \mathcal{Q}_{1.5} caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT and compute 𝓘 ~ Q 1.5 − 1 ( 𝜷 ~ U ) subscript superscript ~ 𝓘 1 subscript 𝑄 1.5 subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}(\widetilde{\boldsymbol{%\beta}}_{U}) over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and 𝝋 ~ Q 1.5 ( 𝐩 o p t , 𝜷 ~ U ) subscript ~ 𝝋 subscript 𝑄 1.5 superscript 𝐩 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\varphi}}_{Q_{1.5}}(\mathbf{p}^{opt},\widetilde{%\boldsymbol{\beta}}_{U}) over~ start_ARG bold_italic_φ end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) . Plot R E ^ ( q n ) ^ 𝑅 𝐸 subscript 𝑞 𝑛 \widehat{RE}(q_{n}) over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) as a function of q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT . Choose the minimal q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that provides the required RE.
In practice, with a large n 𝑛 n italic_n , the curve of R E ^ ( q n ) ^ 𝑅 𝐸 subscript 𝑞 𝑛 \widehat{RE}(q_{n}) over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is anticipated to show a rapid decrease followed by a gradual decline, resembling an ‘elbow’ shape. A sensible selection for q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT would be in the region where the decline becomes moderate, as the incremental efficiency gain from further increasing q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is likely to be minimal. Comprehensive examples from simulations and real data analysis are presented in Sections 5–7 for further insights.
2.3 Subsample Size Based on Hypothesis TestingLet β p o subscript superscript 𝛽 𝑜 𝑝 \beta^{o}_{p} italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT be the p t h superscript 𝑝 𝑡 ℎ p^{th} italic_p start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT element of 𝜷 o superscript 𝜷 𝑜 \boldsymbol{\beta}^{o} bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT . Suppose we aim to test the hypothesis H 0 : β p o = 0 : subscript 𝐻 0 subscript superscript 𝛽 𝑜 𝑝 0 H_{0}:\beta^{o}_{p}=0 italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0 against H 1 : β p o ≠ 0 : subscript 𝐻 1 superscript subscript 𝛽 𝑝 𝑜 0 H_{1}:\beta_{p}^{o}\neq 0 italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ≠ 0 at a significance level of α 𝛼 \alpha italic_α with a power of γ 𝛾 \gamma italic_γ . Our current objective is to determine the necessary subsample size, given that β p o = β p ∗ subscript superscript 𝛽 𝑜 𝑝 superscript subscript 𝛽 𝑝 \beta^{o}_{p}=\beta_{p}^{*} italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , where β p ∗ superscript subscript 𝛽 𝑝 \beta_{p}^{*} italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is specified by the researcher. Since the minimal n 𝑛 n italic_n should satisfy
n = ⌈ ( Z 1 − α / 2 + Z γ ) 2 { 𝓘 − 1 ( 𝜷 o ) + n q n 𝓘 − 1 ( 𝜷 o ) 𝝋 ( 𝐩 o p t , 𝜷 o ) 𝓘 − 1 ( 𝜷 o ) } p p β p ∗ − 2 ⌉ , 𝑛 superscript subscript 𝑍 1 𝛼 2 subscript 𝑍 𝛾 2 subscript superscript 𝓘 1 superscript 𝜷 𝑜 𝑛 subscript 𝑞 𝑛 superscript 𝓘 1 superscript 𝜷 𝑜 𝝋 superscript 𝐩 𝑜 𝑝 𝑡 superscript 𝜷 𝑜 superscript 𝓘 1 superscript 𝜷 𝑜 𝑝 𝑝 superscript subscript 𝛽 𝑝 absent 2 n=\left\lceil{(Z_{1-\alpha/2}+Z_{\gamma})^{2}\left\{\boldsymbol{\mathcal{I}}^{%-1}({\boldsymbol{\beta}}^{o})+\frac{n}{q_{n}}{\boldsymbol{\mathcal{I}}}^{-1}({%\boldsymbol{\beta}}^{o}){\boldsymbol{\varphi}}({\mathbf{p}}^{opt},{\boldsymbol%{\beta}}^{o}){\boldsymbol{\mathcal{I}}}^{-1}(\boldsymbol{\beta}^{o})\right\}_{%pp}\beta_{p}^{*-2}}\right\rceil\,, italic_n = ⌈ ( italic_Z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT { bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) + divide start_ARG italic_n end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_italic_φ ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_caligraphic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ - 2 end_POSTSUPERSCRIPT ⌉ ,
where ⌈ . ⌉ \left\lceil{.}\right\rceil ⌈ . ⌉ is the ceiling function, the required q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is obtained by solving for q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and using estimated quantities, namely by
q ~ n = ⌈ { 𝓘 ~ Q 1.5 − 1 ( 𝜷 ~ U ) 𝝋 ~ Q 1.5 ( 𝐩 ~ o p t , 𝜷 ~ U ) 𝓘 ~ Q 1.5 − 1 ( 𝜷 ~ U ) } p p ( Z 1 − α / 2 + Z γ ) 2 ( β p ∗ 2 − n − 1 𝓘 ~ Q 1.5 − 1 ( 𝜷 ~ U ) p p ) ( Z 1 − α / 2 + Z γ ) 2 ⌉ . subscript ~ 𝑞 𝑛 subscript subscript superscript ~ 𝓘 1 subscript 𝑄 1.5 subscript ~ 𝜷 𝑈 subscript ~ 𝝋 subscript 𝑄 1.5 superscript ~ 𝐩 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 subscript superscript ~ 𝓘 1 subscript 𝑄 1.5 subscript ~ 𝜷 𝑈 𝑝 𝑝 superscript subscript 𝑍 1 𝛼 2 subscript 𝑍 𝛾 2 superscript superscript subscript 𝛽 𝑝 2 superscript 𝑛 1 subscript superscript ~ 𝓘 1 subscript 𝑄 1.5 subscript subscript ~ 𝜷 𝑈 𝑝 𝑝 superscript subscript 𝑍 1 𝛼 2 subscript 𝑍 𝛾 2 \widetilde{q}_{n}=\left\lceil\frac{\left\{\widetilde{\boldsymbol{\mathcal{I}}}%^{-1}_{Q_{1.5}}(\widetilde{\boldsymbol{\beta}}_{U})\widetilde{\boldsymbol{%\varphi}}_{Q_{1.5}}(\widetilde{\mathbf{p}}^{opt},\widetilde{\boldsymbol{\beta}%}_{U})\widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}(\widetilde{%\boldsymbol{\beta}}_{U})\right\}_{pp}(Z_{1-\alpha/2}+Z_{\gamma})^{2}}{\left({{%\beta_{p}^{*}}}^{2}-n^{-1}\widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}(%\widetilde{\boldsymbol{\beta}}_{U})_{pp}\right)(Z_{1-\alpha/2}+Z_{\gamma})^{2}%}\right\rceil. over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ⌈ divide start_ARG { over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) over~ start_ARG bold_italic_φ end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ) ( italic_Z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⌉ . (7)
This formula is convenient and practically valuable because, upon completing Steps 1 and 1.5, we can straightforwardly plot q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as a function of γ 𝛾 \gamma italic_γ . A negative value of q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT indicates that the required power cannot be achieved even with the entire sample n 𝑛 n italic_n . Our simulation study demonstrates that in scenarios where the required power is attainable, typically only a small fraction of the censored observations is necessary.
We summarize the additional step for a single-covariate hypothesis testing by adding the following mid-step to the original two-step Algorithm 1 :
Step 1.5* : Sample q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations from 𝒞 𝒞 \mathcal{C} caligraphic_C using the optimal sampling probabilities computed in Step 1. Combine these observations with the observed events to create 𝒬 1.5 subscript 𝒬 1.5 \mathcal{Q}_{1.5} caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT . Compute 𝓘 ~ Q 1.5 − 1 ( 𝜷 ~ U ) subscript superscript ~ 𝓘 1 subscript 𝑄 1.5 subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}(\widetilde{\boldsymbol{%\beta}}_{U}) over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) , 𝝋 ~ Q 1.5 ( 𝐩 o p t , 𝜷 ~ U ) subscript ~ 𝝋 subscript 𝑄 1.5 superscript 𝐩 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\varphi}}_{Q_{1.5}}(\mathbf{p}^{opt},\widetilde{%\boldsymbol{\beta}}_{U}) over~ start_ARG bold_italic_φ end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_p start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) , and q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , with the desired values of α 𝛼 \alpha italic_α and γ 𝛾 \gamma italic_γ . If q ~ n < 0 subscript ~ 𝑞 𝑛 0 \widetilde{q}_{n}<0 over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT < 0 , achieving the required power is not feasible even with the entire sample. Otherwise, set q n = q ~ n subscript 𝑞 𝑛 subscript ~ 𝑞 𝑛 q_{n}=\widetilde{q}_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT .
3 Logistic Regression with Rare Events3.1 Two-Step AlgorithmWhile Wangetal. (2018 ) introduced a two-stage optimal subsampling algorithm for logistic regression, it was observed that their asymptotic variance may not perform well in cases of highly imbalanced data (i.e., when the rate of cases is below 15%). Section 4 presents a method for choosing a subsample size for their optimal subsampling algorithm, and our simulation results indeed indicate its effectiveness primarily when the event is not rare.
In the realm of subsampling for imbalanced binary data, various methods have been explored and developed (Wang, 2020 ; Wangetal., 2021 ) . However, their results were derived under the assumption that the intercept approaches zero as the sample size goes to infinity, and the other coefficients are fixed, leading to the probability of experiencing an event decreasing to zero as the sample size goes to infinity.Our current work, akin to Wangetal. (2021 ) , is based on subsampling only among non-cases observations while retaining all cases. Notably, our approach does not necessitate the undesired assumption that the event probability approaches zero as the sample size increases. Furthermore, our method yields a simpler formula for the asymptotic variance, enabling evaluation of the required subsample size in a practically efficient manner.
Let D i ∈ { 0 , 1 } subscript 𝐷 𝑖 0 1 D_{i}\in\{0,1\} italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } be the response of individual i 𝑖 i italic_i , i = 1 , … , n 𝑖 1 … 𝑛
i=1,\dots,n italic_i = 1 , … , italic_n . In order to include an intercept term, we extend the vector of covariates of each individual from r 𝑟 r italic_r to r + 1 𝑟 1 r+1 italic_r + 1 to include the value of 1 in its first element, and for simplicity of presentation we continue using the notation 𝐗 i subscript 𝐗 𝑖 \mathbf{X}_{i} bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . Let 𝒩 = { i ; D i = 0 } 𝒩 𝑖 subscript 𝐷 𝑖
0 \mathcal{N}=\{i\,;\,D_{i}=0\} caligraphic_N = { italic_i ; italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 } and n 0 = | 𝒩 | subscript 𝑛 0 𝒩 n_{0}=|\mathcal{N}| italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = | caligraphic_N | . The logistic regression model is of the form
Pr ( D i = 1 | 𝐗 i ) = μ i ( 𝜷 ) = e 𝐗 i T 𝜷 ( 1 + e 𝐗 i T 𝜷 ) − 1 i = 1 , … , n . formulae-sequence Pr subscript 𝐷 𝑖 conditional 1 subscript 𝐗 𝑖 subscript 𝜇 𝑖 𝜷 superscript 𝑒 superscript subscript 𝐗 𝑖 𝑇 𝜷 superscript 1 superscript 𝑒 superscript subscript 𝐗 𝑖 𝑇 𝜷 1 𝑖 1 … 𝑛
\Pr(D_{i}=1|\mathbf{X}_{i})=\mu_{i}(\boldsymbol{\beta})={e^{\mathbf{X}_{i}^{T}%\boldsymbol{\beta}}}\left(1+e^{\mathbf{X}_{i}^{T}\boldsymbol{\beta}}\right)^{-%1}\quad\quad i=1,\ldots,n\,. roman_Pr ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 | bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) = italic_e start_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β end_POSTSUPERSCRIPT ( 1 + italic_e start_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_i = 1 , … , italic_n .
and the maximum likelihood estimator (MLE) is given by
𝜷 ^ M L E = arg max 𝜷 ∑ i = 1 n [ D i log μ i ( 𝜷 ) + ( 1 − D i ) log { 1 − μ i ( 𝜷 ) } ] . subscript ^ 𝜷 𝑀 𝐿 𝐸 subscript 𝜷 superscript subscript 𝑖 1 𝑛 delimited-[] subscript 𝐷 𝑖 subscript 𝜇 𝑖 𝜷 1 subscript 𝐷 𝑖 1 subscript 𝜇 𝑖 𝜷 \widehat{\boldsymbol{\beta}}_{MLE}=\arg\max_{\boldsymbol{\beta}}\sum_{i=1}^{n}%\big{[}D_{i}\log\mu_{i}(\boldsymbol{\beta})+(1-D_{i})\log\{1-\mu_{i}(%\boldsymbol{\beta})\}\big{]}\,. over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT bold_italic_β end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT [ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) + ( 1 - italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log { 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) } ] .
As before, let q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the size of the subsample from 𝒩 𝒩 \mathcal{N} caligraphic_N , π i subscript 𝜋 𝑖 \pi_{i} italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the sampling probability of individual i 𝑖 i italic_i , ∑ i ∈ 𝒩 π i = 1 subscript 𝑖 𝒩 subscript 𝜋 𝑖 1 \sum_{i\in\mathcal{N}}\pi_{i}=1 ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , and 𝒬 𝒬 \mathcal{Q} caligraphic_Q is the index set containing of all the observed cases (i.e., D i = 1 subscript 𝐷 𝑖 1 D_{i}=1 italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) and the subsampled non-cases (D i = 0 subscript 𝐷 𝑖 0 D_{i}=0 italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ). Finally, for i = 1 , … , n 𝑖 1 … 𝑛
i=1,\ldots,n italic_i = 1 , … , italic_n , set
w i = { ( π i q n ) − 1 , if D i = 0 1 , if D i = 1 subscript 𝑤 𝑖 cases superscript subscript 𝜋 𝑖 subscript 𝑞 𝑛 1 if subscript 𝐷 𝑖 0 1 if subscript 𝐷 𝑖 1 w_{i}=\begin{cases}(\pi_{i}q_{n})^{-1},&\text{if }D_{i}=0\\1,&\text{if }D_{i}=1\end{cases} italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL ( italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL if italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_CELL end_ROW
as the sampling weights. Then, the estimator 𝜷 ~ ~ 𝜷 \widetilde{\boldsymbol{\beta}} over~ start_ARG bold_italic_β end_ARG that is based on 𝒬 𝒬 \mathcal{Q} caligraphic_Q is obtained by maximizing the pseudo log-likelihood function
l ∗ ( 𝜷 ) = ∑ i ∈ 𝒬 w i [ D i log μ i ( 𝜷 ) + ( 1 − D i ) log { 1 − μ i ( 𝜷 ) } ] . superscript 𝑙 𝜷 subscript 𝑖 𝒬 subscript 𝑤 𝑖 delimited-[] subscript 𝐷 𝑖 subscript 𝜇 𝑖 𝜷 1 subscript 𝐷 𝑖 1 subscript 𝜇 𝑖 𝜷 l^{*}(\boldsymbol{\beta})=\sum_{i\in\mathcal{Q}}w_{i}\big{[}D_{i}\log\mu_{i}(%\boldsymbol{\beta})+(1-D_{i})\log\{1-\mu_{i}(\boldsymbol{\beta})\}\big{]}\,. italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) + ( 1 - italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log { 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) } ] . (8)
The following theorem provides the asymptotic distribution of a general subsampling-based estimator 𝜷 ~ ~ 𝜷 \widetilde{\boldsymbol{\beta}} over~ start_ARG bold_italic_β end_ARG , for any vector of sampling probabilities, given standard assumptions (see the SM, Section S2). Based on the asymptotic distribution, the optimal sampling probabilities will be derived. Since the optimal sampling probabilities will be shown to involve the true unknown 𝜷 o superscript 𝜷 𝑜 \boldsymbol{\beta}^{o} bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , we describe the two-step algorithm for logistic regression, which uses approximation of the optimal probabilities.
Theorem 1 If Assumptions A.1-A.4 hold, then as q n , n → ∞ → subscript 𝑞 𝑛 𝑛
q_{n},n\rightarrow\infty italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ ,
n ℍ R ( 𝝅 , 𝜷 o ) − 1 / 2 ( 𝜷 ~ − 𝜷 o ) → 𝐷 N ( 0 , 𝐈 ) 𝐷 → 𝑛 superscript ℍ 𝑅 superscript 𝝅 superscript 𝜷 𝑜 1 2 ~ 𝜷 superscript 𝜷 𝑜 𝑁 0 𝐈 \sqrt{n}\mathbb{H}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})^{-1/2}(%\widetilde{\boldsymbol{\beta}}-\boldsymbol{\beta}^{o})\xrightarrow{D}N(0,%\mathbf{I}) square-root start_ARG italic_n end_ARG blackboard_H start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_I )
where
ℍ R ( 𝝅 , 𝜷 ) = 𝐌 X − 1 ( 𝜷 ) + n q n 𝐌 X − 1 ( 𝜷 ) 𝐊 R ( 𝝅 , 𝜷 ) 𝐌 X − 1 ( 𝜷 ) , superscript ℍ 𝑅 𝝅 𝜷 superscript subscript 𝐌 𝑋 1 𝜷 𝑛 subscript 𝑞 𝑛 superscript subscript 𝐌 𝑋 1 𝜷 superscript 𝐊 𝑅 𝝅 𝜷 superscript subscript 𝐌 𝑋 1 𝜷 \mathbb{H}^{R}(\boldsymbol{\pi},\boldsymbol{\beta})=\mathbf{M}_{X}^{-1}(%\boldsymbol{\beta})+\frac{n}{q_{n}}\mathbf{M}_{X}^{-1}(\boldsymbol{\beta})%\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta})\mathbf{M}_{X}^{-1}(%\boldsymbol{\beta})\,, blackboard_H start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) = bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β ) + divide start_ARG italic_n end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β ) bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β ) ,
𝐌 X ( 𝜷 ) = n − 1 ∑ i = 1 n μ i ( 𝜷 ) { 1 − μ i ( 𝜷 ) } 𝐗 i 𝐗 i T , subscript 𝐌 𝑋 𝜷 superscript 𝑛 1 superscript subscript 𝑖 1 𝑛 subscript 𝜇 𝑖 𝜷 1 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 superscript subscript 𝐗 𝑖 𝑇 \mathbf{M}_{X}(\boldsymbol{\beta})=n^{-1}\sum_{i=1}^{n}\mu_{i}(\boldsymbol{%\beta})\{1-\mu_{i}(\boldsymbol{\beta})\}\mathbf{X}_{i}\mathbf{X}_{i}^{T}\,, bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( bold_italic_β ) = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) { 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) } bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ,
and
𝐊 R ( 𝝅 , 𝜷 ) = 1 n 2 { ∑ i ∈ 𝒩 μ i 2 ( 𝜷 ) 𝐗 i 𝐗 i T π i − ∑ i , j ∈ 𝒩 μ i ( 𝜷 ) μ j ( 𝜷 ) 𝐗 i 𝐗 j T } . superscript 𝐊 𝑅 𝝅 𝜷 1 superscript 𝑛 2 subscript 𝑖 𝒩 subscript superscript 𝜇 2 𝑖 𝜷 subscript 𝐗 𝑖 superscript subscript 𝐗 𝑖 𝑇 subscript 𝜋 𝑖 subscript 𝑖 𝑗
𝒩 subscript 𝜇 𝑖 𝜷 subscript 𝜇 𝑗 𝜷 subscript 𝐗 𝑖 superscript subscript 𝐗 𝑗 𝑇 \mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta})=\frac{1}{n^{2}}\bigg{\{}%\sum_{i\in\mathcal{N}}\frac{\mu^{2}_{i}(\boldsymbol{\beta})\mathbf{X}_{i}%\mathbf{X}_{i}^{T}}{\pi_{i}}-\sum_{i,j\in\mathcal{N}}\mu_{i}(\boldsymbol{\beta%})\mu_{j}(\boldsymbol{\beta})\mathbf{X}_{i}\mathbf{X}_{j}^{T}\bigg{\}}\,. bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } .
We now turn to derive optimal sampling probabilities while considering the A-optimal and L-optimal criteria, as before.
Theorem 2 The respective A-optimal and L-optimal sampling probability vectors, denoted by 𝛑 R , A superscript 𝛑 𝑅 𝐴
\boldsymbol{\pi}^{R,A} bold_italic_π start_POSTSUPERSCRIPT italic_R , italic_A end_POSTSUPERSCRIPT and 𝛑 R , L superscript 𝛑 𝑅 𝐿
\boldsymbol{\pi}^{R,L} bold_italic_π start_POSTSUPERSCRIPT italic_R , italic_L end_POSTSUPERSCRIPT , which minimize the trace of ℍ R ( 𝛑 , 𝛃 o ) superscript ℍ 𝑅 𝛑 superscript 𝛃 𝑜 \mathbb{H}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o}) blackboard_H start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) and 𝐊 R ( 𝛑 , 𝛃 ) superscript 𝐊 𝑅 𝛑 𝛃 \mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}) bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) , respectively, are given by
π m R , A = μ m ( 𝜷 o ) ‖ 𝐌 X − 1 ( 𝜷 o ) 𝐗 m ‖ 2 ∑ j ∈ 𝒩 μ j ( 𝜷 o ) ‖ 𝐌 X − 1 ( 𝜷 o ) 𝐗 j ‖ 2 for all m ∈ 𝒩 formulae-sequence superscript subscript 𝜋 𝑚 𝑅 𝐴
subscript 𝜇 𝑚 superscript 𝜷 𝑜 subscript norm superscript subscript 𝐌 𝑋 1 superscript 𝜷 𝑜 subscript 𝐗 𝑚 2 subscript 𝑗 𝒩 subscript 𝜇 𝑗 superscript 𝜷 𝑜 subscript norm superscript subscript 𝐌 𝑋 1 superscript 𝜷 𝑜 subscript 𝐗 𝑗 2 for all
𝑚 𝒩 \pi_{m}^{R,A}=\frac{\mu_{m}(\boldsymbol{\beta}^{o})\|\mathbf{M}_{X}^{-1}(%\boldsymbol{\beta}^{o})\mathbf{X}_{m}\|_{2}}{\sum_{j\in\mathcal{N}}\mu_{j}(%\boldsymbol{\beta}^{o})\|\mathbf{M}_{X}^{-1}(\boldsymbol{\beta}^{o})\mathbf{X}%_{j}\|_{2}}\quad\text{ for all }\quad m\in\mathcal{N} italic_π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R , italic_A end_POSTSUPERSCRIPT = divide start_ARG italic_μ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG for all italic_m ∈ caligraphic_N (9)
and
π m R , L = μ m ( 𝜷 o ) ‖ 𝐗 m ‖ 2 ∑ j ∈ 𝒩 μ j ( 𝜷 o ) ‖ 𝐗 j ‖ 2 for all m ∈ 𝒩 . formulae-sequence superscript subscript 𝜋 𝑚 𝑅 𝐿
subscript 𝜇 𝑚 superscript 𝜷 𝑜 subscript norm subscript 𝐗 𝑚 2 subscript 𝑗 𝒩 subscript 𝜇 𝑗 superscript 𝜷 𝑜 subscript norm subscript 𝐗 𝑗 2 for all
𝑚 𝒩 \pi_{m}^{R,L}=\frac{\mu_{m}(\boldsymbol{\beta}^{o})\|\mathbf{X}_{m}\|_{2}}{%\sum_{j\in\mathcal{N}}\mu_{j}(\boldsymbol{\beta}^{o})\|\mathbf{X}_{j}\|_{2}}%\quad\text{ for all }\quad m\in\mathcal{N}. italic_π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R , italic_L end_POSTSUPERSCRIPT = divide start_ARG italic_μ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG for all italic_m ∈ caligraphic_N . (10)
Notably, the optimal probabilities expressed in (9 ) and (10 ) bear a resemblance to those derived for the subsampling approach in Wangetal. (2018 ) applied to a balanced design. The discrepancy between these optimal probabilities and their counterparts in Wangetal. (2018 ) stems from the fact that, here, the summation in the denominators is restricted to the set 𝒩 𝒩 \mathcal{N} caligraphic_N rather than the entire sample. Since 𝝅 R , A superscript 𝝅 𝑅 𝐴
\boldsymbol{\pi}^{R,A} bold_italic_π start_POSTSUPERSCRIPT italic_R , italic_A end_POSTSUPERSCRIPT and 𝝅 R , L superscript 𝝅 𝑅 𝐿
\boldsymbol{\pi}^{R,L} bold_italic_π start_POSTSUPERSCRIPT italic_R , italic_L end_POSTSUPERSCRIPT involve the unknown 𝜷 o superscript 𝜷 𝑜 \boldsymbol{\beta}^{o} bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , we suggest the following two-step algorithm, in the spirit of the previous section and Wangetal. (2018 ) :
Step 1: Sample q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations uniformly from 𝒩 𝒩 \mathcal{N} caligraphic_N and combine them with all the observed events to create 𝒬 p i l o t 𝒬 𝑝 𝑖 𝑙 𝑜 𝑡 \mathcal{Q}{pilot} caligraphic_Q italic_p italic_i italic_l italic_o italic_t . Perform a weighted logistic regression on 𝒬 p i l o t 𝒬 𝑝 𝑖 𝑙 𝑜 𝑡 \mathcal{Q}{pilot} caligraphic_Q italic_p italic_i italic_l italic_o italic_t , based on Eq. (8 ), and obtain 𝜷 ~ U subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\beta}}_{U} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT . Utilize this estimator to derive approximated optimal sampling probabilities by substituting 𝜷 o superscript 𝜷 𝑜 \boldsymbol{\beta}^{o} bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT with 𝜷 ~ U subscript ~ 𝜷 𝑈 \widetilde{\boldsymbol{\beta}}_{U} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT in Eq. (9 ) or (10 ).
Step 2: Sample q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT observations from 𝒩 𝒩 \mathcal{N} caligraphic_N using the sampling probabilities computed in Step 1. Combine these observations with the observed events to create 𝒬 𝒬 \mathcal{Q} caligraphic_Q and conduct a weighted logistic regression on 𝒬 𝒬 \mathcal{Q} caligraphic_Q , based on Eq. (8 ), to obtain the two-step estimator 𝜷 ~ T S subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT .
Consistency and asymptotic normality of 𝜷 ~ T S subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT can be shown by following the main steps of Wangetal. (2018 ) and Keret andGorfine (2023 ) , as detailed in the SM, Section S2. As for the Cox regression, the value of q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is recommended to be c 0 ( n − n 0 ) subscript 𝑐 0 𝑛 subscript 𝑛 0 c_{0}(n-n_{0}) italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) with small values of c 0 subscript 𝑐 0 c_{0} italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . Once 𝜷 ~ T S subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT is calculated, inference can be executed by using the subsample counterparts of ℍ R ( 𝝅 , 𝜷 ~ T S ) superscript ℍ 𝑅 𝝅 subscript ~ 𝜷 𝑇 𝑆 \mathbb{H}^{R}(\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}}_{TS}) blackboard_H start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) and 𝐊 R ( 𝝅 , 𝜷 ~ T S ) superscript 𝐊 𝑅 𝝅 subscript ~ 𝜷 𝑇 𝑆 \mathbf{K}^{R}(\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}}_{TS}) bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) , namely
ℍ ~ R ( 𝝅 , 𝜷 ~ T S ) = 𝐌 ~ X − 1 ( 𝜷 ~ T S ) + n q n 𝐌 ~ X − 1 ( 𝜷 ~ T S ) 𝐊 ~ R ( 𝝅 , 𝜷 ~ T S ) 𝐌 ~ X − 1 ( 𝜷 ~ T S ) superscript ~ ℍ 𝑅 𝝅 subscript ~ 𝜷 𝑇 𝑆 superscript subscript ~ 𝐌 𝑋 1 subscript ~ 𝜷 𝑇 𝑆 𝑛 subscript 𝑞 𝑛 superscript subscript ~ 𝐌 𝑋 1 subscript ~ 𝜷 𝑇 𝑆 superscript ~ 𝐊 𝑅 𝝅 subscript ~ 𝜷 𝑇 𝑆 superscript subscript ~ 𝐌 𝑋 1 subscript ~ 𝜷 𝑇 𝑆 \widetilde{\mathbb{H}}^{R}(\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}}_{TS%})=\widetilde{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{TS})+\frac{%n}{q_{n}}\widetilde{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{TS})%\widetilde{\mathbf{K}}^{R}(\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}}_{TS%})\widetilde{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{TS}) over~ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) = over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) + divide start_ARG italic_n end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) over~ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) (11)
and
𝐊 ~ ( 𝝅 , 𝜷 ~ T S ) = 1 n 2 { 1 q n ∑ i ∈ 𝒬 ∖ ℰ μ i ( 𝜷 ~ T S ) 2 𝐗 i ( 𝐗 i ) T π i 2 − 1 q n 2 ∑ i ∈ 𝒬 ∖ ℰ μ i ( 𝜷 ~ T S ) 𝐗 i π i ( ∑ i ∈ 𝒬 ∖ ℰ μ i ( 𝜷 ~ T S ) 𝐗 i π i ) T } , ~ 𝐊 𝝅 subscript ~ 𝜷 𝑇 𝑆 1 superscript 𝑛 2 1 subscript 𝑞 𝑛 subscript 𝑖 𝒬 ℰ subscript 𝜇 𝑖 superscript subscript ~ 𝜷 𝑇 𝑆 2 subscript 𝐗 𝑖 superscript subscript 𝐗 𝑖 𝑇 superscript subscript 𝜋 𝑖 2 1 superscript subscript 𝑞 𝑛 2 subscript 𝑖 𝒬 ℰ subscript 𝜇 𝑖 subscript ~ 𝜷 𝑇 𝑆 subscript 𝐗 𝑖 subscript 𝜋 𝑖 superscript subscript 𝑖 𝒬 ℰ subscript 𝜇 𝑖 subscript ~ 𝜷 𝑇 𝑆 subscript 𝐗 𝑖 subscript 𝜋 𝑖 𝑇 \widetilde{\mathbf{K}}(\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}}_{TS})=%\frac{1}{n^{2}}\Bigg{\{}\frac{1}{q_{n}}\sum_{i\in\mathcal{Q}\setminus\mathcal{%E}}\frac{\mu_{i}(\widetilde{\boldsymbol{\beta}}_{TS})^{2}\mathbf{X}_{i}(%\mathbf{X}_{i})^{T}}{\pi_{i}^{2}}-\frac{1}{q_{n}^{2}}\sum_{i\in\mathcal{Q}%\setminus\mathcal{E}}\frac{\mu_{i}(\widetilde{\boldsymbol{\beta}}_{TS})\mathbf%{X}_{i}}{\pi_{i}}\bigg{(}\sum_{i\in\mathcal{Q}\setminus\mathcal{E}}\frac{\mu_{%i}(\widetilde{\boldsymbol{\beta}}_{TS})\mathbf{X}_{i}}{\pi_{i}}\bigg{)}^{T}%\Bigg{\}}\,, over~ start_ARG bold_K end_ARG ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } ,
where ℰ = { i : D i = 1 } ℰ conditional-set 𝑖 subscript 𝐷 𝑖 1 \mathcal{E}=\{i\,:\,D_{i}=1\} caligraphic_E = { italic_i : italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 } and
𝐌 ~ X ( 𝜷 ~ T S ) = 1 n ∑ i ∈ 𝒬 w i μ i ( 𝜷 ~ T S ) { 1 − μ i ( 𝜷 ~ T S ) } 𝐗 i ( 𝐗 i ) T . subscript ~ 𝐌 𝑋 subscript ~ 𝜷 𝑇 𝑆 1 𝑛 subscript 𝑖 𝒬 subscript 𝑤 𝑖 subscript 𝜇 𝑖 subscript ~ 𝜷 𝑇 𝑆 1 subscript 𝜇 𝑖 subscript ~ 𝜷 𝑇 𝑆 subscript 𝐗 𝑖 superscript subscript 𝐗 𝑖 𝑇 \widetilde{\mathbf{M}}_{X}(\widetilde{\boldsymbol{\beta}}_{TS})=\frac{1}{n}%\sum_{i\in\mathcal{Q}}w_{i}\mu_{i}(\widetilde{\boldsymbol{\beta}}_{TS})\left\{%1-\mu_{i}(\widetilde{\boldsymbol{\beta}}_{TS})\right\}\mathbf{X}_{i}(\mathbf{X%}_{i})^{T}\,. over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) { 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) } bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .
3.2 Choosing Subsample Size by Relative Efficiency or Hypothesis TestingIn the spirit of Section 2.2 , we can estimate the RE of the two-step estimator relative to the estimator based on the entire dataset, in order to assess the required subsample size of Step 2. To this end, define
ℍ ˇ R ( 𝝅 o p t , 𝜷 ~ U ) = 𝐌 ˇ X − 1 ( 𝜷 ˇ U ) + n q 0 𝐌 ˇ X − 1 ( 𝜷 ˇ U ) 𝐊 ˇ R ( 𝝅 o p t , 𝜷 ˇ U ) 𝐌 ˇ X − 1 ( 𝜷 ˇ U ) superscript ˇ ℍ 𝑅 superscript 𝝅 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 superscript subscript ˇ 𝐌 𝑋 1 subscript ˇ 𝜷 𝑈 𝑛 subscript 𝑞 0 superscript subscript ˇ 𝐌 𝑋 1 subscript ˇ 𝜷 𝑈 superscript ˇ 𝐊 𝑅 superscript 𝝅 𝑜 𝑝 𝑡 subscript ˇ 𝜷 𝑈 superscript subscript ˇ 𝐌 𝑋 1 subscript ˇ 𝜷 𝑈 \check{\mathbb{H}}^{R}(\boldsymbol{\pi}^{opt},\widetilde{\boldsymbol{\beta}}_{%U})=\check{\mathbf{M}}_{X}^{-1}(\check{\boldsymbol{\beta}}_{U})+\frac{n}{q_{0}%}\check{\mathbf{M}}_{X}^{-1}(\check{\boldsymbol{\beta}}_{U})\check{\mathbf{K}}%^{R}(\boldsymbol{\pi}^{opt},\check{\boldsymbol{\beta}}_{U})\check{\mathbf{M}}_%{X}^{-1}(\check{\boldsymbol{\beta}}_{U}) overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) = overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( overroman_ˇ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) + divide start_ARG italic_n end_ARG start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( overroman_ˇ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , overroman_ˇ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( overroman_ˇ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) (12)
where
𝐌 ˇ X ( 𝜷 ~ U ) = 1 n ∑ i ∈ 𝒬 1.5 w ˇ i μ i ( 𝜷 ~ U ) { 1 − μ i ( 𝜷 ~ U ) } 𝐗 i 𝐗 i T , subscript ˇ 𝐌 𝑋 subscript ~ 𝜷 𝑈 1 𝑛 subscript 𝑖 subscript 𝒬 1.5 subscript ˇ 𝑤 𝑖 subscript 𝜇 𝑖 subscript ~ 𝜷 𝑈 1 subscript 𝜇 𝑖 subscript ~ 𝜷 𝑈 subscript 𝐗 𝑖 superscript subscript 𝐗 𝑖 𝑇 \check{\mathbf{M}}_{X}(\widetilde{\boldsymbol{\beta}}_{U})=\frac{1}{n}\sum_{i%\in\mathcal{Q}_{1.5}}\check{w}_{i}\mu_{i}(\widetilde{\boldsymbol{\beta}}_{U})%\left\{1-\mu_{i}(\widetilde{\boldsymbol{\beta}}_{U})\right\}\mathbf{X}_{i}%\mathbf{X}_{i}^{T}\,, overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT overroman_ˇ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) { 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) } bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (13)
w ˇ i = { ( π i o p t q 0 ) − 1 , if D i = 0 1 , if D i = 1 subscript ˇ 𝑤 𝑖 cases superscript superscript subscript 𝜋 𝑖 𝑜 𝑝 𝑡 subscript 𝑞 0 1 if subscript 𝐷 𝑖 0 1 if subscript 𝐷 𝑖 1 \check{w}_{i}=\begin{cases}(\pi_{i}^{opt}q_{0})^{-1},&\text{if }D_{i}=0\\1,&\text{if }D_{i}=1\end{cases} overroman_ˇ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL ( italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL if italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_CELL end_ROW
and
𝐊 ˇ R ( 𝝅 , 𝜷 ~ U ) = 1 n 2 { 1 q 0 ∑ i ∈ 𝒬 1.5 ∖ ℰ μ i 2 ( 𝜷 ~ U ) 𝐗 i ( 𝐗 i ) T π i 2 − 1 q 0 2 ∑ i ∈ 𝒬 1.5 ∖ ℰ μ i ( 𝜷 ~ U ) 𝐗 i π i ( ∑ i ∈ 𝒬 1.5 ∖ ℰ μ i ( 𝜷 ~ U ) 𝐗 i π i ) T } . superscript ˇ 𝐊 𝑅 𝝅 subscript ~ 𝜷 𝑈 1 superscript 𝑛 2 1 subscript 𝑞 0 subscript 𝑖 subscript 𝒬 1.5 ℰ subscript superscript 𝜇 2 𝑖 subscript ~ 𝜷 𝑈 subscript 𝐗 𝑖 superscript subscript 𝐗 𝑖 𝑇 superscript subscript 𝜋 𝑖 2 1 superscript subscript 𝑞 0 2 subscript 𝑖 subscript 𝒬 1.5 ℰ subscript 𝜇 𝑖 subscript ~ 𝜷 𝑈 subscript 𝐗 𝑖 subscript 𝜋 𝑖 superscript subscript 𝑖 subscript 𝒬 1.5 ℰ subscript 𝜇 𝑖 subscript ~ 𝜷 𝑈 subscript 𝐗 𝑖 subscript 𝜋 𝑖 𝑇 \check{\mathbf{K}}^{R}(\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}}_{U})=%\frac{1}{n^{2}}\left\{\frac{1}{q_{0}}\sum_{i\in\mathcal{Q}_{1.5}\setminus%\mathcal{E}}\frac{\mu^{2}_{i}(\widetilde{\boldsymbol{\beta}}_{U})\mathbf{X}_{i%}(\mathbf{X}_{i})^{T}}{\pi_{i}^{2}}-\frac{1}{q_{0}^{2}}\sum_{i\in\mathcal{Q}_{%1.5}\setminus\mathcal{E}}\frac{\mu_{i}(\widetilde{\boldsymbol{\beta}}_{U})%\mathbf{X}_{i}}{\pi_{i}}\left(\sum_{i\in\mathcal{Q}_{1.5}\setminus\mathcal{E}}%\frac{\mu_{i}(\widetilde{\boldsymbol{\beta}}_{U})\mathbf{X}_{i}}{\pi_{i}}%\right)^{T}\right\}\,. overroman_ˇ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } .
Finally, we define the RE estimators as
R E ^ ( q n ) = ‖ n − 1 𝐌 ˇ X − 1 ( 𝜷 ~ U ) + q n − 1 𝐌 ˇ X − 1 ( 𝜷 ~ U ) ℍ ˇ R ( 𝝅 ~ o p t , 𝜷 ~ U ) 𝐌 ˇ X − 1 ( 𝜷 ~ U ) ‖ F ‖ n − 1 𝐌 ˇ X − 1 ( 𝜷 ~ U ) ‖ F ^ 𝑅 𝐸 subscript 𝑞 𝑛 subscript norm superscript 𝑛 1 superscript subscript ˇ 𝐌 𝑋 1 subscript ~ 𝜷 𝑈 superscript subscript 𝑞 𝑛 1 superscript subscript ˇ 𝐌 𝑋 1 subscript ~ 𝜷 𝑈 superscript ˇ ℍ 𝑅 superscript ~ 𝝅 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 superscript subscript ˇ 𝐌 𝑋 1 subscript ~ 𝜷 𝑈 𝐹 subscript norm superscript 𝑛 1 superscript subscript ˇ 𝐌 𝑋 1 subscript ~ 𝜷 𝑈 𝐹 \widehat{RE}(q_{n})=\frac{\|n^{-1}\check{\mathbf{M}}_{X}^{-1}(\widetilde{%\boldsymbol{\beta}}_{U})+q_{n}^{-1}\check{\mathbf{M}}_{X}^{-1}(\widetilde{%\boldsymbol{\beta}}_{U})\check{\mathbb{H}}^{R}(\widetilde{\boldsymbol{\pi}}^{%opt},\widetilde{\boldsymbol{\beta}}_{U})\check{\mathbf{M}}_{X}^{-1}(\widetilde%{\boldsymbol{\beta}}_{U})\|_{F}}{\|n^{-1}\check{\mathbf{M}}_{X}^{-1}(%\widetilde{\boldsymbol{\beta}}_{U})\|_{F}} over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) + italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_π end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG (14)
and
R E ^ p ( q n ) = [ n − 1 𝐌 ˇ X − 1 ( 𝜷 ~ U ) + q n − 1 𝐌 ˇ X − 1 ( 𝜷 ~ U ) ℍ ˇ R ( 𝝅 ~ o p t , 𝜷 ~ U ) 𝐌 ~ X − 1 ( 𝜷 ~ U ) ] p p [ n − 1 𝐌 ˇ X − 1 ( 𝜷 ~ U ) ] p p . subscript ^ 𝑅 𝐸 𝑝 subscript 𝑞 𝑛 subscript delimited-[] superscript 𝑛 1 superscript subscript ˇ 𝐌 𝑋 1 subscript ~ 𝜷 𝑈 superscript subscript 𝑞 𝑛 1 superscript subscript ˇ 𝐌 𝑋 1 subscript ~ 𝜷 𝑈 superscript ˇ ℍ 𝑅 superscript ~ 𝝅 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 superscript subscript ~ 𝐌 𝑋 1 subscript ~ 𝜷 𝑈 𝑝 𝑝 subscript delimited-[] superscript 𝑛 1 superscript subscript ˇ 𝐌 𝑋 1 subscript ~ 𝜷 𝑈 𝑝 𝑝 \widehat{RE}_{p}(q_{n})=\frac{\big{[}n^{-1}\check{\mathbf{M}}_{X}^{-1}(%\widetilde{\boldsymbol{\beta}}_{U})+q_{n}^{-1}\check{\mathbf{M}}_{X}^{-1}(%\widetilde{\boldsymbol{\beta}}_{U})\check{\mathbb{H}}^{R}(\widetilde{%\boldsymbol{\pi}}^{opt},\widetilde{\boldsymbol{\beta}}_{U})\widetilde{\mathbf{%M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{U})\big{]}_{pp}}{\big{[}n^{-1}%\check{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{U})\big{]}_{pp}}\,. over^ start_ARG italic_R italic_E end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG [ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) + italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_π end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT end_ARG start_ARG [ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT end_ARG . (15)
The procedure can be easily incorporated within the two-step Algorithm 2 by the following additional step:
Step 1.5 : Sample q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations from 𝒩 𝒩 \mathcal{N} caligraphic_N using the optimal sampling probabilities from Step 1. Combine the sampled observations with ℰ ℰ \mathcal{E} caligraphic_E to create 𝒬 1.5 subscript 𝒬 1.5 \mathcal{Q}_{1.5} caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT . Calculate 𝐌 ˇ X − 1 ( 𝜷 ~ U ) superscript subscript ˇ 𝐌 𝑋 1 subscript ~ 𝜷 𝑈 \check{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{U}) overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and ℍ ˇ R ( 𝝅 ~ o p t , 𝜷 ~ U ) superscript ˇ ℍ 𝑅 superscript ~ 𝝅 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 \check{\mathbb{H}}^{R}(\widetilde{\boldsymbol{\pi}}^{opt},\widetilde{%\boldsymbol{\beta}}_{U}) overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_π end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) . Plot R E ^ ( q n ) ^ 𝑅 𝐸 subscript 𝑞 𝑛 \widehat{RE}(q_{n}) over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) or R E ^ p ( q n ) subscript ^ 𝑅 𝐸 𝑝 subscript 𝑞 𝑛 \widehat{RE}_{p}(q_{n}) over^ start_ARG italic_R italic_E end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) as a function of q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and select the minimum q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that satisfies the required relative efficiency.
The minimal subsample size for testing H 0 : β p o = 0 : subscript 𝐻 0 subscript superscript 𝛽 𝑜 𝑝 0 H_{0}:\beta^{o}_{p}=0 italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0 against a two-sided alternative, given β p o = β p ∗ subscript superscript 𝛽 𝑜 𝑝 superscript subscript 𝛽 𝑝 \beta^{o}_{p}=\beta_{p}^{*} italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , a significance level α 𝛼 \alpha italic_α and a power γ 𝛾 \gamma italic_γ , is given by
q ~ n = ⌈ { 𝐌 ˇ X − 1 ( 𝜷 ~ U ) 𝐊 ˇ ( 𝝅 o p t , 𝜷 ~ U ) 𝐌 ˇ X − 1 ( 𝜷 ~ U ) } p p ( Z 1 − α / 2 + Z γ ) 2 β p ∗ 2 − n − 1 𝐌 ˇ X − 1 ( 𝜷 ~ U ) p p ( Z 1 − α / 2 + Z γ ) 2 ⌉ . subscript ~ 𝑞 𝑛 subscript superscript subscript ˇ 𝐌 𝑋 1 subscript ~ 𝜷 𝑈 ˇ 𝐊 superscript 𝝅 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 superscript subscript ˇ 𝐌 𝑋 1 subscript ~ 𝜷 𝑈 𝑝 𝑝 superscript subscript 𝑍 1 𝛼 2 subscript 𝑍 𝛾 2 superscript superscript subscript 𝛽 𝑝 2 superscript 𝑛 1 superscript subscript ˇ 𝐌 𝑋 1 subscript subscript ~ 𝜷 𝑈 𝑝 𝑝 superscript subscript 𝑍 1 𝛼 2 subscript 𝑍 𝛾 2 \widetilde{q}_{n}=\left\lceil{\frac{\left\{\check{\mathbf{M}}_{X}^{-1}(%\widetilde{\boldsymbol{\beta}}_{U})\check{\mathbf{K}}(\boldsymbol{\pi}^{opt},%\widetilde{\boldsymbol{\beta}}_{U})\check{\mathbf{M}}_{X}^{-1}(\widetilde{%\boldsymbol{\beta}}_{U})\right\}_{pp}(Z_{1-\alpha/2}+Z_{\gamma})^{2}}{{{\beta_%{p}^{*}}}^{2}-n^{-1}\check{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}%_{U})_{pp}(Z_{1-\alpha/2}+Z_{\gamma})^{2}}}\right\rceil\,. over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ⌈ divide start_ARG { overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_K end_ARG ( bold_italic_π start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⌉ . (16)
A plot of q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as a function of γ 𝛾 \gamma italic_γ can be easily generated. The algorithm for a single covariate hypothesis testing is defined by adding the following mid-step to the two-step Algorithm 2 :
Step 1.5* : Sample q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations from 𝒩 𝒩 \mathcal{N} caligraphic_N using the optimal sampling probabilities of Step 1. Combine these sampled observations with ℰ ℰ \mathcal{E} caligraphic_E to form 𝒬 1.5 subscript 𝒬 1.5 \mathcal{Q}_{1.5} caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT . Compute 𝐌 ˇ X − 1 ( 𝜷 ~ U ) superscript subscript ˇ 𝐌 𝑋 1 subscript ~ 𝜷 𝑈 \check{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{U}) overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and ℍ ˇ R ( 𝝅 ~ o p t , 𝜷 ~ U ) superscript ˇ ℍ 𝑅 superscript ~ 𝝅 𝑜 𝑝 𝑡 subscript ~ 𝜷 𝑈 \check{\mathbb{H}}^{R}(\widetilde{\boldsymbol{\pi}}^{opt},\widetilde{%\boldsymbol{\beta}}_{U}) overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_π end_ARG start_POSTSUPERSCRIPT italic_o italic_p italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) . Plot q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT against γ 𝛾 \gamma italic_γ . If q ~ n < 0 subscript ~ 𝑞 𝑛 0 \widetilde{q}_{n}<0 over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT < 0 , achieving the required power is unattainable even with the entire dataset n 𝑛 n italic_n . Otherwise, set q n = q ~ n subscript 𝑞 𝑛 subscript ~ 𝑞 𝑛 q_{n}=\widetilde{q}_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT .
4 Logistic Regression with Nearly Balanced Data4.1 The Two-Step Optimal Subsampling Algorithm (Wangetal., 2018 ) While Wangetal. (2018 ) presented an optimal two-step subsampling algorithm for logistic regression in the context of nearly balanced data and laid the theoretical asymptotic foundations for this approach, they did not offer a method for selecting the subsample size. This section aims to remedy this gap. To enhance clarity, we begin by providing a summary of their optimal two-step subsampling algorithm.
In the rare event setting, sampling is performed exclusively from the majority class, whereas in the nearly balanced binary outcome scenario, sampling is conducted from the entire sample. Hence, now we redefine 𝒬 𝒬 \mathcal{Q} caligraphic_Q as the index set of all the observations included in the subsample with sampling weights w i = ( π i q n ) − 1 , i = 1 , … , n formulae-sequence subscript 𝑤 𝑖 superscript subscript 𝜋 𝑖 subscript 𝑞 𝑛 1 𝑖 1 … 𝑛
w_{i}=(\pi_{i}q_{n})^{-1},i=1,\dots,n italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_i = 1 , … , italic_n . Then, the estimator 𝜷 ~ ~ 𝜷 \widetilde{\boldsymbol{\beta}} over~ start_ARG bold_italic_β end_ARG that is based on 𝒬 𝒬 \mathcal{Q} caligraphic_Q is obtained by maximizing the pseudo log-likelihood function (8 ).
Under some regularity assumptions (Wangetal., 2018 ) , they showed that given ℱ n = { D i , 𝐗 i , 1 , … , n } subscript ℱ 𝑛 subscript 𝐷 𝑖 subscript 𝐗 𝑖 1 … 𝑛 \mathcal{F}_{n}=\{D_{i},\mathbf{X}_{i}\,,\,1,\ldots,n\} caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 , … , italic_n } , the asymptotic distribution of 𝜷 ~ ~ 𝜷 \widetilde{\boldsymbol{\beta}} over~ start_ARG bold_italic_β end_ARG is
n ℍ B ( 𝝅 , 𝜷 ^ M L E ) − 1 / 2 ( 𝜷 ~ − 𝜷 ^ M L E ) → 𝐷 N ( 0 , 𝐈 ) 𝐷 → 𝑛 superscript ℍ 𝐵 superscript 𝝅 subscript ^ 𝜷 𝑀 𝐿 𝐸 1 2 ~ 𝜷 subscript ^ 𝜷 𝑀 𝐿 𝐸 𝑁 0 𝐈 \sqrt{n}\mathbb{H}^{B}(\boldsymbol{\pi},\widehat{\boldsymbol{\beta}}_{MLE})^{-%1/2}(\widetilde{\boldsymbol{\beta}}-\widehat{\boldsymbol{\beta}}_{MLE})%\xrightarrow{D}N(0,\mathbf{I}) square-root start_ARG italic_n end_ARG blackboard_H start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_I ) (17)
as n , q n → ∞ → 𝑛 subscript 𝑞 𝑛
n,q_{n}\rightarrow\infty italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ , where
ℍ B ( 𝝅 , 𝜷 ) = 𝐌 X − 1 ( 𝜷 ) 𝐊 B ( 𝝅 , 𝜷 ) 𝐌 X − 1 ( 𝜷 ) superscript ℍ 𝐵 𝝅 𝜷 superscript subscript 𝐌 𝑋 1 𝜷 superscript 𝐊 𝐵 𝝅 𝜷 superscript subscript 𝐌 𝑋 1 𝜷 \mathbb{H}^{B}(\boldsymbol{\pi},\boldsymbol{\beta})=\mathbf{M}_{X}^{-1}(%\boldsymbol{\beta})\mathbf{K}^{B}(\boldsymbol{\pi},\boldsymbol{\beta})\mathbf{%M}_{X}^{-1}(\boldsymbol{\beta}) blackboard_H start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) = bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β ) bold_K start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β )
and
𝐊 B ( 𝝅 , 𝜷 ) = 1 n 2 ∑ i = 1 n w i { D i − μ i ( 𝜷 ) } 2 𝐗 i 𝐗 i T . superscript 𝐊 𝐵 𝝅 𝜷 1 superscript 𝑛 2 superscript subscript 𝑖 1 𝑛 subscript 𝑤 𝑖 superscript subscript 𝐷 𝑖 subscript 𝜇 𝑖 𝜷 2 subscript 𝐗 𝑖 superscript subscript 𝐗 𝑖 𝑇 \mathbf{K}^{B}(\boldsymbol{\pi},\boldsymbol{\beta})=\frac{1}{n^{2}}\sum_{i=1}^%{n}w_{i}\left\{D_{i}-\mu_{i}(\boldsymbol{\beta})\right\}^{2}\mathbf{X}_{i}%\mathbf{X}_{i}^{T}\,. bold_K start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .
Then, the A-optimal and L-optimal subsampling probabilities are given by
π i B , A = | D i − μ i ( 𝜷 ^ M L E ) | ‖ 𝐌 X − 1 𝐗 i ‖ ∑ j = 1 n | D j − μ j ( 𝜷 ^ M L E ) | ‖ 𝐌 X − 1 𝐗 j ‖ , i = 1 , … , n \pi_{i}^{B,A}=\frac{|D_{i}-\mu_{i}(\widehat{\boldsymbol{\beta}}_{MLE})|\|%\mathbf{M}_{X}^{-1}\mathbf{X}_{i}\|}{\sum_{j=1}^{n}|D_{j}-\mu_{j}(\widehat{%\boldsymbol{\beta}}_{MLE})|\|\mathbf{M}_{X}^{-1}\mathbf{X}_{j}\|}\quad,\quad i%=1,\dots,n\, italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B , italic_A end_POSTSUPERSCRIPT = divide start_ARG | italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) | ∥ bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) | ∥ bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ end_ARG , italic_i = 1 , … , italic_n (18)
and
π i B , L = | D i − μ i ( 𝜷 ^ M L E ) | ‖ 𝐗 i ‖ ∑ j = 1 n | D j − μ j ( 𝜷 ^ M L E ) | ‖ 𝐗 j ‖ , i = 1 , … , n . \pi_{i}^{B,L}=\frac{|D_{i}-\mu_{i}(\widehat{\boldsymbol{\beta}}_{MLE})|\|%\mathbf{X}_{i}\|}{\sum_{j=1}^{n}|D_{j}-\mu_{j}(\widehat{\boldsymbol{\beta}}_{%MLE})|\|\mathbf{X}_{j}\|}\quad,\quad i=1,\dots,n\,. italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B , italic_L end_POSTSUPERSCRIPT = divide start_ARG | italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) | ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) | ∥ bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ end_ARG , italic_i = 1 , … , italic_n . (19)
Since we wish to avoid evaluating the full-data estimator 𝜷 ^ M L E subscript ^ 𝜷 𝑀 𝐿 𝐸 \widehat{\boldsymbol{\beta}}_{MLE} over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT , the following two-step algorithm is given by Wangetal. (2018 ) :
Step 1: Sample q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations using the following probabilities
π i p r o p = { ( 2 n 0 ) − 1 if D i = 0 ( 2 n 1 ) − 1 if D i = 1 i = 1 , … , n . formulae-sequence superscript subscript 𝜋 𝑖 𝑝 𝑟 𝑜 𝑝 cases superscript 2 subscript 𝑛 0 1 if subscript 𝐷 𝑖 0 superscript 2 subscript 𝑛 1 1 if subscript 𝐷 𝑖 1 𝑖 1 … 𝑛
\pi_{i}^{prop}=\begin{cases}(2n_{0})^{-1}&\text{if }D_{i}=0\\(2n_{1})^{-1}&\text{if }D_{i}=1\end{cases}\quad\quad i=1,\dots,n\,. italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_o italic_p end_POSTSUPERSCRIPT = { start_ROW start_CELL ( 2 italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL if italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 end_CELL end_ROW start_ROW start_CELL ( 2 italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL if italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_CELL end_ROW italic_i = 1 , … , italic_n . (20)
where n 1 = n − n 0 subscript 𝑛 1 𝑛 subscript 𝑛 0 n_{1}=n-n_{0} italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . Conduct a weighted logistic regression with the subsample , based on Eq. (8 ), and get 𝜷 ~ p r o p subscript ~ 𝜷 𝑝 𝑟 𝑜 𝑝 \widetilde{\boldsymbol{\beta}}_{prop} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT . Derive the approximated optimal sampling probabilities by substituting 𝜷 ^ M L E subscript ^ 𝜷 𝑀 𝐿 𝐸 \widehat{\boldsymbol{\beta}}_{MLE} over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT with 𝜷 ~ p r o p subscript ~ 𝜷 𝑝 𝑟 𝑜 𝑝 \widetilde{\boldsymbol{\beta}}_{prop} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT in (18 ) or (19 ).
Step 2: Sample q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT observations from the entire sample using the probabilities of Step 1 and get 𝒬 𝒬 \mathcal{Q} caligraphic_Q . Conduct a weighted logistic regression on 𝒬 𝒬 \mathcal{Q} caligraphic_Q , based on Eq. (8 ), and obtain the two-step estimator 𝜷 ~ T S subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT .
Once 𝜷 ~ T S subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT is calculated, inference can be carried out by using the variance estimator ℍ ~ B ( 𝜷 ~ T S ) = 𝐌 ~ X ( 𝜷 ~ T S ) − 1 𝐊 ~ B ( 𝜷 ~ T S ) 𝐌 ~ X ( 𝜷 ~ T S ) − 1 superscript ~ ℍ 𝐵 subscript ~ 𝜷 𝑇 𝑆 subscript ~ 𝐌 𝑋 superscript subscript ~ 𝜷 𝑇 𝑆 1 superscript ~ 𝐊 𝐵 subscript ~ 𝜷 𝑇 𝑆 subscript ~ 𝐌 𝑋 superscript subscript ~ 𝜷 𝑇 𝑆 1 \widetilde{\mathbb{H}}^{B}(\widetilde{\boldsymbol{\beta}}_{TS})=\widetilde{%\mathbf{M}}_{X}(\widetilde{\boldsymbol{\beta}}_{TS})^{-1}\widetilde{\mathbf{K}%}^{B}(\widetilde{\boldsymbol{\beta}}_{TS})\widetilde{\mathbf{M}}_{X}(%\widetilde{\boldsymbol{\beta}}_{TS})^{-1} over~ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) = over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , where
𝐊 ~ B ( 𝜷 ~ ) = 1 n 2 ∑ i ∈ 𝒬 w i 2 { D i − μ i ( 𝜷 ~ ) } 2 𝐗 i 𝐗 i T . superscript ~ 𝐊 𝐵 ~ 𝜷 1 superscript 𝑛 2 subscript 𝑖 𝒬 superscript subscript 𝑤 𝑖 2 superscript subscript 𝐷 𝑖 subscript 𝜇 𝑖 ~ 𝜷 2 subscript 𝐗 𝑖 superscript subscript 𝐗 𝑖 𝑇 \widetilde{\mathbf{K}}^{B}(\widetilde{\boldsymbol{\beta}})=\frac{1}{n^{2}}\sum%_{i\in\mathcal{Q}}w_{i}^{2}\left\{D_{i}-\mu_{i}(\widetilde{\boldsymbol{\beta}}%)\right\}^{2}\mathbf{X}_{i}\mathbf{X}_{i}^{T}\,. over~ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG ) } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .
Certainly, we can utilize the concepts discussed earlier to determine the desired values of q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT . However, the asymptotic properties outlined in Wangetal. (2018 ) are confined to the conditional space, conditioning on the entire observed data, ℱ n subscript ℱ 𝑛 \mathcal{F}_{n} caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT . In contrast, our approach for the optimal q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT necessitates the consideration of the asymptotic distribution under the unconditional space. The subsequent theorem presents this result, and the proof is available in the SM, Section S2.
Theorem 3 Given Assumptions A.1-A.3 (see SM, Section S2) and as q n , n → ∞ → subscript 𝑞 𝑛 𝑛
q_{n},n\rightarrow\infty italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ ,
n ℍ B ( 𝝅 , 𝜷 o ) − 1 / 2 ( 𝜷 ~ T S − 𝜷 o ) → 𝐷 N ( 0 , 𝐈 ) . 𝐷 → 𝑛 superscript ℍ 𝐵 superscript 𝝅 superscript 𝜷 𝑜 1 2 subscript ~ 𝜷 𝑇 𝑆 superscript 𝜷 𝑜 𝑁 0 𝐈 \sqrt{n}\mathbb{H}^{B}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})^{-1/2}(%\widetilde{\boldsymbol{\beta}}_{TS}-\boldsymbol{\beta}^{o})\xrightarrow{D}N(0,%\mathbf{I})\,. square-root start_ARG italic_n end_ARG blackboard_H start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_I ) .
4.2 Choosing Subsample Size by Relative Efficiency or Hypothesis TestingAn estimator of the RE of the two-step estimator relative to the estimator based on the entire datatset is given by
R E ( q n ) = ‖ ℍ B ( 𝜷 ~ T S ) ‖ F ‖ n − 1 𝐌 X ( 𝜷 ^ M L E ) ‖ F , 𝑅 𝐸 subscript 𝑞 𝑛 subscript norm superscript ℍ 𝐵 subscript ~ 𝜷 𝑇 𝑆 𝐹 subscript norm superscript 𝑛 1 subscript 𝐌 𝑋 subscript ^ 𝜷 𝑀 𝐿 𝐸 𝐹 RE(q_{n})=\frac{\|\mathbb{H}^{B}(\widetilde{\boldsymbol{\beta}}_{TS})\|_{F}}{%\|n^{-1}\mathbf{M}_{X}(\widehat{\boldsymbol{\beta}}_{MLE})\|_{F}}\,, italic_R italic_E ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG ∥ blackboard_H start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG ,
and the respective estimator that focuses on the p t h superscript 𝑝 𝑡 ℎ p^{th} italic_p start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT covariate is given by
R E p ( q n ) = [ ℍ B ( 𝜷 ~ T S ) ] p p [ n − 1 𝐌 X − 1 ( 𝜷 ^ M L E ) ] p p . 𝑅 subscript 𝐸 𝑝 subscript 𝑞 𝑛 subscript delimited-[] superscript ℍ 𝐵 subscript ~ 𝜷 𝑇 𝑆 𝑝 𝑝 subscript delimited-[] superscript 𝑛 1 superscript subscript 𝐌 𝑋 1 subscript ^ 𝜷 𝑀 𝐿 𝐸 𝑝 𝑝 RE_{p}(q_{n})=\frac{\left[\mathbb{H}^{B}(\widetilde{\boldsymbol{\beta}}_{TS})%\right]_{pp}}{\left[n^{-1}\mathbf{M}_{X}^{-1}(\widehat{\boldsymbol{\beta}}_{%MLE})\right]_{pp}}\,. italic_R italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG [ blackboard_H start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT end_ARG start_ARG [ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT end_ARG .
Once again, we substitute 𝜷 ~ T S subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT and 𝜷 ^ M L E subscript ^ 𝜷 𝑀 𝐿 𝐸 \widehat{\boldsymbol{\beta}}_{MLE} over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT with the consistent estimator 𝜷 ~ p r o p subscript ~ 𝜷 𝑝 𝑟 𝑜 𝑝 \widetilde{\boldsymbol{\beta}}_{prop} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT from Step 1. To ensure numerical stability in approximating ℍ B ( 𝜷 ~ p r o p ) superscript ℍ 𝐵 subscript ~ 𝜷 𝑝 𝑟 𝑜 𝑝 \mathbb{H}^{B}(\widetilde{\boldsymbol{\beta}}_{prop}) blackboard_H start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) and 𝐌 X − 1 ( 𝜷 ~ p r o p ) superscript subscript 𝐌 𝑋 1 subscript ~ 𝜷 𝑝 𝑟 𝑜 𝑝 \mathbf{M}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{prop}) bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) , we recommend utilizing the estimated optimal probabilities of Step 1 to sample an additional set of size q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , denoted as 𝒬 1.5 subscript 𝒬 1.5 \mathcal{Q}_{1.5} caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT . Let
ℍ ˇ B ( 𝝅 , 𝜷 ~ ) = 𝐌 ˇ X − 1 ( 𝜷 ~ ) 𝐊 ˇ B ( 𝝅 , 𝜷 ~ ) 𝐌 ˇ X − 1 ( 𝜷 ~ ) superscript ˇ ℍ 𝐵 𝝅 ~ 𝜷 superscript subscript ˇ 𝐌 𝑋 1 ~ 𝜷 superscript ˇ 𝐊 𝐵 𝝅 ~ 𝜷 superscript subscript ˇ 𝐌 𝑋 1 ~ 𝜷 \check{\mathbb{H}}^{B}(\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}})=\check%{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}})\check{\mathbf{K}}^{B}(%\boldsymbol{\pi},\widetilde{\boldsymbol{\beta}})\check{\mathbf{M}}_{X}^{-1}(%\widetilde{\boldsymbol{\beta}}) overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG ) = overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG ) overroman_ˇ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , over~ start_ARG bold_italic_β end_ARG ) overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG )
where
𝐊 ˇ B ( 𝜷 ~ ) = 1 n 2 ∑ i ∈ 𝒬 1.5 w ˇ i 2 { D i − μ i ( 𝜷 ~ ) } 2 𝐗 i 𝐗 i T , superscript ˇ 𝐊 𝐵 ~ 𝜷 1 superscript 𝑛 2 subscript 𝑖 subscript 𝒬 1.5 superscript subscript ˇ 𝑤 𝑖 2 superscript subscript 𝐷 𝑖 subscript 𝜇 𝑖 ~ 𝜷 2 subscript 𝐗 𝑖 superscript subscript 𝐗 𝑖 𝑇 \check{\mathbf{K}}^{B}(\widetilde{\boldsymbol{\beta}})=\frac{1}{n^{2}}\sum_{i%\in\mathcal{Q}_{1.5}}\check{w}_{i}^{2}\left\{D_{i}-\mu_{i}(\widetilde{%\boldsymbol{\beta}})\right\}^{2}\mathbf{X}_{i}\mathbf{X}_{i}^{T}\,, overroman_ˇ start_ARG bold_K end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT overroman_ˇ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG ) } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ,
and w ˇ i = ( q 0 π i ) − 1 subscript ˇ 𝑤 𝑖 superscript subscript 𝑞 0 subscript 𝜋 𝑖 1 \check{w}_{i}=(q_{0}\pi_{i})^{-1} overroman_ˇ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , i = 1 , … , n 𝑖 1 … 𝑛
i=1,\dots,n italic_i = 1 , … , italic_n . Finally,
R E ^ ( q n ) = ‖ q 0 q n − 1 ℍ ˇ B ( 𝜷 ~ p r o p ) ‖ ‖ n − 1 𝐌 ˇ X − 1 ( 𝜷 ~ p r o p ) ‖ . ^ 𝑅 𝐸 subscript 𝑞 𝑛 norm subscript 𝑞 0 superscript subscript 𝑞 𝑛 1 superscript ˇ ℍ 𝐵 subscript ~ 𝜷 𝑝 𝑟 𝑜 𝑝 norm superscript 𝑛 1 superscript subscript ˇ 𝐌 𝑋 1 subscript ~ 𝜷 𝑝 𝑟 𝑜 𝑝 \widehat{RE}(q_{n})=\frac{\|q_{0}q_{n}^{-1}\check{\mathbb{H}}^{B}(\widetilde{%\boldsymbol{\beta}}_{prop})\|}{\|n^{-1}\check{\mathbf{M}}_{X}^{-1}(\widetilde{%\boldsymbol{\beta}}_{prop})\|}\,. over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG ∥ italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) ∥ end_ARG start_ARG ∥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) ∥ end_ARG . (21)
Unlike the RE estimator in the rare event setting, Eq. (21 ) approaches zero as q n → ∞ → subscript 𝑞 𝑛 q_{n}\rightarrow\infty italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ while keeping q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and n 𝑛 n italic_n fixed. Consequently, only practical sizes for q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT should be taken into account. In other words, values of q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that are close to n 𝑛 n italic_n should not be considered in the plot of R E ^ ( q n ) ^ 𝑅 𝐸 subscript 𝑞 𝑛 \widehat{RE}(q_{n}) over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) as a function of q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT . This procedure can be seamlessly integrated into the two-step Algorithm 3 with minimal additional computation time, as outlined below:
Step 1.5 : Draw a sample of q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations from the entire dataset using the optimal sampling probabilities obtained in Step 1 to create 𝒬 1.5 subscript 𝒬 1.5 \mathcal{Q}_{1.5} caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT . Compute ℍ ˇ B ( 𝜷 ~ p r o p ) superscript ˇ ℍ 𝐵 subscript ~ 𝜷 𝑝 𝑟 𝑜 𝑝 \check{\mathbb{H}}^{B}(\widetilde{\boldsymbol{\beta}}_{prop}) overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) and 𝐌 ˇ X − 1 ( 𝜷 ~ p r o p ) superscript subscript ˇ 𝐌 𝑋 1 subscript ~ 𝜷 𝑝 𝑟 𝑜 𝑝 \check{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{prop}) overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) . Generate a plot of R E ^ ( q n ) ^ 𝑅 𝐸 subscript 𝑞 𝑛 \widehat{RE}(q_{n}) over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) against q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and select the smallest q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that meets the desired RE.
Similarly, the minimal subsample size for testing H 0 : β p o = 0 : subscript 𝐻 0 subscript superscript 𝛽 𝑜 𝑝 0 H_{0}:\beta^{o}_{p}=0 italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0 against a two-sided alternative, given β p o = β p ∗ subscript superscript 𝛽 𝑜 𝑝 superscript subscript 𝛽 𝑝 \beta^{o}_{p}=\beta_{p}^{*} italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , a significance level α 𝛼 \alpha italic_α and power γ 𝛾 \gamma italic_γ , is given by
q ~ n = ⌈ q 0 ( Z 1 − α / 2 + Z γ ) 2 [ ℍ ˇ B ( 𝜷 ~ p r o p ) ] p p β p ∗ 2 ⌉ . subscript ~ 𝑞 𝑛 subscript 𝑞 0 superscript subscript 𝑍 1 𝛼 2 subscript 𝑍 𝛾 2 subscript delimited-[] superscript ˇ ℍ 𝐵 subscript ~ 𝜷 𝑝 𝑟 𝑜 𝑝 𝑝 𝑝 superscript superscript subscript 𝛽 𝑝 2 \widetilde{q}_{n}=\left\lceil{\frac{q_{0}(Z_{1-\alpha/2}+Z_{\gamma})^{2}\big{[%}\check{\mathbb{H}}^{B}(\widetilde{\boldsymbol{\beta}}_{prop})\big{]}_{pp}}{{%\beta_{p}^{*}}^{2}}}\right\rceil\,. over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ⌈ divide start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⌉ . (22)
A plot of q n ~ ~ subscript 𝑞 𝑛 \widetilde{q_{n}} over~ start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG as a function of γ 𝛾 \gamma italic_γ can be easily generated. The values derived from Eq. (22 ) might exceed the sample size n 𝑛 n italic_n . Nevertheless, exceeding the sample size may not yield any additional information beyond what is already captured by the full-data MLE, 𝜷 ^ M L E subscript ^ 𝜷 𝑀 𝐿 𝐸 \widehat{\boldsymbol{\beta}}_{MLE} over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT . Despite this, a value surpassing n 𝑛 n italic_n remains informative, indicating that the desired statistical power cannot be attained. In conclusion, the algorithm for a single covariate hypothesis testing is provided by adding the following mid-step to the two-step Algorithm 3 :
Step 1.5* : Draw a sample of q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT observations from the entire dataset using the optimal sampling probabilities obtained in Step 1 to create 𝒬 1.5 subscript 𝒬 1.5 \mathcal{Q}_{1.5} caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT . Compute ℍ ˇ B ( 𝜷 ~ p r o p ) superscript ˇ ℍ 𝐵 subscript ~ 𝜷 𝑝 𝑟 𝑜 𝑝 \check{\mathbb{H}}^{B}(\widetilde{\boldsymbol{\beta}}_{prop}) overroman_ˇ start_ARG blackboard_H end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) and 𝐌 ˇ X − 1 ( 𝜷 ~ p r o p ) superscript subscript ˇ 𝐌 𝑋 1 subscript ~ 𝜷 𝑝 𝑟 𝑜 𝑝 \check{\mathbf{M}}_{X}^{-1}(\widetilde{\boldsymbol{\beta}}_{prop}) overroman_ˇ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ) . Generate a plot of R E ^ ( q n ) ^ 𝑅 𝐸 subscript 𝑞 𝑛 \widehat{RE}(q_{n}) over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) against γ 𝛾 \gamma italic_γ and select the smallest q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that meets the desired power.
5 Simulation Study5.1 Cox Regression5.1.1 Data GenerationThe sampling designs are similar to that ofKeret andGorfine (2023 ) . For each of the settings described below, 500 samples were drawn, n = 15 , 000 𝑛 15 000
n=15,000 italic_n = 15 , 000 with 𝜷 o = ( 0.3 , − 0.5 , 0.1 , − 0.1 , 0.1 , − 0.3 ) T superscript 𝜷 𝑜 superscript 0.3 0.5 0.1 0.1 0.1 0.3 𝑇 \boldsymbol{\beta}^{o}=(0.3,-0.5,0.1,-0.1,0.1,-0.3)^{T} bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = ( 0.3 , - 0.5 , 0.1 , - 0.1 , 0.1 , - 0.3 ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . Censoring times were generated from an exponential distribution with rate 0.2, independently of failure times. The instantaneous baseline hazard rate was set to be λ 0 ( t ) = 0.001 I ( t < 6 ) + c λ 0 I ( t ≥ 6 ) \lambda_{0}(t)=0.001I(t<6)+c_{\lambda_{0}}I(t\geq_{6}) italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t ) = 0.001 italic_I ( italic_t < 6 ) + italic_c start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_I ( italic_t ≥ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) .The distributions of the covariates and the parameter c λ 0 subscript 𝑐 subscript 𝜆 0 c_{\lambda_{0}} italic_c start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT of each setting, I, II, III, were as follows:
1. Setting I : X j ∼ U n i f ( 0 , 4 ) similar-to subscript 𝑋 𝑗 𝑈 𝑛 𝑖 𝑓 0 4 X_{j}\sim Unif(0,4) italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ italic_U italic_n italic_i italic_f ( 0 , 4 ) , j = 1 , … , 6 𝑗 1 … 6
j=1,\ldots,6 italic_j = 1 , … , 6 and c λ 0 = 0.075 subscript 𝑐 subscript 𝜆 0 0.075 c_{\lambda_{0}}=0.075 italic_c start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.075 . This is a setting of equal variances and no correlation between the covariates.
2. Setting II : X j ∼ U n i f ( 0 , θ j ) similar-to subscript 𝑋 𝑗 𝑈 𝑛 𝑖 𝑓 0 subscript 𝜃 𝑗 X_{j}\sim Unif(0,\theta_{j}) italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ italic_U italic_n italic_i italic_f ( 0 , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , ( θ 1 , θ 2 , θ 3 , θ 4 , θ 5 , θ 6 ) = ( 1 , 6 , 2 , 2 , 1 , 6 ) subscript 𝜃 1 subscript 𝜃 2 subscript 𝜃 3 subscript 𝜃 4 subscript 𝜃 5 subscript 𝜃 6 1 6 2 2 1 6 (\theta_{1},\theta_{2},\theta_{3},\theta_{4},\theta_{5},\theta_{6})=(1,6,2,2,1%,6) ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) = ( 1 , 6 , 2 , 2 , 1 , 6 ) and c λ 0 = 0.15 subscript 𝑐 subscript 𝜆 0 0.15 c_{\lambda_{0}}=0.15 italic_c start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.15 . This is a setting of unequal variances and no correlation between covariates.
3. Setting III : X 1 , X 2 subscript 𝑋 1 subscript 𝑋 2
X_{1},X_{2} italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and X 3 subscript 𝑋 3 X_{3} italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are independently sampled from U n i f ( 0 , 4 ) 𝑈 𝑛 𝑖 𝑓 0 4 Unif(0,4) italic_U italic_n italic_i italic_f ( 0 , 4 ) , X 4 = 0.5 X 1 + 0.5 X 2 + ε 1 subscript 𝑋 4 0.5 subscript 𝑋 1 0.5 subscript 𝑋 2 subscript 𝜀 1 X_{4}=0.5X_{1}+0.5X_{2}+\varepsilon_{1} italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 0.5 italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 0.5 italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , X 5 = X 1 + ε 2 subscript 𝑋 5 subscript 𝑋 1 subscript 𝜀 2 X_{5}=X_{1}+\varepsilon_{2} italic_X start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , X 6 = X 1 + ε 3 subscript 𝑋 6 subscript 𝑋 1 subscript 𝜀 3 X_{6}=X_{1}+\varepsilon_{3} italic_X start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and c λ 0 = 0.05 subscript 𝑐 subscript 𝜆 0 0.05 c_{\lambda_{0}}=0.05 italic_c start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.05 , where ε 1 ∼ N ( 0 , 0.1 ) similar-to subscript 𝜀 1 𝑁 0 0.1 \varepsilon_{1}\sim N(0,0.1) italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 0.1 ) , ε 2 ∼ N ( 0 , 1 ) similar-to subscript 𝜀 2 𝑁 0 1 \varepsilon_{2}\sim N(0,1) italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 1 ) , ε 3 ∼ N ( 1 , 1.5 ) similar-to subscript 𝜀 3 𝑁 1 1.5 \varepsilon_{3}\sim N(1,1.5) italic_ε start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∼ italic_N ( 1 , 1.5 ) , and the ε 𝜀 \varepsilon italic_ε ’s are independent. The strongest correlation between two covariates is about 0.75.
5.1.2 ResultsEq. (6 ) and (7 ) become practically valuable only when the approximations made in Step 1.5 and Step 1.5* of Sections 2.2 and 2.3 , respectively, closely align with their true values. In Fig. S1 of the SM, we compare the Frobenius norm of three covariance matrices: (i) The covariance matrix of the two-step estimator, 𝜷 ~ T S subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT . (ii) The approximated covariance matrix utilized in Step 1.5. (iii) The empirical covariance matrix of 𝜷 ~ T S subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT . The results are obtained with q n = 5 n e subscript 𝑞 𝑛 5 subscript 𝑛 𝑒 q_{n}=5n_{e} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 5 italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and q 0 = c 0 n e subscript 𝑞 0 subscript 𝑐 0 subscript 𝑛 𝑒 q_{0}=c_{0}n_{e} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , where c 0 subscript 𝑐 0 c_{0} italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ranges from 1 to 5. Clearly, the Frobenius norm of Step 1.5 is remarkably close to the covariance matrix of the two-step estimator, and both are in close agreement with the empirical variance, even for small values of c 0 subscript 𝑐 0 c_{0} italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such as c 0 = 1 subscript 𝑐 0 1 c_{0}=1 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 .
A comparison between the RE as defined by Eq. (4 ) and its approximation in Eq. (6 ) is summarized in Fig. 1 , where q 0 = 2 n e subscript 𝑞 0 2 subscript 𝑛 𝑒 q_{0}=2n_{e} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 2 italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , q n = c n e subscript 𝑞 𝑛 𝑐 subscript 𝑛 𝑒 q_{n}=cn_{e} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_c italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , and c = 1 , … , 9 𝑐 1 … 9
c=1,\ldots,9 italic_c = 1 , … , 9 . The results indicate that the approximated RE of Step 1.5, (i.e., Eq. (6 )) closely mirrors Eq. (4 ). The presence of an ‘elbow’ shape around c = 3 𝑐 3 c=3 italic_c = 3 with RE fairly close to 1 1 1 1 suggests that q n = 3 n e subscript 𝑞 𝑛 3 subscript 𝑛 𝑒 q_{n}=3n_{e} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 3 italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is sufficiently large under these specific settings. Clearly, the two optimal sampling strategies substantially outperform uniform sampling in terms of RE.
To assess the effectiveness of q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as defined in Eq. (7 ), we conducted a comparison of the empirical and nominal power of the test for H 0 : β 5 = 0 : subscript 𝐻 0 subscript 𝛽 5 0 H_{0}:\beta_{5}=0 italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_β start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = 0 against a two-sided alternative using q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT based on the proposed three-step estimation algorithm, comprising Steps 1, 1.5*, and 2. The tests were performed with α = 0.05 𝛼 0.05 \alpha=0.05 italic_α = 0.05 . Here, we increased the sample size to n = 150 , 000 𝑛 150 000
n=150,000 italic_n = 150 , 000 , and for Setting I c λ 0 = 0.005 subscript 𝑐 subscript 𝜆 0 0.005 c_{\lambda_{0}}=0.005 italic_c start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.005 , while for Settings II and III c λ 0 = 0.05 subscript 𝑐 subscript 𝜆 0 0.05 c_{\lambda_{0}}=0.05 italic_c start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.05 . Consequently, the respective event rates were 0.65 % percent 0.65 0.65\% 0.65 % , 1.3 % percent 1.3 1.3\% 1.3 % and 3 % percent 3 3\% 3 % .
The results are outlined in Table 1 with q 0 = 2 n e subscript 𝑞 0 2 subscript 𝑛 𝑒 q_{0}=2n_{e} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 2 italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT . Clearly, q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT achieves the intended nominal power. When considering the mean and standard deviation (SD) of q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , we observe that both optimality criteria exhibit similar performance under Settings I and II, while A surpasses L in Setting III. These results further validate our assertion that in extensive datasets with rare events, only a small fraction of the censored data is practically necessary. For instance, in Setting III, it is demonstrate that subsampling approximately 6000 6000 6000 6000 censored observations from about 145,550 censored observations is sufficient to achieve a power of 0.95 0.95 0.95 0.95 .
5.2 Logistic Regression with Rare Events5.2.1 Data GenerationThe sampling designs are similar to that of Wangetal. (2018 ) with some modification to represent settings of rare events. 500 samples were drawn for each setting, each dataset is of size n = 100 , 000 𝑛 100 000
n=100,000 italic_n = 100 , 000 . The following covariates’ distributions were considered:
1. mzNormal . 𝐗 𝐗 \mathbf{X} bold_X follows a multivariate normal distribution N ( 𝟎 , 𝚺 ) 𝑁 0 𝚺 N(\mathbf{0},\mathbf{\Sigma}) italic_N ( bold_0 , bold_Σ ) , where Σ i j = 0.5 I ( i ≠ j ) subscript Σ 𝑖 𝑗 superscript 0.5 𝐼 𝑖 𝑗 {\Sigma}_{ij}=0.5^{I(i\neq j)} roman_Σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0.5 start_POSTSUPERSCRIPT italic_I ( italic_i ≠ italic_j ) end_POSTSUPERSCRIPT .
2. mixNormal . 𝐗 𝐗 \mathbf{X} bold_X is a mixture of two multivariate normal distribution, 𝐗 ∼ 0.5 N ( 𝟏 , 𝚺 ) + 0.5 N ( − 𝟏 , 𝚺 ) similar-to 𝐗 0.5 𝑁 1 𝚺 0.5 𝑁 1 𝚺 \mathbf{X}\sim 0.5N(\mathbf{1},\mathbf{\Sigma})+0.5N(\mathbf{-1},\mathbf{%\Sigma}) bold_X ∼ 0.5 italic_N ( bold_1 , bold_Σ ) + 0.5 italic_N ( - bold_1 , bold_Σ ) so the distribution of 𝐗 𝐗 \mathbf{X} bold_X is bimodal.
3. T3 . 𝐗 𝐗 \mathbf{X} bold_X follows a multivariate t 𝑡 t italic_t distribution with 3 degrees of freedom, 𝐗 ∼ t 3 ( 𝟎 , 𝚺 ) / 10 similar-to 𝐗 subscript 𝑡 3 0 𝚺 10 \mathbf{X}\sim t_{3}(\mathbf{0},\mathbf{\Sigma})/10 bold_X ∼ italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_0 , bold_Σ ) / 10 . Hence, the distribution of 𝐗 𝐗 \mathbf{X} bold_X has heavy tails.
4. EXP . Components of 𝐗 𝐗 \mathbf{X} bold_X are independent and each has an exponential distribution with a rate parameter of 2. The distribution of 𝐗 𝐗 \mathbf{X} bold_X is skewed and has a heaviertail on the right.
We set q 0 = 1 , 000 subscript 𝑞 0 1 000
q_{0}=1,000 italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 , 000 and explored various values for q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , ranging from 1 , 000 1 000
1,000 1 , 000 to 10 , 000 10 000
10,000 10 , 000 in increments of 1 , 000 1 000
1,000 1 , 000 . We set β i = 0.5 subscript 𝛽 𝑖 0.5 \beta_{i}=0.5 italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0.5 , i = 1 , … , 6 𝑖 1 … 6
i=1,\ldots,6 italic_i = 1 , … , 6 , and employed distinct values for the intercept β 0 subscript 𝛽 0 \beta_{0} italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to regulate the event rate. Specifically, β 0 = − 6 subscript 𝛽 0 6 \beta_{0}=-6 italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 6 for mzNormal (yielding an event rate of 2%), β 0 = − 5 subscript 𝛽 0 5 \beta_{0}=-5 italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 5 for mixNormal (event rate of 2.1%), β 0 = − 5 subscript 𝛽 0 5 \beta_{0}=-5 italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 5 for T3 (event rate of 1.5%), and β 0 = − 11 subscript 𝛽 0 11 \beta_{0}=-11 italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 11 for EXP (event rate of 1.3%).
5.2.2 ResultsThe comparison among different estimators involved assessing the empirical root mean squared errors (RMSEs) with respect to 𝜷 o superscript 𝜷 𝑜 \boldsymbol{\beta}^{o} bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT and 𝜷 ^ P L subscript ^ 𝜷 𝑃 𝐿 \widehat{\boldsymbol{\beta}}_{PL} over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT . Namely, B − 1 ∑ j = 1 B ∑ i = 1 6 ( β ^ i ( j ) − β i o ) 2 superscript 𝐵 1 superscript subscript 𝑗 1 𝐵 superscript subscript 𝑖 1 6 superscript superscript subscript ^ 𝛽 𝑖 𝑗 superscript subscript 𝛽 𝑖 𝑜 2 B^{-1}\sum_{j=1}^{B}\sqrt{\sum_{i=1}^{6}(\widehat{\beta}_{i}^{(j)}-\beta_{i}^{%o})^{2}} italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG and B − 1 ∑ j = 1 B ∑ i = 1 6 ( β ^ i ( j ) − β ^ P L , i ( j ) ) 2 superscript 𝐵 1 superscript subscript 𝑗 1 𝐵 superscript subscript 𝑖 1 6 superscript superscript subscript ^ 𝛽 𝑖 𝑗 superscript subscript ^ 𝛽 𝑃 𝐿 𝑖
𝑗 2 B^{-1}\sum_{j=1}^{B}\sqrt{\sum_{i=1}^{6}(\widehat{\beta}_{i}^{(j)}-\widehat{%\beta}_{PL,i}^{(j)})^{2}} italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , where 𝜷 ^ ^ 𝜷 \widehat{\boldsymbol{\beta}} over^ start_ARG bold_italic_β end_ARG represents the relevant estimator, the superscript ( j ) 𝑗 (j) ( italic_j ) denotes the j 𝑗 j italic_j ’th sample, and B = 500 𝐵 500 B=500 italic_B = 500 signifies the number of repetitions. Fig. 2 shows the RMSEs of the two-step estimators of Algorithm 2 with 𝐩 A superscript 𝐩 𝐴 {\bf p}^{A} bold_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and 𝐩 L superscript 𝐩 𝐿 {\bf p}^{L} bold_p start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , the full-data MLE, and a one-step estimator with uniform subsampling from the non-cases data. Clearly, the optimal subsampling methods outperform uniform subsampling in terms of RMSE, with A-optimal yielding slightly superior results compared to L-optimal, as anticipated. Table 2 presents a comparison of the running times. Evidently, the optimal subsampling methods are substantially faster than 𝜷 ^ P L subscript ^ 𝜷 𝑃 𝐿 \widehat{\boldsymbol{\beta}}_{PL} over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT while still maintaining low RMSE.
Fig. S2 of the SM demonstrate the validity of the variance estimator (11 ), and the effectiveness of optimal subsampling over uniform subsampling. In Figure 3 (a) it is demonstrated that Eq. (14 ) provides a good approximation of the RE based on the actual two-step estimator, thereby endorsing the validity of the proposed three-step estimator that includes steps 1, 1.5 and 2.
To evaluate the utility of q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT derived from Eq. (16 ), a comparison was made between the empirical and nominal power of testing H 0 : β 5 = 0 : subscript 𝐻 0 subscript 𝛽 5 0 H_{0}:\beta_{5}=0 italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_β start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = 0 against a two-sided alternative and α = 0.05 𝛼 0.05 \alpha=0.05 italic_α = 0.05 , using q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the three-step estimation algorithm of Section 3.1 with steps 1, 1.5* and 2. Due to impractical subsample sizes for some higher values of γ 𝛾 \gamma italic_γ , meaning the required power could not be attained even with the entire sample, the coefficient vector 𝜷 o superscript 𝜷 𝑜 \boldsymbol{\beta}^{o} bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT was modified:
1. mzNormal . β 0 = − 3.5 subscript 𝛽 0 3.5 \beta_{0}=-3.5 italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 3.5 and β j = 0.1 subscript 𝛽 𝑗 0.1 \beta_{j}=0.1 italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0.1 , i = 1 , … , 6 𝑖 1 … 6
i=1,\ldots,6 italic_i = 1 , … , 6 , with an event rate of 3.2%.
2. mixNormal . β 0 = − 4.5 subscript 𝛽 0 4.5 \beta_{0}=-4.5 italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 4.5 and β j = 0.2 subscript 𝛽 𝑗 0.2 \beta_{j}=0.2 italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0.2 , i = 1 , … , 6 𝑖 1 … 6
i=1,\ldots,6 italic_i = 1 , … , 6 , with an event rate of 1.3%.
3. T3 . β 0 = − 3 subscript 𝛽 0 3 \beta_{0}=-3 italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 3 and β j = 0.15 subscript 𝛽 𝑗 0.15 \beta_{j}=0.15 italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0.15 , j = 1 , … , 6 𝑗 1 … 6
j=1,\ldots,6 italic_j = 1 , … , 6 , with an event rate of 5%.
4. EXP . β 0 = − 4 subscript 𝛽 0 4 \beta_{0}=-4 italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 4 and β j = 0.15 subscript 𝛽 𝑗 0.15 \beta_{j}=0.15 italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0.15 , j = 1 , … , 6 𝑗 1 … 6
j=1,\ldots,6 italic_j = 1 , … , 6 , with an event rate of 2.8%.
The results are summarized in Table 3 employing q 0 = 1 , 000 subscript 𝑞 0 1 000
q_{0}=1,000 italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 , 000 with 5,000 repetitions for each γ 𝛾 \gamma italic_γ value. Our conclusion is that q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT yields power close to the nominal level across all scenarios. Regarding the mean and standard deviation of q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , the A-optimal approach consistently outperforms L-optimal.
5.3 Logistic Regression with Nearly Balanced DataThe configurations examined correspond to those outlined in Wangetal. (2018 ) (Section 5.1): mzNormal, nzNormal, mixNormal, T3, and EXP. The setting mzNormal is not balanced, and with event rate of 0.73. For each specified scenario, we generated 500 samples, each consisting of 100,000 observations and q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT was set to 5,000. The results, succinctly illustrated in Fig. 3 (b), demonstrate a strong agreement between the proposed RE estimator (21 ) and the RE based on the actual two-step Algorithm 3 .
Table 4 provides a summary of the comparison between empirical and nominal power of testing H 0 : β 6 = 0 : subscript 𝐻 0 subscript 𝛽 6 0 H_{0}:\beta_{6}=0 italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_β start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = 0 against a two-sided alternative and α = 0.05 𝛼 0.05 \alpha=0.05 italic_α = 0.05 , utilizing q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the proposed three-step estimation algorithm outlined in Section 4.2. These results are derived from 5,000 repetitions for each configuration, employing a smaller subsample size for steps 1 and 1.5*, with q 0 subscript 𝑞 0 q_{0} italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT set to 1,000. Evidently, q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT provides the desired nominal power, which supports the use of Eq. (22 ). In terms of the mean and standard deviation of q ~ n subscript ~ 𝑞 𝑛 \widetilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , A-optimal outperforms L-optimal.
The optimal approach of Wangetal. (2018 ) involves subsampling from both cases and controls, exhibiting strong performance when dealing with balanced data. The sensitivity of the optimal subsample size, as defined by Eq. (22 ), to imbalanced data is illustrated in Fig. 4 . The setting mzNormal is explored with varying sample sizes n = a × 100 , 000 𝑛 𝑎 100 000
n=a\times 100,000 italic_n = italic_a × 100 , 000 , a = 1 , … , 5 𝑎 1 … 5
a=1,\ldots,5 italic_a = 1 , … , 5 , and diverse values of β 0 o = − 5 , − 4 , … , − 1 superscript subscript 𝛽 0 𝑜 5 4 … 1
\beta_{0}^{o}=-5,-4,\ldots,-1 italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = - 5 , - 4 , … , - 1 , corresponding to event rates of 0.8%, 2%, 5%, 13%, and 28%, respectively. Evidently, Eq. (22 ) fails to deliver the required power as the event rate decreases, regardless of the sample size. These findings underscore the imperative need for a distinct consideration of the imbalanced setting, as addressed in this work.
6 Survival Analysis of UKBiobank Colorectal CancerWe conducted an analysis complementing the one presented in Keret andGorfine (2023 ) and studied the required subsample size based on RE. The event time is defined as the age at colorectal cancer (CRC) diagnosis, while the censoring time is specified as the age at death before CRC diagnosis or the current age without CRC. The analysis encompasses established environmental CRC risk factors, including body mass index (BMI), smoking status (no/yes), family history of CRC (no/yes), physical activity (no/yes), sex (female/male), alcohol consumption (non or occasional/light frequent drinker/very frequent drinker), education (lower than high school/high school/higher vocational education/college or university graduate/prefer not to answer), NSAIDs drug use (none/Aspirin or Ibuprofen), and post-menopausal hormones (no/yes). Additionally, 139 single-nucleotide polymorphisms (SNPs) associated with CRC through GWAS (Jeonetal., 2018 ) were included along with six principal components to account for population substructure. The SNPs were standardized to have a mean of zero and unit variance.
Building on the analysis in Keret andGorfine (2023 ) , a time-dependent effect β ( t ) 𝛽 𝑡 \beta(t) italic_β ( italic_t ) is essential for sex, due to violation of the proportional hazard assumption. In total, 180 regression coefficients were considered for the model, with 5,342 observed events and 479,343 censored observations. However, the introduction of time-dependent coefficients results in the partitioning of each observation into several distinct time-fixed “dummy-observations”, each having an “entrance” and “exit” time (Therneauetal., 2017 ) . This creates non-overlapping intervals that reconstruct the original time interval and inflating the dataset to approximately 350 million rows and n e = 5 , 342 subscript 𝑛 𝑒 5 342
n_{e}=5,342 italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = 5 , 342 . Subsampling is then performed from the censored dummy-observations using the reservoir-sampling approach.
We set c 0 = 15 subscript 𝑐 0 15 c_{0}=15 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 15 and investigated the RE based on Step 1.5. The results are summarized in Fig. 5 (a) and Table 5 . Notably, c = 100 𝑐 100 c=100 italic_c = 100 with the L-optimal subsampling approach (i.e., approximately 500K “dummy-observations” instead of nearly 350 million) proves sufficient. However, in subsequent analyses, we also applied our proposed algorithms for c = 40 𝑐 40 c=40 italic_c = 40 and 160, for comparison purposes. Table 6 presents the RMSE of the estimators with respect to the full-data PL estimator, the Frobenius norm of the covariance matrices of the estimators, and their running times. Clearly, the optimal methods outperform uniform subsampling regarding both RMSE and Forbenius norm, with the A-optimal method consistently exhibiting somewhat better values than the L-optimal method, as expected. While the running time required for the full dataset is 14.5 hours, the time required for the L-optimal method with c = 100 𝑐 100 c=100 italic_c = 100 is reduced to 3.287 hours, with minimal loss in terms of efficiency, as demonstrated in Figures 5 (b) and (c).In summary, this analysis, incorporating Step 1.5, highlights the effectiveness of selecting the optimalq n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT according to the RE criterion.
7 Linked Birth and Infant Death Data - Logistic RegressionThe birth and infant death data sets, sourced from the National Bureau of Economic Research’s public-use data archives, combine information from death certificates with corresponding birth certificates for infants under one year old who pass away in the United States, Puerto Rico, The Virgin Islands, and Guam. This linkage aims to leverage the additional information available in birth certificates, such as age, parents ’ race, birth weight, period of gestation, plurality, prenatal care usage, maternal education, live birth order, marital status, and maternal smoking, to enable more comprehensive analyses of infant mortality patterns.
The data from years 2007 to 2013 were amalgamated into a single extensive dataset comprising n = 28 , 586 , 919 𝑛 28 586 919
n=28,586,919 italic_n = 28 , 586 , 919 rows. From the raw data, a set of features was derived, resulting in a covariate matrix with 103 103 103 103 columns, encompassing 18 interaction terms with sex and 23 interaction terms with birth year. The covariates in the model are summarized in Tables S1–S3 of the SM. The primary outcome of interest is whether an infant passed away before reaching one year of age. Exactly 176 , 400 176 400
176,400 176 , 400 deaths were observed, constituting about 0.6 % percent 0.6 0.6\% 0.6 % of event rate, justifying the use of a subsampling algorithm for rare events. The results are summarized in Fig. 6 .
In Fig. 6 (a), the RE based on Step 1.5 is displayed. Notably, the RE exhibits a distinct ‘elbow’ around q n = 1 , 500 , 000 subscript 𝑞 𝑛 1 500 000
q_{n}=1,500,000 italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 , 500 , 000 , where the RE is also close to 1. We opted for a slightly higher value, q n = 1 , 7640 , 000 subscript 𝑞 𝑛 1 7640 000
q_{n}=1,7640,000 italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 , 7640 , 000 , with c = 10 𝑐 10 c=10 italic_c = 10 , indicating that 10 controls were sampled for each event. To offer a more comprehensive assessment of the algorithm’s performance, we include results of analyses with c = 5 𝑐 5 c=5 italic_c = 5 and 25. The approximated RE, varying with c 𝑐 c italic_c , is presented in Table 7 . As anticipated, the A-optimal outperforms the L-optimal in terms of RE.
In Fig. 6 (b), the running time of various methods is illustrated as a function of c 𝑐 c italic_c . The effectiveness of optimal subsampling becomes apparent when compared to the full-data MLE. With our chosen c = 10 𝑐 10 c=10 italic_c = 10 , the running times for A and L criteria are 1700 1700 1700 1700 seconds and 628 628 628 628 seconds, respectively, whereas the full-data MLE estimator takes 6484 6484 6484 6484 seconds. It is evident that the additional computational time required for optimal subsampling, as opposed to uniform subsampling, is relatively short, especially for the L method. This outcome reinforces the efficacy of the proposed procedure.
In Fig. 6 (c), the RMSE relative to 𝜷 ^ M L E subscript ^ 𝜷 𝑀 𝐿 𝐸 \widehat{\boldsymbol{\beta}}_{MLE} over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT is depicted. The findings validate the judicious selection of q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT since increasing c 𝑐 c italic_c from 5 5 5 5 to 10 10 10 10 substantially reduces the RMSE. However, a further increment to c = 25 𝑐 25 c=25 italic_c = 25 incurs a longer computational time and yields a comparatively modest improvement. Additionally, it is evident that optimal subsampling yields results substantially superior to those obtained through uniform subsampling.
The effectiveness of optimal subsampling methods over uniform subsampling is also evident in Fig.s 6 (d) and 6 (e). In Figure 6 (d), the estimated coefficients of each subsampling method are compared to their 𝜷 ^ M L E subscript ^ 𝜷 𝑀 𝐿 𝐸 \widehat{\boldsymbol{\beta}}_{MLE} over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT counterparts. Optimal subsampling yields results much closer to the full-data estimator than uniform subsampling. Figure 6 (e) displays the standard errors of 𝜷 ^ M L E subscript ^ 𝜷 𝑀 𝐿 𝐸 \widehat{\boldsymbol{\beta}}_{MLE} over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT versus the standard errors of their subsampling counterparts. Uniform subsampling produces notably larger standard errors.
We completed the analysis by conducting hypothesis testing, H 0 : β i = 0 : subscript 𝐻 0 subscript 𝛽 𝑖 0 H_{0}:\beta_{i}=0 italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 versus H 1 : β i ≠ 0 : subscript 𝐻 1 subscript 𝛽 𝑖 0 H_{1}:\beta_{i}\neq 0 italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ 0 , i = 1 , … , 103 𝑖 1 … 103
i=1,\dots,103 italic_i = 1 , … , 103 with FDR adjustment for multiplicity (Benjamini andHochberg, 1995 ) . This process was iterated for c = 5 , 10 𝑐 5 10
c=5,10 italic_c = 5 , 10 and 25 25 25 25 . In Figure 6 (f), the total number of rejected hypotheses under each c 𝑐 c italic_c is presented, contrasting with the number of rejections based on the full-data analysis. Notably, the A-optimal and L-optimal sampling methods outperforms uniform sampling. Even with a relatively small subsample size, the optimal sampling estimator yields results highly similar to those of the full data, surpassing the number of rejections achieved with uniform subsampling. For our chosen c = 10 𝑐 10 c=10 italic_c = 10 , both A-optimal and L-optimal methods result in rejecting 56 hypotheses, almost matching the full-data analysis of 57 rejections. In contrast, uniform subsampling at c = 10 𝑐 10 c=10 italic_c = 10 only leads to 37 rejected hypotheses, underscoring the effectiveness of the optimal subsampling.
This dataset possesses a noteworthy characteristic–many of its features consist of rare binary variables. Examples include newborns with congenital anomalies like anencephaly, spina bifida, omphalocele, and Down’s syndrome, alongside rare features related to the mother and delivery. Additionally, these features exhibit significant correlations with the outcome of interest, namely, death within the first year of life. The optimal subsampling procedures offer a notable advantage over uniform subsampling by ensuring that observations with rare features associated with the outcome have larger sampling probabilities. Consequently, they are more likely to be included in the subsample, leading to lower variance. In Table 8 , the 20 rarest features in the data are presented, along with their corresponding proportions in both the full dataset and subsampling procedures for c = 10 𝑐 10 c=10 italic_c = 10 . The results affirm that optimal subsamples better capture observations with rare indicators. These insights shed light on the efficiency of our proposed estimators, elucidating their superiority over uniform subsampling.
Regarding the findings derived from the analysis, Tables S4–S6 of the SM present the estimated coefficients for each method, with c = 10 𝑐 10 c=10 italic_c = 10 . While the results are organized into three tables for clarity, it is essential to note that the FDR procedure was executed once, encompassing all coefficients collectively.
Among the significant results, both the mother’s age and the squared mother’s age emerged as noteworthy, corroborating established findings on the impact of maternal age on infant mortality (MacDorman etal., 1997 ; Standfastetal., 1980 ) . This suggests heightened risks associated with motherhood at either a young or advanced age compared to medium age.
Other variables demonstrating significance in our analysis, consistent with prior literature, include lower risk as a function of number of prenatal visits (Carter etal., 2016 ) , gestational weight gain (Naeve, 1979 ; Thorsdottir etal., 2002 ) , five-minute Apgar score (Lietal., 2013 ) , and plurality (Ahrens etal., 2017 ) . Conversely, factors known as increasing risk in the literature and affirmed in this study include live birth order (MacDorman etal., 1997 ; Modin, 2002 ) , eclampsia (Duley, 2009 ) , and certain congenital malformations linked to infant mortality, such as Spina Bifida (Paceetal., 2019 ) , Omphalocele (Marshalletal., 2015 ) , cleft lip (Carlsonetal., 2013 ) , and Down’s syndrome (Sadetzki etal., 1999 ) .
Birth year, confined to the years 2007-2013, did not yield a significant effect. Similarly, no distinctions were observed among different months of the year. Treating Sunday as the baseline, negative impacts were noted for all days of the week except Saturday, indicating a significant difference between workdays and weekends. Concerning parental racial attributes, negative significant effects were identified for native-American ancestry in both parents and for African ancestry from the father’s side. Additionally, an unknown father’s race exhibited statistical significance with a negative effect. In contrast to some prior studies (HolmesJretal., 2020 ; Xie etal., 2015 ) , our findings indicate a lower risk of infant mortality among Caesarean section. Regarding interaction terms, five sex-interaction terms (weight gain, 5 minutes Apgar score, pre-pregnancy -associated hypertension, induction of labor, and cleft lip) and five birth year-interaction terms (African ancestry for the mother, induction of labor, tocolysis, Anencephaly, and Down’s syndrome) were found to be statistically significant.
8 DiscussionThis study makes significant enhancements to the efficient two-step algorithms proposed by Wangetal. (2018 ) and Keret andGorfine (2023 ) . We introduced practical tools for selecting optimal subsample sizes, illustrating their effectiveness through simulations and real-world data. Additionally, we proposed a new subsampling algorithm designed for logistic regression with rare events. This algorithm, which exclusively subsamples among non-cases, demonstrated speed and efficiency compared to full-data maximum-likelihood estimation. Its superiority over uniform subsampling was established in both simulated and real data, as evidenced by lower RMSE and variance. Furthermore, we demonstrated the algorithm’s nearly equivalent performance to the full-data estimator in hypothesis testing while significantly reducing computational time.
Similar approaches to those proposed in this study can be extended to other two-step subsampling methods, including algorithms for generalized linear models (Aietal., 2018 ) , quantile regression (Aietal., 2021 ; Fanetal., 2021 ) , and quasi-likelihood regression (Yuetal., 2020 ) .
Datasets with rare events often pose challenges for classification algorithms primarily oriented toward prediction rather than inference. The subsampling-based algorithm proposed for logistic regression with rare events in this study could serve as a practical tool for sampling probabilities in computationally-intensive methods. Notably, it may be worth exploring its application in methods like random forests (Breiman, 2001 ) and gradient boosting (Friedman, 2001 ) , among others.
10 Supplementary MaterialAcknowledgments The work was supported by the Israel Science Foundation (ISF) grant number 767/21 and by a grant from the Tel Aviv University Center for AI and Data Science (TAD).
Conflict of Interest : None declared.
References Ahrens etal. (2017) Ahrens, K.A., M.E. Thoma, L.M. Rossen, M.Warner, and A.E. Simon (2017). Plurality of birth and infant mortality due to external causes in theunited states, 2000–2010. American journal of epidemiology 185 (5), 335–344.Aietal. (2021) Ai, M., F.Wang, J.Yu, and H.Zhang (2021). Optimal subsampling for large-scale quantile regression. Journal of Complexity 62 , 101512.Aietal. (2018) Ai, M., J.Yu, H.Zhang, and H.Wang (2018). Optimal subsampling algorithms for big data generalized linearmodels. arXiv preprint arXiv:1806.06761 .Benjamini andHochberg (1995) Benjamini, Y. and Y.Hochberg (1995). Controlling the false discovery rate: a practical and powerfulapproach to multiple testing. Journal of the Royal statistical society: series B(Methodological) 57 (1), 289–300.Breiman (2001) Breiman, L. (2001). Random forests. Machine learning 45 (1), 5–32.Breslow (1972) Breslow, N.E. (1972). Contribution to discussion of paper by dr cox. J. Roy. Statist. Soc., Ser. B 34 , 216–217.Carlsonetal. (2013) Carlson, L., K.W. Hatcher, and R.VanderBurg (2013). Elevated infant mortality rates among oral cleft and isolated oralcleft cases: a meta-analysis of studies from 1943 to 2010. The Cleft Palate-Craniofacial Journal 50 (1), 2–12.Carter etal. (2016) Carter, E.B., M.G. Tuuli, A.B. Caughey, A.O. Odibo, G.A. Macones, andA.G. Cahill (2016). Number of prenatal visits and pregnancy outcomes in low-risk women. Journal of perinatology 36 (3), 178–181.Cox (1972) Cox, D.R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B(Methodological) 34 (2), 187–202.Dhillonetal. (2013) Dhillon, P., Y.Lu, D.P. Foster, and L.Ungar (2013). New subsampling algorithms for fast least squares regression. pp. 360–368. Duley (2009) Duley, L. (2009). The global impact of pre-eclampsia and eclampsia. In Seminars in perinatology , Volume33, pp. 130–137.Elsevier. Fanetal. (2021) Fan, Y., Y.Liu, and L.Zhu (2021). Optimal subsampling for linear quantile regression models. Canadian Journal of Statistics .Friedman (2001) Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics , 1189–1232.Gorfine etal. (2021) Gorfine, M., N.Keret, A.BenArie, D.Zucker, and L.Hsu (2021). Marginalized frailty-based illness-death model: application to theuk-biobank survival data. Journal of the American Statistical Association 116 (535), 1155–1167.HolmesJretal. (2020) HolmesJr, L., L.O’Neill, H.Elmi, C.Chinacherem, C.Comeaux, L.Pelaez,K.W. Dabney, O.Akinola, and M.Enwere (2020). Implication of vagin*l and cesarean section delivery method inblack–white differentials in infant mortality in the united states: Linkedbirth/infant death records, 2007–2016. International Journal of Environmental Research and PublicHealth 17 (9), 3146.Jeonetal. (2018) Jeon, J., M.Du, R.E. Schoen, M.Hoffmeister, P.A. Newcomb, S.I. Berndt,B.Caan, P.T. Campbell, A.T. Chan, J.Chang-Claude, etal. (2018). Determining risk of colorectal cancer and starting age of screeningbased on lifestyle, environmental, and genetic factors. Gastroenterology 154 (8), 2152–2164.Keret andGorfine (2023) Keret, N. and M.Gorfine (2023). Analyzing big ehr data—optimal cox regression subsampling procedurewith rare events. Journal of the American Statistical Association 118 (544), 2262–2275.Lietal. (2013) Li, F., T.Wu, X.Lei, H.Zhang, M.Mao, and J.Zhang (2013). The apgar score and infant mortality. PloS one 8 (7), e69072.Maetal. (2015) Ma, P., M.W. Mahoney, and B.Yu (2015). A statistical perspective on algorithmic leveraging. The Journal of Machine Learning Research 16 (1),861–911.MacDorman etal. (1997) MacDorman, M.F., S.Cnattingius, H.J. Hoffman, M.S. Kramer, and B.Haglund(1997). Sudden infant death syndrome and smoking in the united states andsweden. American journal of epidemiology 146 (3), 249–257.Marshalletal. (2015) Marshall, J., J.L. Salemi, J.P. Tanner, R.Ramakrishnan, M.L. Feldkamp,L.K. Marengo, R.E. Meyer, C.M. Druschel, R.Rickard, R.S. Kirby, etal.(2015). Prevalence, correlates, and outcomes of omphalocele in the unitedstates, 1995–2005. Obstetrics & Gynecology 126 (2), 284–293.Modin (2002) Modin, B. (2002). Birth order and mortality: a life-long follow-up of 14,200 boys andgirls born in early 20th century sweden. Social science & medicine 54 (7), 1051–1064.Naeve (1979) Naeve, R.L. (1979). Weight gain and the outcome of pregnancy. American journal of obstetrics and gynecology 135 (1),3–9.Paceetal. (2019) Pace, N.D., A.M. Siega-Riz, A.F. Olshan, N.C. Chescheir, S.R. Cole, T.A.Desrosiers, S.C. Tinker, A.T. Hoyt, M.A. Canfield, S.L. Carmichael,etal. (2019). Survival of infants with spina bifida and the role of maternalprepregnancy body mass index. Birth defects research 111 (16), 1205–1216.Sadetzki etal. (1999) Sadetzki, S., A.Chetrit, E.Akstein, O.Luxenburg, L.Keinan, I.Litvak, andB.Modan (1999). Risk factors for infant mortality in down’s syndrome: a nationwidestudy. Paediatric and Perinatal Epidemiology 13 (4),442–451.Standfastetal. (1980) Standfast, S.J., S.Jereb, and D.T. Janerich (1980). The epidemiology of sudden infant death in upstate new york: Ii:birth characteristics. American Journal of Public Health 70 (10), 1061–1067.Therneauetal. (2017) Therneau, T., C.Crowson, and E.Atkinson (2017). Using time dependent covariates and time dependent coefficients inthe cox model. Survival Vignettes 2 (3), 1–25.Thorsdottir etal. (2002) Thorsdottir, I., J.E. Torfadottir, B.E. Birgisdottir, and R.T. Geirsson(2002). Weight gain in women of normal weight before pregnancy: complicationsin pregnancy or delivery and birth outcome. Obstetrics & Gynecology 99 (5), 799–806.VanderVaart (2000) Vander Vaart, A.W. (2000). Asymptotic statistics , Volume3.Cambridge university press. Wang (2020) Wang, H. (2020). Logistic regression for massive data with rare events. In International Conference on Machine Learning , pp.9829–9836. PMLR. Wang andMa (2021) Wang, H. and Y.Ma (2021). Optimal subsampling for quantile regression in big data. Biometrika 108 (1), 99–112.Wangetal. (2021) Wang, H., A.Zhang, and C.Wang (2021). Nonuniform negative sampling and log odds correction with rare eventsdata. In Thirty-Fifth Conference on Neural Information ProcessingSystems . Wangetal. (2018) Wang, H., R.Zhu, and P.Ma (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association 113 (522), 829–844.Xie etal. (2015) Xie, R.-h., L.Gaudet, D.Krewski, I.D. Graham, M.C. Walker, and S.W. Wen(2015). Higher cesarean delivery rates are associated with higher infantmortality rates in industrialized countries. Birth 42 (1), 62–69.Yangetal. (2024) Yang, Z., H.Wang, and J.Yan (2024). Subsampling approach for least squares fitting of semi-parametricaccelerated failure time models to massive survival data. Statistics and Computing 34 (2), 1–11.Yuetal. (2020) Yu, J., H.Wang, M.Ai, and H.Zhang (2020). Optimal distributed subsampling for maximum quasi-likelihoodestimators with massive data. Journal of the American Statistical Association , 1–12.Zuoetal. (2021) Zuo, L., H.Zhang, H.Wang, and L.Liu (2021). Sampling-based estimation for massive survival data with additivehazards model. Statistics in Medicine 40 (2), 441–450.Setting Nominal Power Empirical Power Mean (SD) of q ~ n subscript ~ 𝑞 𝑛 \tilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT A L A L I 0.80 0.812 0.794 857 (45) 843 (37) 0.83 0.818 0.814 1027 (61) 1007 (55) 0.85 0.852 0.842 1175 (76) 1154 (69) 0.87 0.882 0.882 1381 (97) 1342 (95) 0.90 0.896 0.874 1856 (179) 1814 (164) 0.91 0.920 0.890 2098 (212) 2064 (199) 0.93 0.936 0.934 2901 (425) 2866 (379) 0.95 0.952 0.960 5179 (1550) 4922 (1074) II 0.80 0.774 0.794 1112 (36) 1407 (48) 0.83 0.804 0.812 1301 (50) 1652 (63) 0.85 0.834 0.854 1470 (57) 1862 (74) 0.87 0.882 0.844 1681 (74) 2122 (92) 0.90 0.892 0.906 2155 (112) 2714 (142) 0.91 0.916 0.902 2369 (122) 2997 (163) 0.93 0.932 0.908 3046 (208) 3849 (247) 0.95 0.940 0.930 4361 (418) 5533 (496) III 0.80 0.774 0.782 1640 (49) 2677 (81) 0.83 0.824 0.784 1911 (62) 3132 (108) 0.85 0.844 0.804 2148 (74) 3516 (123) 0.87 0.858 0.860 2449 (94) 3999 (156) 0.90 0.894 0.876 3105 (135) 5103 (216) 0.91 0.886 0.874 3420 (168) 5603 (253) 0.93 0.894 0.934 4289 (234) 7071 (375) 0.95 0.942 0.930 6078 (412) 9872 (675)
Setting MLE L A Uniform mzNormal 0.815 (0.243) 0.112 (0.058) 0.116 (0.055) 0.047 (0.030) mixNormal 0.779 (0.123) 0.101 (0.045) 0.111 (0.052) 0.041 (0.029) T3 0.740 (0.133) 0.102 (0.042) 0.113 (0.047) 0.040 (0.029) EXP 1.084 (0.303) 0.120 (0.036) 0.130 (0.052) 0.048 (0.026)
Setting Nominal Power Empirical Power Mean (SD) of q ~ n subscript ~ 𝑞 𝑛 \tilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT A L A L mzNormal 0.80 0.804 0.799 1495 (91) 1648 (107) 0.83 0.833 0.824 1715 (113) 1893 (129) 0.85 0.843 0.849 1900 (127) 2093 (149) 0.87 0.857 0.861 2122 (148) 2345 (175) 0.90 0.900 0.887 2591 (196) 2865 (230) 0.93 0.921 0.931 3369 (296) 3718 (352) 0.95 0.948 0.941 4317 (442) 4758 (524) mixNormal 0.80 0.792 0.786 821 (58) 911 (69) 0.83 0.827 0.815 958 (74) 1061 (85) 0.85 0.844 0.846 1074 (88) 1191 (102) 0.87 0.863 0.864 1223 (108) 1353 (128) 0.90 0.899 0.900 1542 (156) 1711 (183) 0.93 0.930 0.935 2123 (265) 2364 (322) 0.95 0.947 0.946 2957 (473) 3289 (585) T3 0.80 0.797 0.798 1175 (168) 1347 (214) 0.83 0.819 0.821 1334 (193) 1536 (247) 0.85 0.842 0.843 1467 (221) 1686 (284) 0.87 0.853 0.850 1625 (250) 1865 (318) 0.90 0.890 0.891 1947 (319) 2243 (404) 0.93 0.926 0.918 2449 (421) 2820 (548) 0.95 0.947 0.951 3021 (558) 3475 (725) EXP 0.80 0.794 0.791 1263 (225) 1264 (214) 0.83 0.828 0.831 1449 (259) 1458 (256) 0.85 0.843 0.841 1613 (301) 1624 (296) 0.87 0.863 0.857 1812 (346) 1826 (351) 0.90 0.892 0.891 2225 (477) 2249 (481) 0.93 0.925 0.920 2913 (742) 2924 (727) 0.95 0.943 0.955 3814 (1162) 3826 (1182)
Setting Nominal Power Empirical Power Mean (SD) of q ~ n subscript ~ 𝑞 𝑛 \tilde{q}_{n} over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT A L A L mzNormal 0.80 0.806 0.820 4126 (230) 4402 (241) 0.83 0.854 0.818 4513 (253) 4811 (269) 0.85 0.840 0.822 4831 (266) 5111 (285) 0.87 0.872 0.860 5114 (271) 5461 (292) 0.89 0.880 0.872 5506 (307) 5883 (330) 0.91 0.900 0.924 5952 (324) 6339 (352) 0.93 0.928 0.922 6502 (353) 6945 (376) 0.95 0.958 0.940 7219 (415) 7690 (404) nzNormal 0.80 0.784 0.752 4169 (263) 4921 (292) 0.83 0.820 0.856 4604 (297) 5392 (353) 0.85 0.866 0.852 4855 (319) 5706 (348) 0.87 0.874 0.880 5208 (326) 6094 (382) 0.89 0.904 0.880 5582 (368) 6591 (410) 0.91 0.910 0.890 6049 (398) 7092 (415) 0.93 0.920 0.920 6597 (420) 7727 (477) 0.95 0.952 0.942 7360 (472) 8587 (535) mixNormal 0.80 0.842 0.812 8682 (467) 9160 (443) 0.83 0.810 0.836 9514 (481) 9952 (534) 0.85 0.864 0.828 10120 (548) 10575 (554) 0.87 0.856 0.864 10866 (608) 11367 (595) 0.89 0.874 0.900 11623 (603) 12181 (608) 0.91 0.926 0.912 12565 (624) 13219 (710) 0.93 0.930 0.932 13777 (665) 14396 (753) 0.95 0.956 0.918 15286 (787) 16076 (857) T3 0.80 0.816 0.780 11900 (1534) 12899 (1668) 0.83 0.822 0.804 12966 (1659) 14200 (1843) 0.85 0.838 0.838 13940 (1741) 15171 (2158) 0.87 0.854 0.850 14698 (1917) 16340 (2112) 0.89 0.880 0.874 16012 (1959) 17557 (2326) 0.91 0.912 0.922 17265 (2199) 19050 (2485) 0.93 0.928 0.916 18837 (2371) 20551 (2805) 0.95 0.928 0.924 20634 (2660) 23076 (3047) exp 0.80 0.804 0.808 7526 (832) 7661 (832) 0.83 0.834 0.800 8200 (829) 8396 (918) 0.85 0.850 0.856 8652 (940) 8957 (1010) 0.87 0.872 0.858 9281 (997) 9463 (1089) 0.89 0.906 0.902 9938 (1057) 10190 (1102) 0.91 0.890 0.912 10725 (1162) 11027 (1206) 0.93 0.932 0.910 11786 (1217) 12009 (1230) 0.95 0.946 0.950 13195 (1380) 13385 (1369)
c 𝑐 c italic_c q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT A L Uniform 40 213,680 1.0076 1.0316 1.1091 60 320,520 1.0051 1.0206 1.0725 80 427,360 1.0038 1.0154 1.0543 100 534,200 1.0031 1.0123 1.0434 120 641,040 1.0025 1.0103 1.0361 140 747,880 1.0022 1.0088 1.0309 160 854,720 1.0019 1.0077 1.0271 180 961,560 1.0017 1.0068 1.0240 200 1,068,400 1.0015 1.0062 1.0216
RMSE with respect Frobenius norm Computation Time to 𝜷 ^ P L subscript bold-^ 𝜷 𝑃 𝐿 \boldsymbol{\widehat{\beta}}_{PL} overbold_^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT ( × 100 absent 100 \times 100 × 100 ) of covariance matrix ( × 100 absent 100 \times 100 × 100 ) in hours c 𝑐 c italic_c A L Uniform A L Uniform A L Uniform 40 6.308 7.627 11.186 2.387 2.457 2.649 5.927 2.911 0.306 100 3.033 4.356 5.690 2.343 2.374 2.447 5.870 3.287 0.387 160 2.528 3.225 7.201 2.338 2.354 2.398 5.847 3.254 0.428
c 𝑐 c italic_c q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT A L 5 882,000 1.023 1.180 10 1,764,000 1.013 1.090 25 4,410,000 1.005 1.036
Coefficient Full-sample Proportion Subsample Proportion Ratio A L A L Anencephaly = no 0.00011 0.03529 0.00416 309.41689 36.52090 Spina Bifida = no 0.00016 0.03385 0.00227 206.53149 13.85638 Omphalocele = no 0.00038 0.04523 0.00497 118.48022 13.02339 Downs syndrome = no 0.00048 0.03536 0.00710 73.33574 14.73338 Cleft lip = no 0.00072 0.04208 0.00908 58.72595 12.66362 Residence status = 4 0.00190 0.00927 0.00118 4.88057 0.62083 Eclampsia = no 0.00253 0.04124 0.00754 16.28528 2.97785 Attendant = other midwife 0.00638 0.01121 0.00387 1.75569 0.60659 Attendant = other 0.00654 0.02102 0.01575 3.21212 2.40667 Forceps delivery = no 0.00663 0.03397 0.00433 5.12472 0.65301 Father’s race = american indian 0.00870 0.02241 0.01109 2.57516 1.27423 Mother’s race = american indian 0.01165 0.03025 0.01596 2.59707 1.37015 Birth place = not in hospital 0.01205 0.02719 0.01835 2.25730 1.52346 Tocolysis = no 0.01211 0.05251 0.03946 4.33686 3.25930 Cronic hypertension = no 0.01325 0.04522 0.03319 3.41423 2.50554 Residence status = 3 0.02119 0.04346 0.03647 2.05113 1.72109 Precipitous labor = no 0.02471 0.04615 0.03853 1.86780 1.55949 Vacuum delivery = no 0.03009 0.03916 0.01360 1.30158 0.45214 Prepregnacny associated hypertension = no 0.04275 0.06545 0.05763 1.53100 1.34794 Meconium = no 0.04736 0.05964 0.04199 1.25928 0.88652
S1 Additional Technical DetailsThe following functions are required for 𝕍 ~ 𝜷 ~ ( p , 𝜷 ^ ) subscript ~ 𝕍 ~ 𝜷 p ^ 𝜷 \widetilde{\mathbb{V}}_{\widetilde{\boldsymbol{\beta}}}(\textbf{p},\widehat{%\boldsymbol{\beta}}) over~ start_ARG blackboard_V end_ARG start_POSTSUBSCRIPT over~ start_ARG bold_italic_β end_ARG end_POSTSUBSCRIPT ( p , over^ start_ARG bold_italic_β end_ARG ) :
𝓘 ~ ( 𝜷 ) = 1 n ∂ 2 l ∗ ( 𝜷 ) ∂ 𝜷 T ∂ 𝜷 = − 1 n ∫ 0 τ { 𝐒 w ( 2 ) ( 𝜷 , t ) S w ( 0 ) ( 𝜷 , t ) − ( 𝐒 w ( 1 ) ( 𝜷 , t ) S w ( 0 ) ( 𝜷 , t ) ) ( 𝐒 w ( 1 ) ( 𝜷 , t ) S w ( 0 ) ( 𝜷 , t ) ) T } 𝑑 N . ( t ) formulae-sequence ~ 𝓘 𝜷 1 𝑛 superscript 2 superscript 𝑙 𝜷 superscript 𝜷 𝑇 𝜷 1 𝑛 superscript subscript 0 𝜏 superscript subscript 𝐒 𝑤 2 𝜷 𝑡 superscript subscript 𝑆 𝑤 0 𝜷 𝑡 superscript subscript 𝐒 𝑤 1 𝜷 𝑡 superscript subscript 𝑆 𝑤 0 𝜷 𝑡 superscript superscript subscript 𝐒 𝑤 1 𝜷 𝑡 superscript subscript 𝑆 𝑤 0 𝜷 𝑡 𝑇 differential-d 𝑁 𝑡 \widetilde{\boldsymbol{\mathcal{I}}}(\boldsymbol{\beta})=\frac{1}{n}\frac{%\partial^{2}l^{*}(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}^{T}\partial%\boldsymbol{\beta}}=-\frac{1}{n}\int_{0}^{\tau}\left\{\frac{\mathbf{S}_{w}^{(2%)}(\boldsymbol{\beta},t)}{S_{w}^{(0)}(\boldsymbol{\beta},t)}-\left(\frac{%\mathbf{S}_{w}^{(1)}(\boldsymbol{\beta},t)}{S_{w}^{(0)}(\boldsymbol{\beta},t)}%\right)\left(\frac{\mathbf{S}_{w}^{(1)}(\boldsymbol{\beta},t)}{S_{w}^{(0)}(%\boldsymbol{\beta},t)}\right)^{T}\right\}dN.(t) over~ start_ARG bold_caligraphic_I end_ARG ( bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∂ bold_italic_β end_ARG = - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT { divide start_ARG bold_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG - ( divide start_ARG bold_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG ) ( divide start_ARG bold_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } italic_d italic_N . ( italic_t )
and
𝝋 ~ ( 𝐩 , 𝜷 ) = 1 n 2 { 1 q ∑ i ∈ 𝒬 ∖ ℰ 𝐚 ~ i ( 𝜷 ) 𝐚 ~ i ( 𝜷 ) T p i 2 − 1 q 2 ∑ i ∈ 𝒬 ∖ ℰ 𝐚 ~ i ( 𝜷 ) p i ( ∑ i ∈ 𝒬 ∖ ℰ 𝐚 ~ i ( 𝜷 ) p i ) T } ~ 𝝋 𝐩 𝜷 1 superscript 𝑛 2 1 𝑞 subscript 𝑖 𝒬 ℰ subscript ~ 𝐚 𝑖 𝜷 subscript ~ 𝐚 𝑖 superscript 𝜷 𝑇 superscript subscript 𝑝 𝑖 2 1 superscript 𝑞 2 subscript 𝑖 𝒬 ℰ subscript ~ 𝐚 𝑖 𝜷 subscript 𝑝 𝑖 superscript subscript 𝑖 𝒬 ℰ subscript ~ 𝐚 𝑖 𝜷 subscript 𝑝 𝑖 𝑇 \widetilde{\boldsymbol{\varphi}}(\mathbf{p},\boldsymbol{\beta})=\frac{1}{n^{2}%}\left\{\frac{1}{q}\sum_{i\in\mathcal{Q}\setminus\mathcal{E}}\frac{\widetilde{%\mathbf{a}}_{i}(\boldsymbol{\beta})\widetilde{\mathbf{a}}_{i}(\boldsymbol{%\beta})^{T}}{p_{i}^{2}}-\frac{1}{q^{2}}\sum_{i\in\mathcal{Q}\setminus\mathcal{%E}}\frac{\widetilde{\mathbf{a}}_{i}(\boldsymbol{\beta})}{p_{i}}\left(\sum_{i%\in\mathcal{Q}\setminus\mathcal{E}}\frac{\widetilde{\mathbf{a}}_{i}(%\boldsymbol{\beta})}{p_{i}}\right)^{T}\right\} over~ start_ARG bold_italic_φ end_ARG ( bold_p , bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { divide start_ARG 1 end_ARG start_ARG italic_q end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT }
where
𝐚 ~ i ( 𝜷 ) = ∫ 0 τ { 𝐗 i − 𝐒 w ( 1 ) ( 𝜷 , t ) S w ( 0 ) ( 𝜷 , t ) } Y i ( t ) e 𝜷 T 𝐗 i S w ( 0 ) ( 𝜷 , t ) 𝑑 N . ( t ) . formulae-sequence subscript ~ 𝐚 𝑖 𝜷 superscript subscript 0 𝜏 subscript 𝐗 𝑖 superscript subscript 𝐒 𝑤 1 𝜷 𝑡 superscript subscript 𝑆 𝑤 0 𝜷 𝑡 subscript 𝑌 𝑖 𝑡 superscript 𝑒 superscript 𝜷 𝑇 subscript 𝐗 𝑖 superscript subscript 𝑆 𝑤 0 𝜷 𝑡 differential-d 𝑁 𝑡 \widetilde{\mathbf{a}}_{i}(\boldsymbol{\beta})=\int_{0}^{\tau}\left\{\mathbf{X%}_{i}-\frac{\mathbf{S}_{w}^{(1)}(\boldsymbol{\beta},t)}{S_{w}^{(0)}(%\boldsymbol{\beta},t)}\right\}\frac{Y_{i}(t)e^{\boldsymbol{\beta}^{T}\mathbf{X%}_{i}}}{S_{w}^{(0)}(\boldsymbol{\beta},t)}dN.(t)\,. over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT { bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG bold_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG } divide start_ARG italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_e start_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG italic_d italic_N . ( italic_t ) .
The following functions are required for R E ^ ( q n ) ^ 𝑅 𝐸 subscript 𝑞 𝑛 \widehat{RE}(q_{n}) over^ start_ARG italic_R italic_E end_ARG ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) :
𝓘 ~ Q 1.5 − 1 ( 𝜷 ) = 1 n ∂ 2 l ∗ ( 𝜷 ) ∂ 𝜷 T ∂ 𝜷 = − 1 n ∫ 0 τ { 𝐒 w , Q 1.5 ( 2 ) ( 𝜷 , t ) S w , Q 1.5 ( 0 ) ( 𝜷 , t ) − ( 𝐒 w , Q 1.5 ( 1 ) ( 𝜷 , t ) S w , Q 1.5 ( 0 ) ( 𝜷 , t ) ) ( 𝐒 w , Q 1.5 ( 1 ) ( 𝜷 , t ) S w , Q 1.5 ( 0 ) ( 𝜷 , t ) ) T } 𝑑 N . ( t ) formulae-sequence subscript superscript ~ 𝓘 1 subscript 𝑄 1.5 𝜷 1 𝑛 superscript 2 superscript 𝑙 𝜷 superscript 𝜷 𝑇 𝜷 1 𝑛 superscript subscript 0 𝜏 superscript subscript 𝐒 𝑤 subscript 𝑄 1.5
2 𝜷 𝑡 superscript subscript 𝑆 𝑤 subscript 𝑄 1.5
0 𝜷 𝑡 superscript subscript 𝐒 𝑤 subscript 𝑄 1.5
1 𝜷 𝑡 superscript subscript 𝑆 𝑤 subscript 𝑄 1.5
0 𝜷 𝑡 superscript superscript subscript 𝐒 𝑤 subscript 𝑄 1.5
1 𝜷 𝑡 superscript subscript 𝑆 𝑤 subscript 𝑄 1.5
0 𝜷 𝑡 𝑇 differential-d 𝑁 𝑡 \widetilde{\boldsymbol{\mathcal{I}}}^{-1}_{Q_{1.5}}({\boldsymbol{\beta}})=%\frac{1}{n}\frac{\partial^{2}l^{*}(\boldsymbol{\beta})}{\partial\boldsymbol{%\beta}^{T}\partial\boldsymbol{\beta}}=-\frac{1}{n}\int_{0}^{\tau}\left\{\frac{%\mathbf{S}_{w,Q_{1.5}}^{(2)}(\boldsymbol{\beta},t)}{S_{w,Q_{1.5}}^{(0)}(%\boldsymbol{\beta},t)}-\left(\frac{\mathbf{S}_{w,Q_{1.5}}^{(1)}(\boldsymbol{%\beta},t)}{S_{w,Q_{1.5}}^{(0)}(\boldsymbol{\beta},t)}\right)\left(\frac{%\mathbf{S}_{w,Q_{1.5}}^{(1)}(\boldsymbol{\beta},t)}{S_{w,Q_{1.5}}^{(0)}(%\boldsymbol{\beta},t)}\right)^{T}\right\}dN.(t) over~ start_ARG bold_caligraphic_I end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∂ bold_italic_β end_ARG = - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT { divide start_ARG bold_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG - ( divide start_ARG bold_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG ) ( divide start_ARG bold_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } italic_d italic_N . ( italic_t )
and
𝝋 ~ Q 1.5 ( 𝐩 , 𝜷 ) = 1 n 2 { 1 q ∑ i ∈ 𝒬 1.5 ∖ ℰ 𝐚 ~ i ( 𝜷 ) 𝐚 ~ i ( 𝜷 ) T p i 2 − 1 q 2 ∑ i ∈ 𝒬 1.5 ∖ ℰ 𝐚 ~ i ( 𝜷 ) p i ( ∑ i ∈ 𝒬 1.5 ∖ ℰ 𝐚 ~ i ( 𝜷 ) p i ) T } subscript ~ 𝝋 subscript 𝑄 1.5 𝐩 𝜷 1 superscript 𝑛 2 1 𝑞 subscript 𝑖 subscript 𝒬 1.5 ℰ subscript ~ 𝐚 𝑖 𝜷 subscript ~ 𝐚 𝑖 superscript 𝜷 𝑇 superscript subscript 𝑝 𝑖 2 1 superscript 𝑞 2 subscript 𝑖 subscript 𝒬 1.5 ℰ subscript ~ 𝐚 𝑖 𝜷 subscript 𝑝 𝑖 superscript subscript 𝑖 subscript 𝒬 1.5 ℰ subscript ~ 𝐚 𝑖 𝜷 subscript 𝑝 𝑖 𝑇 \widetilde{\boldsymbol{\varphi}}_{Q_{1.5}}(\mathbf{p},{\boldsymbol{\beta}})=%\frac{1}{n^{2}}\left\{\frac{1}{q}\sum_{i\in\mathcal{Q}_{1.5}\setminus\mathcal{%E}}\frac{\widetilde{\mathbf{a}}_{i}(\boldsymbol{\beta})\widetilde{\mathbf{a}}_%{i}(\boldsymbol{\beta})^{T}}{p_{i}^{2}}-\frac{1}{q^{2}}\sum_{i\in\mathcal{Q}_{%1.5}\setminus\mathcal{E}}\frac{\widetilde{\mathbf{a}}_{i}(\boldsymbol{\beta})}%{p_{i}}\left(\sum_{i\in\mathcal{Q}_{1.5}\setminus\mathcal{E}}\frac{\widetilde{%\mathbf{a}}_{i}(\boldsymbol{\beta})}{p_{i}}\right)^{T}\right\} over~ start_ARG bold_italic_φ end_ARG start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_p , bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { divide start_ARG 1 end_ARG start_ARG italic_q end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT ∖ caligraphic_E end_POSTSUBSCRIPT divide start_ARG over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT }
where
𝐚 ~ i ( 𝜷 ) = ∫ 0 τ { 𝐗 i − 𝐒 w , Q 1.5 ( 1 ) ( 𝜷 , t ) S w , Q 1.5 ( 0 ) ( 𝜷 , t ) } Y i ( t ) e 𝜷 T 𝐗 i S w , Q 1.5 ( 0 ) ( 𝜷 , t ) 𝑑 N . ( t ) , formulae-sequence subscript ~ 𝐚 𝑖 𝜷 superscript subscript 0 𝜏 subscript 𝐗 𝑖 superscript subscript 𝐒 𝑤 subscript 𝑄 1.5
1 𝜷 𝑡 superscript subscript 𝑆 𝑤 subscript 𝑄 1.5
0 𝜷 𝑡 subscript 𝑌 𝑖 𝑡 superscript 𝑒 superscript 𝜷 𝑇 subscript 𝐗 𝑖 superscript subscript 𝑆 𝑤 subscript 𝑄 1.5
0 𝜷 𝑡 differential-d 𝑁 𝑡 \widetilde{\mathbf{a}}_{i}(\boldsymbol{\beta})=\int_{0}^{\tau}\left\{\mathbf{X%}_{i}-\frac{\mathbf{S}_{w,Q_{1.5}}^{(1)}(\boldsymbol{\beta},t)}{S_{w,Q_{1.5}}^%{(0)}(\boldsymbol{\beta},t)}\right\}\frac{Y_{i}(t)e^{\boldsymbol{\beta}^{T}%\mathbf{X}_{i}}}{S_{w,Q_{1.5}}^{(0)}(\boldsymbol{\beta},t)}dN.(t)\,, over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT { bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG bold_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG } divide start_ARG italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_e start_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) end_ARG italic_d italic_N . ( italic_t ) ,
𝐒 w , Q 1.5 ( k ) ( 𝜷 , t ) = ∑ i ∈ 𝒬 1.5 w i e 𝜷 T 𝐗 i Y i ( t ) 𝐗 i ⊗ k k = 0 , 1 , 2 , formulae-sequence superscript subscript 𝐒 𝑤 subscript 𝑄 1.5
𝑘 𝜷 𝑡 subscript 𝑖 subscript 𝒬 1.5 subscript 𝑤 𝑖 superscript 𝑒 superscript 𝜷 𝑇 subscript 𝐗 𝑖 subscript 𝑌 𝑖 𝑡 superscript subscript 𝐗 𝑖 tensor-product absent 𝑘 𝑘 0 1 2
{\mathbf{S}}_{{w,Q_{1.5}}}^{(k)}(\boldsymbol{\beta},t)=\sum_{i\in\mathcal{Q}_{%1.5}}{w}_{i}e^{\boldsymbol{\beta}^{T}\mathbf{X}_{i}}Y_{i}(t)\mathbf{X}_{i}^{%\otimes k}\quad\quad k=0,1,2\,, bold_S start_POSTSUBSCRIPT italic_w , italic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( bold_italic_β , italic_t ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_k end_POSTSUPERSCRIPT italic_k = 0 , 1 , 2 ,
and
w i = { ( p i q 0 ) − 1 if Δ i = 0 , p i > 0 1 if Δ i = 1 i = 1 , … , n . formulae-sequence subscript 𝑤 𝑖 cases superscript subscript 𝑝 𝑖 subscript 𝑞 0 1 formulae-sequence if subscript Δ 𝑖 0 subscript 𝑝 𝑖 0 1 if subscript Δ 𝑖 1 𝑖 1 … 𝑛
w_{i}=\begin{cases}(p_{i}q_{0})^{-1}&\text{if }\Delta_{i}=0,p_{i}>0\\1&\text{if }\Delta_{i}=1\end{cases}\quad\quad i=1,\dots,n\,. italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL if roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL if roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_CELL end_ROW italic_i = 1 , … , italic_n .
S2 Logistic Regression - Assumptions and ProofsThe 𝐗 i subscript 𝐗 𝑖 \mathbf{X}_{i} bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ’s are assumed to be independent and identically distributed and the following additional assumptions are required for the asymptotic results:
A.1 As n → ∞ → 𝑛 n\rightarrow\infty italic_n → ∞ , n − 1 ∑ i = 1 n ‖ 𝐗 i ‖ 3 = O P ( 1 ) superscript 𝑛 1 superscript subscript 𝑖 1 𝑛 superscript norm subscript 𝐗 𝑖 3 subscript 𝑂 𝑃 1 n^{-1}\sum_{i=1}^{n}\|\mathbf{X}_{i}\|^{3}=O_{P}(1) italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT = italic_O start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( 1 ) and 𝐌 X ( 𝜷 o ) subscript 𝐌 𝑋 superscript 𝜷 𝑜 \mathbf{M}_{X}(\boldsymbol{\beta}^{o}) bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) goes in probability to a positive-definite matrix 𝚺 ( 𝜷 o ) 𝚺 superscript 𝜷 𝑜 \boldsymbol{\Sigma}(\boldsymbol{\beta}^{o}) bold_Σ ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) , where
𝐌 X ( 𝜷 ) = n − 1 ∑ i = 1 n p i ( 𝜷 ) ( 1 − p i ( 𝜷 ) ) 𝐗 i 𝐗 i T . subscript 𝐌 𝑋 𝜷 superscript 𝑛 1 superscript subscript 𝑖 1 𝑛 subscript 𝑝 𝑖 𝜷 1 subscript 𝑝 𝑖 𝜷 subscript 𝐗 𝑖 superscript subscript 𝐗 𝑖 𝑇 \mathbf{M}_{X}(\boldsymbol{\beta})=n^{-1}\sum_{i=1}^{n}p_{i}(\boldsymbol{\beta%})\big{(}1-p_{i}(\boldsymbol{\beta})\big{)}\mathbf{X}_{i}\mathbf{X}_{i}^{T}. bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( bold_italic_β ) = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .
A.2 n − 2 ∑ i = 1 n π i − 1 ‖ 𝐗 i ‖ k = O P ( 1 ) superscript 𝑛 2 superscript subscript 𝑖 1 𝑛 superscript subscript 𝜋 𝑖 1 superscript norm subscript 𝐗 𝑖 𝑘 subscript 𝑂 𝑃 1 n^{-2}\sum_{i=1}^{n}\pi_{i}^{-1}\|\mathbf{X}_{i}\|^{k}=O_{P}(1) italic_n start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_O start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( 1 ) for k = 2 , 4 𝑘 2 4
k=2,4 italic_k = 2 , 4 .
A.3 There exists some δ > 0 𝛿 0 \delta>0 italic_δ > 0 such that n − 2 + δ ∑ i = 1 n π i − 1 − δ ‖ 𝐗 i ‖ 2 + δ = O P ( 1 ) superscript 𝑛 2 𝛿 superscript subscript 𝑖 1 𝑛 superscript subscript 𝜋 𝑖 1 𝛿 superscript norm subscript 𝐗 𝑖 2 𝛿 subscript 𝑂 𝑃 1 n^{-2+\delta}\sum_{i=1}^{n}\pi_{i}^{-1-\delta}\|\mathbf{X}_{i}\|^{2+\delta}=O_%{P}(1) italic_n start_POSTSUPERSCRIPT - 2 + italic_δ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 - italic_δ end_POSTSUPERSCRIPT ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT = italic_O start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( 1 ) .
A.4 q n / n subscript 𝑞 𝑛 𝑛 q_{n}/n italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_n and ( n − n 0 ) / n 𝑛 subscript 𝑛 0 𝑛 (n-n_{0})/n ( italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) / italic_n converge to small positive constants as q n , n → ∞ → subscript 𝑞 𝑛 𝑛
q_{n},n\rightarrow\infty italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ .
The first three assumptions are essentially general moment conditions (Wangetal., 2018 ) . In Assumption A.4 it is assumed that the sampled event rate goes to a positive constant as n 𝑛 n italic_n goes to infinity.
S2.1 Proof of Theorem 3.1This proof follows derivation similar to that of Keret andGorfine (2023 ) . Wangetal. (2018 ) have already shown that 𝜷 ~ ~ 𝜷 \widetilde{\boldsymbol{\beta}} over~ start_ARG bold_italic_β end_ARG is consistent to 𝜷 ^ M L E subscript ^ 𝜷 𝑀 𝐿 𝐸 \widehat{\boldsymbol{\beta}}_{MLE} over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT in the conditional space, given ℱ n subscript ℱ 𝑛 \mathcal{F}_{n} caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT . We begin by expanding this result and show that 𝜷 ~ ~ 𝜷 \widetilde{\boldsymbol{\beta}} over~ start_ARG bold_italic_β end_ARG is consistent to 𝜷 o superscript 𝜷 𝑜 \boldsymbol{\beta}^{o} bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT in the unconditional space. Based on Theorem 1 of Wangetal. (2018 ) , for any ϵ > 0 italic-ϵ 0 \epsilon>0 italic_ϵ > 0 ,
lim q n , n → ∞ Pr ( ‖ 𝜷 ~ − 𝜷 ^ M L E ‖ 2 > ϵ | ℱ n ) = 0 . subscript → subscript 𝑞 𝑛 𝑛
Pr subscript norm ~ 𝜷 subscript ^ 𝜷 𝑀 𝐿 𝐸 2 conditional italic-ϵ subscript ℱ 𝑛 0 \lim_{q_{n},n\rightarrow\infty}\Pr({\|\widetilde{\boldsymbol{\beta}}-\widehat{%\boldsymbol{\beta}}_{MLE}\|}_{2}>\epsilon|\mathcal{F}_{n})=0\,. roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = 0 .
In the unconditional probability space, Pr ( ‖ 𝜷 ~ − 𝜷 ^ M L E ‖ 2 > ϵ | ℱ n ) Pr subscript norm ~ 𝜷 subscript ^ 𝜷 𝑀 𝐿 𝐸 2 conditional italic-ϵ subscript ℱ 𝑛 \Pr({\|\widetilde{\boldsymbol{\beta}}-\widehat{\boldsymbol{\beta}}_{MLE}\|}_{2%}>\epsilon|\mathcal{F}_{n}) roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) itself is a random variable. Hence, denote it by ζ n , q n subscript 𝜁 𝑛 subscript 𝑞 𝑛
\zeta_{n,q_{n}} italic_ζ start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT and it follows that
Pr ( lim q n , n → ∞ ζ n , q n = 0 ) = 1 , Pr subscript → subscript 𝑞 𝑛 𝑛
subscript 𝜁 𝑛 subscript 𝑞 𝑛
0 1 \Pr(\lim_{q_{n},n\rightarrow\infty}\zeta_{n,q_{n}}=0)=1, roman_Pr ( roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT italic_ζ start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0 ) = 1 ,
in the sense that ζ n , q n → a . s . 0 \zeta_{n,q_{n}}\xrightarrow{a.s.}0 italic_ζ start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ARROW start_OVERACCENT italic_a . italic_s . end_OVERACCENT → end_ARROW 0 as q n , n → ∞ → subscript 𝑞 𝑛 𝑛
q_{n},n\rightarrow\infty italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ . Then, for any ϵ > 0 italic-ϵ 0 \epsilon>0 italic_ϵ > 0 ,
lim q n , n → ∞ Pr ( ‖ 𝜷 ~ − 𝜷 ^ M L E ‖ 2 > ϵ ) = lim q n , n → ∞ E ( ζ n , q n ) = E ( lim q n , n → ∞ ζ n , q n ) = 0 subscript → subscript 𝑞 𝑛 𝑛
Pr subscript norm ~ 𝜷 subscript ^ 𝜷 𝑀 𝐿 𝐸 2 italic-ϵ subscript → subscript 𝑞 𝑛 𝑛
𝐸 subscript 𝜁 𝑛 subscript 𝑞 𝑛
𝐸 subscript → subscript 𝑞 𝑛 𝑛
subscript 𝜁 𝑛 subscript 𝑞 𝑛
0 \lim_{q_{n},n\rightarrow\infty}\Pr({\|\widetilde{\boldsymbol{\beta}}-\widehat{%\boldsymbol{\beta}}_{MLE}\|}_{2}>\epsilon)=\lim_{q_{n},n\rightarrow\infty}E(%\zeta_{n,q_{n}})=E(\lim_{q_{n},n\rightarrow\infty}\zeta_{n,q_{n}})=0 roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ ) = roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT italic_E ( italic_ζ start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = italic_E ( roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT italic_ζ start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = 0 (S.23)
where the interchange of expectation and limit is allowed due to the dominated convergence theorem, since ζ n , q n subscript 𝜁 𝑛 subscript 𝑞 𝑛
\zeta_{n,q_{n}} italic_ζ start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT is trivially bounded by 1 1 1 1 . Next, we write
Pr ( ‖ 𝜷 ~ − 𝜷 o ‖ 2 > ϵ ) Pr subscript norm ~ 𝜷 superscript 𝜷 𝑜 2 italic-ϵ \displaystyle\Pr({\|\widetilde{\boldsymbol{\beta}}-\boldsymbol{\beta}^{o}\|}_{%2}>\epsilon) roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ ) = Pr ( ‖ 𝜷 ~ + 𝜷 ^ M L E − 𝜷 ^ M L E − 𝜷 o ‖ 2 > ϵ ) absent Pr subscript norm ~ 𝜷 subscript ^ 𝜷 𝑀 𝐿 𝐸 subscript ^ 𝜷 𝑀 𝐿 𝐸 superscript 𝜷 𝑜 2 italic-ϵ \displaystyle=\Pr({\|\widetilde{\boldsymbol{\beta}}+\widehat{\boldsymbol{\beta%}}_{MLE}-\widehat{\boldsymbol{\beta}}_{MLE}-\boldsymbol{\beta}^{o}\|}_{2}>\epsilon) = roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG + over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ ) ≤ Pr ( ‖ 𝜷 ~ − 𝜷 ^ M L E ‖ 2 + ‖ 𝜷 ^ M L E − 𝜷 o ‖ 2 > ϵ ) absent Pr subscript norm ~ 𝜷 subscript ^ 𝜷 𝑀 𝐿 𝐸 2 subscript norm subscript ^ 𝜷 𝑀 𝐿 𝐸 superscript 𝜷 𝑜 2 italic-ϵ \displaystyle\leq\Pr({\|\widetilde{\boldsymbol{\beta}}-\widehat{\boldsymbol{%\beta}}_{MLE}\|}_{2}+{\|\widehat{\boldsymbol{\beta}}_{MLE}-\boldsymbol{\beta}^%{o}\|}_{2}>\epsilon) ≤ roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ ) ≤ Pr ( { ‖ 𝜷 ~ − 𝜷 ^ M L E ‖ 2 > ϵ / 2 } ∪ { ‖ 𝜷 ^ M L E − 𝜷 o ‖ 2 > ϵ / 2 } ) absent Pr subscript norm ~ 𝜷 subscript ^ 𝜷 𝑀 𝐿 𝐸 2 italic-ϵ 2 subscript norm subscript ^ 𝜷 𝑀 𝐿 𝐸 superscript 𝜷 𝑜 2 italic-ϵ 2 \displaystyle\leq\Pr\big{(}\{{\|\widetilde{\boldsymbol{\beta}}-\widehat{%\boldsymbol{\beta}}_{MLE}\|}_{2}>\epsilon/2\}\cup\{{\|\widehat{\boldsymbol{%\beta}}_{MLE}-\boldsymbol{\beta}^{o}\|}_{2}>\epsilon/2\}\big{)} ≤ roman_Pr ( { ∥ over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ / 2 } ∪ { ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ / 2 } ) ≤ Pr ( ‖ 𝜷 ~ − 𝜷 ^ M L E ‖ 2 > ϵ / 2 ) + Pr ( ‖ 𝜷 ^ M L E − 𝜷 o ‖ 2 > ϵ / 2 ) . absent Pr subscript norm ~ 𝜷 subscript ^ 𝜷 𝑀 𝐿 𝐸 2 italic-ϵ 2 Pr subscript norm subscript ^ 𝜷 𝑀 𝐿 𝐸 superscript 𝜷 𝑜 2 italic-ϵ 2 \displaystyle\leq\Pr({\|\widetilde{\boldsymbol{\beta}}-\widehat{\boldsymbol{%\beta}}_{MLE}\|}_{2}>\epsilon/2)+\Pr({\|\widehat{\boldsymbol{\beta}}_{MLE}-%\boldsymbol{\beta}^{o}\|}_{2}>\epsilon/2)\,. ≤ roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ / 2 ) + roman_Pr ( ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ / 2 ) .
Taking limits on both sides, yields
lim q n , n → ∞ Pr ( ‖ 𝜷 ~ − 𝜷 o ‖ 2 > ϵ ) subscript → subscript 𝑞 𝑛 𝑛
Pr subscript norm ~ 𝜷 superscript 𝜷 𝑜 2 italic-ϵ \displaystyle\lim_{q_{n},n\rightarrow\infty}\Pr({\|\widetilde{\boldsymbol{%\beta}}-\boldsymbol{\beta}^{o}\|}_{2}>\epsilon) roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ ) ≤ lim q n , n → ∞ Pr ( ‖ 𝜷 ~ − 𝜷 ^ M L E ‖ 2 > ϵ / 2 ) + lim q n , n → ∞ Pr ( ‖ 𝜷 ^ M L E − 𝜷 o ‖ 2 > ϵ / 2 ) = 0 absent subscript → subscript 𝑞 𝑛 𝑛
Pr subscript norm ~ 𝜷 subscript ^ 𝜷 𝑀 𝐿 𝐸 2 italic-ϵ 2 subscript → subscript 𝑞 𝑛 𝑛
Pr subscript norm subscript ^ 𝜷 𝑀 𝐿 𝐸 superscript 𝜷 𝑜 2 italic-ϵ 2 0 \displaystyle\leq\lim_{q_{n},n\rightarrow\infty}\Pr({\|\widetilde{\boldsymbol{%\beta}}-\widehat{\boldsymbol{\beta}}_{MLE}\|}_{2}>\epsilon/2)+\lim_{q_{n},n%\rightarrow\infty}\Pr({\|\widehat{\boldsymbol{\beta}}_{MLE}-\boldsymbol{\beta}%^{o}\|}_{2}>\epsilon/2)=0 ≤ roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ / 2 ) + roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT roman_Pr ( ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ / 2 ) = 0
where the first addend is 0 0 due to Equation (S.23 ) and the second addend is 0 0 based on the well-known properties of logistic regression MLE. Then, we conclude that
lim q n , n → ∞ Pr ( ‖ 𝜷 ~ − 𝜷 o ‖ 2 > ϵ ) = 0 . subscript → subscript 𝑞 𝑛 𝑛
Pr subscript norm ~ 𝜷 superscript 𝜷 𝑜 2 italic-ϵ 0 \lim_{q_{n},n\rightarrow\infty}\Pr({\|\widetilde{\boldsymbol{\beta}}-%\boldsymbol{\beta}^{o}\|}_{2}>\epsilon)=0. roman_lim start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n → ∞ end_POSTSUBSCRIPT roman_Pr ( ∥ over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_ϵ ) = 0 .
Similarly to Eq. (S.12) in Wangetal. (2018 ) , a Taylor expansion for the subsample-based pseudo-score function evaluated at 𝜷 ~ ~ 𝜷 \widetilde{\boldsymbol{\beta}} over~ start_ARG bold_italic_β end_ARG around 𝜷 o superscript 𝜷 𝑜 \boldsymbol{\beta}^{o} bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT instead of 𝜷 ^ M L E subscript ^ 𝜷 𝑀 𝐿 𝐸 \widehat{\boldsymbol{\beta}}_{MLE} over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT gives
𝜷 ~ − 𝜷 o = − 𝐌 ~ X − 1 ( 𝜷 o ) { 1 n ∂ l ∗ ( 𝜷 o ) ∂ ( 𝜷 T ) + o P ( ‖ 𝜷 ~ − 𝜷 o ‖ ) } ~ 𝜷 superscript 𝜷 𝑜 superscript subscript ~ 𝐌 𝑋 1 superscript 𝜷 𝑜 1 𝑛 superscript 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑇 subscript 𝑜 𝑃 norm ~ 𝜷 superscript 𝜷 𝑜 \widetilde{\boldsymbol{\beta}}-\boldsymbol{\beta}^{o}=-\widetilde{\mathbf{M}}_%{X}^{-1}(\boldsymbol{\beta}^{o})\bigg{\{}\frac{1}{n}\frac{\partial l^{*}(%\boldsymbol{\beta}^{o})}{\partial(\boldsymbol{\beta}^{T})}+o_{P}(\|\widetilde{%\boldsymbol{\beta}}-\boldsymbol{\beta}^{o}\|)\bigg{\}} over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = - over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ ( bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) end_ARG + italic_o start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( ∥ over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ ) } (S.24)
where the consistency of 𝐌 ~ 𝐗 ( 𝜷 ~ ) subscript ~ 𝐌 𝐗 ~ 𝜷 \widetilde{\mathbf{M}}_{\mathbf{X}}(\widetilde{\boldsymbol{\beta}}) over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_β end_ARG ) to 𝐌 𝐗 ( 𝜷 o ) subscript 𝐌 𝐗 superscript 𝜷 𝑜 \mathbf{M}_{\mathbf{X}}(\boldsymbol{\beta}^{o}) bold_M start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) is derived in a similar manner to the proof of the consistency of 𝜷 ~ ~ 𝜷 \widetilde{\boldsymbol{\beta}} over~ start_ARG bold_italic_β end_ARG to 𝜷 o superscript 𝜷 𝑜 \boldsymbol{\beta}^{o} bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT and based on Eq. (S.1) in Wang et al. Wangetal. (2018 ) .
Denote by R i subscript 𝑅 𝑖 R_{i} italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the number of times observation i 𝑖 i italic_i appears in the subsample. Then,
∂ l ∗ ( 𝜷 ) ∂ 𝜷 T superscript 𝑙 𝜷 superscript 𝜷 𝑇 \displaystyle\frac{\partial l^{*}(\boldsymbol{\beta})}{\partial\boldsymbol{%\beta}^{T}} divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG = \displaystyle= = ∑ i ∈ 𝒬 w i ∗ ( D i ∗ − μ i ∗ ( 𝜷 ) ) 𝐗 i subscript 𝑖 𝒬 superscript subscript 𝑤 𝑖 superscript subscript 𝐷 𝑖 superscript subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 \displaystyle\sum_{i\in\mathcal{Q}}w_{i}^{*}\big{(}D_{i}^{*}-\mu_{i}^{*}(%\boldsymbol{\beta})\big{)}\mathbf{X}_{i} ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (S.25) = \displaystyle= = ∑ i ∈ ℰ w i ( 1 − μ i ( 𝜷 ) ) 𝐗 i − ∑ i ∈ { 𝒬 ∖ ℰ } w i μ i ( 𝜷 ) 𝐗 i subscript 𝑖 ℰ subscript 𝑤 𝑖 1 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 subscript 𝑖 𝒬 ℰ subscript 𝑤 𝑖 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 \displaystyle\sum_{i\in\mathcal{E}}w_{i}\big{(}1-\mu_{i}(\boldsymbol{\beta})%\big{)}\mathbf{X}_{i}-\sum_{i\in\{\mathcal{Q}\setminus\mathcal{E}\}}w_{i}\mu_{%i}(\boldsymbol{\beta})\mathbf{X}_{i} ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_E end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ { caligraphic_Q ∖ caligraphic_E } end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = \displaystyle= = ∑ i ∈ ℰ ( 1 − μ i ( 𝜷 ) ) 𝐗 i − ∑ i ∈ { 𝒬 ∖ ℰ } w i μ i ( 𝜷 ) 𝐗 i subscript 𝑖 ℰ 1 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 subscript 𝑖 𝒬 ℰ subscript 𝑤 𝑖 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 \displaystyle\sum_{i\in\mathcal{E}}\big{(}1-\mu_{i}(\boldsymbol{\beta})\big{)}%\mathbf{X}_{i}-\sum_{i\in\{\mathcal{Q}\setminus\mathcal{E}\}}w_{i}\mu_{i}(%\boldsymbol{\beta})\mathbf{X}_{i} ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_E end_POSTSUBSCRIPT ( 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ { caligraphic_Q ∖ caligraphic_E } end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = \displaystyle= = ∑ i ∈ ℰ ( 1 − μ i ( 𝜷 ) ) 𝐗 i − ∑ i ∈ { 𝒬 ∖ ℰ } w i μ i ( 𝜷 ) 𝐗 i − ∑ i ∈ 𝒩 μ i ( 𝜷 ) 𝐗 i + ∑ i ∈ 𝒩 μ i ( 𝜷 ) 𝐗 i subscript 𝑖 ℰ 1 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 subscript 𝑖 𝒬 ℰ subscript 𝑤 𝑖 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 subscript 𝑖 𝒩 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 subscript 𝑖 𝒩 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 \displaystyle\sum_{i\in\mathcal{E}}\big{(}1-\mu_{i}(\boldsymbol{\beta})\big{)}%\mathbf{X}_{i}-\sum_{i\in\{\mathcal{Q}\setminus\mathcal{E}\}}w_{i}\mu_{i}(%\boldsymbol{\beta})\mathbf{X}_{i}-\sum_{i\in\mathcal{N}}\mu_{i}(\boldsymbol{%\beta})\mathbf{X}_{i}+\sum_{i\in\mathcal{N}}\mu_{i}(\boldsymbol{\beta})\mathbf%{X}_{i} ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_E end_POSTSUBSCRIPT ( 1 - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ { caligraphic_Q ∖ caligraphic_E } end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = \displaystyle= = ∑ i = 1 n ( D i − μ i ( 𝜷 ) ) 𝐗 i − ∑ i ∈ { 𝒬 ∖ ℰ } w i μ i ( 𝜷 ) 𝐗 i + ∑ i ∈ 𝒩 μ i ( 𝜷 ) 𝐗 i superscript subscript 𝑖 1 𝑛 subscript 𝐷 𝑖 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 subscript 𝑖 𝒬 ℰ subscript 𝑤 𝑖 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 subscript 𝑖 𝒩 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 \displaystyle\sum_{i=1}^{n}\big{(}D_{i}-\mu_{i}(\boldsymbol{\beta})\big{)}%\mathbf{X}_{i}-\sum_{i\in\{\mathcal{Q}\setminus\mathcal{E}\}}w_{i}\mu_{i}(%\boldsymbol{\beta})\mathbf{X}_{i}+\sum_{i\in\mathcal{N}}\mu_{i}(\boldsymbol{%\beta})\mathbf{X}_{i} ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ { caligraphic_Q ∖ caligraphic_E } end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = \displaystyle= = ∑ i = 1 n ( D i − μ i ( 𝜷 ) ) 𝐗 i − ∑ i ∈ 𝒩 R i w i μ i ( 𝜷 ) 𝐗 i + ∑ i ∈ 𝒩 μ i ( 𝜷 ) 𝐗 i superscript subscript 𝑖 1 𝑛 subscript 𝐷 𝑖 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 subscript 𝑖 𝒩 subscript 𝑅 𝑖 subscript 𝑤 𝑖 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 subscript 𝑖 𝒩 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 \displaystyle\sum_{i=1}^{n}\big{(}D_{i}-\mu_{i}(\boldsymbol{\beta})\big{)}%\mathbf{X}_{i}-\sum_{i\in\mathcal{N}}R_{i}w_{i}\mu_{i}(\boldsymbol{\beta})%\mathbf{X}_{i}+\sum_{i\in\mathcal{N}}\mu_{i}(\boldsymbol{\beta})\mathbf{X}_{i} ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = \displaystyle= = ∑ i = 1 n ( D i − μ i ( 𝜷 ) ) 𝐗 i + ∑ i ∈ 𝒩 ( 1 − w i R i ) μ i ( 𝜷 ) 𝐗 i superscript subscript 𝑖 1 𝑛 subscript 𝐷 𝑖 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 subscript 𝑖 𝒩 1 subscript 𝑤 𝑖 subscript 𝑅 𝑖 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 \displaystyle\sum_{i=1}^{n}\big{(}D_{i}-\mu_{i}(\boldsymbol{\beta})\big{)}%\mathbf{X}_{i}+\sum_{i\in\mathcal{N}}(1-w_{i}R_{i})\mu_{i}(\boldsymbol{\beta})%\mathbf{X}_{i} ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = \displaystyle= = ∂ l ( 𝜷 ) ∂ 𝜷 T + ∑ i = 1 n ( 1 − w i R i ) μ i ( 𝜷 ) 𝐗 i . 𝑙 𝜷 superscript 𝜷 𝑇 superscript subscript 𝑖 1 𝑛 1 subscript 𝑤 𝑖 subscript 𝑅 𝑖 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 \displaystyle\frac{\partial l(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}^%{T}}+\sum_{i=1}^{n}(1-w_{i}R_{i})\mu_{i}(\boldsymbol{\beta})\mathbf{X}_{i}. divide start_ARG ∂ italic_l ( bold_italic_β ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .
Based on Eq.s (S.24 ) and (S.25 ), we conclude that
n ( 𝜷 ~ − 𝜷 o ) = − 𝐌 ~ X − 1 ( 𝜷 o ) 1 n ∂ l ( 𝜷 o ) ∂ 𝜷 T − 𝐌 ~ X − 1 ( 𝜷 o ) 1 n ∑ i = 1 n ( 1 − w i R i ) μ i ( 𝜷 ) 𝐗 i + o P ( n ‖ 𝜷 ~ − 𝜷 o ‖ 2 ) . 𝑛 ~ 𝜷 superscript 𝜷 𝑜 superscript subscript ~ 𝐌 𝑋 1 superscript 𝜷 𝑜 1 𝑛 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑇 superscript subscript ~ 𝐌 𝑋 1 superscript 𝜷 𝑜 1 𝑛 superscript subscript 𝑖 1 𝑛 1 subscript 𝑤 𝑖 subscript 𝑅 𝑖 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 subscript 𝑜 𝑃 𝑛 subscript norm ~ 𝜷 superscript 𝜷 𝑜 2 \sqrt{n}(\widetilde{\boldsymbol{\beta}}-\boldsymbol{\beta}^{o})=-\widetilde{%\mathbf{M}}_{X}^{-1}(\boldsymbol{\beta}^{o})\frac{1}{\sqrt{n}}\frac{\partial l%(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{T}}-\widetilde{\mathbf{M%}}_{X}^{-1}(\boldsymbol{\beta}^{o})\frac{1}{\sqrt{n}}\sum_{i=1}^{n}(1-w_{i}R_{%i})\mu_{i}(\boldsymbol{\beta})\mathbf{X}_{i}+o_{P}(\sqrt{n}{\|\widetilde{%\boldsymbol{\beta}}-\boldsymbol{\beta}^{o}\|}_{2}). square-root start_ARG italic_n end_ARG ( over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) = - over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_o start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( square-root start_ARG italic_n end_ARG ∥ over~ start_ARG bold_italic_β end_ARG - bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . (S.26)
Now it will be shown that n − 1 / 2 ∂ l ( 𝜷 o ) / ∂ 𝜷 T superscript 𝑛 1 2 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑇 n^{-1/2}\partial l(\boldsymbol{\beta}^{o})/\partial\boldsymbol{\beta}^{T} italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) / ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and n − 1 / 2 ∑ i = 1 n ( 1 − w i R i ) μ i ( 𝜷 ) 𝐗 i superscript 𝑛 1 2 superscript subscript 𝑖 1 𝑛 1 subscript 𝑤 𝑖 subscript 𝑅 𝑖 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 n^{-1/2}\sum_{i=1}^{n}(1-w_{i}R_{i})\mu_{i}(\boldsymbol{\beta})\mathbf{X}_{i} italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are asymptotically independent and each one of them is asymptotically normal. From the asymptotic theory of standard logistic regression,
− 𝐌 X − 1 / 2 ( 𝜷 o ) 1 n ∂ l ( 𝜷 o ) ∂ 𝜷 T → 𝐷 N ( 0 , 𝐈 ) , 𝐷 → superscript subscript 𝐌 𝑋 1 2 superscript 𝜷 𝑜 1 𝑛 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑇 𝑁 0 𝐈 -\mathbf{M}_{X}^{-1/2}(\boldsymbol{\beta}^{o})\frac{1}{\sqrt{n}}\frac{\partiall%(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{T}}\xrightarrow[]{D}N(0,%\mathbf{I})\,, - bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_I ) ,
and
1 n ∂ l ( 𝜷 o ) ∂ 𝜷 T → 𝐷 N ( 0 , 𝚺 ( 𝜷 o ) ) . 𝐷 → 1 𝑛 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑇 𝑁 0 𝚺 superscript 𝜷 𝑜 \frac{1}{\sqrt{n}}\frac{\partial l(\boldsymbol{\beta}^{o})}{\partial%\boldsymbol{\beta}^{T}}\xrightarrow[]{D}N\big{(}0,\boldsymbol{\Sigma}(%\boldsymbol{\beta}^{o})\big{)}\,. divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_Σ ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ) . (S.27)
Also, q n / n ∑ i ∈ 𝒩 w i R i μ i ( 𝜷 ) 𝐗 i subscript 𝑞 𝑛 𝑛 subscript 𝑖 𝒩 subscript 𝑤 𝑖 subscript 𝑅 𝑖 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 \sqrt{q_{n}}/n\sum_{i\in\mathcal{N}}w_{i}R_{i}\mu_{i}(\boldsymbol{\beta})%\mathbf{X}_{i} square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG / italic_n ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be alternatively expressed as a sum of independent identically distributed observations in the conditional space, namely
q n n ∑ i ∈ 𝒩 w i R i μ i ( 𝜷 ) 𝐗 i subscript 𝑞 𝑛 𝑛 subscript 𝑖 𝒩 subscript 𝑤 𝑖 subscript 𝑅 𝑖 subscript 𝜇 𝑖 𝜷 subscript 𝐗 𝑖 \displaystyle\frac{\sqrt{q_{n}}}{n}\sum_{i\in\mathcal{N}}w_{i}R_{i}\mu_{i}(%\boldsymbol{\beta})\mathbf{X}_{i} divide start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = \displaystyle= = q n n ∑ i = 1 q n w i ∗ μ i ∗ ( 𝜷 ) 𝐗 i ∗ subscript 𝑞 𝑛 𝑛 superscript subscript 𝑖 1 subscript 𝑞 𝑛 superscript subscript 𝑤 𝑖 superscript subscript 𝜇 𝑖 𝜷 superscript subscript 𝐗 𝑖 \displaystyle\frac{\sqrt{q_{n}}}{n}\sum_{i=1}^{q_{n}}w_{i}^{*}\mu_{i}^{*}(%\boldsymbol{\beta})\mathbf{X}_{i}^{*} divide start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = \displaystyle= = q n n ∑ i = 1 q n μ i ∗ ( 𝜷 ) 𝐗 i ∗ π i ∗ q n subscript 𝑞 𝑛 𝑛 superscript subscript 𝑖 1 subscript 𝑞 𝑛 superscript subscript 𝜇 𝑖 𝜷 superscript subscript 𝐗 𝑖 superscript subscript 𝜋 𝑖 subscript 𝑞 𝑛 \displaystyle\frac{\sqrt{q_{n}}}{n}\sum_{i=1}^{q_{n}}\frac{\mu_{i}^{*}(%\boldsymbol{\beta})\mathbf{X}_{i}^{*}}{\pi_{i}^{*}q_{n}} divide start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG = \displaystyle= = 1 q n ∑ i = 1 q n μ i ∗ ( 𝜷 ) 𝐗 i ∗ n π i ∗ 1 subscript 𝑞 𝑛 superscript subscript 𝑖 1 subscript 𝑞 𝑛 superscript subscript 𝜇 𝑖 𝜷 superscript subscript 𝐗 𝑖 𝑛 superscript subscript 𝜋 𝑖 \displaystyle\frac{1}{\sqrt{q_{n}}}\sum_{i=1}^{q_{n}}\frac{\mu_{i}^{*}(%\boldsymbol{\beta})\mathbf{X}_{i}^{*}}{n\pi_{i}^{*}} divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG italic_n italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG ≡ \displaystyle\equiv ≡ 1 q n ∑ i = 1 q n 𝝎 i ( 𝝅 , 𝜷 o ) . 1 subscript 𝑞 𝑛 superscript subscript 𝑖 1 subscript 𝑞 𝑛 subscript 𝝎 𝑖 𝝅 superscript 𝜷 𝑜 \displaystyle\frac{1}{\sqrt{q_{n}}}\sum_{i=1}^{q_{n}}\boldsymbol{\omega}_{i}(%\mathbf{\boldsymbol{\pi}},\boldsymbol{\beta}^{o}). divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) .
Since the distribution of 𝝎 i ( 𝝅 , 𝜷 𝒐 ) subscript 𝝎 𝑖 𝝅 superscript 𝜷 𝒐 \boldsymbol{\omega}_{i}(\boldsymbol{\pi},\boldsymbol{\beta^{o}}) bold_italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT bold_italic_o end_POSTSUPERSCRIPT ) changes as a function of n 𝑛 n italic_n and q n subscript 𝑞 𝑛 q_{n} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , the Lindeberg-Feller condition (VanderVaart, 2000 , proposition 2.27) should be established as it covers the setting of triangular arrays.First, let us denote 𝐊 R ( 𝝅 , 𝜷 ) ≡ V a r ( 𝝎 i ( 𝝅 , 𝜷 o ) | ℱ n ) superscript 𝐊 𝑅 𝝅 𝜷 𝑉 𝑎 𝑟 conditional subscript 𝝎 𝑖 𝝅 superscript 𝜷 𝑜 subscript ℱ 𝑛 \mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta})\equiv Var(\boldsymbol{%\omega}_{i}(\mathbf{\boldsymbol{\pi}},\boldsymbol{\beta}^{o})|\mathcal{F}_{n}) bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) ≡ italic_V italic_a italic_r ( bold_italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) . It follows that
𝐊 R ( 𝝅 , 𝜷 ) superscript 𝐊 𝑅 𝝅 𝜷 \displaystyle\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}) bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) = E ( 𝝎 ( 𝝅 , 𝜷 o ) 𝝎 T ( 𝝅 , 𝜷 o ) | ℱ n ) − E ( 𝝎 ( 𝝅 , 𝜷 o ) | ℱ n ) E ( 𝝎 ( 𝝅 , 𝜷 o ) | ℱ n ) T absent 𝐸 conditional 𝝎 𝝅 superscript 𝜷 𝑜 superscript 𝝎 𝑇 𝝅 superscript 𝜷 𝑜 subscript ℱ 𝑛 𝐸 conditional 𝝎 𝝅 superscript 𝜷 𝑜 subscript ℱ 𝑛 𝐸 superscript conditional 𝝎 𝝅 superscript 𝜷 𝑜 subscript ℱ 𝑛 𝑇 \displaystyle=E\big{(}\boldsymbol{\omega}(\mathbf{\boldsymbol{\pi}},%\boldsymbol{\beta}^{o})\boldsymbol{\omega}^{T}(\mathbf{\boldsymbol{\pi}},%\boldsymbol{\beta}^{o})|\mathcal{F}_{n}\big{)}-E(\boldsymbol{\omega}(\mathbf{%\boldsymbol{\pi}},\boldsymbol{\beta}^{o})|\mathcal{F}_{n})E(\boldsymbol{\omega%}(\mathbf{\boldsymbol{\pi}},\boldsymbol{\beta}^{o})|\mathcal{F}_{n})^{T} = italic_E ( bold_italic_ω ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_italic_ω start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_E ( bold_italic_ω ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_E ( bold_italic_ω ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = 1 n 2 { ∑ i ∈ 𝒩 μ i 2 ( 𝜷 ) 𝐗 i 𝐗 i T π i − ∑ i , j ∈ 𝒩 μ i ( 𝜷 ) μ j ( 𝜷 ) 𝐗 i 𝐗 j } = O | ℱ n ( 1 ) \displaystyle=\frac{1}{n^{2}}\bigg{\{}\sum_{i\in\mathcal{N}}\frac{\mu^{2}_{i}(%\boldsymbol{\beta})\mathbf{X}_{i}\mathbf{X}_{i}^{T}}{\pi_{i}}-\sum_{i,j\in%\mathcal{N}}\mu_{i}(\boldsymbol{\beta})\mu_{j}(\boldsymbol{\beta})\mathbf{X}_{%i}\mathbf{X}_{j}\bigg{\}}=O_{|\mathcal{F}_{n}}(1) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_β ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } = italic_O start_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 )
where the last equation is due to Assumptions A.1 and A.2.
Now, for every ε > 0 𝜀 0 \varepsilon>0 italic_ε > 0 and some δ > 0 𝛿 0 \delta>0 italic_δ > 0 ,
∑ i = 1 q n superscript subscript 𝑖 1 subscript 𝑞 𝑛 \displaystyle\sum_{i=1}^{q_{n}} ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT E { ‖ q n − 1 / 2 𝝎 i ( 𝝅 , 𝜷 o ) ‖ 2 2 I ( q n − 1 / 2 𝝎 i ( 𝝅 , 𝜷 o ) > ε ) | ℱ n } . 𝐸 conditional superscript subscript norm superscript subscript 𝑞 𝑛 1 2 subscript 𝝎 𝑖 𝝅 superscript 𝜷 𝑜 2 2 𝐼 superscript subscript 𝑞 𝑛 1 2 subscript 𝝎 𝑖 𝝅 superscript 𝜷 𝑜 𝜀 subscript ℱ 𝑛 \displaystyle E\{\|q_{n}^{-1/2}\boldsymbol{\omega}_{i}(\boldsymbol{\pi},%\boldsymbol{\beta}^{o})\|_{2}^{2}I(q_{n}^{-1/2}\boldsymbol{\omega}_{i}(%\boldsymbol{\pi},\boldsymbol{\beta}^{o})>\varepsilon)|\mathcal{F}_{n}\}. italic_E { ∥ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) > italic_ε ) | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } . ≤ 1 q n 1 + δ / 2 ε δ ∑ i = 1 q E { ‖ μ i ∗ ( 𝜷 o ) 𝐗 i ∗ n π i ∗ ‖ 2 2 + δ | ℱ n } absent 1 superscript subscript 𝑞 𝑛 1 𝛿 2 superscript 𝜀 𝛿 superscript subscript 𝑖 1 𝑞 𝐸 conditional superscript subscript norm superscript subscript 𝜇 𝑖 superscript 𝜷 𝑜 superscript subscript 𝐗 𝑖 𝑛 superscript subscript 𝜋 𝑖 2 2 𝛿 subscript ℱ 𝑛 \displaystyle\leq\frac{1}{q_{n}^{1+\delta/2}\varepsilon^{\delta}}\sum_{i=1}^{q%}E\Bigg{\{}\bigg{\|}\frac{\mu_{i}^{*}(\boldsymbol{\beta}^{o})\mathbf{X}_{i}^{*%}}{n\pi_{i}^{*}}\bigg{\|}_{2}^{2+\delta}\Bigg{|}\mathcal{F}_{n}\Bigg{\}} ≤ divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 + italic_δ / 2 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT italic_E { ∥ divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG italic_n italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } = 1 q n δ / 2 ϵ δ n 2 + δ ∑ i ∈ 𝒩 { μ i ( 𝜷 ) } 2 ‖ 𝐗 i ‖ 2 2 + δ π i δ + 1 absent 1 superscript subscript 𝑞 𝑛 𝛿 2 superscript italic-ϵ 𝛿 superscript 𝑛 2 𝛿 subscript 𝑖 𝒩 superscript subscript 𝜇 𝑖 𝜷 2 superscript subscript norm subscript 𝐗 𝑖 2 2 𝛿 superscript subscript 𝜋 𝑖 𝛿 1 \displaystyle=\frac{1}{q_{n}^{\delta/2}\epsilon^{\delta}n^{2+\delta}}\sum_{i%\in\mathcal{N}}\frac{\{\mu_{i}(\boldsymbol{\beta})\}^{2}\|\mathbf{X}_{i}\|_{2}%^{2+\delta}}{\pi_{i}^{\delta+1}} = divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ / 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG { italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β ) } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ + 1 end_POSTSUPERSCRIPT end_ARG ≤ 1 q n δ / 2 ϵ δ n 2 + δ ∑ i ∈ 𝒩 ‖ 𝐗 i ‖ 2 2 + δ π i δ + 1 = o P | ℱ n ( 1 ) absent 1 superscript subscript 𝑞 𝑛 𝛿 2 superscript italic-ϵ 𝛿 superscript 𝑛 2 𝛿 subscript 𝑖 𝒩 superscript subscript norm subscript 𝐗 𝑖 2 2 𝛿 superscript subscript 𝜋 𝑖 𝛿 1 subscript 𝑜 conditional 𝑃 subscript ℱ 𝑛 1 \displaystyle\leq\frac{1}{q_{n}^{\delta/2}\epsilon^{\delta}n^{2+\delta}}\sum_{%i\in\mathcal{N}}\frac{\|\mathbf{X}_{i}\|_{2}^{2+\delta}}{\pi_{i}^{\delta+1}}=o%_{P|\mathcal{F}_{n}}(1) ≤ divide start_ARG 1 end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ / 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 + italic_δ end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ + 1 end_POSTSUPERSCRIPT end_ARG = italic_o start_POSTSUBSCRIPT italic_P | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 )
where the first inequality is due to Van der Vaart (VanderVaart, 2000 , p. 21) and the last equality is due to Assumption A.3. Since E ( 1 − w i R i | ℱ n ) = 0 𝐸 1 conditional subscript 𝑤 𝑖 subscript 𝑅 𝑖 subscript ℱ 𝑛 0 E(1-w_{i}R_{i}|\mathcal{F}_{n})=0 italic_E ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = 0 , it holds that q n 1 / 2 n − 1 K ( 𝝅 , 𝜷 ) − 1 / 2 ∑ i ∈ 𝒩 ( 1 − w i R i ) μ i ( 𝜷 o ) 𝐗 i superscript subscript 𝑞 𝑛 1 2 superscript 𝑛 1 𝐾 superscript 𝝅 𝜷 1 2 subscript 𝑖 𝒩 1 subscript 𝑤 𝑖 subscript 𝑅 𝑖 subscript 𝜇 𝑖 superscript 𝜷 𝑜 subscript 𝐗 𝑖 q_{n}^{1/2}n^{-1}K(\boldsymbol{\pi},\boldsymbol{\beta})^{-1/2}\sum_{i\in%\mathcal{N}}(1-w_{i}R_{i})\mu_{i}(\boldsymbol{\beta}^{o})\mathbf{X}_{i} italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_K ( bold_italic_π , bold_italic_β ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT converges conditionally on ℱ n subscript ℱ 𝑛 \mathcal{F}_{n} caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to a standard multivariate distribution. Put differently, for any 𝐮 ∈ ℝ r 𝐮 superscript ℝ 𝑟 \mathbf{u}\in\mathbb{R}^{r} bold_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ,
Pr { q n n 𝐊 R ( 𝝅 , 𝜷 ) − 1 / 2 ∑ i ∈ 𝒩 ( 1 − w i R i ) μ i ( 𝜷 o ) 𝐗 i ≤ 𝐮 | ℱ n } → Φ ( 𝐮 ) . → Pr subscript 𝑞 𝑛 𝑛 superscript 𝐊 𝑅 superscript 𝝅 𝜷 1 2 subscript 𝑖 𝒩 1 subscript 𝑤 𝑖 subscript 𝑅 𝑖 subscript 𝜇 𝑖 superscript 𝜷 𝑜 subscript 𝐗 𝑖 conditional 𝐮 subscript ℱ 𝑛 Φ 𝐮 \Pr\bigg{\{}\frac{\sqrt{q_{n}}}{n}\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{%\beta})^{-1/2}\sum_{i\in\mathcal{N}}(1-w_{i}R_{i})\mu_{i}(\boldsymbol{\beta}^{%o})\mathbf{X}_{i}\leq\mathbf{u}|\mathcal{F}_{n}\bigg{\}}\rightarrow\Phi(%\mathbf{u})\,. roman_Pr { divide start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_n end_ARG bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ bold_u | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } → roman_Φ ( bold_u ) . (S.28)
where Φ Φ \Phi roman_Φ is the cumulative distribution function of the standard multivariate normal distribution. Since the conditional probability is a random variable in the unconditional space, then due to Eq. (S.28 ) it converges almost surely to Φ ( 𝐮 ) Φ 𝐮 \Phi(\mathbf{u}) roman_Φ ( bold_u ) . Being additionally bounded, then due to the dominated convergence theorem, it follows that for any 𝐮 ∈ ℝ r 𝐮 superscript ℝ 𝑟 \mathbf{u}\in\mathbb{R}^{r} bold_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ,
Pr { q n n 𝐊 R ( 𝝅 , 𝜷 ) − 1 / 2 ∑ i ∈ 𝒩 ( 1 − w i R i ) μ i ( 𝜷 o ) 𝐗 i ≤ 𝐮 } → Φ ( 𝐮 ) . → Pr subscript 𝑞 𝑛 𝑛 superscript 𝐊 𝑅 superscript 𝝅 𝜷 1 2 subscript 𝑖 𝒩 1 subscript 𝑤 𝑖 subscript 𝑅 𝑖 subscript 𝜇 𝑖 superscript 𝜷 𝑜 subscript 𝐗 𝑖 𝐮 Φ 𝐮 \Pr\bigg{\{}\frac{\sqrt{q_{n}}}{n}\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{%\beta})^{-1/2}\sum_{i\in\mathcal{N}}(1-w_{i}R_{i})\mu_{i}(\boldsymbol{\beta}^{%o})\mathbf{X}_{i}\leq\mathbf{u}\bigg{\}}\rightarrow\Phi(\mathbf{u})\,. roman_Pr { divide start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_n end_ARG bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ bold_u } → roman_Φ ( bold_u ) . (S.29)
Suppose that 𝐊 R ( 𝝅 , 𝜷 o ) → 𝑃 𝚿 ( 𝝅 , 𝜷 o ) 𝑃 → superscript 𝐊 𝑅 𝝅 superscript 𝜷 𝑜 𝚿 𝝅 superscript 𝜷 𝑜 \mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})\xrightarrow{P}%\boldsymbol{\Psi}(\boldsymbol{\pi},\boldsymbol{\beta}^{o}) bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_ARROW overitalic_P → end_ARROW bold_Ψ ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) where 𝚿 ( 𝝅 , 𝜷 o ) 𝚿 𝝅 superscript 𝜷 𝑜 {\boldsymbol{\Psi}}(\boldsymbol{\pi},\boldsymbol{\beta}^{o}) bold_Ψ ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) is a positive-definite matrix. Denote θ 𝜃 \theta italic_θ as the limit of q n / n subscript 𝑞 𝑛 𝑛 q_{n}/n italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_n , which we assumed its existence earlier. Then, from Eq. (S.29 )
1 n ∑ i ∈ 𝒩 ( 1 − w i R i ) μ i ( 𝜷 o ) → 𝐷 N ( 0 , θ 𝚿 ( 𝝅 , 𝜷 o ) ) . 𝐷 → 1 𝑛 subscript 𝑖 𝒩 1 subscript 𝑤 𝑖 subscript 𝑅 𝑖 subscript 𝜇 𝑖 superscript 𝜷 𝑜 𝑁 0 𝜃 𝚿 𝝅 superscript 𝜷 𝑜 \frac{1}{\sqrt{n}}\sum_{i\in\mathcal{N}}(1-w_{i}R_{i})\mu_{i}(\boldsymbol{%\beta}^{o})\xrightarrow{D}N(0,\theta\boldsymbol{\Psi}(\boldsymbol{\pi},%\boldsymbol{\beta}^{o})). divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_ARROW overitalic_D → end_ARROW italic_N ( 0 , italic_θ bold_Ψ ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ) . (S.30)
In the following, it will be shown that the two addends are asymptotically independent. Write
lim n , q n → ∞ subscript → 𝑛 subscript 𝑞 𝑛
\displaystyle\lim_{n,q_{n}\rightarrow\infty} roman_lim start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT Pr ( 1 n ∂ l ( 𝜷 o ) ∂ 𝜷 T ≤ 𝐮 , 1 n ∑ i ∈ 𝒩 ( 1 − w i R i ) p i ( 𝜷 o ) 𝐗 i ≤ 𝐯 ) Pr 1 𝑛 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑇 𝐮 1 𝑛 subscript 𝑖 𝒩 1 subscript 𝑤 𝑖 subscript 𝑅 𝑖 subscript 𝑝 𝑖 superscript 𝜷 𝑜 subscript 𝐗 𝑖 𝐯 \displaystyle\Pr\bigg{(}\frac{1}{n}\frac{\partial l(\boldsymbol{\beta}^{o})}{%\partial\boldsymbol{\beta}^{T}}\leq\mathbf{u}\,,\,\frac{1}{\sqrt{n}}\sum_{i\in%\mathcal{N}}(1-w_{i}R_{i})p_{i}(\boldsymbol{\beta}^{o})\mathbf{X}_{i}\leq%\mathbf{v}\bigg{)} roman_Pr ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≤ bold_u , divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ bold_v ) (S.31) = \displaystyle= = lim n , q n → ∞ E ( I { 1 n ∂ l ( 𝜷 o ) ∂ 𝜷 T ≤ 𝐮 } Pr { 1 n ∑ i ∈ 𝒩 ( 1 − w i R i ) p i ( 𝜷 o ) 𝐗 i ≤ 𝐯 | ℱ n } ) subscript → 𝑛 subscript 𝑞 𝑛
𝐸 𝐼 1 𝑛 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑇 𝐮 Pr 1 𝑛 subscript 𝑖 𝒩 1 subscript 𝑤 𝑖 subscript 𝑅 𝑖 subscript 𝑝 𝑖 superscript 𝜷 𝑜 subscript 𝐗 𝑖 conditional 𝐯 subscript ℱ 𝑛 \displaystyle\lim_{n,q_{n}\rightarrow\infty}E\bigg{(}I\bigg{\{}\frac{1}{n}%\frac{\partial l(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{T}}\leq%\mathbf{u}\bigg{\}}\Pr\bigg{\{}\frac{1}{\sqrt{n}}\sum_{i\in\mathcal{N}}(1-w_{i%}R_{i})p_{i}(\boldsymbol{\beta}^{o})\mathbf{X}_{i}\leq\mathbf{v}|\mathcal{F}_{%n}\bigg{\}}\bigg{)} roman_lim start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT italic_E ( italic_I { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≤ bold_u } roman_Pr { divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ bold_v | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ) = \displaystyle= = E ( lim n , q n → ∞ I { 1 n ∂ l ( 𝜷 o ) ∂ 𝜷 T ≤ 𝐮 } lim n , q → ∞ Pr { 1 n ∑ i ∈ 𝒩 ( 1 − w i R i ) p i ( 𝜷 o ) 𝐗 i ≤ 𝐯 | ℱ n } ) 𝐸 subscript → 𝑛 subscript 𝑞 𝑛
𝐼 1 𝑛 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑇 𝐮 subscript → 𝑛 𝑞
Pr 1 𝑛 subscript 𝑖 𝒩 1 subscript 𝑤 𝑖 subscript 𝑅 𝑖 subscript 𝑝 𝑖 superscript 𝜷 𝑜 subscript 𝐗 𝑖 conditional 𝐯 subscript ℱ 𝑛 \displaystyle E\bigg{(}\lim_{n,q_{n}\rightarrow\infty}I\bigg{\{}\frac{1}{n}%\frac{\partial l(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{T}}\leq%\mathbf{u}\bigg{\}}\lim_{n,q\rightarrow\infty}\Pr\bigg{\{}\frac{1}{\sqrt{n}}%\sum_{i\in\mathcal{N}}(1-w_{i}R_{i})p_{i}(\boldsymbol{\beta}^{o})\mathbf{X}_{i%}\leq\mathbf{v}|\mathcal{F}_{n}\bigg{\}}\bigg{)} italic_E ( roman_lim start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT italic_I { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≤ bold_u } roman_lim start_POSTSUBSCRIPT italic_n , italic_q → ∞ end_POSTSUBSCRIPT roman_Pr { divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ bold_v | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ) = \displaystyle= = E ( lim n , q n → ∞ I { 1 n ∂ l ( 𝜷 o ) ∂ 𝜷 T ≤ 𝐮 } Φ ( θ − 1 / 2 𝚿 ( 𝝅 , 𝜷 o ) − 1 / 2 𝐯 ) ) 𝐸 subscript → 𝑛 subscript 𝑞 𝑛
𝐼 1 𝑛 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑇 𝐮 Φ superscript 𝜃 1 2 𝚿 superscript 𝝅 superscript 𝜷 𝑜 1 2 𝐯 \displaystyle E\bigg{(}\lim_{n,q_{n}\rightarrow\infty}I\bigg{\{}\frac{1}{n}%\frac{\partial l(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{T}}\leq%\mathbf{u}\bigg{\}}\Phi(\theta^{-1/2}\boldsymbol{\Psi}(\boldsymbol{\pi},%\boldsymbol{\beta}^{o})^{-1/2}\mathbf{v})\bigg{)} italic_E ( roman_lim start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT italic_I { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≤ bold_u } roman_Φ ( italic_θ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_Ψ ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_v ) ) = \displaystyle= = lim n , q n → ∞ E ( I { 1 n ∂ l ( 𝜷 o ) ∂ 𝜷 T ≤ 𝐮 } Φ ( θ − 1 / 2 𝚿 ( 𝝅 , 𝜷 o ) − 1 / 2 𝐯 ) ) subscript → 𝑛 subscript 𝑞 𝑛
𝐸 𝐼 1 𝑛 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑇 𝐮 Φ superscript 𝜃 1 2 𝚿 superscript 𝝅 superscript 𝜷 𝑜 1 2 𝐯 \displaystyle\lim_{n,q_{n}\rightarrow\infty}E\bigg{(}I\bigg{\{}\frac{1}{n}%\frac{\partial l(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{T}}\leq%\mathbf{u}\bigg{\}}\Phi(\theta^{-1/2}\boldsymbol{\Psi}(\boldsymbol{\pi},%\boldsymbol{\beta}^{o})^{-1/2}\mathbf{v})\bigg{)} roman_lim start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT italic_E ( italic_I { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≤ bold_u } roman_Φ ( italic_θ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_Ψ ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_v ) ) = \displaystyle= = lim n , q n → ∞ Pr ( 1 n ∂ l ( 𝜷 o ) ∂ 𝜷 T ≤ 𝐮 ) Φ ( θ − 1 / 2 𝚿 ( 𝝅 , 𝜷 o ) − 1 / 2 𝐯 ) subscript → 𝑛 subscript 𝑞 𝑛
Pr 1 𝑛 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑇 𝐮 Φ superscript 𝜃 1 2 𝚿 superscript 𝝅 superscript 𝜷 𝑜 1 2 𝐯 \displaystyle\lim_{n,q_{n}\rightarrow\infty}\Pr\bigg{(}\frac{1}{n}\frac{%\partial l(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{T}}\leq\mathbf%{u}\bigg{)}\Phi(\theta^{-1/2}\boldsymbol{\Psi}(\boldsymbol{\pi},\boldsymbol{%\beta}^{o})^{-1/2}\mathbf{v}) roman_lim start_POSTSUBSCRIPT italic_n , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT roman_Pr ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG divide start_ARG ∂ italic_l ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≤ bold_u ) roman_Φ ( italic_θ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_Ψ ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_v ) = \displaystyle= = Φ ( 𝚺 ( 𝜷 o ) − 1 / 2 𝐮 ) Φ ( θ − 1 / 2 𝚿 ( 𝝅 , 𝜷 o ) − 1 / 2 𝐯 ) Φ 𝚺 superscript superscript 𝜷 𝑜 1 2 𝐮 Φ superscript 𝜃 1 2 𝚿 superscript 𝝅 superscript 𝜷 𝑜 1 2 𝐯 \displaystyle\Phi\left(\boldsymbol{\Sigma}(\boldsymbol{\beta}^{o})^{-1/2}%\mathbf{u}\right)\Phi\left(\theta^{-1/2}\boldsymbol{\Psi}(\boldsymbol{\pi},%\boldsymbol{\beta}^{o})^{-1/2}\mathbf{v}\right) roman_Φ ( bold_Σ ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_u ) roman_Φ ( italic_θ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_Ψ ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_v )
and we have used the dominated convergence theorem.
Since 𝐌 ~ 𝐗 ( 𝜷 o ) subscript ~ 𝐌 𝐗 superscript 𝜷 𝑜 \widetilde{\mathbf{M}}_{\mathbf{X}}(\boldsymbol{\beta}^{o}) over~ start_ARG bold_M end_ARG start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) is a consistent estimator of 𝐌 𝐗 ( 𝜷 o ) subscript 𝐌 𝐗 superscript 𝜷 𝑜 \mathbf{M}_{\mathbf{X}}(\boldsymbol{\beta}^{o}) bold_M start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) , its consistency to 𝚺 ( 𝜷 o ) 𝚺 superscript 𝜷 𝑜 \boldsymbol{\Sigma}(\boldsymbol{\beta}^{o}) bold_Σ ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) could be easily shown. Then, from slu*tsky’s theorem and Eq.s (S.27 ), (S.30 ) and (S.31 ) it follows that Eq. (S.26 ) converges in distribution to a multivariate normal distribution with zero mean and a covariance matrix asymptotically equivalent to ℍ R ( 𝝅 , 𝜷 o ) superscript ℍ 𝑅 𝝅 superscript 𝜷 𝑜 \mathbb{H}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o}) blackboard_H start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) . The two variance components correspond to two orthogonal sources of variance, the variance of the original full-data MLE, and the additional variance generated by the subsampling procedure.
S2.2 Proof of Theorem 3.2A-optimal criterion is equivalent to minimizing the asymptotic MSE of 𝜷 ~ T S subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT , which is the trace of ℍ R ( 𝝅 , 𝜷 o ) superscript ℍ 𝑅 𝝅 superscript 𝜷 𝑜 \mathbb{H}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o}) blackboard_H start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) . However,
T r ( ℍ R ( 𝝅 , 𝜷 o ) ) = T r ( n q n 𝐌 X − 1 ( 𝜷 o ) 𝕂 R ( 𝝅 , 𝜷 o ) 𝐌 X − 1 ( 𝜷 o ) ) + d 𝑇 𝑟 superscript ℍ 𝑅 𝝅 superscript 𝜷 𝑜 𝑇 𝑟 𝑛 subscript 𝑞 𝑛 superscript subscript 𝐌 𝑋 1 superscript 𝜷 𝑜 superscript 𝕂 𝑅 𝝅 superscript 𝜷 𝑜 superscript subscript 𝐌 𝑋 1 superscript 𝜷 𝑜 𝑑 Tr\big{(}\mathbb{H}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})\big{)}=Tr%\bigg{(}\frac{n}{q_{n}}\mathbf{M}_{X}^{-1}(\boldsymbol{\beta}^{o})\mathbb{K}^{%R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})\mathbf{M}_{X}^{-1}(\boldsymbol{%\beta}^{o})\bigg{)}+d italic_T italic_r ( blackboard_H start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ) = italic_T italic_r ( divide start_ARG italic_n end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) blackboard_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ) + italic_d
where d 𝑑 d italic_d is a constant that does not involve 𝝅 𝝅 \boldsymbol{\pi} bold_italic_π , and
T r ( n q n 𝐌 X − 1 ( 𝜷 o ) 𝕂 R ( 𝝅 , 𝜷 o ) 𝐌 X − 1 ( 𝜷 o ) ) 𝑇 𝑟 𝑛 subscript 𝑞 𝑛 superscript subscript 𝐌 𝑋 1 superscript 𝜷 𝑜 superscript 𝕂 𝑅 𝝅 superscript 𝜷 𝑜 superscript subscript 𝐌 𝑋 1 superscript 𝜷 𝑜 \displaystyle Tr\bigg{(}\frac{n}{q_{n}}\mathbf{M}_{X}^{-1}(\boldsymbol{\beta}^%{o})\mathbb{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})\mathbf{M}_{X}^{-1}%(\boldsymbol{\beta}^{o})\bigg{)} italic_T italic_r ( divide start_ARG italic_n end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) blackboard_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ) = T r ( 1 n q n 𝐌 X − 1 ( 𝜷 o ) { ∑ i ∈ 𝒩 μ i 2 ( 𝜷 o ) π i 𝐗 i 𝐗 i T − ∑ i , j ∈ 𝒩 μ i ( 𝜷 o ) μ j ( 𝜷 o ) 𝐗 i 𝐗 j T } 𝐌 X − 1 ( 𝜷 o ) ) . 𝑇 𝑟 1 𝑛 subscript 𝑞 𝑛 superscript subscript 𝐌 𝑋 1 superscript 𝜷 𝑜 subscript 𝑖 𝒩 subscript superscript 𝜇 2 𝑖 superscript 𝜷 𝑜 subscript 𝜋 𝑖 subscript 𝐗 𝑖 superscript subscript 𝐗 𝑖 𝑇 subscript 𝑖 𝑗
𝒩 subscript 𝜇 𝑖 superscript 𝜷 𝑜 subscript 𝜇 𝑗 superscript 𝜷 𝑜 subscript 𝐗 𝑖 superscript subscript 𝐗 𝑗 𝑇 superscript subscript 𝐌 𝑋 1 superscript 𝜷 𝑜
\displaystyle\quad=\quad Tr\Bigg{(}\frac{1}{nq_{n}}\mathbf{M}_{X}^{-1}(%\boldsymbol{\beta}^{o})\bigg{\{}\sum_{i\in\mathcal{N}}\frac{\mu^{2}_{i}(%\boldsymbol{\beta}^{o})}{\pi_{i}}\mathbf{X}_{i}\mathbf{X}_{i}^{T}-\sum_{i,j\in%\mathcal{N}}\mu_{i}(\boldsymbol{\beta}^{o})\mu_{j}(\boldsymbol{\beta}^{o})%\mathbf{X}_{i}\mathbf{X}_{j}^{T}\bigg{\}}\mathbf{M}_{X}^{-1}(\boldsymbol{\beta%}^{o})\Bigg{)}. = italic_T italic_r ( divide start_ARG 1 end_ARG start_ARG italic_n italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) { ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ) .
By removing the part that does not involve 𝝅 𝝅 \boldsymbol{\pi} bold_italic_π and the factor ( n q n ) − 1 superscript 𝑛 subscript 𝑞 𝑛 1 (nq_{n})^{-1} ( italic_n italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT which does not alter the optimization process, we are left with
T r ( ∑ i ∈ 𝒩 μ i 2 ( 𝜷 o ) π i 𝐌 X − 1 ( 𝜷 o ) 𝐗 i 𝐗 i T 𝐌 X − 1 ( 𝜷 o ) ) 𝑇 𝑟 subscript 𝑖 𝒩 subscript superscript 𝜇 2 𝑖 superscript 𝜷 𝑜 subscript 𝜋 𝑖 superscript subscript 𝐌 𝑋 1 superscript 𝜷 𝑜 subscript 𝐗 𝑖 superscript subscript 𝐗 𝑖 𝑇 superscript subscript 𝐌 𝑋 1 superscript 𝜷 𝑜 \displaystyle Tr\bigg{(}\sum_{i\in\mathcal{N}}\frac{\mu^{2}_{i}(\boldsymbol{%\beta}^{o})}{\pi_{i}}\mathbf{M}_{X}^{-1}(\boldsymbol{\beta}^{o})\mathbf{X}_{i}%\mathbf{X}_{i}^{T}\mathbf{M}_{X}^{-1}(\boldsymbol{\beta}^{o})\bigg{)} italic_T italic_r ( ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ) = \displaystyle= = ∑ i ∈ 𝒩 μ i 2 ( 𝜷 o ) π i T r ( 𝐗 i T 𝐌 X − 2 ( 𝜷 o ) 𝐗 i ) subscript 𝑖 𝒩 subscript superscript 𝜇 2 𝑖 superscript 𝜷 𝑜 subscript 𝜋 𝑖 𝑇 𝑟 superscript subscript 𝐗 𝑖 𝑇 superscript subscript 𝐌 𝑋 2 superscript 𝜷 𝑜 subscript 𝐗 𝑖 \displaystyle\sum_{i\in\mathcal{N}}\frac{\mu^{2}_{i}(\boldsymbol{\beta}^{o})}{%\pi_{i}}Tr\big{(}\mathbf{X}_{i}^{T}\mathbf{M}_{X}^{-2}(\boldsymbol{\beta}^{o})%\mathbf{X}_{i}\big{)} ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_T italic_r ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = \displaystyle= = ∑ i ∈ 𝒩 μ i 2 ( 𝜷 o ) π i ‖ 𝐌 X − 1 𝐗 i ‖ 2 2 . subscript 𝑖 𝒩 subscript superscript 𝜇 2 𝑖 superscript 𝜷 𝑜 subscript 𝜋 𝑖 superscript subscript norm superscript subscript 𝐌 𝑋 1 subscript 𝐗 𝑖 2 2 \displaystyle\sum_{i\in\mathcal{N}}\frac{\mu^{2}_{i}(\boldsymbol{\beta}^{o})}{%\pi_{i}}\|\mathbf{M}_{X}^{-1}\mathbf{X}_{i}\|_{2}^{2}\,. ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∥ bold_M start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
Define the following Lagrangian function, with multiplier α 𝛼 \alpha italic_α ,
g ( 𝝅 ) = ∑ i ∈ 𝒩 μ i 2 ( 𝜷 o ) π i ‖ 𝐌 x − 1 𝐗 i ‖ 2 2 + α ( 1 − ∑ i ∈ 𝒩 π i ) . 𝑔 𝝅 subscript 𝑖 𝒩 subscript superscript 𝜇 2 𝑖 superscript 𝜷 𝑜 subscript 𝜋 𝑖 superscript subscript norm superscript subscript 𝐌 𝑥 1 subscript 𝐗 𝑖 2 2 𝛼 1 subscript 𝑖 𝒩 subscript 𝜋 𝑖 g(\boldsymbol{\pi})=\sum_{i\in\mathcal{N}}\frac{\mu^{2}_{i}(\boldsymbol{\beta}%^{o})}{\pi_{i}}\|\mathbf{M}_{x}^{-1}\mathbf{X}_{i}\|_{2}^{2}+\alpha\left(1-%\sum_{i\in\mathcal{N}}\pi_{i}\right)\,. italic_g ( bold_italic_π ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∥ bold_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α ( 1 - ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .
Differentiating g ( 𝝅 ) 𝑔 𝝅 g(\boldsymbol{\pi}) italic_g ( bold_italic_π ) with respect to π i subscript 𝜋 𝑖 \pi_{i} italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for any i ∈ 𝒩 𝑖 𝒩 i\in\mathcal{N} italic_i ∈ caligraphic_N and setting the derivative to 0, gives
∂ g ( 𝝅 ) ∂ π i = − μ i 2 ( 𝜷 o ) ‖ 𝐌 x − 1 𝐗 i ‖ 2 2 π i 2 − α ≡ 0 , 𝑔 𝝅 subscript 𝜋 𝑖 subscript superscript 𝜇 2 𝑖 superscript 𝜷 𝑜 superscript subscript norm superscript subscript 𝐌 𝑥 1 subscript 𝐗 𝑖 2 2 superscript subscript 𝜋 𝑖 2 𝛼 0 \frac{\partial g(\boldsymbol{\pi})}{\partial\pi_{i}}=-\frac{\mu^{2}_{i}(%\boldsymbol{\beta}^{o})\|\mathbf{M}_{x}^{-1}\mathbf{X}_{i}\|_{2}^{2}}{\pi_{i}^%{2}}-\alpha\equiv 0, divide start_ARG ∂ italic_g ( bold_italic_π ) end_ARG start_ARG ∂ italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = - divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ bold_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - italic_α ≡ 0 ,
and
π i = μ i ( 𝜷 o ) ‖ 𝐌 x − 1 𝐗 i ‖ 2 − α . subscript 𝜋 𝑖 subscript 𝜇 𝑖 superscript 𝜷 𝑜 subscript norm superscript subscript 𝐌 𝑥 1 subscript 𝐗 𝑖 2 𝛼 \pi_{i}=\frac{\mu_{i}(\boldsymbol{\beta}^{o})\|\mathbf{M}_{x}^{-1}\mathbf{X}_{%i}\|_{2}}{\sqrt{-\alpha}}\,. italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ bold_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG - italic_α end_ARG end_ARG .
Since ∑ i ∈ 𝒩 π i = 1 subscript 𝑖 𝒩 subscript 𝜋 𝑖 1 \sum_{i\in\mathcal{N}}\pi_{i}=1 ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ,
− α = ∑ i ∈ 𝒩 μ i ( 𝜷 o ) ‖ 𝐌 x − 1 𝐗 i ‖ 2 , 𝛼 subscript 𝑖 𝒩 subscript 𝜇 𝑖 superscript 𝜷 𝑜 subscript norm superscript subscript 𝐌 𝑥 1 subscript 𝐗 𝑖 2 \sqrt{-\alpha}=\sum_{i\in\mathcal{N}}\mu_{i}(\boldsymbol{\beta}^{o})\|\mathbf{%M}_{x}^{-1}\mathbf{X}_{i}\|_{2}, square-root start_ARG - italic_α end_ARG = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∥ bold_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
which yields Eq. (3.9) in the main text. The proof of Eq. (3.10) of the main text follows similarly.
S2.3 Proof of Theorem 4.1Following the main steps of the proof of Theorem 2 in Wangetal. (2018 ) , it is straightforward to show that given ℱ n subscript ℱ 𝑛 \mathcal{F}_{n} caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ,
1 n 𝐊 R ( 𝝅 , 𝜷 o ) 1 / 2 ∂ l ∗ ( 𝜷 o ) ∂ 𝜷 o = 1 q n { V a r ( 𝜼 i | ℱ n ) } − 1 / 2 ∑ i = 1 q n 𝜼 i → 𝐷 N ( 0 , 𝐈 ) 1 𝑛 superscript 𝐊 𝑅 superscript 𝝅 superscript 𝜷 𝑜 1 2 superscript 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑜 1 subscript 𝑞 𝑛 superscript 𝑉 𝑎 𝑟 conditional subscript 𝜼 𝑖 subscript ℱ 𝑛 1 2 superscript subscript 𝑖 1 subscript 𝑞 𝑛 subscript 𝜼 𝑖 𝐷 → 𝑁 0 𝐈 \frac{1}{n}\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})^{1/2}\frac{%\partial l^{*}(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{o}}=\frac{%1}{\sqrt{q_{n}}}\{Var(\boldsymbol{\eta}_{i}|\mathcal{F}_{n})\}^{-1/2}\sum_{i=1%}^{q_{n}}\boldsymbol{\eta}_{i}\xrightarrow[]{D}N(0,\mathbf{I}) divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG { italic_V italic_a italic_r ( bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_I )
where
𝜼 𝒊 ≡ { D i ∗ − μ i ∗ ( 𝜷 o ) } 𝐗 i ∗ n π i ∗ , i = 1 , … , q n \boldsymbol{\eta_{i}}\equiv\frac{\{D_{i}^{*}-\mu_{i}^{*}(\boldsymbol{\beta}^{o%})\}\mathbf{X}_{i}^{*}}{n\pi_{i}^{*}}\quad,\quad i=1,\dots,q_{n} bold_italic_η start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ≡ divide start_ARG { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) } bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG italic_n italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG , italic_i = 1 , … , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
are independent and identically distributed with mean 𝟎 0 \mathbf{0} bold_0 and variance q n 𝐊 B ( 𝝅 , 𝜷 o ) subscript 𝑞 𝑛 superscript 𝐊 𝐵 𝝅 superscript 𝜷 𝑜 q_{n}\mathbf{K}^{B}(\boldsymbol{\pi},\boldsymbol{\beta}^{o}) italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_K start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) . In other words, for all 𝐮 ∈ ℝ r 𝐮 superscript ℝ 𝑟 \mathbf{u}\in\mathbb{R}^{r} bold_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ,
Pr { n − 1 𝐊 R ( 𝝅 , 𝜷 o ) 1 / 2 ∂ l ∗ ( 𝜷 o ) ∂ 𝜷 o ≤ 𝐮 | ℱ n } → 𝑃 Φ ( 𝐮 ) . 𝑃 → Pr superscript 𝑛 1 superscript 𝐊 𝑅 superscript 𝝅 superscript 𝜷 𝑜 1 2 superscript 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑜 conditional 𝐮 subscript ℱ 𝑛 Φ 𝐮 \Pr\Big{\{}n^{-1}\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})^{1/2}%\frac{\partial l^{*}(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{o}}%\leq\mathbf{u}|\mathcal{F}_{n}\Big{\}}\xrightarrow[]{P}\Phi(\mathbf{u})\,. roman_Pr { italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_ARG ≤ bold_u | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_ARROW overitalic_P → end_ARROW roman_Φ ( bold_u ) . (S.32)
The conditional probability in Eq. (S.32 ) is a bounded random variable, thus convergence in probability to a constant implies convergence in the mean. Therefore,
Pr { n − 1 𝐊 R ( 𝝅 , 𝜷 o ) 1 / 2 ∂ l ∗ ( 𝜷 o ) ∂ 𝜷 o ≤ 𝐮 } = E { Pr { n − 1 𝐊 R ( 𝝅 , 𝜷 o ) 1 / 2 ∂ l ∗ ( 𝜷 o ) ∂ 𝜷 o ≤ 𝐮 } | ℱ n } → Φ ( 𝐮 ) , Pr superscript 𝑛 1 superscript 𝐊 𝑅 superscript 𝝅 superscript 𝜷 𝑜 1 2 superscript 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑜 𝐮 𝐸 conditional-set Pr superscript 𝑛 1 superscript 𝐊 𝑅 superscript 𝝅 superscript 𝜷 𝑜 1 2 superscript 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑜 𝐮 subscript ℱ 𝑛 absent → Φ 𝐮 \Pr\Big{\{}n^{-1}\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})^{1/2}%\frac{\partial l^{*}(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{o}}%\leq\mathbf{u}\Big{\}}=E\Bigg{\{}\Pr\Big{\{}n^{-1}\mathbf{K}^{R}(\boldsymbol{%\pi},\boldsymbol{\beta}^{o})^{1/2}\frac{\partial l^{*}(\boldsymbol{\beta}^{o})%}{\partial\boldsymbol{\beta}^{o}}\leq\mathbf{u}\Big{\}}|\mathcal{F}_{n}\Bigg{%\}}\xrightarrow[]{}\Phi(\mathbf{u})\,, roman_Pr { italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_ARG ≤ bold_u } = italic_E { roman_Pr { italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_ARG ≤ bold_u } | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW roman_Φ ( bold_u ) ,
and therefore
1 n 𝐊 R ( 𝝅 , 𝜷 o ) 1 / 2 ∂ l ∗ ( 𝜷 o ) ∂ 𝜷 o → 𝐷 N ( 0 , 𝐈 ) 𝐷 → 1 𝑛 superscript 𝐊 𝑅 superscript 𝝅 superscript 𝜷 𝑜 1 2 superscript 𝑙 superscript 𝜷 𝑜 superscript 𝜷 𝑜 𝑁 0 𝐈 \frac{1}{n}\mathbf{K}^{R}(\boldsymbol{\pi},\boldsymbol{\beta}^{o})^{1/2}\frac{%\partial l^{*}(\boldsymbol{\beta}^{o})}{\partial\boldsymbol{\beta}^{o}}%\xrightarrow[]{D}N(0,\mathbf{I}) divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_K start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_italic_π , bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT divide start_ARG ∂ italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_italic_β start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_ARG start_ARROW overitalic_D → end_ARROW italic_N ( 0 , bold_I )
in the unconditional space. The rest of the proof follows directly from Wangetal. (2018 ) .
S3 Additional Simulation ResultsIn Fig. S1, we compare the Frobenius norm of three covariance matrices: (i) The covariance matrix of the two-step estimator, 𝜷 ~ T S subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT . (ii) The approximated covariance matrix utilized in Step 1.5. (iii) The empirical covariance matrix of 𝜷 ~ T S subscript ~ 𝜷 𝑇 𝑆 \widetilde{\boldsymbol{\beta}}_{TS} over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT . Fig. S2 demonstrates the validity of the variance estimator (3.11), and the effectiveness of optimal subsampling over uniform subsampling.
S4 Linked Birth and Infant Death Data - Additional ResultsThe covariates in the model are summarized in Tables S1–S3. Tables S4–S6 present the estimated coefficients for each method, with c = 10 𝑐 10 c=10 italic_c = 10 . While the results are organized into three tables for clarity, it is essential to note that the FDR procedure was executed once, encompassing all coefficients collectively.
Non-events Events (N=28410519) (N=176400) Mother’s age (limited 12-50) Mean (SD) 27.71 (6.09) 26.9 (6.5) Median [Min, Max] 28 [12, 15] 26 [12, 50] Live Birth Order Mean (SD) 2.08 (1.24) 2.19 (1.4) Median [Min, Max] 2 [1, 8] 2 [1, 8] Number of Prenatal Visits Mean (SD) 11.26 (3.94) 8.21 (5.11) Median [Min, Max] 12 [0, 49] 8 [0,49] Weight Gain (limited to 99 pounds) Mean (SD) 30.51 (14.32) 22.95 (15.04) Median [Min, Max] 30 [0, 98] 22 [0, 98] Five Minute APGAR Score Mean (SD) 8.84 [0.71] 5.19 (3.44) Median [Min, Max] 9 [0,10] 6 [0, 10] Plurality (limited to 5) Mean (SD) 1.04 (0.19) 1.65 (0.42) Median [Min, Max] 1 [1, 5] 1 [1, 5] Gestation weeks Mean (SD) 38.65 (2.37) 30.21 (7.68) Median [Min, Max] 39 [17, 47] 30 [17,47] Years after 2007 Mean (SD) 2.93 (2.01) 2.85 (2.01) Median [Min, Max] 3 [0, 6] 3 [0, 6] Birth month January 8.19% 8.36% February 7.63% 7.72% March 8.31% 8.35% April 7.98% 8.19% May 8.33% 8.56% June 8.32% 8.32% July 8.80% 8.66% August 8.93% 8.79% September 8.66% 8.42% October 8.50% 8.52% November 8.02% 7.97% December 8.33% 8.14% Birth weekday Sunday 9.31% 11.36% Monday 15.18% 14.73% Tuesday 16.59% 15.51% Wednesday 16.27% 15.38% Thursday 16.21% 15.57% Friday 15.84% 15.25% Saturday 10.60% 12.20% Birth place In hospital 98.80% 98.53% Not in hospital 1.20% 1.47% Residence status Resident 72.97% 65.82% Interstate nonresident (type 1) 24.73% 30.26% Interstate nonresident (type 2) 2.11% 3.83% Foreign resident 0.19% 0.09%
Non-events Events (N=28410519) (N=176400) Mother’s race White 76.72% 64.51% Black 15.81% 29.58% American Indian / Alaskan Native 1.16% 1.55% Asian / Pacific Islander 6.31% 4.36% Mother’s marital status Married 59.52% 45.56% Not Married 40.48% 54.44% Father’s race White 63.37% 47.25% Black 11.58% 17.65% American Indian / Alaskan Native 0.87% 1.01% Asian / Pacific Islander 5.25% 3.24% Unknown 18.93% 30.85% Diabetes Yes 5.14% 4.65% No 94.86% 95.35% Chronic Hypertension Yes 1.32% 2.65% No 98.68% 97.35% Prepregnacny Associated Hypertension Yes 4.27% 4.82% No 95.73% 95.18% Eclampsia Yes 0.25% 0.58% No 99.75% 99.42% Induction of Labor Yes 23.01% 13.01% No 76.99% 86.99% Tocolysis Yes 1.19% 4.30% No 98.81% 95.70% Meconium Yes 4.74% 3.71% No 95.26% 96.29% Precipitous Labor Yes 2.46% 4.46% No 97.54% 95.54% Breech Yes 5.36% 20.60% No 94.64% 79.40% Forceps delivery Yes 0.66% 0.38% No 99.34% 99.62% Vacuum delivery Yes 3.02% 1.08% No 96.98% 98.92% Delivery method vagin*l 67.54% 60.98% C-Section 32.46% 39.02%
Non-events Events (N=28410519) (N=176400) Attendant Doctor of Medicine (MD) 85.40% 90.00% Doctor of Osteopathy (DO) 5.56% 4.96% Certified Nurse Midwife (CNM) 7.75% 3.12% Other Midwife 0.64% 0.27% Other 0.65% 1.65% Sex Female 48.86% 44.17% Male 51.14% 55.83% Birth Weight 227- 1499 grams 1.15% 53.28% 1500 – 2499 grams 6.62% 14.70% 2500 - 8165 grams 92.23% 32.02% Anencephalus Yes 0.01% 1.01% No 99.99% 98.99% Spina Bifida Yes 0.01% 0.25% No 99.99% 99.75% Omphalocele Yes 0.03% 0.68% No 99.97% 99.32% Cleft Lip Yes 0.07% 1.08% No 99.93% 98.92% Downs Syndrome Yes 0.05% 0.55% No 99.95% 99.45%
Estimate Standard Deviation Adjusted P-value MLE A L uniform MLE A L uniform MLE A L uniform Intercept 18.5035 18.4739 18.4791 19.0026 0.2762 0.2782 0.2892 0.6643 0.0000 0.0000 0.0000 0.0000 Mother Age -0.0938 -0.0947 -0.0919 -0.0944 0.0036 0.0038 0.0038 0.0069 0.0000 0.0000 0.0000 0.0000 Live birth order 0.1109 0.1120 0.1111 0.1080 0.0023 0.0024 0.0024 0.0046 0.0000 0.0000 0.0000 0.0000 Number of prenatal visits -0.0134 -0.0132 -0.0134 -0.0151 0.0007 0.0007 0.0007 0.0013 0.0000 0.0000 0.0000 0.0000 Weight gain -0.0029 -0.0029 -0.0030 -0.0030 0.0003 0.0003 0.0003 0.0006 0.0000 0.0000 0.0000 0.0000 Five minute APGAR score -0.5183 -0.5182 -0.5180 -0.5140 0.0019 0.0019 0.0019 0.0044 0.0000 0.0000 0.0000 0.0000 Plurality -0.0872 -0.0846 -0.0811 -0.0553 0.0122 0.0128 0.0127 0.0250 0.0000 0.0000 0.0000 0.0691 Gestation weeks -0.1307 -0.1308 -0.1300 -0.1282 0.0012 0.0013 0.0013 0.0025 0.0000 0.0000 0.0000 0.0000 Year 0.0126 0.0260 0.0215 -0.0901 0.0657 0.0662 0.0690 0.1523 0.8913 0.7523 0.8268 0.7115 Squared mother age 0.0012 0.0012 0.0012 0.0013 0.0001 0.0001 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 Birth place = not in hospital 0.6151 0.6252 0.6098 0.6420 0.0317 0.0325 0.0331 0.0643 0.0000 0.0000 0.0000 0.0000 Diabetes = no -0.0148 0.0001 -0.0172 -0.0110 0.0285 0.0293 0.0294 0.0511 0.6896 0.9975 0.6611 0.9378 Chronic hypertension = no 0.1177 0.1152 0.0952 0.0178 0.0402 0.0409 0.0415 0.0768 0.0072 0.0101 0.0418 0.9339 Prepregnacny hypertension = no 0.3343 0.3450 0.3208 0.3400 0.0271 0.0280 0.0282 0.0474 0.0000 0.0000 0.0000 0.0000 Eclampsia = no 0.4996 0.4983 0.5299 0.2907 0.0770 0.0774 0.0796 0.1371 0.0000 0.0000 0.0000 0.0823 Induction of labor = no 0.0054 0.0029 0.0020 -0.0083 0.0176 0.0184 0.0183 0.0246 0.8081 0.9082 0.9134 0.8705 Tocolysis = no 0.0992 0.1073 0.1005 -0.0120 0.0313 0.0321 0.0327 0.0640 0.0034 0.0019 0.0047 0.9418 Meconium = no -0.1216 -0.1277 -0.1117 -0.0889 0.0262 0.0272 0.0275 0.0475 0.0000 0.0000 0.0001 0.1242 Precipitous labor = no 0.0361 0.0351 0.0248 -0.0213 0.0286 0.0294 0.0298 0.0622 0.2721 0.3129 0.5080 0.8705 Breech = no -0.1003 -0.0980 -0.1055 -0.1011 0.0147 0.0155 0.0153 0.0330 0.0000 0.0000 0.0000 0.0066 Forceps delivery = no 0.1253 0.1161 0.1195 0.1464 0.0898 0.0902 0.0937 0.1282 0.2293 0.2778 0.2840 0.4185 Vacuum delivery = no 0.2982 0.3145 0.3024 0.2598 0.0508 0.0514 0.0529 0.0568 0.0000 0.0000 0.0000 0.0000 Delivery method = C-Section -0.0401 -0.0432 -0.0377 -0.0368 0.0098 0.0103 0.0102 0.0186 0.0001 0.0001 0.0005 0.1008 Sex = male -0.7590 -0.7601 -0.7388 -0.9213 0.2665 0.2686 0.2796 0.6448 0.0090 0.0099 0.0168 0.2792 Anencephaly = no -4.1255 -4.1271 -4.1709 -4.0831 0.1118 0.1120 0.1172 0.3471 0.0000 0.0000 0.0000 0.0000 Spina Bifida = no -2.1350 -2.1323 -2.1315 -2.0956 0.1488 0.1487 0.1559 0.3737 0.0000 0.0000 0.0000 0.0000 Omphalocele = no -1.7259 -1.7286 -1.7383 -1.8822 0.0829 0.0831 0.0876 0.2115 0.0000 0.0000 0.0000 0.0000 Cleft lip = no -2.8745 -2.8681 -2.8451 -2.9524 0.0656 0.0660 0.0685 0.1456 0.0000 0.0000 0.0000 0.0000 Downs syndrome = no -2.3438 -2.3402 -2.3462 -2.4419 0.0863 0.0868 0.0886 0.1639 0.0000 0.0000 0.0000 0.0000 Birth month vs. January Birth month = February 0.0103 0.0119 0.0123 0.0592 0.0145 0.0153 0.0152 0.0268 0.5611 0.5133 0.5178 0.0691 Birth month = March -0.0305 -0.0312 -0.0278 0.0051 0.0143 0.0151 0.0149 0.0265 0.0571 0.0703 0.1064 0.9418 Birth month = April -0.0294 -0.0275 -0.0268 -0.0121 0.0144 0.0152 0.0150 0.0269 0.0711 0.1193 0.1184 0.8082 Birth month = May -0.0434 -0.0441 -0.0402 -0.0014 0.0143 0.0150 0.0149 0.0272 0.0051 0.0073 0.0145 0.9683 Birth month = June -0.0344 -0.0296 -0.0294 0.0006 0.0143 0.0151 0.0149 0.0268 0.0299 0.0883 0.0870 0.9829 Birth month = July -0.0301 -0.0259 -0.0271 0.0163 0.0141 0.0149 0.0147 0.0265 0.0571 0.1339 0.1064 0.7080 Birth month = August -0.0213 -0.0185 -0.0241 0.0174 0.0141 0.0148 0.0147 0.0264 0.2002 0.2945 0.1514 0.6974 Birth month = September -0.0180 -0.0165 -0.0150 0.0418 0.0142 0.0150 0.0148 0.0263 0.2721 0.3511 0.4116 0.2134 Birth month = October -0.0287 -0.0275 -0.0273 0.0314 0.0142 0.0150 0.0148 0.0262 0.0731 0.1136 0.1064 0.3872 Birth month = November -0.0311 -0.0414 -0.0302 0.0029 0.0144 0.0152 0.0150 0.0271 0.0561 0.0129 0.0808 0.9524 Birth month = December -0.0409 -0.0377 -0.0422 -0.0248 0.0143 0.0151 0.0149 0.0274 0.0090 0.0240 0.0100 0.5444 Birth weekday vs. Sunday Birth weekday = Monday 0.0907 0.0910 0.0978 0.1239 0.0118 0.0125 0.0123 0.0228 0.0000 0.0000 0.0000 0.0000 Birth weekday = Tuesday 0.0981 0.1037 0.1045 0.1035 0.0116 0.0123 0.0122 0.0227 0.0000 0.0000 0.0000 0.0000 Birth weekday = Wednesday 0.0941 0.0971 0.0999 0.0781 0.0117 0.0123 0.0122 0.0226 0.0000 0.0000 0.0000 0.0018 Birth weekday = Thursday 0.0902 0.0903 0.0974 0.1145 0.0117 0.0123 0.0122 0.0224 0.0000 0.0000 0.0000 0.0000 Birth weekday = Friday 0.0755 0.0726 0.0783 0.0711 0.0117 0.0124 0.0122 0.0229 0.0000 0.0000 0.0000 0.0060 Birth weekday = Saturday 0.0208 0.0199 0.0205 0.0302 0.0124 0.0131 0.0130 0.0245 0.1504 0.2010 0.1687 0.3731 Resdience status vs. 1 Residence status = 2 0.1156 0.1123 0.1118 0.1159 0.0066 0.0069 0.0068 0.0124 0.0000 0.0000 0.0000 0.0000 Residence status = 3 0.2355 0.2306 0.2443 0.2717 0.0162 0.0169 0.0168 0.0338 0.0000 0.0000 0.0000 0.0000 Residence status = 4 -0.4333 -0.4285 -0.4366 -0.1390 0.0873 0.0876 0.0897 0.0931 0.0000 0.0000 0.0000 0.2513 Mother’s race vs. white Mother’s race = black -0.0156 -0.0196 -0.0136 -0.0678 0.0165 0.0174 0.0174 0.0338 0.4331 0.3441 0.5295 0.0980 Mother’s race = american indian 0.2387 0.2444 0.2206 0.0930 0.0460 0.0467 0.0482 0.0990 0.0000 0.0000 0.0000 0.5240 Mother’s race = asian 0.0121 0.0207 0.0099 0.0364 0.0362 0.0368 0.0375 0.0575 0.7989 0.6485 0.8578 0.7075 Paternity acknowledged = no 0.0991 0.1022 0.0988 0.0945 0.0074 0.0078 0.0076 0.0136 0.0000 0.0000 0.0000 0.0000 Father’s race vs. white Father’s race = black 0.1357 0.1409 0.1389 0.2239 0.0199 0.0209 0.0209 0.0387 0.0000 0.0000 0.0000 0.0000 Father’s race = american indian 0.2200 0.2230 0.2323 0.2298 0.0562 0.0569 0.0591 0.1122 0.0002 0.0002 0.0002 0.0895 Father’s race = asian -0.0198 -0.0329 -0.0227 -0.0100 0.0413 0.0421 0.0426 0.0661 0.7066 0.5132 0.6948 0.9481 Father’s race = unknown 0.1746 0.1724 0.1732 0.2093 0.0141 0.0148 0.0147 0.0265 0.0000 0.0000 0.0000 0.0000 Attendant vs. MD Attendant = DO -0.0444 -0.0513 -0.0462 -0.0537 0.0134 0.0140 0.0139 0.0239 0.0020 0.0006 0.0021 0.0656 Attendant = CNM -0.2782 -0.2861 -0.2786 -0.3119 0.0157 0.0165 0.0164 0.0241 0.0000 0.0000 0.0000 0.0000 Attendant = other midwife -0.2361 -0.2453 -0.2209 -0.3163 0.0542 0.0549 0.0563 0.0901 0.0000 0.0000 0.0002 0.0015 Attendant = other 0.1822 0.1765 0.1797 0.2198 0.0311 0.0318 0.0325 0.0617 0.0000 0.0000 0.0000 0.0013 Birth weight recode vs. 1 Birth weight recode = 2 -0.7391 -0.7430 -0.7353 -0.7328 0.0116 0.0122 0.0121 0.0220 0.0000 0.0000 0.0000 0.0000 Birth weight recode = 3 -1.6916 -1.6899 -1.6930 -1.6808 0.0141 0.0149 0.0147 0.0265 0.0000 0.0000 0.0000 0.0000
Estimate Standard Deviation Adjusted P-value MLE A L uniform MLE A L uniform MLE A L uniform Weight gain -0.0012 -0.0012 -0.0010 -0.0013 0.0004 0.0004 0.0004 0.0008 0.0127 0.0155 0.0418 0.1586 Apgar 0.0315 0.0312 0.0302 0.0307 0.0025 0.0026 0.0026 0.0059 0.0000 0.0000 0.0000 0.0000 Plurality 0.0180 0.0165 0.0075 -0.0103 0.0163 0.0172 0.0169 0.0340 0.3447 0.4207 0.7435 0.8911 Gestation week -0.0016 -0.0014 -0.0022 -0.0055 0.0012 0.0013 0.0013 0.0025 0.2518 0.3507 0.1257 0.0701 Diabetes = no 0.0388 0.0253 0.0418 0.0964 0.0266 0.0277 0.0275 0.0470 0.2147 0.4457 0.1877 0.0895 Chronic hypertension = no -0.0586 -0.0510 -0.0428 0.0197 0.0376 0.0386 0.0388 0.0761 0.1846 0.2645 0.3600 0.9196 Prepregnacny hypertension = no -0.0692 -0.0634 -0.0620 -0.0421 0.0260 0.0272 0.0271 0.0478 0.0152 0.0373 0.0418 0.5539 Eclampsia = no -0.0270 -0.0385 -0.0374 -0.0806 0.0740 0.0745 0.0771 0.1286 0.7827 0.6763 0.7254 0.7075 Induction of labor = no 0.0409 0.0411 0.0469 0.0427 0.0173 0.0182 0.0180 0.0246 0.0328 0.0440 0.0183 0.1586 Tocolysis = no -0.0056 -0.0120 -0.0076 0.0724 0.0312 0.0323 0.0326 0.0676 0.8921 0.7620 0.8655 0.4548 Forceps delivery = no 0.0692 0.0866 0.1151 0.0757 0.0875 0.0879 0.0917 0.1252 0.5132 0.4116 0.2903 0.7088 Vacuum delivery = no -0.0215 -0.0308 -0.0060 -0.0080 0.0495 0.0502 0.0518 0.0557 0.7347 0.6169 0.9134 0.9481 Delivery method = C-Section -0.0217 -0.0212 -0.0242 -0.0183 0.0126 0.0133 0.0131 0.0241 0.1380 0.1824 0.1064 0.6312 Anencephaly = no 0.1447 0.1466 0.1047 0.7030 0.1095 0.1096 0.1141 0.3429 0.2518 0.2616 0.4612 0.0895 Spina Bifida = no 0.2920 0.2926 0.2826 -0.1952 0.1514 0.1517 0.1594 0.3568 0.0889 0.0946 0.1202 0.7321 Omphalocele = no -0.0071 -0.0054 -0.0096 -0.1927 0.0799 0.0804 0.0846 0.2031 0.9473 0.9647 0.9134 0.5240 Cleft lip = no 0.3197 0.3159 0.2768 0.4703 0.0644 0.0649 0.0672 0.1370 0.0000 0.0000 0.0001 0.0019 Downs syndrome = no 0.0677 0.0748 0.1029 0.1770 0.0852 0.0857 0.0880 0.1879 0.5132 0.4630 0.3274 0.5240
Coefficient Coefficient sd P-value MLE A L uniform MLE A L uniform MLE A L uniform Father’s race = black -0.0079 -0.0084 -0.0078 -0.0265 0.0057 0.0060 0.0059 0.0112 0.2293 0.2400 0.2751 0.0498 Father’s race = american indian -0.0267 -0.0250 -0.0322 -0.0249 0.0161 0.0163 0.0168 0.0305 0.1522 0.1969 0.0986 0.5991 Father’s race = asian -0.0128 -0.0105 -0.0114 -0.0257 0.0116 0.0119 0.0120 0.0187 0.3447 0.4625 0.4449 0.2938 Father’s race = unknown -0.0029 -0.0028 -0.0019 -0.0129 0.0039 0.0041 0.0041 0.0073 0.5362 0.5733 0.7345 0.1541 Mother’s race = black -0.0174 -0.0169 -0.0169 -0.0012 0.0047 0.0050 0.0050 0.0098 0.0006 0.0016 0.0016 0.9489 Mother’s race = american indian -0.0188 -0.0196 -0.0103 0.0091 0.0132 0.0134 0.0138 0.0270 0.2228 0.2193 0.5481 0.8705 Mother’s race = asian -0.0049 -0.0032 -0.0034 -0.0028 0.0102 0.0104 0.0105 0.0163 0.7066 0.8068 0.8237 0.9448 Diabetes = no -0.0058 -0.0082 -0.0045 -0.0162 0.0066 0.0068 0.0068 0.0118 0.4589 0.3129 0.6081 0.2938 Chronic hypertension = no -0.0028 -0.0047 0.0019 0.0066 0.0094 0.0096 0.0096 0.0185 0.8081 0.6930 0.8772 0.8705 Prepregnacny hypertension = no 0.0162 0.0116 0.0198 0.0127 0.0064 0.0066 0.0066 0.0115 0.0208 0.1339 0.0062 0.4381 Eclampsia = no -0.0232 -0.0204 -0.0319 0.0234 0.0186 0.0187 0.0192 0.0330 0.2758 0.3535 0.1484 0.6627 Induction of labor = no -0.0157 -0.0150 -0.0147 -0.0125 0.0042 0.0044 0.0043 0.0060 0.0004 0.0015 0.0016 0.0861 Tocolysis = no 0.0277 0.0239 0.0259 0.0440 0.0077 0.0079 0.0080 0.0160 0.0007 0.0056 0.0028 0.0182 Meconium = no -0.0008 -0.0004 -0.0033 -0.0074 0.0072 0.0075 0.0076 0.0129 0.9445 0.9658 0.7450 0.7177 Precipitous labor = no -0.0106 -0.0109 -0.0071 -0.0009 0.0077 0.0080 0.0081 0.0165 0.2367 0.2543 0.4840 0.9683 Breech = no -0.0002 -0.0011 0.0010 -0.0012 0.0040 0.0043 0.0042 0.0092 0.9685 0.8443 0.8655 0.9481 Forceps delivery = no -0.0199 -0.0184 -0.0264 -0.0241 0.0215 0.0216 0.0225 0.0304 0.4390 0.4727 0.3274 0.6090 Vacuum delivery = no -0.0176 -0.0187 -0.0234 -0.0137 0.0120 0.0121 0.0125 0.0135 0.2109 0.1968 0.1064 0.4905 Anencephaly = no -0.1061 -0.1052 -0.0938 -0.2437 0.0271 0.0272 0.0284 0.0899 0.0002 0.0003 0.0022 0.0194 Spina Bifida = no 0.0532 0.0520 0.0503 0.1032 0.0373 0.0374 0.0393 0.0749 0.2228 0.2449 0.2840 0.2938 Omphalocele = no -0.0008 -0.0014 0.0043 0.1153 0.0200 0.0200 0.0212 0.0496 0.9685 0.9647 0.8772 0.0549 Cleft lip = no 0.0085 0.0071 0.0022 -0.0026 0.0159 0.0160 0.0166 0.0349 0.6826 0.7202 0.9134 0.9675 Downs syndrome = no 0.0559 0.0533 0.0492 0.0827 0.0211 0.0212 0.0218 0.0428 0.0152 0.0235 0.0443 0.1105