Doctoral Dissertation Defense: Ji Li
Advisor: Dr. Yi Huang
Location
Mathematics/Psychology : 412
Date & Time
June 10, 2025, 3:00 pm – 5:00 pm
Description
Title: Leveraging External Data in Clinical Trial Design: Synthetic Control Arm Construction Using a Causal Inference Integrated Machine Learning Approach
Abstract
The integration of external data to construct synthetic controls marks a paradigm shift from the conventional reliance on concurrent randomized controlled trials (RCTs) in evidence-based medicine. Although initially met with skepticism from academic community, the use of real-world data (RWD) to form synthetic controls has gained significant traction in recent years and become really popular in real practice. Both the U.S. FDA and the EMA have issued multiple guidance documents supporting the use of real-world evidence (RWE) in drug development and post-market safety evaluation. This momentum is fueled by the high costs and feasibility challenges of traditional RCTs, escalating healthcare expenditures, increased availability of high-quality RWD, and advances in causal inference and Bayesian methodologies to evaluate average causal effects from observational studies and partially controlled trials. To address the complexities and potential biases- such as covariate heterogeneity and unmeasured confounding- arising from integrating external data, the past two decades have seen the emergence of a new class of statistical techniques known as synthetic control methods. These methods use external control or RWD to construct equivalent control subjects to mimic traditional control groups, and enable consistent estimation of treatment effects in single-arm trials or RCTs with limited concurrent controls.
Given the increasing adoption of synthetic control arms and the complexities involved in reaching valid causal inference on average treatment effect (ATE) using such design, it is crucial to understand their methodology and theoretical foundations that ensure their validity and reliability in clinical research. The Chapter 2 of my dissertation compared the performance of various synthetic control methods borrowing external information to maximize efficiency of ATE estimation through a comprehensive simulation study. We proposed to use one-class Support Vector Machine (OCSVM) to replace PS methods for improving the efficiency and validity of synthetic control methods in this chapter too. I compared four categories of methods - PS methods, newly proposed machine learning method, Bayesian approaches, and two-stage approaches - through various rationales, ranging from similar external data to multiple data sets with big heterogeneity and non-linear structure to mimic the real world data setting. Bayesian methods include the randomized power prior and the meta-analytic-predictive related priors. Simulation studies indicate that OCSVM identifies external data points most compatible with the current data set, achieving improved covariate balance relative to PS methods across diverse simulated scenarios, confirmed that OCSVM outperforms PS methods in handling non-linear data complexities of from multiple external data sources with heterogeneity.
Building on the promising performance of OCSVM, further enhancements are explored in Chapter 3 to maximize its capability in addressing the complexities of RWD. This chapter introduces three methodological innovations: (1) a novel tuning approach for the γ parameter in the radial basis function kernel; (2) a weighted OCSVM method designed to mitigate the influence of outliers, employing position and density based weights to improve boundary sensitivity; and (3) a specialized mixed-type kernel for datasets contain both continuous and categorical variables. Benchmarking studies demonstrate that these innovations substantially enhance the robustness, adaptability, and generalizability of OCSVM, making it highly suitable for complex, real-world datasets.
Chapter 4 introduces a hybrid approach designed to address an inherent limitation of OCSVM: it treats all external data points within the decision boundary equally. It can introduce bias if covariate distributions differ substantially between the borrowed external and current data sets. To overcome this, the proposed approach OCSVM-EB first employs OCSVM to trim external data points incompatible with the current dataset, and then applies entropy balancing (EB) to reweight the borrowed external data. By enforcing specific moment constraints, EB ensures the covariate distributions from external data closely align with those of the current study, effectively reducing distributional bias. Simulation results confirm that OCSVM-EB consistently achieves better covariate balance than traditional PS methods. Taken together, the methodological innovations in this dissertation strengthen the valid and efficient integration of external data into clinical trials.
Tags:
