患者表征学习
基于对比学习与自监督预训练的表征学习
这些文献核心在于利用对比学习、自监督代理任务和预训练范式,解决医疗领域标签数据稀缺问题,通过对齐不同模态、视角或样本,生成通用的、鲁棒的患者表征。
- An Adaptive Multi-Indicator Contrastive Predictive Coding Framework for Patient Representation Learning(Hongxu Yuan, Yuzheng Yan, Xiaozhu Jing, Wuman Luo, 2025, 2025 International Joint Conference on Neural Networks (IJCNN))
- Patient Embeddings From Diagnosis Codes for Health Care Prediction Tasks: Pat2Vec Machine Learning Framework(E. Steiger, L. Kroll, 2022, JMIR AI)
- Patient contrastive learning: A performant, expressive, and practical approach to electrocardiogram modeling(N. Diamant, Erik Reinertsen, Steven Song, A. Aguirre, Collin M. Stultz, P. Batra, 2021, PLOS Computational Biology)
- Applying Self-Supervised Learning to Medicine: Review of the State of the Art and Medical Implementations(Alexander Chowdhury, Jacob Rosenthal, Jonathan Waring, Renato Umeton, 2021, Informatics)
- Self-Supervised Contrastive Learning for Disease Trajectory Prediction(A. C. Das, Md Shujan Shak, Nabila Rahman, Fuad Mahmud, A. Eva, M. Hasan, 2025, 2025 5th International Conference on Pervasive Computing and Social Networking (ICPCSN))
- PRIME: Pretraining for Patient Condition Representation with Irregular Multimodal Electronic Health Records(Bohao Li, Bowen Du, Junchen Ye, 2025, ACM Transactions on Knowledge Discovery from Data)
- Medical Knowledge-Driven Contrastive Learning for Similar Patient Retrieval(Fanqing Meng, C. Feng, Ge Shi, Xia Liu, Bo Wang, Kaiyuan Zhang, Zhuang Yan, 2026, IEEE Journal of Biomedical and Health Informatics)
- Learning end-to-end patient representations through self-supervised covariate balancing for causal treatment effect estimation(Gino L. Tesei, S. Giampanis, Jingpu Shi, Beau Norgeot, 2023, Journal of Biomedical Informatics)
- Self-supervised learning in medicine and healthcare(R. Krishnan, P. Rajpurkar, E. Topol, 2022, Nature Biomedical Engineering)
- Forecasting the future clinical events of a patient through contrastive learning(Ziqi Zhang, Chao Yan, Xinmeng Zhang, Steve Nyemba, B. Malin, 2022, Journal of the American Medical Informatics Association)
- MedPACL: Medical Patient-Aware Contrastive Learning with Modality-Specific Augmentations for Robust Representation Learning(Abdellah Azizi, Yassine Azizi, M’barek Nasri, 2026, Lecture Notes in Networks and Systems)
- MHGRL: An Effective Representation Learning Model for Electronic Health Records(Feiyan Liu, Liangzhi Li, Xiaoli Wang, Feng Luo, Chang Liu, Jinsong Su, Yiming Qian, 2024, Proceedings of the Language Resources and Evaluation Conference)
- CARE: Contrastive and Adversarial Representation Enhancement for Heart Failure Prediction(Yunfan Zhou, Ying Li, Xuxue Sun, Yongqi Hou, Dongquan Li, Jingsong Shao, Li Wang, Bo Kong, 2025, 2025 IEEE International Conference on Big Data (BigData))
- Universal representations in cardiovascular ECG assessment: A self-supervised learning approach(Zhi-Yong Liu, Ching-Heng Lin, Yu-Chun Hsu, Jung-Sheng Chen, Po-Cheng Chang, Ming-Shien Wen, Chang-Fu Kuo, 2024, International Journal of Medical Informatics)
- GatorCLR: Personalized predictions of patient outcomes on electronic health records using self-supervised contrastive graph representation(Yuxi Liu, Zhenhao Zhang, Jiacong Mi, Shirui Pan, Tianlong Chen, Yi Guo, Xing He, Jiang Bian, 2025, Journal of Biomedical Informatics)
- Contrast Everything: A Hierarchical Contrastive Framework for Medical Time-Series(Yu Han, Haishuai Wang, Yihe Wang, Xiang Zhang, 2023, Advances in Neural Information Processing Systems 36)
- A decision support system in precision medicine: contrastive multimodal learning for patient stratification(Qing Yin, Linda L D Zhong, Yunya Song, Liang Bai, Zhihua Wang, Chen Li, Yida Xu, Xian Yang, 2023, Annals of Operations Research)
基于序列建模与Transformer的临床轨迹分析
这些研究将患者病历视为时间序列,利用Transformer、RNN及VAE等深度架构捕捉病历记录中的长期依赖和动态演变规律,实现对诊疗轨迹的深度表征。
- Transformer patient embedding using electronic health records enables patient stratification and progression analysis(Su Xian, M. Grabowska, I. Kullo, Yuan Luo, J. Smoller, T. Walunas, Wei-Qi Wei, Gail P. Jarvik, Sean D. Mooney, D. Crosslin, 2025, npj Digital Medicine)
- Patient Representation From Structured Electronic Medical Records Based on Embedding Technique: Development and Validation Study(Yanqun Huang, Ni Wang, Zhiqiang Zhang, Honglei Liu, Xiaolu Fei, Lan Wei, Hui Chen, 2020, JMIR Medical Informatics)
- DeepEnroll: Patient-Trial Matching with Deep Embedding and Entailment Prediction(Xingyao Zhang, Cao Xiao, Lucas M. Glass, Jimeng Sun, 2020, Proceedings of The Web Conference 2020)
- Language Models Are An Effective Representation Learning Technique For Electronic Health Record Data(E. Steinberg, Kenneth Jung, J. Fries, Conor K. Corbin, S. Pfohl, N. Shah, 2020, Journal of Biomedical Informatics)
- Deep representation learning of electronic health records to unlock patient stratification at scale(I. Landi, B. Glicksberg, Hao-Chih Lee, S. Cherng, Giulia Landi, M. Danieletto, J. Dudley, Cesare Furlanello, Riccardo Miotto, 2020, npj Digital Medicine)
- PROMISE: A pre-trained knowledge-infused multimodal representation learning framework for medication recommendation(Jialun Wu, Xinyao Yu, Kai He, Zeyu Gao, Tieliang Gong, 2024, Information Processing & Management)
- Knowledge enhanced representation learning network for drug recommendation(Xiaobo Li, Xiaodi Hou, Fanjun Meng, Xiaokun Zhang, Mingyu Lu, Hongfei Lin, Yijia Zhang, 2025, Information Processing & Management)
- Contrastive Learning of Temporal Distinctiveness for Survival Analysis in Electronic Health Records(Mohsen Nayebi Kerdabadi, Arya Hadizadeh Moghaddam, Bin Liu, Meitian Liu, Zijun Yao, 2023, Proceedings of the 32nd ACM International Conference on Information and Knowledge Management)
- Language-model-based patient embedding using electronic health records facilitates phenotyping, disease forecasting, and progression analysis(Su Xian, M. Grabowska, I. Kullo, Yuan Luo, J. Smoller, Wei-Qi Wei, Gail P. Jarvik, Sean D. Mooney, D. Crosslin, 2024, Research …)
- HyMaTE: A Hybrid Mamba and Transformer Model for EHR Representation Learning(Md Mozaharul Mottalib, Thao-Ly T. Phan, R. Beheshti, 2025, Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics)
- Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record(Jinghe Zhang, Kamran Kowsari, James H. Harrison, J. Lobo, Laura E. Barnes, 2018, IEEE Access)
- Patient representation learning and interpretable evaluation using clinical notes(Madhumita Sushil, Simon Suster, Kim Luyckx, Walter Daelemans, 2018, Journal of Biomedical Informatics)
- Predicting Sequences of Clinical Events by Using a Personalized Temporal Latent Embedding Model(Cristóbal Esteban, D. Schmidt, Denis Krompass, Volker Tresp, 2015, 2015 International Conference on Healthcare Informatics)
- Readmission prediction via deep contextual embedding of clinical concepts(Cao Xiao, Tengfei Ma, A. B. Dieng, D. Blei, Fei Wang, 2018, PLOS ONE)
- Modelling Patient Trajectories Using Multimodal Information(J. F. Silva, S. Matos, 2022, Journal of Biomedical Informatics)
- ConCare: Personalized Clinical Feature Embedding via Capturing the Healthcare Context(Liantao Ma, Chaohe Zhang, Yasha Wang, Wenjie Ruan, Jiantao Wang, Wen Tang, Xinyu Ma, Xin Gao, Junyi Gao, 2019, Proceedings of the AAAI Conference on Artificial Intelligence)
- Deep representation learning for clustering longitudinal survival data from electronic health records(Jiajun Qiu, Yao Hu, Li Li, A. M. Erzurumluoglu, Ingrid Braenne, C. Whitehurst, Jochen Schmitz, J. Arora, B. A. Bartholdy, Shrey Gandhi, Pierre Khoueiry, Stefanie Mueller, Boris Noyvert, Zhihao Ding, Jan-Nygaard Jensen, Johann de Jong, 2025, Nature Communications)
- An Effective Patient Representation Learning for Time-series Prediction Tasks Based on EHRs(Liqi Lei, Yangming Zhou, Jie Zhai, Le Zhang, Zhijia Fang, Ping He, Ju Gao, 2018, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM))
- Representation learning for clinical time series prediction tasks in electronic health records(Tong Ruan, Liqi Lei, Yangming Zhou, Jie Zhai, Le Zhang, Ping He, Ju Gao, 2019, BMC Medical Informatics and Decision Making)
- On Clinical Event Prediction in Patient Treatment Trajectory Using Longitudinal Electronic Health Records(H. Duan, Zhoujian Sun, W. Dong, K. He, Zhengxing Huang, 2019, IEEE Journal of Biomedical and Health Informatics)
- Hi-BEHRT: Hierarchical Transformer-Based Model for Accurate Prediction of Clinical Events Using Multimodal Longitudinal Electronic Health Records(Yikuan Li, M. Mamouei, G. Salimi-Khorshidi, Shishir Rao, A. Hassaine, D. Canoy, Thomas Lukasiewicz, K. Rahimi, 2021, IEEE Journal of Biomedical and Health Informatics)
- Trajectory-Ordered Objectives for Self-Supervised Representation Learning of Temporal Healthcare Data Using Transformers: Model Development and Evaluation Study(Ali Amirahmadi, Farzaneh Etminani, Jonas Björk, Olle Melander, Mattias Ohlsson, 2025, JMIR Medical Informatics)
- Bidirectional Representation Learning from Transformers using Multimodal Electronic Health Record Data to Predict Depression(Yiwen Meng, W. Speier, Michael K. Ong, C. Arnold, 2021, IEEE Journal of Biomedical and Health Informatics)
基于图神经网络与多模态融合的复杂结构嵌入
这些文献侧重于利用图神经网络和多模态整合技术,显式建模异构医疗实体(药物、诊断、服务)间的复杂关联,以及文本、结构化数据间的跨模态依赖关系。
- Graph-Based Patient Representation for Multimodal Clinical Data: Addressing Data Heterogeneity(Suparna Ghanvatkar, Vaibhav Rajan, 2023, medRxiv)
- Predicting Patient Outcomes with Graph Representation Learning(Catherine Tong, Emma Rocheteau, Petar Veličković, Nicholas D. Lane, Píetro Lió, 2022, Studies in Computational Intelligence)
- Predicting the Survival of Cancer Patients With Multimodal Graph Neural Network(Jianliang Gao, Tengfei Lyu, Fan Xiong, Jianxin Wang, W. Ke, Zhao Li, 2021, IEEE/ACM Transactions on Computational Biology and Bioinformatics)
- Patient Health Representation Learning via Correlational Sparse Prior of Medical Features(Xin Ma, Yasha Wang, Xu Chu, Liantao Ma, Wen Tang, Junfeng Zhao, Ye Yuan, Guoren Wang, 2023, IEEE Transactions on Knowledge and Data Engineering)
- Enhancing Drug Recommendations Via Heterogeneous Graph Representation Learning in EHR Networks(Haijun Zhang, Xian Yang, Liang Bai, Jiye Liang, 2024, IEEE Transactions on Knowledge and Data Engineering)
- Variationally regularized graph-based representation learning for electronic health records(Weicheng Zhu, N. Razavian, 2019, Proceedings of the Conference on Health, Inference, and Learning)
- Multimodal learning for scalable representation of high-dimensional medical data(A. Alsaafin, A. Shafique, S. Alfasly, Krishna R. Kalari, H. Tizhoosh, 2024, Frontiers in Digital Health)
- Deep learning with multimodal representation for pancancer prognosis prediction(Anika Cheerla, O. Gevaert, 2019, Bioinformatics)
- Multimodal representation learning for medical analytics - a systematic literature review(E. Hansen, Tomer Sagi, Katja Hose, 2024, Health Informatics Journal)
- Leveraging graph-based hierarchical medical entity embedding for healthcare applications(Tong Wu, Yunlong Wang, Yue Wang, E. Zhao, Yilian Yuan, 2021, Scientific Reports)
- Multi-layer Representation Learning for Medical Concepts(E. Choi, M. T. Bahadori, Elizabeth Searles, Catherine Coffey, Michael Thompson, J. Bost, Javier Tejedor-Sojo, Jimeng Sun, 2016, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining)
- Two-stage Federated Phenotyping and Patient Representation Learning(Dianbo Liu, Dmitriy Dligach, Timothy Miller, 2019, Proceedings of the 18th BioNLP Workshop and Shared Task)
- A Representation Fusion Framework for Decoupling Diagnostic Information in Multimodal Learning(Sana Tonekaboni, S. Friedman, Xinyi Zhang, Mahnaz Maddah, Caroline Uhler, 2025, npj Digital Medicine)
- Transformer-based unsupervised patient representation learning based on medical claims for risk stratification and analysis(Xianlong Zeng, Simon M. Lin, Chang Liu, 2021, Proceedings of the 12th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics)
- Multimodal Representation Learning Based on Personalized Graph-Based Fusion for Mortality Prediction Using Electronic Medical Records(Abdulrahman Al-Dailami, Hulin Kuang, Jian-xin Wang, 2025, Big Data Mining and Analytics)
患者相似性挖掘与系统综述评估
该分组涵盖了利用患者群组相似性增强表征的方法,以及对该领域进行系统性归纳、方法论评估和挑战(如偏差、可解释性)探讨的综述类文献。
- GRASP: Generic Framework for Health Status Representation Learning Based on Incorporating Knowledge from Similar Patients(Chaohe Zhang, Xin Gao, Liantao Ma, Yasha Wang, Jiangtao Wang, Wen Tang, 2021, Proceedings of the AAAI Conference on Artificial Intelligence)
- Mining Patient Cohort Discovery: A Synergy of Medical Embeddings and Approximate Nearest Neighbor Search(Dimitrios Karapiperis, Antonios P. Antoniadis, V. Verykios, 2025, Electronics)
- Early prediction of hepatocellular carcinoma using a risk-embedded longitudinal attention model(Chupeng Ling, Yiwen Zhang, Chengguang Hu, Naying Liao, Jinlong Zhang, Yuanping Zhou, Wei Yang, 2026, Biomedical Signal Processing and Control)
- Deep Representation Learning of Patient Data from Electronic Health Records (EHR): A Systematic Review(Yuqi Si, Jingcheng Du, Zhao Li, Xiaoqian Jiang, T. Miller, Fei Wang, W. J. Zheng, Kirk Roberts, 2020, Journal of Biomedical Informatics)
- A scoping review of self-supervised representation learning for clinical decision making using EHR categorical data(Yuanyuan Zheng, Adel Bensahla, Mina Bjelogrlic, Jamil Zaghir, Hugues Turbé, Bednarczyk Lydie, C. Gaudet-Blavignac, J. Ehrsam, Stéphane Marchand-Maillet, Christian Lovis, 2025, npj Digital Medicine)
- Deep Holistic Representation Learning from EHR(Edmond Zhang, R. Robinson, Bernhard Pfahringer, 2018, 2018 12th International Symposium on Medical Information and Communication Technology (ISMICT))
- Gender-sensitive word embeddings for healthcare(Shunit Agmon, Plia Gillis, E. Horvitz, Kira Radinsky, 2021, Journal of the American Medical Informatics Association)
- Augmentation-Free Longitudinal Modeling Through Structuring Whitened Embeddings(Karel Fonteyn, Lennert Bontinck, T. Dhaene, D. Deschrijver, 2025, IEEE Access)
- Recent advances in representation learning for electronic health records: A systematic review(X Liu, H Wang, T He, Y Liao, C Jian, 2022, Journal of Physics …)
- Embedding Methods for Electronic Health Record Research.(Justin Kauffman, Riccardo Miotto, Eyal Klang, Anthony B. Costa, Beau Norgeot, M. Zitnik, Shameer Khader, Fei Wang, Girish N. Nadkarni, Benjamin S. Glicksberg, 2025, Annual Review of Biomedical Data Science)
- Generic medical concept embedding and time decay for diverse patient outcome prediction tasks(Yupeng Li, Wei Dong, B. Ru, Adam Black, Xinyuan Zhang, Y. Guan, 2022, iScience)
- Temporal and comorbidity-aware representation of longitudinal patient trajectories from electronic health records(M Sreenivasan, S Madhavendranath, 2026, Biomedical Physics & …)
患者表征学习研究已形成以自监督对比学习、序列轨迹建模、图谱结构融合为三大技术支柱的完整体系。研究核心正从简单的单模态表示转向关注多模态数据一致性、临床时间序列的长程依赖以及异构实体间的关联性,同时领域内对模型的鲁棒性、偏差控制及临床可解释性评价愈发重视。
总计67篇相关文献
OBJECTIVES Patient representation learning refers to learning a dense mathematical representation of a patient that encodes meaningful information from Electronic Health Records (EHRs). This is generally performed using advanced deep learning methods. This study presents a systematic review of this field and provides both qualitative and quantitative analyses from a methodological perspective. METHODS We identified studies developing patient representations from EHRs with deep learning methods from MEDLINE, EMBASE, Scopus, the Association for Computing Machinery (ACM) Digital Library, and the Institute of Electrical and Electronics Engineers (IEEE) Xplore Digital Library. After screening 363 articles, 49 papers were included for a comprehensive data collection. RESULTS Publications developing patient representations almost doubled each year from 2015 until 2019. We noticed a typical workflow starting with feeding raw data, applying deep learning models, and ending with clinical outcome predictions as evaluations of the learned representations. Specifically, learning representations from structured EHR data was dominant (37 out of 49 studies). Recurrent Neural Networks were widely applied as the deep learning architecture (Long short-term memory: 13 studies, Gated recurrent unit: 11 studies). Learning was mainly performed in a supervised manner (30 studies) optimized with cross-entropy loss. Disease prediction was the most common application and evaluation (31 studies). Benchmark datasets were mostly unavailable (28 studies) due to privacy concerns of EHR data, and code availability was assured in 20 studies. DISCUSSION & CONCLUSION The existing predictive models mainly focus on the prediction of single diseases, rather than considering the complex mechanisms of patients from a holistic review. We show the importance and feasibility of learning comprehensive representations of patient EHR data through a systematic review. Advances in patient representation learning techniques will be essential for powering patient-level EHR analyses. Future work will still be devoted to leveraging the richness and potential of available EHR data. Reproducibility and transparency of reported results will hopefully improve. Knowledge distillation and advanced learning techniques will be exploited to assist the capability of learning patient representation further.
Deriving disease subtypes from electronic health records (EHRs) can guide next-generation personalized medicine. However, challenges in summarizing and representing patient data prevent widespread practice of scalable EHR-based stratification analysis. Here we present an unsupervised framework based on deep learning to process heterogeneous EHRs and derive patient representations that can efficiently and effectively enable patient stratification at scale. We considered EHRs of 1,608,741 patients from a diverse hospital cohort comprising a total of 57,464 clinical concepts. We introduce a representation learning model based on word embeddings, convolutional neural networks, and autoencoders (i.e., ConvAE) to transform patient trajectories into low-dimensional latent vectors. We evaluated these representations as broadly enabling patient stratification by applying hierarchical clustering to different multi-disease and disease-specific patient cohorts. ConvAE significantly outperformed several baselines in a clustering task to identify patients with different complex conditions, with 2.61 entropy and 0.31 purity average scores. When applied to stratify patients within a certain condition, ConvAE led to various clinically relevant subtypes for different disorders, including type 2 diabetes, Parkinson’s disease, and Alzheimer’s disease, largely related to comorbidities, disease progression, and symptom severity. With these results, we demonstrate that ConvAE can generate patient representations that lead to clinically meaningful insights. This scalable framework can help better understand varying etiologies in heterogeneous sub-populations and unlock patterns for EHR-based research in the realm of personalized medicine.
We have three contributions in this work: 1. We explore the utility of a stacked denoising autoencoder and a paragraph vector model to learn task-independent dense patient representations directly from clinical notes. To analyze if these representations are transferable across tasks, we evaluate them in multiple supervised setups to predict patient mortality, primary diagnostic and procedural category, and gender. We compare their performance with sparse representations obtained from a bag-of-words model. We observe that the learned generalized representations significantly outperform the sparse representations when we have few positive instances to learn from, and there is an absence of strong lexical features. 2. We compare the model performance of the feature set constructed from a bag of words to that obtained from medical concepts. In the latter case, concepts represent problems, treatments, and tests. We find that concept identification does not improve the classification performance. 3. We propose novel techniques to facilitate model interpretability. To understand and interpret the representations, we explore the best encoded features within the patient representations obtained from the autoencoder model. Further, we calculate feature sensitivity across two networks to identify the most significant input features for different classification tasks when we use these pretrained representations as the supervised input. We successfully extract the most influential features for the pipeline using this technique.
Electronic Health Records (EHRs) provide possibilities to improve patient care and facilitate clinical research. However, there are many challenges faced by the applications of EHRs, such as temporality, high dimensionality, sparseness, noise, random error, and systematic bias. In particular, temporal patient information is difficult to effectively use by traditional machine learning methods while the sequential information of EHRs is very useful. In this paper, we propose a general-purpose patient representation learning approach to summarize sequential EHRs. Specifically, a recurrent neural network based denoising autoencoder is employed to encode in hospital records of each patient into a low dimensional dense vector. Based on EHR data collected from Shanghai Shuguang Hospital, we experimentally evaluate our proposed method on both mortality prediction and comorbidity prediction tasks. Experimental studies show that our proposed method outperforms other reference methods based on raw EHRs data. We also apply the “Deep Feature” represented by our method to track similar patients with t-SNE, which also achieves interesting results.
Proper representations of medical concepts such as diagnosis, medication, procedure codes and visits from Electronic Health Records (EHR) has broad applications in healthcare analytics. Patient EHR data consists of a sequence of visits over time, where each visit includes multiple medical concepts, e.g., diagnosis, procedure, and medication codes. This hierarchical structure provides two types of relational information, namely sequential order of visits and co-occurrence of the codes within a visit. In this work, we propose Med2Vec, which not only learns the representations for both medical codes and visits from large EHR datasets with over million visits, but also allows us to interpret the learned representations confirmed positively by clinical experts. In the experiments, Med2Vec shows significant improvement in prediction accuracy in clinical applications compared to baselines such as Skip-gram, GloVe, and stacked autoencoder, while providing clinically meaningful interpretation.
A large percentage of medical information is in unstructured text format in electronic medical record systems. Manual extraction of information from clinical notes is extremely time consuming. Natural language processing has been widely used in recent years for automatic information extraction from medical texts. However, algorithms trained on data from a single healthcare provider are not generalizable and error-prone due to the heterogeneity and uniqueness of medical documents. We develop a two-stage federated natural language processing method that enables utilization of clinical notes from different hospitals or clinics without moving the data, and demonstrate its performance using obesity and comorbities phenotyping as medical task. This approach not only improves the quality of a specific clinical task but also facilitates knowledge progression in the whole healthcare system, which is an essential part of learning health system. To the best of our knowledge, this is the first application of federated machine learning in clinical NLP.
… ) for extracting the patient neighbourhood information. We … patient cases using graph neural networks is a promising research direction, yielding tangible returns in supervised learning …
Background Electronic health records (EHRs) provide possibilities to improve patient care and facilitate clinical research. However, there are many challenges faced by the applications of EHRs, such as temporality, high dimensionality, sparseness, noise, random error and systematic bias. In particular, temporal information is difficult to effectively use by traditional machine learning methods while the sequential information of EHRs is very useful. Method In this paper, we propose a general-purpose patient representation learning approach to summarize sequential EHRs. Specifically, a recurrent neural network based denoising autoencoder (RNN-DAE) is employed to encode inhospital records of each patient into a low dimensional dense vector. Results Based on EHR data collected from Shuguang Hospital affiliated to Shanghai University of Traditional Chinese Medicine, we experimentally evaluate our proposed RNN-DAE method on both mortality prediction task and comorbidity prediction task. Extensive experimental results show that our proposed RNN-DAE method outperforms existing methods. In addition, we apply the “Deep Feature” represented by our proposed RNN-DAE method to track similar patients with t-SNE, which also achieves some interesting observations. Conclusion We propose an effective unsupervised RNN-DAE method to summarize patient sequential information in EHR data. Our proposed RNN-DAE method is useful on both mortality prediction task and comorbidity prediction task.
Language Models Are An Effective Representation Learning Technique For Electronic Health Record Data
Widespread adoption of electronic health records (EHRs) has fueled the development of using machine learning to build prediction models for various clinical outcomes. However, this process is often constrained by having a relatively small number of patient records for training the model. We demonstrate that using patient representation schemes inspired from techniques in natural language processing can increase the accuracy of clinical prediction models by transferring information learned from the entire patient population to the task of training a specific model, where only a subset of the population is relevant. Such patient representation schemes enable a 3.5% mean improvement in AUROC on five prediction tasks compared to standard baselines, with the average improvement rising to 19% when only a small number of patient records are available for training the clinical prediction model.
Exploiting the correlations between medical features is essential to the success of healthcare data analysis. However, most existing methods are either suffering large estimation variance for data insufficiency or inflexible in terms of demanding task-specific medical knowledge. In this paper, we propose a novel patient health representation learning framework dubbed SAFARI. SAFARI learns a compact representation by imposing a clinical-fact-inspired task-agnostic correlational sparsity prior to the correlations of medical feature pairs. Specifically, we learn the compact representation by solving the bi-level optimization problem, which involves solving the high-level inter-group correlations and the nested lower-level intra-group correlations. We leverage the Laplacian kernel as a robust metric for feature grouping and graph neural networks for solving the bi-level optimization problem following the optimal value reformulation paradigm. Experiments on five datasets of various inputs and tasks demonstrate the efficacy of SAFARI. The discovered findings are also consistent with our insights and medical literature, which can provide valuable clinical explanations.
… Therefore, we collect the existing review papers from four perspectives: “representation learning” … graph attention mechanism to learn the node embeddings for patients’ risk prediction. …
The claims data, containing medical codes, services information, and incurred expenditure, can be a good resource for estimating an individual's health condition and medical risk level. In this study, we developed Transformer-based Multimodal AutoEncoder (TMAE), an unsupervised learning framework that can learn efficient patient representation by encoding meaningful information from the claims data. TMAE is motivated by the practical needs in healthcare to stratify patients into different risk levels for improving care delivery and management. Compared to previous approaches, TMAE is able to 1) model inpatient, outpatient, and medication claims collectively, 2) handle irregular time intervals between medical events, 3) alleviate the sparsity issue of the rare medical codes, and 4) incorporate medical expenditure information. We trained TMAE using a real-world pediatric claims dataset containing more than 600,000 patients and compared its performance with various approaches in two clustering tasks. Experimental results demonstrate that TMAE has superior performance compared to all baselines. Multiple downstream applications are also conducted to illustrate the effectiveness of our framework. The promising results confirm that the TMAE framework is scalable to large claims data and is able to generate efficient patient embeddings for risk stratification and analysis.
Deep learning models have been applied to many healthcare tasks based on electronic medical records (EMR) data and shown substantial performance. Existing methods commonly embed the records of a single patient into a representation for medical tasks. Such methods learn inadequate representations and lead to inferior performance, especially when the patient’s data is sparse or low-quality. Aiming at the above problem, we propose GRASP, a generic framework for healthcare models. For a given patient, GRASP first finds patients in the dataset who have similar conditions and similar results (i.e., the similar patients), and then enhances the representation learning and prognosis of the given patient by leveraging knowledge extracted from these similar patients. GRASP defines similarities with different meanings between patients for different clinical tasks, and finds similar patients with useful information accordingly, and then learns cohort representation to extract valuable knowledge contained in the similar patients. The cohort information is fused with the current patient’s representation to conduct final clinical tasks. Experimental evaluations on two real-world datasets show that GRASP can be seamlessly integrated into state-of-the-art models with consistent performance improvements. Besides, under the guidance of medical experts, we verified the findings extracted by GRASP, and the findings are consistent with the existing medical knowledge, indicating that GRASP can generate useful insights for relevant predictions.
… a Knowledge Enhanced Representation Learning (KERL) … granularities to enhance patient representation. Meanwhile, we … a dual-path drug representation network to model longitudinal …
Background The secondary use of structured electronic medical record (sEMR) data has become a challenge due to the diversity, sparsity, and high dimensionality of the data representation. Constructing an effective representation for sEMR data is becoming more and more crucial for subsequent data applications. Objective We aimed to apply the embedding technique used in the natural language processing domain for the sEMR data representation and to explore the feasibility and superiority of the embedding-based feature and patient representations in clinical application. Methods The entire training corpus consisted of records of 104,752 hospitalized patients with 13,757 medical concepts of disease diagnoses, physical examinations and procedures, laboratory tests, medications, etc. Each medical concept was embedded into a 200-dimensional real number vector using the Skip-gram algorithm with some adaptive changes from shuffling the medical concepts in a record 20 times. The average of vectors for all medical concepts in a patient record represented the patient. For embedding-based feature representation evaluation, we used the cosine similarities among the medical concept vectors to capture the latent clinical associations among the medical concepts. We further conducted a clustering analysis on stroke patients to evaluate and compare the embedding-based patient representations. The Hopkins statistic, Silhouette index (SI), and Davies-Bouldin index were used for the unsupervised evaluation, and the precision, recall, and F1 score were used for the supervised evaluation. Results The dimension of patient representation was reduced from 13,757 to 200 using the embedding-based representation. The average cosine similarity of the selected disease (subarachnoid hemorrhage) and its 15 clinically relevant medical concepts was 0.973. Stroke patients were clustered into two clusters with the highest SI (0.852). Clustering analyses conducted on patients with the embedding representations showed higher applicability (Hopkins statistic 0.931), higher aggregation (SI 0.862), and lower dispersion (Davies-Bouldin index 0.551) than those conducted on patients with reference representation methods. The clustering solutions for patients with the embedding-based representation achieved the highest F1 scores of 0.944 and 0.717 for two clusters. Conclusions The feature-level embedding-based representations can reflect the potential clinical associations among medical concepts effectively. The patient-level embedding-based representation is easy to use as continuous input to standard machine learning algorithms and can bring performance improvements. It is expected that the embedding-based representation will be helpful in a wide range of secondary uses of sEMR data.
Clinical trials are essential for drug development but often suffer from expensive, inaccurate and insufficient patient recruitment. The core problem of patient-trial matching is to find qualified patients for a trial, where patient information is stored in electronic health records (EHR) while trial eligibility criteria (EC) are described in text documents available on the web. How to represent longitudinal patient EHR? How to extract complex logical rules from EC? Most existing works rely on manual rule-based extraction, which is time consuming and inflexible for complex inference. To address these challenges, we proposed a cross-modal inference learning model to jointly encode enrollment criteria (text) and patients records (tabular data) into a shared latent space for matching inference. pplies a pre-trained Bidirectional Encoder Representations from Transformers(BERT) model to encode clinical trial information into sentence embedding. And uses a hierarchical embedding model to represent patient longitudinal EHR. In addition, s augmented by a numerical information embedding and entailment module to reason over numerical information in both EC and EHR. These encoders are trained jointly to optimize patient-trial matching score. We evaluated n the trial-patient matching task with demonstrated on real world datasets. utperformed the best baseline by up to 12.4% in average F1.
Predicting the patient's clinical outcome from the historical electronic medical records (EMR) is a fundamental research problem in medical informatics. Most deep learning-based solutions for EMR analysis concentrate on learning the clinical visit embedding and exploring the relations between visits. Although those works have shown superior performances in healthcare prediction, they fail to explore the personal characteristics during the clinical visits thoroughly. Moreover, existing works usually assume that the more recent record weights more in the prediction, but this assumption is not suitable for all conditions. In this paper, we propose ConCare to handle the irregular EMR data and extract feature interrelationship to perform individualized healthcare prediction. Our solution can embed the feature sequences separately by modeling the time-aware distribution. ConCare further improves the multi-head self-attention via the cross-head decorrelation, so that the inter-dependencies among dynamic features and static baseline information can be effectively captured to form the personal health context. Experimental results on two real-world EMR datasets demonstrate the effectiveness of ConCare. The medical findings extracted by ConCare are also empirically confirmed by human experts and medical literature.
Objective Hospital readmission costs a lot of money every year. Many hospital readmissions are avoidable, and excessive hospital readmissions could also be harmful to the patients. Accurate prediction of hospital readmission can effectively help reduce the readmission risk. However, the complex relationship between readmission and potential risk factors makes readmission prediction a difficult task. The main goal of this paper is to explore deep learning models to distill such complex relationships and make accurate predictions. Materials and methods We propose CONTENT, a deep model that predicts hospital readmissions via learning interpretable patient representations by capturing both local and global contexts from patient Electronic Health Records (EHR) through a hybrid Topic Recurrent Neural Network (TopicRNN) model. The experiment was conducted using the EHR of a real world Congestive Heart Failure (CHF) cohort of 5,393 patients. Results The proposed model outperforms state-of-the-art methods in readmission prediction (e.g. 0.6103 ± 0.0130 vs. second best 0.5998 ± 0.0124 in terms of ROC-AUC). The derived patient representations were further utilized for patient phenotyping. The learned phenotypes provide more precise understanding of readmission risks. Discussion Embedding both local and global context in patient representation not only improves prediction performance, but also brings interpretable insights of understanding readmission risks for heterogeneous chronic clinical conditions. Conclusion This is the first of its kind model that integrates the power of both conventional deep neural network and the probabilistic generative models for highly interpretable deep patient representation learning. Experimental results and case studies demonstrate the improved performance and interpretability of the model.
This review aims to elucidate the role and impact of embedding techniques in the analysis and utilization of electronic health record data for research. By integrating multidimensional, incongruent, and often unstructured medical data for machine learning models, embeddings provide a powerful tool for enhancing data utility, especially under certain conditions and for asking certain questions. We explore a variety of embedding methods, including but not limited to word embeddings, graph embeddings, and other deep learning models. We highlight key applications of embeddings that are representative of a variety of areas of research, including predictive modeling, patient stratification, clinical decision support, and beyond. Finally, we show how to evaluate the impact and quality of embeddings in real-world clinical settings, assessing their performance against traditional models and noting areas where they deliver substantial improvements or fall short.
Current studies regarding the secondary use of electronic health records (EHR) predominantly rely on domain expertise and existing medical knowledge. A powerful representation approach can unleash the potential of discovering new medical patterns underlying the EHR. Here, we introduce an unsupervised method for embedding high-dimensional EHR data at the patient level to characterize heterogeneity in complex diseases and identify novel disease patterns linked to disparities in clinical outcomes. We applied this approach to 34,851 unique medical codes across 1,046,649 longitudinal patient events, including 102,740 patients in the Electronic Medical Records and GEnomics (eMERGE) Network. The model achieved strong predictive performance in predicting future disease (median AUROC = 0.87 within one year) and bulk phenotyping (median AUROC = 0.84). Notably, these patient embeddings revealed diverse comorbidity profiles and health outcomes, including distinct subtypes and progression patterns in colorectal cancer and systemic lupus erythematosus.
… each patient, and thus to develop the basis for a future clinical … on a combination of the embedding of entities and events in … We extend existing embedding models to the clinical domain…
Background In health care, diagnosis codes in claims data and electronic health records (EHRs) play an important role in data-driven decision making. Any analysis that uses a patient’s diagnosis codes to predict future outcomes or describe morbidity requires a numerical representation of this diagnosis profile made up of string-based diagnosis codes. These numerical representations are especially important for machine learning models. Most commonly, binary-encoded representations have been used, usually for a subset of diagnoses. In real-world health care applications, several issues arise: patient profiles show high variability even when the underlying diseases are the same, they may have gaps and not contain all available information, and a large number of appropriate diagnoses must be considered. Objective We herein present Pat2Vec, a self-supervised machine learning framework inspired by neural network–based natural language processing that embeds complete diagnosis profiles into a small real-valued numerical vector. Methods Based on German outpatient claims data with diagnosis codes according to the International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10), we discovered an optimal vectorization embedding model for patient diagnosis profiles with Bayesian optimization for the hyperparameters. The calibration process ensured a robust embedding model for health care–relevant tasks by aggregating the metrics of different regression and classification tasks using different machine learning algorithms (linear and logistic regression as well as gradient-boosted trees). The models were tested against a baseline model that binary encodes the most common diagnoses. The study used diagnosis profiles and supplementary data from more than 10 million patients from 2016 to 2019 and was based on the largest German ambulatory claims data set. To describe subpopulations in health care, we identified clusters (via density-based clustering) and visualized patient vectors in 2D (via dimensionality reduction with uniform manifold approximation). Furthermore, we applied our vectorization model to predict prospective drug prescription costs based on patients’ diagnoses. Results Our final models outperform the baseline model (binary encoding) with equal dimensions. They are more robust to missing data and show large performance gains, particularly in lower dimensions, demonstrating the embedding model’s compression of nonlinear information. In the future, other sources of health care data can be integrated into the current diagnosis-based framework. Other researchers can apply our publicly shared embedding model to their own diagnosis data. Conclusions We envision a wide range of applications for Pat2Vec that will improve health care quality, including personalized prevention and signal detection in patient surveillance as well as health care resource planning based on subcohorts identified by our data-driven machine learning framework.
Traditional methods for patient cohort identification from Electronic Health Records (EHRs) are often slow, labor-intensive, and fail to capture the rich semantic nuance embedded in unstructured clinical narratives. This paper introduces a scalable, end-to-end framework that creates a synergy between deep medical embeddings and Approximate Nearest Neighbor Search (ANNs) to overcome these limitations. We detail a complete pipeline that begins with preprocessing multi-modal EHR data and creating holistic patient representations using a domain-specific language model combined with an intelligent gated fusion mechanism. These high-dimensional embeddings are then indexed using an ANN method to enable near real-time retrieval. A comprehensive experimental evaluation was conducted on the MIMIC-III and MIMIC-IV datasets, comparing the retrieval performance of ClinicalBERT against BioBERT across several ANN algorithms. The results demonstrate that the combination of ClinicalBERT and HNSW consistently achieves the highest retrieval accuracy, with F1-Scores exceeding 0.78, and query latencies under 10 ms. This framework enables a paradigm shift towards high-speed, semantic patient similarity search, with significant implications for accelerating clinical trial recruitment, augmenting clinical decision support, and paving the way for a new era in data-driven precision medicine.
Automatic representation learning of key entities in electronic health record (EHR) data is a critical step for healthcare data mining that turns heterogeneous medical records into structured and actionable information. Here we propose ME2Vec, an algorithmic framework for learning continuous low-dimensional embedding vectors of the most common entities in EHR: medical services, doctors, and patients. ME2Vec features a hierarchical structure that encapsulates different node embedding schemes to cater for the unique characteristic of each medical entity. To embed medical services, we employ a biased-random-walk-based node embedding that leverages the irregular time intervals of medical services in EHR to embody their relative importance. To embed doctors and patients, we adhere to the principle “it’s what you do that defines you” and derive their embeddings based on their interactions with other types of entities through graph neural network and proximity-preserving network embedding, respectively. Using real-world clinical data, we demonstrate the efficacy of ME2Vec over competitive baselines on diagnosis prediction, readmission prediction, as well as recommending doctors to patients based on their medical conditions. In addition, medical service embeddings pretrained using ME2Vec can substantially improve the performance of sequential models in predicting patients clinical outcomes. Overall, ME2Vec can serve as a general-purpose representation learning algorithm for EHR data and benefit various downstream tasks in terms of both performance and interpretability.
Advancements in machine learning algorithms have had a beneficial impact on representation learning, classification, and prediction models built using electronic health record (EHR) data. Effort has been put both on increasing models’ overall performance as well as improving their interpretability, particularly regarding the decision-making process. In this study, we present a temporal deep learning model to perform bidirectional representation learning on EHR sequences with a transformer architecture to predict future diagnosis of depression. This model is able to aggregate five heterogenous and high-dimensional data sources from the EHR and process them in a temporal manner for chronic disease prediction at various prediction windows. We applied the current trend of pretraining and fine-tuning on EHR data to outperform the current state-of-the-art in chronic disease prediction, and to demonstrate the underlying relation between EHR codes in the sequence. The model generated the highest increases of precision-recall area under the curve (PRAUC) from 0.70 to 0.76 in depression prediction compared to the best baseline model. Furthermore, the self-attention weights in each sequence quantitatively demonstrated the inner relationship between various codes, which improved the model's interpretability. These results demonstrate the model's ability to utilize heterogeneous EHR data to predict depression while achieving high accuracy and interpretability, which may facilitate constructing clinical decision support systems in the future for chronic disease screening and early detection.
The widespread adoption of Electronic Health Records (EHRs) and deep learning, particularly through Self-Supervised Representation Learning (SSRL) for categorical data, has transformed clinical decision-making. This scoping review, following PRISMA-ScR guidelines, examines 46 studies published from January 2019 to April 2024, sourced from PubMed, MEDLINE, Embase, ACM, and Web of Science, focusing on SSRL for unlabeled categorical EHR data. The review systematically assesses research trends in building computationally and data-efficient representations for medical tasks, identifying major trends in model families: Transformer-based (43%), Autoencoder-based (28%), and Graph Neural Network-based (17%) models. The analysis highlights scenarios where healthcare institutions can leverage or develop SSRL technologies. It also addresses current limitations in assessing the impact of these technologies and identifies research opportunities to enhance their influence on clinical practice.
… Section I of this paper provides some background on EHR data and deep learning. Section II looks at recent work that has applied deep learning to EHR data. Section III details our …
Electronic Health Records (EHR) are high-dimensional data with implicit connections among thousands of medical concepts. These connections, for instance, the co-occurrence of diseases and lab-disease correlations can be informative when only a subset of these variables is documented by the clinician. A feasible approach to improving the representation learning of EHR data is to associate relevant medical concepts and utilize these connections. Existing medical ontologies can be the reference for EHR structures, but they place numerous constraints on the data source. Recent progress on graph neural networks (GNN) enables end-to-end learning of topological structures for non-grid or non-sequential data. However, there are problems to be addressed on how to learn the medical graph adaptively and how to understand the effect of medical graph on representation learning. In this paper, we propose a variationally regularized encoder-decoder graph network that achieves more robustness in graph structure learning by regularizing node representations. Our model outperforms the existing graph and non-graph based methods in various EHR predictive tasks based on both public data and real-world clinical data. Besides the improvements in empirical experiment performances, we provide an interpretation of the effect of variational regularization compared to standard graph neural network, using singular value analysis.
Electronic health records (EHRs) contain vast medical information like diagnosis, medication, and procedures, enabling personalized drug recommendations and treatment adjustments. However, current drug recommendation methods only model patients’ health conditions from EHR data, neglecting the rich relationships within the data. This paper seeks to utilize a heterogeneous information network (HIN) to represent EHR and develop a graph representation learning method for medication recommendation. However, three critical issues need to be investigated: (1) co-occurrence of diagnosis and drug for the same patient does not imply their relevance; (2) patients’ directly associated information may not be sufficient to reflect their health conditions; and (3) the cold start problem exists when patients have no historical EHRs. To tackle these challenges, we develop a bi-channel heterogeneous local structural encoder to decouple and extract the diverse information in HIN. Additionally, a global information capture and fusion module, aggregating meta-paths to form a global representation, is introduced to fill the information gaps in records. A longitudinal model using rich structural information available in EHR data is proposed for drug recommendations to new patients. Experimental results on real-world EHR data demonstrate significant improvements over existing approaches.
Electronic health Records (EHRs) have become a cornerstone in modern-day healthcare. They are a crucial part for analyzing the progression of patient health; however, their complexity, characterized by long, multivariate sequences, sparsity, and missing values-poses significant challenges in traditional deep learning modeling. While Transformer-based models have demonstrated success in modeling EHR data and predicting clinical outcomes, their quadratic computational complexity and limited context length hinder their efficiency and practical applications. On the other hand, State Space Models (SSMs) like Mamba present a promising alternative offering linear-time sequence modeling and improved efficiency for handling long sequences, but focus mostly on mixing sequence-level information rather than channel-level data. To overcome these challenges, we propose HyMaTE (A Hybrid Mamba and Transformer Model for EHR Representation Learning), a novel hybrid model tailored for representing longitudinal data, combining the strengths of SSMs with advanced attention mechanisms. By testing the model on predictive tasks on multiple clinical datasets, we demonstrate HyMaTE's ability to capture an effective, richer, and more nuanced unified representation of EHR data. Additionally, the interpretability of the outcomes achieved by self-attention illustrates the effectiveness of our model as a scalable and generalizable solution for real-world healthcare applications. Codes are available at: https://github.com/healthylaife/HyMaTE.
Electronic health records (EHRs) serve as a digital repository storing comprehensive medical information about patients. Representation learning for EHRs plays a crucial role in healthcare applications. In this paper, we propose a Multimodal Heterogeneous Graph-enhanced Representation Learning, denoted as MHGRL, aimed at learning effective EHR representations. To address the challenge posed by data insufficiency of EHRs, MHGRL utilizes a multimodal heterogeneous graph to model an EHR. Specifically, we construct a heterogeneous graph for each EHR and enrich it by incorporating multimodal information with medical ontology and textual notes. With the integration of pre-trained model, graph neural network, and attention mechanism, MHGRL effectively incorporates both node attributes and structural information across a multimodal heterogeneous graph. Moreover, we employ contrastive learning to ensure the consistency of representations for similar EHRs and improve the model robustness. The experimental results show that MHGRL outperforms all baselines on two real clinical datasets in downstream tasks, including EHR clustering and disease prediction. The code is available at https://github.com/emmali808/MHGRL.
Precision medicine requires accurate identification of clinically relevant patient subgroups. Electronic health records provide major opportunities for leveraging machine learning approaches to uncover novel patient subgroups. However, many existing approaches fail to adequately capture complex interactions between diagnosis trajectories and disease-relevant risk events, leading to subgroups that can still display great heterogeneity in event risk and underlying molecular mechanisms. To address this challenge, we implemented VaDeSC-EHR, a transformer-based variational autoencoder for clustering longitudinal survival data as extracted from electronic health records. We show that VaDeSC-EHR outperforms baseline methods on both synthetic and real-world benchmark datasets with known ground-truth cluster labels. In an application to Crohn’s disease, VaDeSC-EHR successfully identifies four distinct subgroups with divergent diagnosis trajectories and risk profiles, revealing clinically and genetically relevant factors in Crohn’s disease. Our results show that VaDeSC-EHR can be a powerful tool for discovering novel patient subgroups in the development of precision medicine approaches. Longitudinal data in electronic health records could be used to improve definitions of patient clusters and therefore inform precision medicine interventions. Here, the authors introduce VaDeSC-EHR, a machine learning model that uses patient longitudinal trajectories and time-to-event data to define clusters.
ObjectiveTo analyze gender bias in clinical trials, to design an algorithm that mitigates the effects of biases of gender representation on natural-language (NLP) systems trained on text drawn from clinical trials, and to evaluate its performance.Materials and MethodsWe analyze gender bias in clinical trials described by 16 772 PubMed abstracts (2008–2018). We present a method to augment word embeddings, the core building block of NLP-centric representations, by weighting abstracts by the number of women participants in the trial. We evaluate the resulting gender-sensitive embeddings performance on several clinical prediction tasks: comorbidity classification, hospital length of stay prediction, and intensive care unit (ICU) readmission prediction.ResultsFor female patients, the gender-sensitive model area under the receiver-operator characteristic (AUROC) is 0.86 versus the baseline of 0.81 for comorbidity classification, mean absolute error 4.59 versus the baseline of 4.66 for length of stay prediction, and AUROC 0.69 versus 0.67 for ICU readmission. All results are statistically significant.DiscussionWomen have been underrepresented in clinical trials. Thus, using the broad clinical trials literature as training data for statistical language models could result in biased models, with deficits in knowledge about women. The method presented enables gender-sensitive use of publications as training data for word embeddings. In experiments, the gender-sensitive embeddings show better performance than baseline embeddings for the clinical tasks studied. The results highlight opportunities for recognizing and addressing gender and other representational biases in the clinical trials literature.ConclusionAddressing representational biases in data for training NLP embeddings can lead to better results on downstream tasks for underrepresented populations.
BACKGROUND Electronic Health Records (EHRs) aggregate diverse information at the patient level, holding a trajectory representative of the evolution of the patient health status throughout time. Although this information provides context and can be leveraged by physicians to monitor patient health and make more accurate prognoses/diagnoses, patient records can contain information from very long time spans, which combined with the rapid generation rate of medical data makes clinical decision making more complex. Patient trajectory modelling can assist by exploring existing information in a scalable manner, and can contribute in augmenting health care quality by fostering preventive medicine practices (e.g. earlier disease diagnosis). METHODS We propose a solution to model patient trajectories that combines different types of information (e.g. clinical text, standard codes) and considers the temporal aspect of clinical data. This solution leverages two different architectures: one supporting flexible sets of input features, to convert patient admissions into dense representations; and a second exploring extracted admission representations in a recurrent-based architecture, where patient trajectories are processed in sub-sequences using a sliding window mechanism. RESULTS The developed solution was evaluated on two different clinical outcomes, unexpected patient readmission and disease progression, using the publicly available Medical Information Mart for Intensive Care (MIMIC)-III clinical database. The results obtained demonstrate the potential of the first architecture to model readmission and diagnoses prediction using single patient admissions. While information from clinical text did not show the discriminative power observed in other existing works, this may be explained by the need to fine-tune the clinicalBERT model. Finally, we demonstrate the potential of the sequence-based architecture using a sliding window mechanism to represent the input data, attaining comparable performances to other existing solutions. CONCLUSION Herein, we explored DL-based techniques to model patient trajectories and propose two flexible architectures that explore patient admissions on an individual and sequence basis. The combination of clinical text with other types of information led to positive results, which can be further improved by including a fine-tuned version of clinicalBERT in the architectures. The proposed solution can be publicly accessed at https://github.com/bioinformatics-ua/PatientTM.
Patient data comprises of different modalities like clinical notes, lab results, and radiological investigations. Predictive modeling on this patient data is challenging due to the heterogeneity among patients and modalities captured, such as no ECG recordings for a few patients. Ensuring model applicability to all admitted patients requires addressing three key factors: i) handling modalities at disparate time-scales (e.g., once in a few hours medication and 125 Hz ECG need to be handled by same model), ii) handling missing modality (e.g., ECG not recorded), and iii) modeling temporal interactions between modalities over time (e.g., the ECG fluctuation triggered some lab test order). Existing literature often doesn't simultaneously address these requirements. Therefore, we propose a novel patient representation approach inspired by clinical workflows, representing each patient as a graph. We categorize patient data into two main components: observations and care team actions, allowing cross-modality temporal interaction when actions consider the previous observations. We define observation nodes to capture modality-specific data within the time between two actions, and action nodes capture data of the actions undertaken, with edges to capture the temporal dependencies. To address missing modalities and time-scale disparities, we define node types for different modalities and use modality-specific representation for the nodes; implying a missing modality is equivalent to a missing node type in the graph. Aligned with clinical workflows, this patient-graph representation aims to enhance the practicality of predictive systems for various healthcare tasks, from mortality risk assessment to medication recommendations, thereby improving clinical decision support.
Estimating the future course of cancer is invaluable to physicians; however, current clinical methods fail to effectively use the vast amount of multimodal data that is available for cancer patients. To tackle this problem, we constructed a deep neural network based model to predict the survival of patients for 20 different cancer types using gene expressions, microRNA data, clinical data and histopathology whole slide images (WSIs). We developed an unsupervised encoder to compress these four data modalities into a single feature vector for each patient, handling missing data through a resilient, multimodal dropout method. Encoding methods were tailored to each data type - using deep highway networks to extract features from genomic and clinical data, and convolutional neural networks extract features from pathology images. We then used these feature encodings trained on pancancer data to predict pancancer and single cancer survival data, achieving a C-index of 0.784 overall. This work shows that it is possible to build a pancancer model for prognosis that also predicts prognosis in single cancer sites. Furthermore, our model handles multiple data modalities, efficiently analyzes WSIs, and summarizes patient details flexibly into an unsupervised, informative profile. We thus present a powerful automated tool to accurately determine prognosis, a key step towards personalized treatment for cancer patients.
… To further improve the quality of the patient representations, we introduce a MultiModal … status representation of ICU patients. • We develop a multimodal gated contrastive representation …
With the increasing collection of electronic health records (EHRs), deep learning has become a crucial tool for real-time treatment analysis. However, due to patient privacy concerns, the scarcity of labeled data limits the end-to-end models that rely on large training data. Self-supervised pretraining offers a promising solution. Nevertheless, applying pretraining to EHRs faces two key issues: (1) EHRs exhibit multimodality, including monitoring data and recorded clinical note. For multimodal pretraining, designing a self-supervised task that can establish cross-modal associations while preserving all modal-unique information remains challenging. (2) Both modalities are sequential and irregular, with varying intervals between monitoring or records. Aligning monitoring times with recorded times poses a significant issue for fine-grained cross-modal pretraining. Existing pretraining models either focus on a single modality or only models regular data, failing to address them together. To fill this gap and fully utilize unlabel EHR data, we propose a pretraining model to learn patient representation using unlabel irregular multimodal EHRs, named PRIME. We first utilize a multi-element encoding module to extract patient condition snapshots from both modalities. Then, to construct multiple aligned cross-modal positive sample pairs that span the entire treatment process from irregular data, we employ patient condition alignment modules that integrate time-aware and feature-aware components to transfer snapshots to the aligned timestamps. Next, to preserve both shared and unique information of each modality, our decoupled representation learning strategy first uses a constraint matrix to separate shared information. We then employ contrastive-based cross-modal learning and reconstruction-based intra-modal learning to model shared and complete information, respectively. Extensive experiments on two real-world tasks demonstrate the superiority of PRIME over the state-of-the-art models, especially with limited labels.
Objectives: Machine learning-based analytics over uni-modal medical data has shown considerable promise and is now routinely deployed in diagnostic procedures. However, patient data consists of diverse types of data. By exploiting such data, multimodal approaches promise to revolutionize our ability to provide personalized care. Attempts to combine two modalities in a single diagnostic task have utilized the evolving field of multimodal representation learning (MRL), which learns a shared latent space between related modality samples. This new space can be used to improve the performance of machine-learning-based analytics. So far, however, our understanding of how modalities have been applied in MRL-based medical applications and which modalities are best suited for specific medical tasks is still unclear, as previous reviews have not addressed the medical analytics domain and its unique challenges and opportunities. Instead, this work aims to review the landscape of MRL for medical tasks to highlight opportunities for advancing medical applications. Methods: This paper presents a framework for positioning MRL techniques and medical modalities. More than 1000 papers related to medical analytics were reviewed, positioned, and classified using the proposed framework in the most extensive review to date. The paper further provides an online tool for researchers and developers of medical analytics to dive into the rapidly changing landscape of MRL for medical applications. Results: The main finding is that work in the domain has been sparse: only a few medical informatics tasks have been the target of much MRL-based work, with the overwhelming majority of tasks being diagnostic rather than prognostic. Similarly, numerous potentially compatible information modality combinations are unexplored or under-explored for most medical tasks. Conclusions: There is much to gain from using MRL in many unexplored combinations of medical tasks and modalities. This work can guide researchers working on a specific medical application to identify under-explored modality combinations and identify novel and emerging MRL techniques that can be adapted to the task at hand.
Integrating artificial intelligence (AI) with healthcare data is rapidly transforming medical diagnostics and driving progress toward precision medicine. However, effectively leveraging multimodal data, particularly digital pathology whole slide images (WSIs) and genomic sequencing, remains a significant challenge due to the intrinsic heterogeneity of these modalities and the need for scalable and interpretable frameworks. Existing diagnostic models typically operate on unimodal data, overlooking critical cross-modal interactions that can yield richer clinical insights. We introduce MarbliX (Multimodal Association and Retrieval with Binary Latent Indexed matriX), a self-supervised framework that learns to embed WSIs and immunogenomic profiles into compact, scalable binary codes, termed “monogram.” By optimizing a triplet contrastive objective across modalities, MarbliX captures high-resolution patient similarity in a unified latent space, enabling efficient retrieval of clinically relevant cases and facilitating case-based reasoning. In lung cancer, MarbliX achieves 85%–89% across all evaluation metrics, outperforming histopathology (69%–71%) and immunogenomics (73%–76%). In kidney cancer, real-valued monograms yield the strongest performance (F1: 80%–83%, Accuracy: 87%–90%), with binary monograms slightly lower (F1: 78%–82%).
Modern medicine increasingly relies on multimodal data, ranging from clinical notes to imaging and genomics, to guide diagnosis and treatment. However, integrating these heterogeneous data sources in a principled and interpretable manner remains a major challenge. We present MODES (Multi-mOdal Disentangled Embedding Space), a representation fusion framework that explicitly separates shared and modality-specific factors of variation, offering a structured latent space for multimodal information that improves both prediction and interpretability. By leveraging pre-trained unimodal foundation models, MODES mitigates the dependency on extensive paired datasets, crucial in data-scarce clinical settings. We introduce a masking strategy that optimizes representation dimensionality by eliminating low-information dimensions, to achieve compact, information-rich representations. Our framework demonstrates superior performance in predicting diagnoses and phenotypes compared to unimodal and conventional fusion models. MODES also enables robust diagnostic inference in missing data scenarios, offering an opportunity toward interpretable and efficient multimodal diagnostics in personalized healthcare.
… influence of historical patient visits, and the relevance of similar patient trajectories. … multimodal medication recommendation framework that progressively learns patient representations …
Precision medicine aims to provide personalized healthcare for patients by stratifying them into subgroups based on their health conditions, enabling the development of tailored medical management. Various decision support systems (DSSs) are increasingly developed in this field, where the performance is limited to their capability of handling big amounts of heterogeneous and high-dimensional electronic health records (EHRs). In this paper, we focus on developing a deep learning model for patient stratification that can identify and explain patient subgroups from multimodal EHRs. The primary challenge is to effectively align and unify heterogeneous information from various modalities, which includes both unstructured and structured data. Here, we develop a Contrastive Multimodal learning model for EHR (ConMEHR) based on topic modelling. In ConMEHR, modality-level and topic-level contrastive learning (CL) mechanisms are adopted to obtain a unified representation space and diversify patient subgroups, respectively. The performance of ConMEHR will be evaluated on two real-world EHR datasets and the results show that our model outperforms other baseline methods.
In recent years, cancer patients survival prediction holds important significance for worldwide health problems, and has gained many researchers attention in medical information communities. Cancer patients survival prediction can be seen the classification work which is a meaningful and challenging task. Nevertheless, research in this field is still limited. In this work, we design a novel Multimodal Graph Neural Network (MGNN)framework for predicting cancer survival, which explores the features of real-world multimodal data such as gene expression, copy number alteration and clinical data in a unified framework. Specifically, we first construct the bipartite graphs between patients and multimodal data to explore the inherent relation. Subsequently, the embedding of each patient on different bipartite graphs is obtained with graph neural network. Finally, a multimodal fusion neural layer is proposed to fuse the medical features from different modality data. Comprehensive experiments have been conducted on real-world datasets, which demonstrate the superiority of our modal with significant improvements against state-of-the-arts. Furthermore, the proposed MGNN is validated to be more robust on other four cancer datasets.
The wide implementation of electronic health record (EHR) systems facilitates the collection of large-scale health data from real clinical settings. Despite the significant increase in adoption of EHR systems, these data remain largely unexplored, but present a rich data source for knowledge discovery from patient health histories in tasks, such as understanding disease correlations and predicting health outcomes. However, the heterogeneity, sparsity, noise, and bias in these data present many complex challenges. This complexity makes it difficult to translate potentially relevant information into machine learning algorithms. In this paper, we propose a computational framework, Patient2Vec, to learn an interpretable deep representation of longitudinal EHR data, which is personalized for each patient. To evaluate this approach, we apply it to the prediction of future hospitalizations using real EHR data and compare its predictive performance with baseline methods. Patient2Vec produces a vector space with meaningful structure, and it achieves an area under curve around 0.799, outperforming baseline methods. In the end, the learned feature importance can be visualized and interpreted at both the individual and population levels to bring clinical insights.
Current studies regarding the secondary use of electronic health records (EHR) predominantly rely on domain expertise and existing medical knowledge. Though significant efforts have been devoted to investigating the application of machine learning algorithms in the EHR, efficient and powerful representation of patients is needed to unleash the potential of discovering new medical patterns underlying the EHR. Here, we present an unsupervised method for embedding high-dimensional EHR data at the patient level, aimed at characterizing patient heterogeneity in complex diseases and identifying new disease patterns associated with clinical outcome disparities. Inspired by the architecture of modern language models—specifically transformers with attention mechanisms, we use patient diagnosis and procedure codes as vocabularies and treat each patient as a sentence to perform the patient embedding. We applied this approach to 34,851 unique medical codes across 1,046,649 longitudinal patient events, including 102,739 patients from the electronic Medical Records and GEnomics (eMERGE) Network. The resulting patient vectors demonstrated excellent performance in predicting future disease events (median AUROC = 0.87 within one year) and bulk phenotyping (median AUROC = 0.84). We then illustrated the utility of these patient vectors in revealing heterogeneous comorbidity patterns, exemplified by disease subtypes in colorectal cancer and systemic lupus erythematosus, and capturing distinct longitudinal disease trajectories. External validation using EHR data from the University of Washington confirmed robust model performance, with median AUROCs of 0.83 and 0.84 for bulk phenotyping tasks and disease onset prediction, respectively. Importantly, the model reproduced the clustering results of disease subtypes identified in the eMERGE cohort and uncovered variations in overall mortality among these subtypes. Together, these results underscore the potential of representation learning in EHRs to enhance patient characterization and associated clinical outcomes, thereby advancing disease forecasting and facilitating personalized medicine.
… longitudinal dynamics, but often collapse trajectory complexity into opaque embeddings. … testing whether representations preserve the complexity of longitudinal patient trajectories. …
This paper introduces Structuring Whitened Embeddings, a modality- and encoder-agnostic framework designed to optimize encoders for extracting smooth, progression-aware feature representations by aligning real-world samples. Proportional inter-sample relationships are preserved, enabling the capture of subtle and continuous changes. By establishing relationships directly between samples and thereby avoiding the need for data augmentation, the approach is particularly well suited for transformation-sensitive data, such as medical time series, where even minor sample changes can lead to disproportionate shifts in interpretation. Experimental results on early atrial fibrillation prediction and timestamp imputation, modeling both inter- and intra-patient dynamics, demonstrate significant performance improvements using the optimized features. The framework’s augmentation-free design and generalizability across tasks and modalities position it as a practical solution for modeling evolution in complex datasets.
Summary Many fields, including Natural Language Processing (NLP), have recently witnessed the benefit of pre-training with large generic datasets to improve the accuracy of prediction tasks. However, there exist key differences between the longitudinal healthcare data (e.g., claims) and NLP tasks, which make the direct application of NLP pre-training methods to healthcare data inappropriate. In this article, we developed a pre-training scheme for longitudinal healthcare data that leverages the pairing of medical history and a future event. We then conducted systematic evaluations of various methods on ten patient-level prediction tasks encompassing adverse events, misdiagnosis, disease risks, and readmission. In addition to substantially reducing model size, our results show that a universal medical concept embedding pretrained with generic big data as well as carefully designed time decay modeling improves the accuracy of different downstream prediction tasks.
… Longitudinal ultrasound images from routine follow-ups offer … We present the risk embedded and longitudinal attention … cumulative risk embedding and a longitudinal attention …
Electronic health records (EHR) represent a holistic overview of patients’ trajectories. Their increasing availability has fueled new hopes to leverage them and develop accurate risk prediction models for a wide range of diseases. Given the complex interrelationships of medical records and patient outcomes, deep learning models have shown clear merits in achieving this goal. However, a key limitation of current study remains their capacity in processing long sequences, and long sequence modelling and its application in the context of healthcare and EHR remains unexplored. Capturing the whole history of medical encounters is expected to lead to more accurate predictions, but the inclusion of records collected for decades and from multiple resources can inevitably exceed the receptive field of the most existing deep learning architectures. This can result in missing crucial, long-term dependencies. To address this gap, we present Hi-BEHRT, a hierarchical Transformer-based model that can significantly expand the receptive field of Transformers and extract associations from much longer sequences. Using a multimodal large-scale linked longitudinal EHR, the Hi-BEHRT exceeds the state-of-the-art deep learning models 1% to 5% for area under the receiver operating characteristic (AUROC) curve and 1% to 8% for area under the precision recall (AUPRC) curve on average, and 2% to 8% (AUROC) and 2% to 11% (AUPRC) for patients with long medical history for 5-year heart failure, diabetes, chronic kidney disease, and stroke risk prediction. Additionally, because pretraining for hierarchical Transformer is not well-established, we provide an effective end-to-end contrastive pre-training strategy for Hi-BEHRT using EHR, improving its transferability on predicting clinical events with relatively small training dataset.
Healthcare process leaves patient treatment trajectory (PTT), described as a sequence of interdependent clinical events affiliated with a large volume of longitudinal therapy and treatment information. Predicting the future clinical event in PTT, as a vital and essential task for providing insights into the entire treatment trajectory, can serve as an efficient and proactive altering service for health service delivery. However, it is challenging because there are long-term dependencies between clinical events, which are irregularly distributed along the temporal axis with varying time intervals. This characteristic inevitably impedes the performance of clinical event prediction (CEP) using the existing approaches. To address this challenge, we propose a novel approach to learn representative and discriminative PTT features for CEP. In detail, multivariate Hawkes process (HP) is adopted to uncover the mutual excitation intensities between clinical event pairs in an interpretable manner. Thereafter, the captured spontaneous and interactional intensities of events are incorporated into recurrent neural networks (RNN) to encode PTT in latent representations, while jointly performing the CEP task based on the extracted trajectory representations. We evaluate the performance of the proposed approach on a real clinical dataset consisting of 13,545 visits of 2,102 heart failure patients. Compared to state-of-the-art methods, our best model achieves 6.4% and 4.1% AUC performance gains on three-months and one-year CEP tasks, respectively. The experimental results demonstrate that the proposed approach outperforms state-of-the-art models in CEP, and can be profitably exploited as a basis for PTT analysis and optimization.
… For instance, after clustering on self-supervised representations, medical experts could characterize key visual features shared between examples belonging to the same cluster and …
OBJECTIVE Recently, there has been growing interest in analyzing large amounts of Electronic Health Record (EHR) data. Patient outcome prediction is a major area of interest in EHR analysis that focuses on predicting the future health status of patients using structured data types, such as diagnoses, medications, and procedures collected from longitudinal EHR data. We investigate and design self-supervised learning (SSL) paradigms to learn high-quality representations from longitudinal EHR data, aiming to effectively capture longitudinal relationships and patterns for improved patient outcome predictions. METHODS We propose an end-to-end, novel, and robust model called GatorCLR that aligns with the contrastive SSL paradigm. GatorCLR incorporates graph analysis-based patient modeling into longitudinal EHR data, generating graph representations of nodes and edges representing patients, their relationships, and similarities. A two-layer augmentation technique is further incorporated in our GatorCLR that generates consistent, identity-preserving augmentations from graph representations. RESULTS We evaluate our approach using real-world EHR datasets. Experimental results indicate that our GatorCLR delivers meaningful and robust performance across multiple clinical tasks and datasets and provides transparency of the model decisions. CONCLUSION The proposed approach presents a significant step toward developing a foundation model with longitudinal EHR data, capable of making informed predictions and adaptable to various downstream use cases and tasks. This study should, therefore, be of value to practitioners wishing to leverage longitudinal EHR data for predictive analytics.
Machine learning has become an increasingly ubiquitous technology, as big data continues to inform and influence everyday life and decision-making. Currently, in medicine and healthcare, as well as in most other industries, the two most prevalent machine learning paradigms are supervised learning and transfer learning. Both practices rely on large-scale, manually annotated datasets to train increasingly complex models. However, the requirement of data to be manually labeled leaves an excess of unused, unlabeled data available in both public and private data repositories. Self-supervised learning (SSL) is a growing area of machine learning that can take advantage of unlabeled data. Contrary to other machine learning paradigms, SSL algorithms create artificial supervisory signals from unlabeled data and pretrain algorithms on these signals. The aim of this review is two-fold: firstly, we provide a formal definition of SSL, divide SSL algorithms into their four unique subsets, and review the state of the art published in each of those subsets between the years of 2014 and 2020. Second, this work surveys recent SSL algorithms published in healthcare, in order to provide medical experts with a clearer picture of how they can integrate SSL into their research, with the objective of leveraging unlabeled data.
A causal effect can be defined as a comparison of outcomes that result from two or more alternative actions, with only one of the action-outcome pairs actually being observed. In healthcare, the gold standard for causal effect measurements is randomized controlled trials (RCTs), in which a target population is explicitly defined and each study sample is randomly assigned to either the treatment or control cohorts. The great potential to derive actionable insights from causal relationships has led to a growing body of machine-learning research applying causal effect estimators to observational data in the fields of healthcare, education, and economics. The primary difference between causal effect studies utilizing observational data and RCTs is that for observational data, the study occurs after the treatment, and therefore we do not have control over the treatment assignment mechanism. This can lead to massive differences in covariate distributions between control and treatment samples, making a comparison of causal effects confounded and unreliable. Classical approaches have sought to solve this problem piecemeal, first by predicting treatment assignment and then treatment effect separately. Recent work extended part of these approaches to a new family of representation-learning algorithms, showing that the upper bound of the expected treatment effect estimation error is determined by two factors: the outcome generalization-error of the representation and the distance between the treated and control distributions induced by the representation. To achieve minimal dissimilarity in learning such distributions, in this work we propose a specific auto-balancing, self-supervised objective. Experiments on real and benchmark datasets revealed that our approach consistently produced less biased estimates than previously published state-of-the-art methods. We demonstrate that the reduction in error can be directly attributed to the ability to learn representations that explicitly reduce such dissimilarity; further, in case of violations of the positivity assumption (frequent in observational data), we show our approach performs significantly better than the previous state of the art. Thus, by learning representations that induce similar distributions of the treated and control cohorts, we present evidence to support the error bound dissimilarity hypothesis as well as providing a new state-of-the-art model for causal effect estimation.
Background The growing availability of electronic health records (EHRs) presents an opportunity to enhance patient care by uncovering hidden health risks and improving informed decisions through advanced deep learning methods. However, modeling EHR sequential data, that is, patient trajectories, is challenging due to the evolving relationships between diagnoses and treatments over time. Significant progress has been achieved using transformers and self-supervised learning. While BERT-inspired models using masked language modeling (MLM) capture EHR context, they often struggle with the complex temporal dynamics of disease progression and interventions. Objective This study aims to improve the modeling of EHR sequences by addressing the limitations of traditional transformer-based approaches in capturing complex temporal dependencies. Methods We introduce Trajectory Order Objective BERT (Bidirectional Encoder Representations from Transformers; TOO-BERT), a transformer-based model that advances the MLM pretraining approach by integrating a novel TOO to better learn the complex sequential dependencies between medical events. TOO-Bert enhanced the learned context by MLM by pretraining the model to distinguish ordered sequences of medical codes from permuted ones in a patient trajectory. The TOO is enhanced by a conditional selection process that focus on medical codes or visits that frequently occur together, to further improve contextual understanding and strengthen temporal awareness. We evaluate TOO-BERT on 2 extensive EHR datasets, MIMIC-IV hospitalization records and the Malmo Diet and Cancer Cohort (MDC)—comprising approximately 10 and 8 million medical codes, respectively. TOO-BERT is compared against conventional machine learning methods, a transformer trained from scratch, and a transformer pretrained on MLM in predicting heart failure (HF), Alzheimer disease (AD), and prolonged length of stay (PLS). Results TOO-BERT outperformed conventional machine learning methods and transformer-based approaches in HF, AD, and PLS prediction across both datasets. In the MDC dataset, TOO-BERT improved HF and AD prediction, increasing area under the receiver operating characteristic curve (AUC) scores from 67.7 and 69.5 with the MLM-pretrained Transformer to 73.9 and 71.9, respectively. In the MIMIC-IV dataset, TOO-BERT enhanced HF and PLS prediction, raising AUC scores from 86.2 and 60.2 with the MLM-pretrained Transformer to 89.8 and 60.4, respectively. Notably, TOO-BERT demonstrated strong performance in HF prediction even with limited fine-tuning data, achieving AUC scores of 0.877 and 0.823, compared to 0.839 and 0.799 for the MLM-pretrained Transformer, when fine-tuned on only 50% (442/884) and 20% (176/884) of the training data, respectively. Conclusions These findings demonstrate the effectiveness of integrating temporal ordering objectives into MLM-pretrained models, enabling deeper insights into the complex temporal relationships inherent in EHR data. Attention analysis further highlights TOO-BERT’s capability to capture and represent sophisticated structural patterns within patient trajectories, offering a more nuanced understanding of disease progression.
BACKGROUND The 12-lead electrocardiogram (ECG) is an established modality for cardiovascular assessment. While deep learning algorithms have shown promising results for analyzing ECG data, the limited availability of labeled datasets hinders broader applications. Self-supervised learning can learn meaningful representations from the unlabeled data and transfer the knowledge to downstream tasks. This study underscores the development and validation of a self-supervised learning methodology tailored to produce universal ECG representations from longitudinally collected ECG data, applicable across a spectrum of cardiovascular assessments. METHODS We introduced a pre-trained model that utilizes contrastive self-supervised learning to universal ECG representations from 4,932,573 ECG tracing from 1,684,298 adult patients on 7 campuses of Chang Gung Memorial Hospital. We extensively evaluated the proposed model using an internal dataset collected from diverse healthcare establishments and an external public dataset encompassing varied cardiovascular conditions and sample magnitudes. RESULTS The pre-trained model showed the equivalent performance to the conventionally trained models, which solely rely on supervised learning in both internal and external datasets, to assess atrial fibrillation, atrial flutter, premature rhythm abnormalities, first-degree atrioventricular block, and myocardial infarction. When applied to small sample sizes, it was observed that the learned ECG representations enhanced the classification models, resulting in an improvement of up to 0.3 of the area under the receiver operating characteristic (AUROC). CONCLUSIONS The ECG representations learned from longitudinal ECG data are highly effective, particularly with small sample sizes, and further enhance the learning process and boost robustness.
… , (ii) Patient-aware contrastive representation module: A … by a patient-aware contrastive objective that integrates both class labels and patient identity, thereby enhancing intra-patient …
OBJECTIVE Deep learning models for clinical event forecasting (CEF) based on a patient's medical history have improved significantly over the past decade. However, their transition into practice has been limited, particularly for diseases with very low prevalence. In this paper, we introduce CEF-CL, a novel method based on contrastive learning to forecast in the face of a limited number of positive training instances. MATERIALS AND METHODS CEF-CL consists of two primary components: (1) unsupervised contrastive learning for patient representation and (2) supervised transfer learning over the derived representation. We evaluate the new method along with state-of-the-art model architectures trained in a supervised manner with electronic health records data from Vanderbilt University Medical Center and the All of Us Research Program, covering 48 000 and 16 000 patients, respectively. We assess forecasting for over 100 diagnosis codes with respect to their area under the receiver operator characteristic curve (AUROC) and area under the precision-recall curve (AUPRC). We investigate the correlation between forecasting performance improvement and code prevalence via a Wald Test. RESULTS CEF-CL achieved an average AUROC and AUPRC performance improvement over the state-of-the-art of 8.0%-9.3% and 11.7%-32.0%, respectively. The improvement in AUROC was negatively correlated with the number of positive training instances (P < .001). CONCLUSION This investigation indicates that clinical event forecasting can be improved significantly through contrastive representation learning, especially when the number of positive training instances is small.
… Following common practice, we adopt CLS pooling from the final hidden layer for patient representation in encoder-only models, and last token pooling from the final hidden layer for …
Effective patient representation learning from Electronic Health Records (EHR) is essential for improving disease prediction models, yet it faces critical challenges such as the scarcity of labeled data and the difficulty of capturing complex temporal and multi-indicator relationships. To address these limitations, we propose the Adaptive Multi-Indicator Contrastive Predictive Coding (AMCPC) framework, a self-supervised learning approach designed for EHR data. AMCPC incorporates two key innovations: first, it employs an adaptive optimal window size selection algorithm to segment patient visit sequences into temporal subwindows, which enables the model to focus on localized, context-specific health patterns; second, it extends Contrastive Predictive Coding (CPC) with a multi-indicator approach, leveraging a 2D convolutional neural network (CNN) to capture global correlations among diverse medical indicators within each subwindow. Through extensive experiments on real-world clinical datasets, we demonstrate that AMCPC outperforms both fully-supervised and existing self-supervised methods in disease prediction tasks, particularly when trained on limited labeled data. Our results establish AMCPC as an effective framework for leveraging unlabeled EHR data for self-supervised pretraining, which can then be fine-tuned with a small amount of labeled data to significantly enhance downstream prediction performance, reducing reliance on large-scale labeled datasets.
Heart failure is a prevalent and severe cardiovascular disease with high morbidity, disability, and mortality rates, imposing substantial burdens on global healthcare systems. Early and accurate prediction of heart failure is crucial for improving patient outcomes and reducing medical costs. However, clinical diagnosis relies on integrating rich multimodal patient information, including physiological signals, electronic health records, and clinical texts, which are high-dimensional and heterogeneous, limiting the efficiency and accuracy of manual analysis. To address these challenges, we propose a Contrastive and Adversarial Representation Enhancement framework (CARE) for heart failure prediction. The framework jointly optimizes a cross-modal contrastive objective to explicitly align semantically related modalities and constrains distributional discrepancies via adversarial learning, producing modality-invariant and highly complementary embeddings. A cross-modal attention mechanism further captures semantic correspondences and enables end-toend integration of structured electronic health records (EHRs), signal annotation reports, and clinical texts. Experimental results on real-world medical datasets demonstrate that CARE outperforms existing approaches, improving performance with AUROC improved by 0.038 and AUPRC improved by 0.055 compared to baseline methods.
Supervised machine learning applications in health care are often limited due to a scarcity of labeled training data. To mitigate the effect of small sample size, we introduce a pre-training approach, Patient Contrastive Learning of Representations (PCLR), which creates latent representations of electrocardiograms (ECGs) from a large number of unlabeled examples using contrastive learning. The resulting representations are expressive, performant, and practical across a wide spectrum of clinical tasks. We develop PCLR using a large health care system with over 3.2 million 12-lead ECGs and demonstrate that training linear models on PCLR representations achieves a 51% performance increase, on average, over six training set sizes and four tasks (sex classification, age regression, and the detection of left ventricular hypertrophy and atrial fibrillation), relative to training neural network models from scratch. We also compared PCLR to three other ECG pre-training approaches (supervised pre-training, unsupervised pre-training with an autoencoder, and pre-training using a contrastive multi ECG-segment approach), and show significant performance benefits in three out of four tasks. We found an average performance benefit of 47% over the other models and an average of a 9% performance benefit compared to best model for each task. We release PCLR to enable others to extract ECG representations at https://github.com/broadinstitute/ml4h/tree/master/model_zoo/PCLR.
… In this work, we propose a novel hierarchical contrastive framework to learn representative and generalizable … Similarly, we follow this protocol for the patient-level contrastive block. …
Survival analysis plays a crucial role in many healthcare decisions, where the risk prediction for the events of interest can support an informative outlook for a patient's medical journey. Given the existence of data censoring, an effective way of survival analysis is to enforce the pairwise temporal concordance between censored and observed data, aiming to utilize the time interval before censoring as partially observed time-to-event labels for supervised learning. Although existing studies mostly employed ranking methods to pursue an ordering objective, contrastive methods which learn a discriminative embedding by having data contrast against each other, have not been explored thoroughly for survival analysis. Therefore, in this paper, we propose a novel Ontology-aware Temporality-based Contrastive Survival (OTCSurv) analysis framework that utilizes survival durations from both censored and observed data to define temporal distinctiveness and construct negative sample pairs with adjustable hardness for contrastive learning. Specifically, we first use an ontological encoder and a sequential self-attention encoder to represent the longitudinal EHR data with rich contexts. Second, we design a temporal contrastive loss to capture varying survival durations in a supervised setting through a hardness-aware negative sampling mechanism. Last, we incorporate the contrastive task into the time-to-event predictive task with multiple loss components. We conduct extensive experiments using a large EHR dataset to forecast the risk of hospitalized patients who are in danger of developing acute kidney injury (AKI), a critical and urgent medical condition. The effectiveness and explainability of the proposed model are validated through comprehensive quantitative and qualitative studies.
Early and accurate disease trajectory prediction is critical for personalized medicine and proactive healthcare management. In this study, we propose a self-supervised contrastive learning framework for modeling patient disease progression using structured electronic health records (EHRs). Unlike conventional supervised approaches, our method pretrains patient trajectory embeddings by contrasting similar and dissimilar patient histories, enabling effective representation learning with limited labeled data. We evaluated our model on a synthetic healthcare data set and compared it with eight state-of-the-art baselines, including LSTM, BiLSTM, GRU, CNN-LSTM, Transformer, SimCLR, and MoCo. The proposed model achieves an accuracy of 87.5%, an F1-score of 85.6%, and an AUC-ROC of 90.1%, surpassing the best baseline by 1.4% in AUC-ROC. Ablation studies highlight the importance of contrastive learning, with a performance drop of 5.6% in AUC-ROC when it is removed. Additionally, our model demonstrates strong generalization across different disease types, achieving 87.0% accuracy for cardiovascular diseases and 85.9% for diabetes. Computational analysis shows that our method reduces inference time to 7.6 ms, making it a practical solution for real-time clinical applications. These findings suggest that self-supervised contrastive learning is a promising approach for enhancing disease trajectory prediction, with potential applications in early diagnosis, risk stratification, and personalized treatment planning.
患者表征学习研究已形成以自监督对比学习、序列轨迹建模、图谱结构融合为三大技术支柱的完整体系。研究核心正从简单的单模态表示转向关注多模态数据一致性、临床时间序列的长程依赖以及异构实体间的关联性,同时领域内对模型的鲁棒性、偏差控制及临床可解释性评价愈发重视。