生物信息学
基因组装与测序分析技术
集中讨论高通量测序数据处理、全基因组从头组装(De novo assembly)、Scaffolding策略及测序精度评估方法,旨在解决基因组测序带来的计算与算法复杂性。
- Bioinformatics, Sequencing Accuracy, and the Credibility of Clinical Genomics.(W. G. Feero, 2020, JAMA)
- A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies(Wenyu Zhang, Jiajia Chen, Yang Yang, Yifei Tang, Jing Shang, Bairong Shen, 2011, PLoS ONE)
- Assembly, Annotation, and Comparative Genomics in PATRIC, the All Bacterial Bioinformatics Resource Center.(A. Wattam, T. Brettin, James J. Davis, S. Gerdes, Ron Kenyon, D. Machi, Chunhong Mao, R. Olson, R. Overbeek, G. Pusch, Maulik Shukla, Rick L. Stevens, Veronika Vonstein, A. Warren, Fangfang Xia, H. Yoo, 2018, Methods in Molecular Biology)
- A comprehensive review of scaffolding methods in genome assembly(Junwei Luo, Yawei Wei, Mengna Lyu, Zhengjiang Wu, Xiaoyan Liu, Huimin Luo, Chaokun Yan, 2021, Briefings in Bioinformatics)
- The Theory and Practice of Genome Sequence Assembly.(J. Simpson, Mihai Pop, 2015, Annual Review of Genomics and Human Genetics)
- The MaSuRCA genome assembler(A. Zimin, G. Marçais, D. Puiu, Michael Roberts, S. Salzberg, J. Yorke, 2013, Bioinformatics)
- Machine learning meets genome assembly(K. Souza, J. Setubal, A. C. P. L. F. D. Carvalho, G. Oliveira, A. Chateau, Ronnie Alves, 2018, Briefings in Bioinformatics)
- Toward a more holistic method of genome assembly assessment(Adam Thrash, F. Hoffmann, A. Perkins, 2020, BMC Bioinformatics)
- The bioinformatics tools for the genome assembly and analysis based on third-generation sequencing.(YongKiat Wee, S. Bhyan, Yining Liu, Jiachun Lu, Xiaoyan Li, Min Zhao, 2018, Briefings in Functional Genomics)
- Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities(Kevin C. Chen, L. Pachter, 2005, PLoS Computational Biology)
- Versatile genome assembly evaluation with QUAST-LG(Alla Mikheenko, Andrey D. Prjibelski, V. Saveliev, D. Antipov, A. Gurevich, 2018, Bioinformatics)
- GATB: Genome Assembly & Analysis Tool Box(E. Drezen, Guillaume Rizk, R. Chikhi, Charles Deltel, C. Lemaitre, P. Peterlongo, D. Lavenier, 2014, Bioinformatics)
- Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing(D. Horner, G. Pavesi, T. Castrignanò, P. D. Meo, S. Liuni, M. Sammeth, E. Picardi, G. Pesole, 2010, Briefings in Bioinformatics)
- Comparative genome assembly(Mihai Pop, A. Phillippy, A. Delcher, S. Salzberg, 2004, Briefings in Bioinformatics)
- A field guide to whole-genome sequencing, assembly and annotation(R. Ekblom, J. Wolf, 2014, Evolutionary Applications)
- The Challenge of Genome Sequence Assembly(Andrew Collins, 2018, The Open Bioinformatics Journal)
- Ten steps to get started in Genome Assembly and Annotation(Victoria Dominguez Del Angel, Erik Hjerde, L. Sterck, S. Capella-Gutiérrez, C. Notredame, Olga Vinnere Pettersson, J. Amselem, L. Bouri, S. Bocs, C. Klopp, J. Gibrat, A. Vlasova, B. Leskosek, Lucile Soler, Mahesh Binzer-Panchal, Henrik Lantz, 2018, F1000Research)
- The Atlas genome assembly system.(P. Havlak, Rui Chen, K. Durbin, Amy Egan, Yanru Ren, Xing-Zhi Song, G. Weinstock, R. Gibbs, 2004, Genome Research)
- Genome assembly reborn: recent computational challenges(Mihai Pop, 2009, Briefings in Bioinformatics)
遗传变异注释与优先排序
重点关注如何从大规模测序数据中检测变异(SNV/SV/CNV),以及通过功能注释工具和预测算法对变异致病性进行优先排序和解释。
- AnnTools: a comprehensive and versatile annotation toolkit for genomic variants(Vladimir Makarov, Tina M. O’Grady, Guiqing Cai, J. Lihm, J. Buxbaum, Seungtai Yoon, 2012, Bioinformatics)
- VarSCAT: A computational tool for sequence context annotations of genomic variants(Ning Wang, Sofia Khan, L. Elo, 2022, PLOS Computational Biology)
- Computational Approach to Annotating Variants of Unknown Significance in Clinical Next Generation Sequencing.(W. Schulz, C. Tormey, Richard Torres, 2015, Laboratory Medicine)
- Variant Interpretation for Cancer (VIC): a computational tool for assessing clinical impacts of somatic variants(Max M He, Quan Li, M. Yan, Hui Cao, Yue Hu, K. He, K. Cao, Marilyn M. Li, Kai Wang, 2019, Genome Medicine)
- AnnotSV: an integrated tool for structural variations annotation(Véronique Geoffroy, Y. Hérenger, Arnaud Kress, C. Stoetzel, A. Piton, H. Dollfus, J. Muller, 2018, Bioinformatics)
- Vanno: A Visualization‐Aided Variant Annotation Tool(Po-Jung Huang, Chi-Ching Lee, B. Tan, Yuan-Ming Yeh, Kuo-Yang Huang, Ruei-chi R. Gan, Ting-Wen Chen, Cheng-Yang Lee, Sheng-Ting Yang, Chung-Shou Liao, Hsuan Liu, P. Tang, 2015, Human Mutation)
- Genomic Variant Annotation: A Comprehensive Review of Tools and Techniques(Prajna Hebbar, S. S. Kamath, 2021, Lecture Notes in Networks and Systems)
- Demystifying non-coding GWAS variants: an overview of computational tools and methods(M. Schipper, D. Posthuma, 2022, Human Molecular Genetics)
- Systematic assessment of structural variant annotation tools for genomic interpretation(Xuanshi Liu, Lei Gu, Chanjuan Hao, Wenjian Xu, Fei Leng, Peng Zhang, Wei Li, 2024, Life Science Alliance)
- Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR(Hui Yang, Kai Wang, 2015, Nature Protocols)
- A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data(Jennifer D. Hintzsche, W. Robinson, A. Tan, 2016, International Journal of Genomics)
- Computational approaches for predicting the biological effect of p53 missense mutations: a comparison of three sequence analysis based methods(E. Mathé, M. Olivier, S. Kato, C. Ishioka, P. Hainaut, S. Tavtigian, 2006, Nucleic Acids Research)
- Deep learning in variant detection and annotation(Shaban Ahmad, Aman Bashar, Kushagra Khanna, Nagmi Bano, Khalid Raza, 2024, Deep Learning in Genetics and Genomics)
- ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data(Kai Wang, Mingyao Li, H. Hakonarson, 2010, Nucleic Acids Research)
- Review on the Computational Genome Annotation of Sequences Obtained by Next-Generation Sequencing(Girum Fitihamlak Ejigu, Jaehee Jung, 2020, Biology)
- Genome-wide functional annotation of variants: a systematic review of state-of-the-art tools, techniques and resources(E. Pilalis, Dimitrios Zisis, Christina Andrinopoulou, T. Karamanidou, Maria Antonara, Thanos G. Stavropoulos, A. Chatziioannou, 2025, Frontiers in Pharmacology)
- A Flexible and Open-Source Tool for Genetic Variant Annotation(A. Bombarda, Matteo Bellini, M. Iascone, Domenico Fabio Savo, 2025, Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies)
多组学整合与人工智能应用
整合多维组学(基因组、转录组、蛋白质组等)数据,利用深度学习和图神经网络等先进机器学习模型进行患者分层、精准医疗及生物标志物发现。
- Methodology for Good Machine Learning with Multi‐Omics Data(T. Coroller, B. Sahiner, Anup K Amatya, Alexej Gossmann, Konstantinos Karagiannis, Conor Moloney, Ravi K Samala, Luis Santana-Quintero, N. Solovieff, Craig Wang, L. Amiri-Kordestani, Qian Cao, Kenny H. Cha, Rosane Charlab, Frank H. Cross, Tingting Hu, Ruihao Huang, Jeffrey Kraft, Peter Krusche, Yutong Li, Zheng Li, Ilya Mazo, R. Paul, Susan Schnakenberg, P. Serra, Sean Smith, Chi Song, F. Su, Mohit Tiwari, Colin Vechery, X. Xiong, J. P. Zarate, Hao Zhu, A. Chakravartty, Qi Liu, D. Ohlssen, Nicholas Petrick, J. Schneider, Mark O Walderhaug, Emmanuel Zuber, 2023, Clinical Pharmacology & Therapeutics)
- Interpretable machine learning methods for predictions in systems biology from omics data(David Sidak, Jana Schwarzerová, W. Weckwerth, Steffen Waldherr, 2022, Frontiers in Molecular Biosciences)
- Machine learning and deep learning methods that use omics data for metastasis prediction(Somayah Albaradei, Maha A. Thafar, A. Alsaedi, Christophe Van Neste, T. Gojobori, M. Essack, Xin Gao, 2021, Computational and Structural Biotechnology Journal)
- Integrated Multi-Omics Analyses in Oncology: A Review of Machine Learning Methods and Tools(G. Nicora, Francesca Vitali, A. Dagliati, N. Geifman, R. Bellazzi, 2020, Frontiers in Oncology)
- Using machine learning approaches for multi-omics data analysis: A review.(P. Reel, S. Reel, E. Pearson, E. Trucco, E. Jefferson, 2021, Biotechnology Advances)
- A comprehensive review of machine learning techniques for multi-omics data integration: challenges and applications in precision oncology.(D. Acharya, Anirban Mukhopadhyay, 2024, Briefings in Functional Genomics)
- Machine Learning: A New Prospect in Multi-Omics Data Analysis of Cancer(B. Arjmand, Shayesteh Kokabi Hamidpour, Akram Tayanloo-Beik, P. Goodarzi, H. Aghayan, H. Adibi, B. Larijani, 2022, Frontiers in Genetics)
- Unlocking plant bioactive pathways: omics data harnessing and machine learning assisting.(Mickael Durand, Sébastien Besseau, Nicolas Papon, V. Courdavault, 2024, Current Opinion in Biotechnology)
- Graph machine learning for integrated multi-omics analysis(N. Valous, Ferdinand Popp, I. Zörnig, D. Jäger, P. Charoentong, 2024, British Journal of Cancer)
- ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics(Jiangming Sun, N. Jeliazkova, V. Chupakhin, J. G. Dzib, O. Engkvist, L. Carlsson, J. Wegner, H. Ceulemans, I. Georgiev, V. Jeliazkov, N. Kochev, T. Ashby, Hongming Chen, 2017, Journal of Cheminformatics)
- DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data(Olivier B. Poirion, Zheng Jing, K. Chaudhary, Sijia Huang, L. Garmire, 2021, Genome Medicine)
- Biomarker discovery studies for patient stratification using machine learning analysis of omics data: a scoping review(E. Glaab, A. Rauschenberger, R. Banzi, C. Gerardi, Paula García, J. Demotes, 2021, BMJ Open)
- Machine learning and systems genomics approaches for multi-omics data(E. Lin, H. Lane, 2017, Biomarker Research)
- Machine learning for multi-omics data integration in cancer(Zhaoxiang Cai, R. Poulos, Jia Liu, Qing Zhong, 2022, iScience)
- Identification of the expressome by machine learning on omics data(Ryan C. Sartor, Jaclyn M. Noshay, Nathan M. Springer, S. Briggs, 2019, Proceedings of the National Academy of Sciences)
- Dealing with dimensionality: the application of machine learning to multi-omics data(Dylan Feldner-Busztin, Panos Firbas Nisantzis, S. Edmunds, G. Boza, F. Racimo, S. Gopalakrishnan, Morten Tønsberg Limborg, L. Lahti, G. G. Polavieja, 2023, Bioinformatics)
- Machine learning-based analysis of multi-omics data on the cloud for investigating gene regulations(Minsik Oh, Sungjoon Park, Sun Kim, Heejoon Chae, 2020, Briefings in Bioinformatics)
- Machine learning meets omics: applications and perspectives(Rufeng Li, Lixin Li, Yungang Xu, Juan Yang, 2021, Briefings in Bioinformatics)
- Omics, big data and machine learning as tools to propel understanding of biological mechanisms and to discover novel diagnostics and therapeutics(Nikolaos Perakakis, A. Yazdani, G. Karniadakis, C. Mantzoros, 2018, Metabolism)
结构生物信息学与分子功能模拟
探讨生物大分子(蛋白质、RNA等)的三级结构预测、分子动力学模拟以及生物物理建模方法,以解析分子功能与相互作用。
- Bioinformatics and molecular modeling in glycobiology(M. Frank, S. Schloissnig, 2010, Cellular and Molecular Life Sciences)
- Bioinformatics Tools and Benchmarks for Computational Docking and 3D Structure Prediction of RNA-Protein Complexes(C. Nithin, Pritha Ghosh, J. Bujnicki, 2018, Genes)
- Membrane protein structural bioinformatics.(Timothy Nugent, David T. Jones, 2012, Journal of Structural Biology)
- Protein Structure Prediction(S Agnihotry, RK Pathak, DB Singh, A Tiwari, I Hussain, 2011, SpringerReference)
- Structural Bioinformatics: Methods and Protocols(Zoltán Gáspári, 2020, Methods in Molecular Biology)
- Protein structure prediction: inroads to biology.(Donald Petrey, B. Honig, 2005, Molecular Cell)
- Structural Bioinformatics(J Gu, PE Bourne, 2022, F1000Research Channels)
- Advances in Structural Bioinformatics(Juveriya Israr, Shabroz Alam, Sahabjada Siddiqui, Sankalp Misra, Indrajeet Singh, Ajay Kumar, 2024, Advances in Bioinformatics)
- Bioinformatics methods to predict protein structure and function(Y. Edwards, A. Cottage, 2003, Molecular Biotechnology)
- Prediction of protein structure and function by using bioinformatics.(Y. Edwards, A. Cottage, 2001, Genomics Protocols)
- Advances in protein structure prediction and design(B. Kuhlman, P. Bradley, 2019, Nature Reviews Molecular Cell Biology)
- Molecular modeling of protein structure and function: A bioinformatic approach(M. Liebman, 1988, Journal of Computer-Aided Molecular Design)
- State-of-the-art bioinformatics protein structure prediction tools (Review).(Athanasia Pavlopoulou, I. Michalopoulos, 2011, International Journal of Molecular Medicine)
大数据分析与生物计算基础设施
关注海量生物数据的存储、管理与处理,涵盖高性能计算(HPC)、GPU加速架构、可视化工具及生物信息学通用算法框架。
- Bioinformatic analysis of proteomics data(A. Schmidt, I. Forné, A. Imhof, 2014, BMC Systems Biology)
- Bioinformatics: from genome data to biological knowledge.(Miguel A. Andrade, Chris Sander, 1997, Current Opinion in Biotechnology)
- Exploiting Big Biology: Integrating Large-scale Biological Data for Function Inference(E. Marcotte, Shailesh V. Date, 2001, Briefings in Bioinformatics)
- Algorithms in Bioinformatics: 5th International Workshop, WABI 2005, Mallorca, Spain, October 3-6, 2005, Proceedings (Lecture Notes in Computer Science / Lecture Notes in Bioinformatics)(R. Casadio, G. Myers, 2005, Lecture Notes in Computer Science)
- Biclustering algorithms for biological data analysis: a survey(S. Madeira, Arlindo L. Oliveira, 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics)
- TBtools - an integrative toolkit developed for interactive analyses of big biological data.(Chengjie Chen, Hao Chen, Yi Zhang, Hannah R Thomas, Margaret H. Frank, Yehua He, Rui Xia, 2020, Molecular Plant)
- High Performance Computational Methods for Biological Sequence Analysis(T. K. Yap, O. Frieder, R. Martino, 1996, Springer US)
- Big Data Analysis in Computational Biology and Bioinformatics.(Prakash Kumar, R. Paul, H. S. Roy, M. Yeasin, Ajit, A. K. Paul, 2024, Methods in Molecular Biology)
- BioSeq-Diabolo: Biological sequence similarity analysis using Diabolo(Hongliang Li, Bin Liu, 2023, PLOS Computational Biology)
- Computational sequence analysis revisited: new databases, software tools, and the research opportunities they engender.(MS Boguski, 1992, Journal of Lipid Research)
- Biological Databases for Human Research(D. Zou, Lina Ma, Jun Yu, Zhang Zhang, 2015, Genomics, Proteomics & Bioinformatics)
- Statistics and bioinformatics in nutritional sciences: analysis of complex data in the era of systems biology(Wenjiang J. Fu, A. Stromberg, K. Viele, R. Carroll, Guoyao Wu, 2010, The Journal of Nutritional Biochemistry)
- Bioinformatics Tools for Mass Spectroscopy-Based Metabolomic Data Processing and Analysis(M. Sugimoto, M. Kawakami, M. Robert, T. Soga, M. Tomita, 2012, Current Bioinformatics)
- Computational solutions to large-scale data management and analysis(E. Schadt, M. Linderman, J. Sorenson, Lawrence Lee, G. Nolan, 2010, Nature Reviews Genetics)
- Wavelets in bioinformatics and computational biology: state of art and perspectives(Pietro Liò, 2003, Bioinformatics)
- Genomics(A Polanski, M Kimmel, 2007, Bioinformatics)
- Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges(Zekun Yin, Haidong Lan, Guangming Tan, Mian Lu, A. Vasilakos, Weiguo Liu, 2017, Computational and Structural Biotechnology Journal)
- Biotite: a unifying open source computational biology framework in Python(Patrick Kunzmann, K. Hamacher, 2018, BMC Bioinformatics)
- Graphics processing units in bioinformatics, computational biology and systems biology(Marco S. Nobile, P. Cazzaniga, A. Tangherloni, D. Besozzi, 2016, Briefings in Bioinformatics)
- Parallel computation in biological sequence analysis(Tieng K. Yap, Ophir Frieder, R.L. Martino, 1998, IEEE Transactions on Parallel and Distributed Systems)
- BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis(S. Durinck, Y. Moreau, A. Kasprzyk, Sean Davis, B. Moor, A. Brazma, W. Huber, 2005, Bioinformatics)
- Overview of Commonly Used Bioinformatics Methods and Their Applications(IM Kapetanovic, S Rosenfeld, 2004, Annals of the New York Academy of Sciences)
- Bioinformatics Methods in Medical Genetics and Genomics(Y. Orlov, A. Baranova, T. Tatarinova, 2020, International Journal of Molecular Sciences)
- Bioinformatics Big Data(M. D'Antonio, 2019, Encyclopedia of Big Data Technologies)
- Data integration in biological research: an overview(Vasileios Lapatas, Michalis Stefanidakis, R. Jiménez, A. Via, M. Schneider, 2015, Journal of Biological Research-Thessaloniki)
- Introduction to bioinformatics.(Tolga Can, 2014, Methods in Molecular Biology)
- Big Data Bioinformatics(C. Greene, J. Tan, Matthew Ung, J. H. Moore, Chao Cheng, 2014, Journal of Cellular Physiology)
- A Survey of Biological Data in a Big Data Perspective(Gabriel Dall'Alba, Pedro Lenz Casa, F. P. Abreu, Daniel Luis Notari, Scheila de Avila e Silva, 2022, Big Data)
- Bioinformatics applications for pathway analysis of microarray data.(T. Werner, 2008, Current Opinion in Biotechnology)
- VisANT: an online visualization and analysis tool for biological interaction data(Zhenjun Hu, Joseph C. Mellor, Jie Wu, Charles DeLisi, 2004, BMC Bioinformatics)
- Visualising biological data: a semantic approach to tool and database integration(S. Pettifer, D. Thorne, P. McDermott, James Marsh, A. Villéger, D. Kell, Teresa K. Attwood, 2009, BMC Bioinformatics)
- Algorithms in Bioinformatics(Daniel G. Brown, Burkhard Morgenstern, 2014, Lecture Notes in Computer Science)
- Challenges in Integrating Biological Data Sources(Susan B. Davidson, G. C. Overton, P. Buneman, 1995, Journal of Computational Biology)
- Parameterized complexity analysis in computational biology(H. Bodlaender, R. Downey, M. Fellows, M. Hallett, T. Wareham, 1995, Bioinformatics)
生物信息学方法论综述
收录生物信息学领域的宏观回顾、系统生物学建模、临床应用转化及基础理论概述,为研究提供全局视野。
- Bioinformatics and Computational Tools for Next-Generation Sequencing Analysis in Clinical Genetics(R. Pereira, J. Oliveira, M. Sousa, 2020, Journal of Clinical Medicine)
- Understanding Biology Through Bioinformatics(D. Rajpal, 2005, International Journal of Toxicology)
- Computational Modeling, Formal Analysis, and Tools for Systems Biology(E. Bartocci, Pietro Liò, 2016, PLOS Computational Biology)
- bioDBnet: the biological database network(U. Mudunuri, Anney Che, Ming Yi, R. Stephens, 2009, Bioinformatics)
- Big Data Analytics in Biology: A Systematic Review of Methods for Large-Scale Data Processing(Weipan Wang, Bing Zhang, Manman Li, 2024, Computational Molecular Biology)
- An Overview of Bioinformatics and Its Applications(Anushka Garhwal, Mangala Hegde, Anjana Sajeev, Ajaikumar B. Kunnumakkara, 2026, Bioinformatics, Computational Chemistry, and AI in Drug Innovation)
- Hidden Markov models in computational biology. Applications to protein modeling.(Anders Krogh, 1993, Journal of Molecular Biology)
- Computational approaches to identify functional genetic variants in cancer genomes(A. González-Pérez, Ville Mustonen, B. Reva, G. Ritchie, Pau Creixell, R. Karchin, M. Vázquez, J. L. Fink, K. Kassahn, J. Pearson, Gary D Bader, P. Boutros, L. Muthuswamy, B. F. Ouellette, J. Reimand, R. Linding, T. Shibata, A. Valencia, A. Butler, S. Dronov, Paul Flicek, N. Shannon, H. Carter, L. Ding, C. Sander, J. Stuart, Lincoln D. Stein, N. López-Bigas, 2013, Nature Methods)
- Sequence Alignment in Molecular Biology(A. Apostolico, R. Giancarlo, 1998, Journal of Computational Biology)
- Computational identification of transcriptional regulatory elements in DNA sequence(D. GuhaThakurta, 2006, Nucleic Acids Research)
- ProteoLens: a visual analytic tool for multi-scale database-driven biological network data mining(T. Huan, A. Sivachenko, S. Harrison, J. Chen, 2008, BMC Bioinformatics)
本报告对生物信息学文献进行了系统化梳理,构建了涵盖基因组学测序与组装、变异功能分析、多组学大数据挖掘与人工智能、结构生物信息学、计算架构与基础设施、以及领域方法论综述六大维度的知识框架。该分类有效覆盖了从底层测序数据处理、算法架构创新到高层生物学模型解析及临床精准医疗应用的全生命周期,体现了生物信息学在处理复杂生物数据时向系统化、智能化与高性能化发展的学科态势。
总计113篇相关文献
… accurately, such as the significance analysis of microarrays [31]. There are also some other data mining related problems in microarray data analysis. Some clustering problems can be …
… amount of data. In computational biology and bioinformatics, big data analysis is widely used for the analysis of genomics, transcriptomics, proteomics, and other high-throughput …
A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results from the application of standard clustering methods to genes are limited. This limitation is imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the data matrix has been proposed. The goal is to find submatrices, that is, subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated activities for every condition. In this paper, we refer to this class of algorithms as biclustering. Biclustering is also referred in the literature as coclustering and direct clustering, among others names, and has also been used in fields such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search, the approaches used to evaluate the solution, and the target applications.
… as ‘in silico’ biology. This article attempts to present a simple overview of bioinformatics, and an … is made to discuss annotation, which plays a crucial role in our understanding of biology. …
Bioinformatics, an interdisciplinary field blending science, mathematics, statistics, and genetics with computational technology, has rapidly evolved, catalyzing scientific progress and research. Computational methods began integrating into biological data analysis as early as the 1950s. Initially focused on sequence analysis and structure prediction, bioinformatics has expanded its scope to encompass genomics, proteomics, transcriptomics, and metabolomics. The milestone Human Genome Project (HGP) exemplifies bioinformatics pivotal role in research. Recent advancements highlight bioinformatics critical contributions, notably in combating the COVID-19 pandemic through in silico tools and analyzing protein structures to facilitate therapeutic development. High-throughput technologies further accelerate progress in biological science and drug discovery. Despite these advancements, bioinformatics faces ongoing challenges in managing, compiling, and interpreting vast biological datasets, necessitating continuous development of databases, tools, and servers. This chapter offers a comprehensive overview of bioinformatics, delving into its origins, diverse applications across disciplines, and its crucial role in disease management, drug discovery, diagnosis, and personalized medicine. Bioinformatics emphasizes transformative impact across domains and its potential for future innovations.
… bioinformatics is making a key contribution to the organization and analysis of the massive amount of biological data … ', robotized and miniaturized methods of biological experimentation. …
… Fast and up-to-date data retrieval is possible as the package executes direct SQL queries to … data analysis in Bioconductor creating a powerful environment for biological data mining. …
… a high degree of interactive data analysis and visualization. … large scientific data and more or less used for bioinformatics … workflows or a pipelined data analysis. Availability of an easily …
Most biochemical reactions in a cell are regulated by highly specialized proteins, which are the prime mediators of the cellular phenotype. Therefore the identification, quantitation and characterization of all proteins in a cell are of utmost importance to understand the molecular processes that mediate cellular physiology. With the advent of robust and reliable mass spectrometers that are able to analyze complex protein mixtures within a reasonable timeframe, the systematic analysis of all proteins in a cell becomes feasible. Besides the ongoing improvements of analytical hardware, standardized methods to analyze and study all proteins have to be developed that allow the generation of testable new hypothesis based on the enormous pre-existing amount of biological information. Here we discuss current strategies on how to gather, filter and analyze proteomic data sates using available software packages.
… Some of the fuzzy techniques that have been employed for biomedical data analysis include fuzzy clustering, fuzzy classification, and hybrid systems, such as combinations of fuzzy …
… data to discover underlying biology?” This review deals with the latter question. We will introduce key concepts in the analysis of big data… successfully discovered underlying biology. We …
Over the past two decades, there have been revolutionary developments in life science technologies characterized by high throughput, high efficiency, and rapid computation. Nutritionists now have the advanced methodologies for the analysis of DNA, RNA, protein, low-molecular-weight metabolites, as well as access to bioinformatics databases. Statistics, which can be defined as the process of making scientific inferences from data that contain variability, has historically played an integral role in advancing nutritional sciences. Currently, in the era of systems biology, statistics has become an increasingly important tool to quantitatively analyze information about biological macromolecules. This article describes general terms used in statistical analysis of large, complex experimental data. These terms include experimental design, power analysis, sample size calculation, and experimental errors (type I and II errors) for nutritional studies at population, tissue, cellular, and molecular levels. In addition, we highlighted various sources of experimental variations in studies involving microarray gene expression, real-time polymerase chain reaction, proteomics, and other bioinformatics technologies. Moreover, we provided guidelines for nutritionists and other biomedical scientists to plan and conduct studies and to analyze the complex data. Appropriate statistical analyses are expected to make an important contribution to solving major nutrition-associated problems in humans and animals (including obesity, diabetes, cardiovascular disease, cancer, ageing, and intrauterine fetal retardation).
Biological systems are increasingly being studied in a holistic manner, using omics approaches, to provide quantitative and qualitative descriptions of the diverse collection of cellular components. Among the omics approaches, metabolomics, which deals with the quantitative global profiling of small molecules or metabolites, is being used extensively to explore the dynamic response of living systems, such as organelles, cells, tissues, organs and whole organisms, under diverse physiological and pathological conditions. This technology is now used routinely in a number of applications, including basic and clinical research, agriculture, microbiology, food science, nutrition, pharmaceutical research, environmental science and the development of biofuels. Of the multiple analytical platforms available to perform such analyses, nuclear magnetic resonance and mass spectrometry have come to dominate, owing to the high resolution and large datasets that can be generated with these techniques. The large multidimensional datasets that result from such studies must be processed and analyzed to render this data meaningful. Thus, bioinformatics tools are essential for the efficient processing of huge datasets, the characterization of the detected signals, and to align multiple datasets and their features. This paper provides a state-of-the-art overview of the data processing tools available, and reviews a collection of recent reports on the topic. Data conversion, pre-processing, alignment, normalization and statistical analysis are introduced, with their advantages and disadvantages, and comparisons are made to guide the reader.
… biological processes often represented in the form of gene-ontology (GO) categories or metabolic and regulatory pathways as derived from literature analysis… biological network analysis…
A massive volume of biological sequence data is available in over 36 different databases worldwide, including the sequence data generated by the Human Genome project. These databases, which also contain biological and bibliographical information, are growing at an exponential rate. Consequently, the computational demands needed to explore and analyze the data contained in these databases is quickly becoming a great concern. To meet these demands, we must use high performance computing systems, such as parallel computers and distributed networks of workstations. We present two parallel computational methods for analyzing these biological sequences. The first method is used to retrieve sequences that are homologous to a query sequence. The biological information associated with the homologous sequences found in the database may provide important clues to the structure and function of the query sequence. The second method, which helps in the prediction of the function, structure, and evolutionary history of biological sequences, is used to align a number of homologous sequences with each other. These two parallel computational methods were implemented and evaluated on an Intel IPSC/860 parallel computer. The resulting performance demonstrates that parallel computational methods can significantly reduce the computational time needed to analyze the sequences contained in large databases.
As the key for biological sequence structure and function prediction, disease diagnosis and treatment, biological sequence similarity analysis has attracted more and more attentions. However, the exiting computational methods failed to accurately analyse the biological sequence similarities because of the various data types (DNA, RNA, protein, disease, etc) and their low sequence similarities (remote homology). Therefore, new concepts and techniques are desired to solve this challenging problem. Biological sequences (DNA, RNA and protein sequences) can be considered as the sentences of “the book of life”, and their similarities can be considered as the biological language semantics (BLS). In this study, we are seeking the semantics analysis techniques derived from the natural language processing (NLP) to comprehensively and accurately analyse the biological sequence similarities. 27 semantics analysis methods derived from NLP were introduced to analyse biological sequence similarities, bringing new concepts and techniques to biological sequence similarity analysis. Experimental results show that these semantics analysis methods are able to facilitate the development of protein remote homology detection, circRNA-disease associations identification and protein function annotation, achieving better performance than the other state-of-the-art predictors in the related fields. Based on these semantics analysis methods, a platform called BioSeq-Diabolo has been constructed, which is named after a popular traditional sport in China. The users only need to input the embeddings of the biological sequence data. BioSeq-Diabolo will intelligently identify the task, and then accurately analyse the biological sequence similarities based on biological language semantics. BioSeq-Diabolo will integrate different biological sequence similarities in a supervised manner by using Learning to Rank (LTR), and the performance of the constructed methods will be evaluated and analysed so as to recommend the best methods for the users. The web server and stand-alone package of BioSeq-Diabolo can be accessed at http://bliulab.net/BioSeq-Diabolo/server/.
… various types of sequence-comparison analyses, eg, multiple … However, the number of sequences that can be examined at … for these analyses, where k is the number of sequences and …
As the amount of biological data in the public domain grows, so does the range of modeling and analysis techniques employed in systems biology. In recent years, a number of theoretical computer science developments have enabled modeling methodology to keep pace. The growing interest in systems biology in executable models and their analysis has necessitated the borrowing of terms and methods from computer science, such as formal analysis, model checking, static analysis, and runtime verification. Here, we discuss the most important and exciting computational methods and tools currently available to systems biologists. We believe that a deeper understanding of the concepts and theory highlighted in this review will produce better software practice, improved investigation of complex biological processes, and even new ideas and better feedback into computer science.
Prediction of the biological effect of missense substitutions has become important because they are often observed in known or candidate disease susceptibility genes. In this paper, we carried out a 3-step analysis of 1514 missense substitutions in the DNA-binding domain (DBD) of TP53, the most frequently mutated gene in human cancers. First, we calculated two types of conservation scores based on a TP53 multiple sequence alignment (MSA) for each substitution: (i) Grantham Variation (GV), which measures the degree of biochemical variation among amino acids found at a given position in the MSA; (ii) Grantham Deviation (GD), which reflects the ‘biochemical distance’ of the mutant amino acid from the observed amino acid at a particular position (given by GV). Second, we used a method that combines GV and GD scores, Align-GVGD, to predict the transactivation activity of each missense substitution. We compared our predictions against experimentally measured transactivation activity (yeast assays) to evaluate their accuracy. Finally, the prediction results were compared with those obtained by the program Sorting Intolerant from Tolerant (SIFT) and Dayhoff's classification. Our predictions yielded high prediction accuracy for mutants showing a loss of transactivation (∼88% specificity) with lower prediction accuracy for mutants with transactivation similar to that of the wild-type (67.9 to 71.2% sensitivity). Align-GVGD results were comparable to SIFT (88.3 to 90.6% and 67.4 to 70.3% specificity and sensitivity, respectively) and outperformed Dayhoff's classification (80 and 40.9% specificity and sensitivity, respectively). These results further demonstrate the utility of the Align-GVGD method, which was previously applied to BRCA1. Align-GVGD is available online at .
… Biological sequence analysis includes the mathematical and computational methods for … In support of this project, hundreds of computational biology groups have been established …
Abstract Several studies in Bioinformatics, Computational Biology and Systems Biology rely on the definition of physico-chemical or mathematical models of biological systems at different scales and levels of complexity, ranging from the interaction of atoms in single molecules up to genome-wide interaction networks. Traditional computational methods and software tools developed in these research fields share a common trait: they can be computationally demanding on Central Processing Units (CPUs), therefore limiting their applicability in many circumstances. To overcome this issue, general-purpose Graphics Processing Units (GPUs) are gaining an increasing attention by the scientific community, as they can considerably reduce the running time required by standard CPU-based software, and allow more intensive investigations of biological systems. In this review, we present a collection of GPU tools recently developed to perform computational analyses in life science disciplines, emphasizing the advantages and the drawbacks in the use of these parallel architectures. The complete list of GPU-powered tools here reviewed is available at http://bit.ly/gputools.
The increasing quantity and complexity of sequences and structural data for proteins and nucleic acids create both problems and opportunities for biomedical researchers. Fortunately, a new generation of practical computer tools for data analysis and integrated information retrieval is emerging. Recent developments in fast database searching, multiple sequence alignment, and molecular modeling are discussed and windows-based, mouse-driven software for CD-ROM and network information retrieval are described. Each method is illustrated with a practical example pertinent to lipid research. In particular, the connection among cholesteryl ester transfer protein, bactericidal permeability-increasing protein, and lipopolysaccharide-binding proteins is determined; novel repetitive sequence motifs in mammalian farnesyltransferase subunits and related yeast prenyltransferases are derived; biochemical insights from a three-dimensional model of human apolipoprotein D based on two insect lipocalins are discussed; the relationship between apolipoprotein D and gross cystic disease fluid protein from human breast is reviewed; and prospects for modeling apolipoprotein E-related proteins are described. In addition, information on a number of general and special-purpose sequence, motif, and structural databases is included.
Motivation: At a recent meeting†, the wavelet transform was depicted as a small child kicking back at its father, the Fourier transform. Wavelets are more efficient and faster than Fourier methods in capturing the essence of data. Nowadays there is a growing interest in using wavelets in the analysis of biological sequences and molecular biology-related signals. Results: This review is intended to summarize the potential of state of the art wavelets, and in particular wavelet statistical methodology, in different areas of molecular biology: genome sequence, protein structure and microarray data analysis. I conclude by discussing the use of wavelets in modeling biological structures. Contact: plio@hgmp.mrc.ac.uk † XIX SMC 2001 ‘Wavelets in Statistics’, Vico Equense, Naples, I, 2–7 April 2001.
… Molecular biology is becoming a computationally intense realm of contemporary science … compare and analyze effectively large and growing numbers of bio-sequences are found of …
Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and high-throughput experimental methods for mapping protein-binding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cis-regulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges.
As molecular biology is creating an increasing amount of sequence and structure data, the multitude of software to analyze this data is also rising. Most of the programs are made for a specific task, hence the user often needs to combine multiple programs in order to reach a goal. This can make the data processing unhandy, inflexible and even inefficient due to an overhead of read/write operations. Therefore, it is crucial to have a comprehensive, accessible and efficient computational biology framework in a scripting language to overcome these limitations. We have developed the Python package Biotite: a general computational biology framework, that represents sequence and structure data based on NumPyndarrays. Furthermore the package contains seamless interfaces to biological databases and external software. The source code is freely accessible at https://github.com/biotite-dev/biotite. Biotite is unifying in two ways: At first it bundles popular tasks in sequence analysis and structural bioinformatics in a consistently structured package. Secondly it adresses two groups of users: novice programmers get an easy access to Biotite due to its simplicity and the comprehensive documentation. On the other hand, advanced users can profit from its high performance and extensibility. They can implement their algorithms upon Biotite, so they can skip writing code for general functionality (like file parsers) and can focus on what their software makes unique.
… The rate of generation of sequence data in recent years provides abundant opportunities for the development of new approaches to problems in computational biology. In this paper, we …
… genomics, the relevance of the problem is therefore constrained by the need for sound, efficient and specialized algorithms, … Indeed the ultimate goal is to implement algorithms capable …
… The most fundamental differences between available mapping algorithms are, arguably, whether the genome or the sequence reads are indexed, and the indexing method applied. …
Medical genomics relies on next-gen sequencing methods to decipher underlying molecular mechanisms of gene expression. This special issue collects materials originally presented at the “Centenary of Human Population Genetics” Conference-2019, in Moscow. Here we present some recent developments in computational methods tested on actual medical genetics problems dissected through genomics, transcriptomics and proteomics data analysis, gene networks, protein–protein interactions and biomedical literature mining. We have selected materials based on systems biology approaches, database mining. These methods and algorithms were discussed at the Digital Medical Forum-2019, organized by I.M. Sechenov First Moscow State Medical University presenting bioinformatics approaches for the drug targets discovery in cancer, its computational support, and digitalization of medical research, as well as at “Systems Biology and Bioinformatics”-2019 (SBB-2019) Young Scientists School in Novosibirsk, Russia. Selected recent advancements discussed at these events in the medical genomics and genetics areas are based on novel bioinformatics tools.
… diagnostic laboratory providing genome sequencing depends at least in part on the bioinformatics algorithms used in the sequencing process. The choice of algorithm can be a source …
… all aspects of algorithms and data structure in molecular biology, genomics, and phylogeny … The selected papers cover a wide range of topics from sequence and genome analysis …
… Protein structure prediction by using bioinformatics can involve sequence similarity searches, multiple sequence alignments, identification and characterization of domains, …
… Protein structure prediction using bioinformatics can involve … domains, secondary structure prediction, solvent accessibility … Comparative molecular modeling is the most successful and …
… In this review, we presented a bioinformatics ‘toolkit’ particularly useful for bench biologists. We suggest a hierarchical approach to protein structure prediction that would consist of a …
… The structural details of X-ray crystallography and nuclear magnetic resonance spectroscopy-verified protein structures … Proteins are important molecules that play a significant …
… We examine below the identification and classification of protein secondary structure in the … correlation and for potential incorporation within a structure prediction algorithm. In addition, …
… for understanding the mechanisms of molecular function. The … information or improve our understanding of protein structures … in subcategories of structure prediction efforts within CASP …
… Analysis of known structures, the application of machine learning techniques, molecular dynamics simulations and protein structure prediction have enabled significant advances to be …
… in the critically acclaimed Methods in Molecular Biology series. The series was the first to … analysis of related protein structures. The classic protein structure comparison method Dali …
The prediction of protein three-dimensional structure from amino acid sequence has been a grand challenge problem in computational biophysics for decades, owing to its intrinsic scientific interest and also to the many potential applications for robust protein structure prediction algorithms, from genome interpretation to protein function prediction. More recently, the inverse problem — designing an amino acid sequence that will fold into a specified three-dimensional structure — has attracted growing attention as a potential route to the rational engineering of proteins with functions useful in biotechnology and medicine. Methods for the prediction and design of protein structures have advanced dramatically in the past decade. Increases in computing power and the rapid growth in protein sequence and structure databases have fuelled the development of new data-intensive and computationally demanding approaches for structure prediction. New algorithms for designing protein folds and protein–protein interfaces have been used to engineer novel high-order assemblies and to design from scratch fluorescent proteins with novel or enhanced properties, as well as signalling proteins with therapeutic potential. In this Review, we describe current approaches for protein structure prediction and design and highlight a selection of the successful applications they have enabled. Predicting how proteins fold enables inferring their function. Conversely, rational protein design allows for engineering novel protein functionalities. Recent improvements in computational algorithms and technological advances have dramatically increased the accuracy and speed of protein structure modelling, providing novel opportunities for controlling protein function, with potential applications in biomedicine, industry and research.
RNA-protein (RNP) interactions play essential roles in many biological processes, such as regulation of co-transcriptional and post-transcriptional gene expression, RNA splicing, transport, storage and stabilization, as well as protein synthesis. An increasing number of RNP structures would aid in a better understanding of these processes. However, due to the technical difficulties associated with experimental determination of macromolecular structures by high-resolution methods, studies on RNP recognition and complex formation present significant challenges. As an alternative, computational prediction of RNP interactions can be carried out. Structural models obtained by theoretical predictive methods are, in general, less reliable compared to models based on experimental measurements but they can be sufficiently accurate to be used as a basis for to formulating functional hypotheses. In this article, we present an overview of computational methods for 3D structure prediction of RNP complexes. We discuss currently available methods for macromolecular docking and for scoring 3D structural models of RNP complexes in particular. Additionally, we also review benchmarks that have been developed to assess the accuracy of these methods.
… In the field of bioinformatics, the study of protein structure prediction encompasses various … In conclusion, structural bioinformatics has advanced biological molecule investigation to …
… possible by structure prediction. Despite remaining challenges, protein structure prediction is becoming an extremely useful tool in understanding phenomena in modern molecular …
The field of glycobiology is concerned with the study of the structure, properties, and biological functions of the family of biomolecules called carbohydrates. Bioinformatics for glycobiology is a particularly challenging field, because carbohydrates exhibit a high structural diversity and their chains are often branched. Significant improvements in experimental analytical methods over recent years have led to a tremendous increase in the amount of carbohydrate structure data generated. Consequently, the availability of databases and tools to store, retrieve and analyze these data in an efficient way is of fundamental importance to progress in glycobiology. In this review, the various graphical representations and sequence formats of carbohydrates are introduced, and an overview of newly developed databases, the latest developments in sequence alignment and data mining, and tools to support experimental glycan analysis are presented. Finally, the field of structural glycoinformatics and molecular modeling of carbohydrates, glycoproteins, and protein–carbohydrate interaction are reviewed.
Summary Multi-omics data analysis is an important aspect of cancer molecular biology studies and has led to ground-breaking discoveries. Many efforts have been made to develop machine learning methods that automatically integrate omics data. Here, we review machine learning tools categorized as either general-purpose or task-specific, covering both supervised and unsupervised learning for integrative analysis of multi-omics data. We benchmark the performance of five machine learning approaches using data from the Cancer Cell Line Encyclopedia, reporting accuracy on cancer type classification and mean absolute error on drug response prediction, and evaluating runtime efficiency. This review provides recommendations to researchers regarding suitable machine learning method selection for their specific applications. It should also promote the development of novel machine learning methodologies for data integration, which will be essential for drug discovery, clinical trial design, and personalized treatments.
Cancer is defined as a large group of diseases that is associated with abnormal cell growth, uncontrollable cell division, and may tend to impinge on other tissues of the body by different mechanisms through metastasis. What makes cancer so important is that the cancer incidence rate is growing worldwide which can have major health, economic, and even social impacts on both patients and the governments. Thereby, the early cancer prognosis, diagnosis, and treatment can play a crucial role at the front line of combating cancer. The onset and progression of cancer can occur under the influence of complicated mechanisms and some alterations in the level of genome, proteome, transcriptome, metabolome etc. Consequently, the advent of omics science and its broad research branches (such as genomics, proteomics, transcriptomics, metabolomics, and so forth) as revolutionary biological approaches have opened new doors to the comprehensive perception of the cancer landscape. Due to the complexities of the formation and development of cancer, the study of mechanisms underlying cancer has gone beyond just one field of the omics arena. Therefore, making a connection between the resultant data from different branches of omics science and examining them in a multi-omics field can pave the way for facilitating the discovery of novel prognostic, diagnostic, and therapeutic approaches. As the volume and complexity of data from the omics studies in cancer are increasing dramatically, the use of leading-edge technologies such as machine learning can have a promising role in the assessments of cancer research resultant data. Machine learning is categorized as a subset of artificial intelligence which aims to data parsing, classification, and data pattern identification by applying statistical methods and algorithms. This acquired knowledge subsequently allows computers to learn and improve accurate predictions through experiences from data processing. In this context, the application of machine learning, as a novel computational technology offers new opportunities for achieving in-depth knowledge of cancer by analysis of resultant data from multi-omics studies. Therefore, it can be concluded that the use of artificial intelligence technologies such as machine learning can have revolutionary roles in the fight against cancer.
In light of recent advances in biomedical computing, big data science, and precision medicine, there is a mammoth demand for establishing algorithms in machine learning and systems genomics (MLSG), together with multi-omics data, to weigh probable phenotype-genotype relationships. Software frameworks in MLSG are extensively employed to analyze hundreds of thousands of multi-omics data by high-throughput technologies. In this study, we reviewed the MLSG software frameworks and future directions with respect to multi-omics data analysis and integration. Our review was targeted at researching recent approaches and technical solutions for the MLSG software frameworks using multi-omics platforms.
Abstract Motivation Machine learning (ML) methods are motivated by the need to automate information extraction from large datasets in order to support human users in data-driven tasks. This is an attractive approach for integrative joint analysis of vast amounts of omics data produced in next generation sequencing and other -omics assays. A systematic assessment of the current literature can help to identify key trends and potential gaps in methodology and applications. We surveyed the literature on ML multi-omic data integration and quantitatively explored the goals, techniques and data involved in this field. We were particularly interested in examining how researchers use ML to deal with the volume and complexity of these datasets. Results Our main finding is that the methods used are those that address the challenges of datasets with few samples and many features. Dimensionality reduction methods are used to reduce the feature count alongside models that can also appropriately handle relatively few samples. Popular techniques include autoencoders, random forests and support vector machines. We also found that the field is heavily influenced by the use of The Cancer Genome Atlas dataset, which is accessible and contains many diverse experiments. Availability and implementation All data and processing scripts are available at this GitLab repository: https://gitlab.com/polavieja_lab/ml_multi-omics_review/ or in Zenodo: https://doi.org/10.5281/zenodo.7361807. Supplementary information Supplementary data are available at Bioinformatics online.
With the development of modern high-throughput omic measurement platforms, it has become essential for biomedical studies to undertake an integrative (combined) approach to fully utilise these data to gain insights into biological systems. Data from various omics sources such as genetics, proteomics, and metabolomics can be integrated to unravel the intricate working of systems biology using machine learning-based predictive algorithms. Machine learning methods offer novel techniques to integrate and analyse the various omics data enabling the discovery of new biomarkers. These biomarkers have the potential to help in accurate disease prediction, patient stratification and delivering of precision medicine. This review paper explores different integrative machine learning methods which have been used to provide an in-depth understanding of biological systems during normal physiological functioning and in the presence of a disease. It provides insight and recommendations for interdisciplinary professionals who envisage employing machine learning skills in multi-omics studies.
Significance Our new method uses only epigenomic patterns to classify the expression potential of annotated genes and identifies pseudogenes that are difficult to classify based solely on sequence. Genes were divided into those with protein expression, those with mRNA expression, and those that are silent. A large fraction of annotated genes are constitutively silent in one lineage but can be transcribed in others. We refer to the species-wide set of transcribed genes as the expressome and show that it is much larger than the expressible gene set in any individual. Additionally, we find that DNA methylation patterns within the gene body can differentiate between genes that express proteins and genes that express only RNAs. Accurate annotation of plant genomes remains complex due to the presence of many pseudogenes arising from whole-genome duplication-generated redundancy or the capture and movement of gene fragments by transposable elements. Machine learning on genome-wide epigenetic marks, informed by transcriptomic and proteomic training data, could be used to improve annotations through classification of all putative protein-coding genes as either constitutively silent or able to be expressed. Expressed genes were subclassified as able to express both mRNAs and proteins or only RNAs, and CG gene body methylation was associated only with the former subclass. More than 60,000 protein-coding genes have been annotated in the reference genome of maize inbred B73. About two-thirds of these genes are transcribed and are designated the filtered gene set (FGS). Classification of genes by our trained random forest algorithm was accurate and relied only on histone modifications or DNA methylation patterns within the gene body; promoter methylation was unimportant. Other inbred lines are known to transcribe significantly different sets of genes, indicating that the FGS is specific to B73. We accurately classified the sets of transcribed genes in additional inbred lines, arising from inbred-specific DNA methylation patterns. This approach highlights the potential of using chromatin information to improve annotations of functional genes.
Multi-omics data play a crucial role in precision medicine, mainly to understand the diverse biological interaction between different omics. Machine learning approaches have been extensively employed in this context over the years. This review aims to comprehensively summarize and categorize these advancements, focusing on the integration of multi-omics data, which includes genomics, transcriptomics, proteomics and metabolomics, alongside clinical data. We discuss various machine learning techniques and computational methodologies used for integrating distinct omics datasets and provide valuable insights into their application. The review emphasizes both the challenges and opportunities present in multi-omics data integration, precision medicine and patient stratification, offering practical recommendations for method selection in various scenarios. Recent advances in deep learning and network-based approaches are also explored, highlighting their potential to harmonize diverse biological information layers. Additionally, we present a roadmap for the integration of multi-omics data in precision oncology, outlining the advantages, challenges and implementation difficulties. Hence this review offers a thorough overview of current literature, providing researchers with insights into machine learning techniques for patient stratification, particularly in precision oncology. Contact: anirban@klyuniv.ac.in
Knowing metastasis is the primary cause of cancer-related deaths, incentivized research directed towards unraveling the complex cellular processes that drive the metastasis. Advancement in technology and specifically the advent of high-throughput sequencing provides knowledge of such processes. This knowledge led to the development of therapeutic and clinical applications, and is now being used to predict the onset of metastasis to improve diagnostics and disease therapies. In this regard, predicting metastasis onset has also been explored using artificial intelligence approaches that are machine learning, and more recently, deep learning-based. This review summarizes the different machine learning and deep learning-based metastasis prediction methods developed to date. We also detail the different types of molecular data used to build the models and the critical signatures derived from the different methods. We further highlight the challenges associated with using machine learning and deep learning methods, and provide suggestions to improve the predictive performance of such methods.
… era of ‘big data’. Extracting inherent valuable knowledge from various omics data remains a … analysis and computational modeling of multi-omics data helped address such needs in an …
Multi-omics data are good resources for prognosis and survival prediction; however, these are difficult to integrate computationally. We introduce DeepProg, a novel ensemble framework of deep-learning and machine-learning approaches that robustly predicts patient survival subtypes using multi-omics data. It identifies two optimal survival subtypes in most cancers and yields significantly better risk-stratification than other multi-omics integration methods. DeepProg is highly predictive, exemplified by two liver cancer (C-index 0.73–0.80) and five breast cancer datasets (C-index 0.68–0.73). Pan-cancer analysis associates common genomic signatures in poor survival subtypes with extracellular matrix modeling, immune deregulation, and mitosis processes. DeepProg is freely available at https://github.com/lanagarmire/DeepProg
Machine learning has become a powerful tool for systems biologists, from diagnosing cancer to optimizing kinetic models and predicting the state, growth dynamics, or type of a cell. Potential predictions from complex biological data sets obtained by “omics” experiments seem endless, but are often not the main objective of biological research. Often we want to understand the molecular mechanisms of a disease to develop new therapies, or we need to justify a crucial decision that is derived from a prediction. In order to gain such knowledge from data, machine learning models need to be extended. A recent trend to achieve this is to design “interpretable” models. However, the notions around interpretability are sometimes ambiguous, and a universal recipe for building well-interpretable models is missing. With this work, we want to familiarize systems biologists with the concept of model interpretability in machine learning. We consider data sets, data preparation, machine learning methods, and software tools relevant to omics research in systems biology. Finally, we try to answer the question: “What is interpretability?” We introduce views from the interpretable machine learning community and propose a scheme for categorizing studies on omics data. We then apply these tools to review and categorize recent studies where predictive machine learning models have been constructed from non-sequential omics data.
In recent years, high-throughput sequencing technologies provide unprecedented opportunity to depict cancer samples at multiple molecular levels. The integration and analysis of these multi-omics datasets is a crucial and critical step to gain actionable knowledge in a precision medicine framework. This paper explores recent data-driven methodologies that have been developed and applied to respond major challenges of stratified medicine in oncology, including patients' phenotyping, biomarker discovery, and drug repurposing. We systematically retrieved peer-reviewed journals published from 2014 to 2019, select and thoroughly describe the tools presenting the most promising innovations regarding the integration of heterogeneous data, the machine learning methodologies that successfully tackled the complexity of multi-omics data, and the frameworks to deliver actionable results for clinical practice. The review is organized according to the applied methods: Deep learning, Network-based methods, Clustering, Features Extraction, and Transformation, Factorization. We provide an overview of the tools available in each methodological group and underline the relationship among the different categories. Our analysis revealed how multi-omics datasets could be exploited to drive precision oncology, but also current limitations in the development of multi-omics data integration.
In 2020, Novartis Pharmaceuticals Corporation and the U.S. Food and Drug Administration (FDA) started a 4‐year scientific collaboration to approach complex new data modalities and advanced analytics. The scientific question was to find novel radio‐genomics‐based prognostic and predictive factors for HR+/HER− metastatic breast cancer under a Research Collaboration Agreement. This collaboration has been providing valuable insights to help successfully implement future scientific projects, particularly using artificial intelligence and machine learning. This tutorial aims to provide tangible guidelines for a multi‐omics project that includes multidisciplinary expert teams, spanning across different institutions. We cover key ideas, such as “maintaining effective communication” and “following good data science practices,” followed by the four steps of exploratory projects, namely (1) plan, (2) design, (3) develop, and (4) disseminate. We break each step into smaller concepts with strategies for implementation and provide illustrations from our collaboration to further give the readers actionable guidance.
Gene expressions are subtly regulated by quantifiable measures of genetic molecules such as interaction with other genes, methylation, mutations, transcription factor and histone modifications. Integrative analysis of multi-omics data can help scientists understand the condition or patient-specific gene regulation mechanisms. However, analysis of multi-omics data is challenging since it requires not only the analysis of multiple omics data sets but also mining complex relations among different genetic molecules by using state-of-the-art machine learning methods. In addition, analysis of multi-omics data needs quite large computing infrastructure. Moreover, interpretation of the analysis results requires collaboration among many scientists, often requiring reperforming analysis from different perspectives. Many of the aforementioned technical issues can be nicely handled when machine learning tools are deployed on the cloud. In this survey article, we first survey machine learning methods that can be used for gene regulation study, and we categorize them according to five different goals: gene regulatory subnetwork discovery, disease subtype analysis, survival analysis, clinical prediction and visualization. We also summarize the methods in terms of multi-omics input types. Then, we explain why the cloud is potentially a good solution for the analysis of multi-omics data, followed by a survey of two state-of-the-art cloud systems, Galaxy and BioVLAB. Finally, we discuss important issues when the cloud is used for the analysis of multi-omics data for the gene regulation study.
Plant bioactives hold immense potential in the medicine and food industry. The recent advancements in omics applied in deciphering specialized metabolic pathways underscore the importance of high-quality genome releases and the wealth of data in metabolomics and transcriptomics. While harnessing data, whether integrated or standalone, has proven successful in unveiling plant natural product (PNP) biosynthetic pathways, the democratization of machine learning in biology opens exciting new opportunities for enhancing the exploration of these pathways. This review highlights the recent breakthroughs in disrupting plant-specialized biosynthetic pathways through the utilization of omics data harnessing and machine learning techniques.
Medical research focuses on identifying the causes and deciphering the mechanisms related to a disease, aiming to eventually develop accurate diagnostic tools and effective treatments. With the breakthrough technological advances of the last decades, the “educated” guess that had been previously used for raising scientific hypotheses is rapidly being replaced by the knowledge provided through untargeted high-throughput methods that are able to generate enormous data sets in a short amount of time and in a cost-effective manner. Fortunately, major advances have been also observed in computational mathematics which enables the accurate analysis of the “big” data sets deriving from high-throughput approaches. Here, we summarize the most important “-omics” procedures and describe the current challenges related to their use. Additionally, we describe the novel methods of data-mining and machine learning analysis, and particularly, how they can be used in a hierarchical manner to produce robust results for medicine from “big” data.
Objective To review biomarker discovery studies using omics data for patient stratification which led to clinically validated FDA-cleared tests or laboratory developed tests, in order to identify common characteristics and derive recommendations for future biomarker projects. Design Scoping review. Methods We searched PubMed, EMBASE and Web of Science to obtain a comprehensive list of articles from the biomedical literature published between January 2000 and July 2021, describing clinically validated biomarker signatures for patient stratification, derived using statistical learning approaches. All documents were screened to retain only peer-reviewed research articles, review articles or opinion articles, covering supervised and unsupervised machine learning applications for omics-based patient stratification. Two reviewers independently confirmed the eligibility. Disagreements were solved by consensus. We focused the final analysis on omics-based biomarkers which achieved the highest level of validation, that is, clinical approval of the developed molecular signature as a laboratory developed test or FDA approved tests. Results Overall, 352 articles fulfilled the eligibility criteria. The analysis of validated biomarker signatures identified multiple common methodological and practical features that may explain the successful test development and guide future biomarker projects. These include study design choices to ensure sufficient statistical power for model building and external testing, suitable combinations of non-targeted and targeted measurement technologies, the integration of prior biological knowledge, strict filtering and inclusion/exclusion criteria, and the adequacy of statistical and machine learning methods for discovery and validation. Conclusions While most clinically validated biomarker models derived from omics data have been developed for personalised oncology, first applications for non-cancer diseases show the potential of multivariate omics biomarker design for other complex disorders. Distinctive characteristics of prior success stories, such as early filtering and robust discovery approaches, continuous improvements in assay design and experimental measurement technology, and rigorous multicohort validation approaches, enable the derivation of specific recommendations for future studies.
Multi-omics experiments at bulk or single-cell resolution facilitate the discovery of hypothesis-generating biomarkers for predicting response to therapy, as well as aid in uncovering mechanistic insights into cellular and microenvironmental processes. Many methods for data integration have been developed for the identification of key elements that explain or predict disease risk or other biological outcomes. The heterogeneous graph representation of multi-omics data provides an advantage for discerning patterns suitable for predictive/exploratory analysis, thus permitting the modeling of complex relationships. Graph-based approaches—including graph neural networks—potentially offer a reliable methodological toolset that can provide a tangible alternative to scientists and clinicians that seek ideas and implementation strategies in the integrated analysis of their omics sets for biomedical research. Graph-based workflows continue to push the limits of the technological envelope, and this perspective provides a focused literature review of research articles in which graph machine learning is utilized for integrated multi-omics data analyses, with several examples that demonstrate the effectiveness of graph-based approaches.
The application of third-generation sequencing (TGS) technology in genetics and genomics have provided opportunities to categorize and explore the individual genomic landscapes and mutations relevant for diagnosis and therapy using whole genome sequencing and de novo genome assembly. In general, the emerging TGS technology can produce high quality long reads for the determination of overlapping reads and transcript isoforms. However, this technology still faces challenges such as the accuracy for the identification of nucleotide bases and high error rates. Here, we surveyed 39 TGS-related tools for de novo assembly and genome analysis to identify the differences among their characteristics, such as the required input, the interaction with the user, sequencing platforms, type of reads, error models, the possibility of introducing coverage bias, the simulation of genomic variants and outputs provided. The decision trees are summarized to help researchers to find out the most suitable tools to analyze the TGS data. Our comprehensive survey and evaluation of computational features of existing methods for TGS may provide a valuable guideline for researchers.
In the field of genome assembly, scaffolding methods make it possible to obtain a more complete and contiguous reference genome, which is the cornerstone of genomic research. Scaffolding methods typically utilize the alignments between contigs and sequencing data (reads) to determine the orientation and order among contigs and to produce longer scaffolds, which are helpful for genomic downstream analysis. With the rapid development of high-throughput sequencing technologies, diverse types of reads have emerged over the past decade, especially in long-range sequencing, which have greatly enhanced the assembly quality of scaffolding methods. As the number of scaffolding methods increases, biology and bioinformatics researchers need to perform in-depth analyses of state-of-the-art scaffolding methods. In this article, we focus on the difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities. We hope this work will benefit the design of new scaffolding methods and the selection of appropriate scaffolding methods for specific biological studies.
… of genome sequence analysis is genome assembly. Even today, few centres have the resources, in both software and hardware, to assemble a genome … comparative assembly method …
… assembly that form the basis of virtually all modern genome sequence assemblers. Note, however, that even with such formulations, genome assembly … method of sequence assembly …
… Since the introduction of the chain termination sequencing method by Frederick Sanger in 1977 [1], the genomes of more than 800 bacteria and 100 eukaryotes have been sequenced, …
Background: Although whole genome sequencing is enabling numerous advances in many fields achieving complete chromosome-level sequence assemblies for diverse species presents difficulties. The problems in part reflect the limitations of current sequencing technologies. Chromosome assembly from ‘short read’ sequence data is confounded by the presence of repetitive genome regions with numerous similar sequence tracts which cannot be accurately positioned in the assembled sequence. Longer sequence reads often have higher error rates and may still be too short to span the larger gaps between contigs. Objective: Given the emergence of exciting new applications using sequencing technology, such as the Earth BioGenome Project, it is necessary to further develop and apply a range of strategies to achieve robust chromosome-level sequence assembly. Reviewed here are a range of methods to enhance assembly which include the use of cross-species synteny to understand relationships between sequence contigs, the development of independent genetic and/or physical scaffold maps as frameworks for assembly (for example, radiation hybrid, optical motif and chromatin interaction maps) and the use of patterns of linkage disequilibrium to help position, orient and locate contigs. Results and Conclusion: A range of methods exist which might be further developed to facilitate cost-effective large-scale sequence assembly for diverse species. A combination of strategies is required to best assemble sequence data into chromosome-level assemblies. There are a number of routes towards the development of maps which span chromosomes (including physical, genetic and linkage disequilibrium maps) and construction of these whole chromosome maps greatly facilitates the ordering and orientation of sequence contigs.
Genome sequencing projects were long confined to biomedical model organisms and required the concerted effort of large consortia. Rapid progress in high‐throughput sequencing technology and the simultaneous development of bioinformatic tools have democratized the field. It is now within reach for individual research groups in the eco‐evolutionary and conservation community to generate de novo draft genome sequences for any organism of choice. Because of the cost and considerable effort involved in such an endeavour, the important first step is to thoroughly consider whether a genome sequence is necessary for addressing the biological question at hand. Once this decision is taken, a genome project requires careful planning with respect to the organism involved and the intended quality of the genome draft. Here, we briefly review the state of the art within this field and provide a step‐by‐step introduction to the workflow involved in genome sequencing, assembly and annotation with particular reference to large and complex genomes. This tutorial is targeted at scientists with a background in conservation genetics, but more generally, provides useful practical guidance for researchers engaging in whole‐genome sequencing projects.
As a part of the ELIXIR-EXCELERATE efforts in capacity building, we present here 10 steps to facilitate researchers getting started in genome assembly and genome annotation. The guidelines given are broadly applicable, intended to be stable over time, and cover all aspects from start to finish of a general assembly and annotation project. Intrinsic properties of genomes are discussed, as is the importance of using high quality DNA. Different sequencing technologies and generally applicable workflows for genome assembly are also detailed. We cover structural and functional annotation and encourage readers to also annotate transposable elements, something that is often omitted from annotation workflows. The importance of data management is stressed, and we give advice on where to submit data and how to make your results Findable, Accessible, Interoperable, and Reusable (FAIR).
Motivation: Efficient and fast next-generation sequencing (NGS) algorithms are essential to analyze the terabytes of data generated by the NGS machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and advanced hardware implementation. Results: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with low memory footprints. Availability and implementation: The GATB library is written in C++ and is available at the following Web site http://gatb.inria.fr under the A-GPL license. Contact: lavenier@irisa.fr Supplementary information: Supplementary data are available at Bioinformatics online.
A key use of high throughput sequencing technology is the sequencing and assembly of full genome sequences. These genome assemblies are commonly assessed using statistics relating to contiguity of the assembly. Measures of contiguity are not strongly correlated with information about the biological completion or correctness of the assembly, and a commonly reported metric, N50, can be misleading. Over the years, multiple research groups have rejected the overuse of N50 and sought to develop more informative metrics. This paper presents a review of problems that arise from relying solely on contiguity as a measure of genome assembly quality as well as current alternative methods. Alternative methods are compared on the basis of how informative they are about the biological quality of the assembly and how easy they are to use. A comprehensive method for using multiple metrics of measuring assembly quality is presented. This study aims to report on the status of assembly assessment methods and compare them, as well as to offer a comprehensive method that incorporates multiple facets of quality assessment. Weaknesses and strengths of varying methods are presented and explained, with recommendations based on speed of analysis and user friendliness.
Motivation The emergence of high‐throughput sequencing technologies revolutionized genomics in early 2000s. The next revolution came with the era of long‐read sequencing. These technological advances along with novel computational approaches became the next step towards the automatic pipelines capable to assemble nearly complete mammalian‐size genomes. Results In this manuscript, we demonstrate performance of the state‐of‐the‐art genome assembly software on six eukaryotic datasets sequenced using different technologies. To evaluate the results, we developed QUAST‐LG—a tool that compares large genomic de novo assemblies against reference sequences and computes relevant quality metrics. Since genomes generally cannot be reconstructed completely due to complex repeat patterns and low coverage regions, we introduce a concept of upper bound assembly for a given genome and set of reads, and compute theoretical limits on assembly correctness and completeness. Using QUAST‐LG, we show how close the assemblies are to the theoretical optimum, and how far this optimum is from the finished reference. Availability and implementation http://cab.spbu.ru/software/quast‐lg
Motivation: Second-generation sequencing technologies produce high coverage of the genome by short reads at a low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this article, we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms large numbers of paired-end reads into a much smaller number of longer ‘super-reads’. The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced ‘mazurka’). Results: We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two datasets from organisms for which high-quality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads. Availability: MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a year. Contact: alekseyz@ipst.umd.edu Supplementary information: Supplementary data are available at Bioinformatics online.
… PATRIC’s genome assembly landing page where researchers can assemble either single- … or Nanopore, with an enlargement showing the different assembly strategies available …
The advent of next-generation sequencing technologies is accompanied with the development of many whole-genome sequence assembly methods and software, especially for de novo fragment assembly. Due to the poor knowledge about the applicability and performance of these software tools, choosing a befitting assembler becomes a tough task. Here, we provide the information of adaptivity for each program, then above all, compare the performance of eight distinct tools against eight groups of simulated datasets from Solexa sequencing platform. Considering the computational time, maximum random access memory (RAM) occupancy, assembly accuracy and integrity, our study indicate that string-based assemblers, overlap-layout-consensus (OLC) assemblers are well-suited for very short reads and longer reads of small genomes respectively. For large datasets of more than hundred millions of short reads, De Bruijn graph-based assemblers would be more appropriate. In terms of software implementation, string-based assemblers are superior to graph-based ones, of which SOAPdenovo is complex for the creation of configuration file. Our comparison study will assist researchers in selecting a well-suited assembler and offer essential information for the improvement of existing assemblers or the developing of novel assemblers.
Atlas is a suite of programs developed for assembly of genomes by a "combined approach" that uses DNA sequence reads from both BACs and whole-genome shotgun (WGS) libraries. The BAC clones afford advantages of localized assembly with reduced computational load, and provide a robust method for dealing with repeated sequences. Inclusion of WGS sequences facilitates use of different clone insert sizes and reduces data production costs. A core function of Atlas software is recruitment of WGS sequences into appropriate BACs based on sequence overlaps. Because construction of consensus sequences is from local assembly of these reads, only small (<0.1%) units of the genome are assembled at a time. Once assembled, each BAC is used to derive a genomic layout. This "sequence-based" growth of the genome map has greater precision than with non-sequence-based methods. Use of BACs allows correction of artifacts due to repeats at each stage of the process. This is aided by ancillary data such as BAC fingerprint, other genomic maps, and syntenic relations with other genomes. Atlas was used to assemble a draft DNA sequence of the rat genome; its major components including overlapper and split-scaffold are also being used in pure WGS projects.
Motivation With the recent advances in DNA sequencing technologies, the study of the genetic composition of living organisms has become more accessible for researchers. Several advances have been achieved because of it, especially in the health sciences. However, many challenges which emerge from the complexity of sequencing projects remain unsolved. Among them is the task of assembling DNA fragments from previously unsequenced organisms, which is classified as an NP-hard (nondeterministic polynomial time hard) problem, for which no efficient computational solution with reasonable execution time exists. However, several tools that produce approximate solutions have been used with results that have facilitated scientific discoveries, although there is ample room for improvement. As with other NP-hard problems, machine learning algorithms have been one of the approaches used in recent years in an attempt to find better solutions to the DNA fragment assembly problem, although still at a low scale. Results This paper presents a broad review of pioneering literature comprising artificial intelligence-based DNA assemblers-particularly the ones that use machine learning-to provide an overview of state-of-the-art approaches and to serve as a starting point for further study in this field.
The application of whole-genome shotgun sequencing to microbial communities represents a major development in metagenomics, the study of uncultured microbes via the tools of modern genomic analysis. In the past year, whole-genome shotgun sequencing projects of prokaryotic communities from an acid mine biofilm, the Sargasso Sea, Minnesota farm soil, three deep-sea whale falls, and deep-sea sediments have been reported, adding to previously published work on viral communities from marine and fecal samples. The interpretation of this new kind of data poses a wide variety of exciting and difficult bioinformatics problems. The aim of this review is to introduce the bioinformatics community to this emerging field by surveying existing techniques and promising new approaches for several of the most interesting of these computational problems.
Summary: AnnTools is a versatile bioinformatics application designed for comprehensive annotation of a full spectrum of human genome variation: novel and known single-nucleotide substitutions (SNP/SNV), short insertions/deletions (INDEL) and structural variants/copy number variation (SV/CNV). The variants are interpreted by interrogating data compiled from 15 constantly updated sources. In addition to detailed functional characterization of the coding variants, AnnTools searches for overlaps with regulatory elements, disease/trait associated loci, known segmental duplications and artifact prone regions, thereby offering an integrated and comprehensive analysis of genomic data. The tool conveniently accepts user-provided tracks for custom annotation and offers flexibility in input data formats. The output is generated in the universal Variant Call Format. High annotation speed makes AnnTools suitable for high-throughput sequencing facilities, while a low-memory footprint and modest CPU requirements allow it to operate on a personal computer. The application is freely available for public use; the package includes installation scripts and a set of helper tools. Availability: http://anntools.sourceforge.net/ Contact: vladimir.makarov@mssm.edu; chris.yoon@mssm.edu Supplementary information: Supplementary data are available at Bioinformatics online.
Whole Exome Sequencing (WES) is the application of the next-generation technology to determine the variations in the exome and is becoming a standard approach in studying genetic variants in diseases. Understanding the exomes of individuals at single base resolution allows the identification of actionable mutations for disease treatment and management. WES technologies have shifted the bottleneck in experimental data production to computationally intensive informatics-based data analysis. Novel computational tools and methods have been developed to analyze and interpret WES data. Here, we review some of the current tools that are being used to analyze WES data. These tools range from the alignment of raw sequencing reads all the way to linking variants to actionable therapeutics. Strengths and weaknesses of each tool are discussed for the purpose of helping researchers make more informative decisions on selecting the best tools to analyze their WES data.
… We detail the various tools and methods that have been developed for variant annotation … Hence, there is a need for powerful computational tools for aligning these sequencing reads…
Simple Summary Due to the development of high-throughput sequencing technologies, computational genome annotation of sequences has become one of the principal research area in computational biology. First, we reviewed comparative annotation tools and pipelines for both annotations of structures and functions, which enable us to comprehend gene functions and their genome evolution. Second, we compared genome annotation tools that utilize homology-based and ab initio methods depending on the similarity of sequences or the lack of evidences. Third, we explored visualization tools that aid the annotation process and stressed the need for the quality control of annotations and re-annotations, because misannotations may happen due to experimental errors or missed genes by preceding technologies. Finally, we highlighted how emerging technologies can be used in future annotations. Abstract Next-Generation Sequencing (NGS) has made it easier to obtain genome-wide sequence data and it has shifted the research focus into genome annotation. The challenging tasks involved in annotation rely on the currently available tools and techniques to decode the information contained in nucleotide sequences. This information will improve our understanding of general aspects of life and evolution and improve our ability to diagnose genetic disorders. Here, we present a summary of both structural and functional annotations, as well as the associated comparative annotation tools and pipelines. We highlight visualization tools that immensely aid the annotation process and the contributions of the scientific community to the annotation. Further, we discuss quality-control practices and the need for re-annotation, and highlight the future of annotation.
The sequence contexts of genomic variants play important roles in understanding biological significances of variants and potential sequencing related variant calling issues. However, methods for assessing the diverse sequence contexts of genomic variants such as tandem repeats and unambiguous annotations have been limited. Herein, we describe the Variant Sequence Context Annotation Tool (VarSCAT) for annotating the sequence contexts of genomic variants, including breakpoint ambiguities, flanking sequences, variant nomenclatures, adjacent variants, and tandem repeats with user customizable options. Our analysis demonstrate that VarSCAT is more versatile and customizable than current methods or strategies for annotating variants in short tandem repeat (STR) regions. Variant sequence context annotations of high-confidence human variant sets with VarSCAT revealed that more than 75% of all human individual germline and clinically relevant insertions and deletions (indels) have breakpoint ambiguities. Moreover, we illustrate that more than 80% of human individual germline small variants in STR regions are indels and that the sizes of these indels correlated with STR motif sizes. VarSCAT is available at https://github.com/elolab/VarSCAT. Author Summary The sequence contexts have significant impacts on the biological and technical aspects of genomic variants. The sequence contexts, such as tandem repeats or nearby indels, may increase the mutation rate of a region compared to other genome regions. Besides, variants in specific sequence contexts like STRs may also have distinguished biological consequences, which can lead to certain human diseases and thus they may be used as biomarkers for disease diagnosis and treatments. Moreover, potential ambiguous variant representations such as equivalent or redundant indels are also related with their sequence contexts, which may complicate variant harmonization from different sources. Our previous study demonstrated that more than half of false positive indel calls detected through next generation sequencing data are related with STRs. Thus, the sequence contexts of genomic variants are important and cannot be ignored. However, the current methods or strategies for assessing the sequence contexts of genomic variants are limited and not feasible to use. Here, we developed a computational tool VarSCAT for sequence contexts annotation of genomic variants. Our tool provides diverse sequence contexts annotations providing users information to further explore the variants of their interests. By applying VarSCAT to high confidence human variant sets, we demonstrate the influence of sequence context of genomic variants and emphasize the importance of sequence context assessment.
The recent advancement of sequencing technologies marks a significant shift in the character and complexity of the digital genomic data universe, encompassing diverse types of molecular data, screened through manifold technological platforms. As a result, a plethora of fully assembled genomes are generated that span vertically the evolutionary scale. Notwithstanding the tsunami of thriving innovations that accomplish unprecedented, nucleotide-level, structural and functional annotation, an exhaustive, systemic, massive genome-wide functional annotation remains elusive, particularly when the criterion is automation and efficiency in data-agnostic interpretation. The latter is of paramount importance for the elaboration of strategies for sophisticated, data-driven genome-wide annotation, which aim to impart a sustainable and comprehensive systemic approach to addressing whole genome variation. Therefore, it is essential to develop methods and tools that promote systematic functional genomic annotation, with emphasis on mechanistic information exceeding the limits of coding regions, and exploiting the chunks of pertinent information residing in non-coding regions, including promoter and enhancer sequences, non-coding RNAs, DNA methylation sites, transcription factor binding sites, transposable elements and more. This review provides an overview of the current state-of-the-art in genome-wide functional annotation of genetic variation, including existing bioinformatic tools, resources, databases and platforms currently available or reported in the literature. Particular emphasis is placed on the functional annotation of variants that lie outside protein-coding genomic regions (intronic or intergenic), their potential co-localization with regulatory element areas, such as putative non-coding RNA regions, and the assessment of their functional impact on the investigated phenotype. In addition, state-of-the-art tools that leverage data obtained from WGS and GWAS-based analyses are discussed, along with future bioinformatics directions and developments. These future directions emphasize efficient, comprehensive, and largely automated functional annotation of both coding and non-coding genomic variants, as well as their optimal evaluation.
Abstract Genome-wide association studies (GWAS) have found the majority of disease-associated variants to be non-coding. Major efforts into the charting of the non-coding regulatory landscapes have allowed for the development of tools and methods which aim to aid in the identification of causal variants and their mechanism of action. In this review, we give an overview of current tools and methods for the analysis of non-coding GWAS variants in disease. We provide a workflow that allows for the accumulation of in silico evidence to generate novel hypotheses on mechanisms underlying disease and prioritize targets for follow-up study using non-coding GWAS variants. Lastly, we discuss the need for comprehensive benchmarks and novel tools for the analysis of non-coding variants.
Clinical laboratories implement a variety of measures to classify somatic sequence variants and identify clinically significant variants to facilitate the implementation of precision medicine. To standardize the interpretation process, the Association for Molecular Pathology (AMP), American Society of Clinical Oncology (ASCO), and College of American Pathologists (CAP) published guidelines for the interpretation and reporting of sequence variants in cancer in 2017. These guidelines classify somatic variants using a four-tiered system with ten criteria. Even with the standardized guidelines, assessing clinical impacts of somatic variants remains to be tedious. Additionally, manual implementation of the guidelines may vary among professionals and may lack reproducibility when the supporting evidence is not documented in a consistent manner. We developed a semi-automated tool called “Variant Interpretation for Cancer” (VIC) to accelerate the interpretation process and minimize individual biases. VIC takes pre-annotated files and automatically classifies sequence variants based on several criteria, with the ability for users to integrate additional evidence to optimize the interpretation on clinical impacts. We evaluated VIC using several publicly available databases and compared with several predictive software programs. We found that VIC is time-efficient and conservative in classifying somatic variants under default settings, especially for variants with strong and/or potential clinical significance. Additionally, we also tested VIC on two cancer-panel sequencing datasets to show its effectiveness in facilitating manual interpretation of somatic variants. Although VIC cannot replace human reviewers, it will accelerate the interpretation process on somatic variants. VIC can also be customized by clinical laboratories to fit into their analytical pipelines to facilitate the laborious process of somatic variant interpretation. VIC is freely available at https://github.com/HGLab/VIC/.
… several computational tools used to predict whether a variant has clinical significance. In addition to describing the role of these tools … and benign variants in hematologic malignancies. …
… developed a tool called ANNOVAR 14 to rapidly annotate genetic variants and predict their functionalities. Besides ANNOVAR, several other similar annotation tools have also been …
… variants, we can move to the next step of variant annotation, where we annotate the VCF file containing variants … The output of variant annotation tools includes information about the …
: Advances in genomic research have significantly enhanced our understanding of the genetic factors influencing human health. A key output of this research are VCF (Variant Call Format) files, which document genetic variations detected through DNA sequencing. These files, however, provide limited information, making it challenging to interpret the biological significance of the variants without additional data. Annotation, the process of enriching VCF files with information from publicly available biomedical datasets, is essential for facilitating variant interpretation in research. In this paper, we present VCFAnnotator , a tool developed to adapt ANNOVAR software used in genetic research, enabling the annotation of entire directories with a single command and facilitating the use of any relevant external database. Additionally, VCFAnnotator offers the ability to scrape the various websites of the biomedical databases in use, ensuring that the researchers remain informed of any updates.
Summary Structural Variations (SV) are a major source of variability in the human genome that shaped its actual structure during evolution. Moreover, many human diseases are caused by SV, highlighting the need to accurately detect those genomic events but also to annotate them and assist their biological interpretation. Therefore, we developed AnnotSV that compiles functionally, regulatory and clinically relevant information and aims at providing annotations useful to (i) interpret SV potential pathogenicity and (ii) filter out SV potential false positive. In particular, AnnotSV reports heterozygous and homozygous counts of single nucleotide variations (SNVs) and small insertions/deletions called within each SV for the analyzed patients, this genomic information being extremely useful to support or question the existence of an SV. We also report the computed allelic frequency relative to overlapping variants from DGV (MacDonald et al., 2014), that is especially powerful to filter out common SV. To delineate the strength of AnnotSV, we annotated the 4751 SV from one sample of the 1000 Genomes Project, integrating the sample information of four million of SNV/indel, in less than 60 s. Availability and implementation AnnotSV is implemented in Tcl and runs in command line on all platforms. The source code is available under the GNU GPL license. Source code, README and Supplementary data are available at http://lbgi.fr/AnnotSV/. Supplementary information Supplementary data are available at Bioinformatics online.
This study benchmarks eight structural variant prioritization tools, highlighting their comparable effectiveness in predicting pathogenicity and providing insights for improved genomic research. Structural variants (SVs) over 50 base pairs play a significant role in phenotypic diversity and are associated with various diseases, but their analysis is complex and resource-intensive. Numerous computational tools have been developed for SV prioritization, yet their effectiveness in biomedicine remains unclear. Here we benchmarked eight widely used SV prioritization tools, categorized into knowledge-driven (AnnotSV, ClassifyCNV) and data-driven (CADD-SV, dbCNV, StrVCTVRE, SVScore, TADA, XCNV) groups in accordance with the ACMG guidelines. We assessed their accuracy, robustness, and usability across diverse genomic contexts, biological mechanisms and computational efficiency using seven carefully curated independent datasets. Our results revealed that both groups of methods exhibit comparable effectiveness in predicting SV pathogenicity, although performance varies among tools, emphasizing the importance of selecting the appropriate tool based on specific research purposes. Furthermore, we pinpointed the potential improvement of expanding these tools for future applications. Our benchmarking framework provides a crucial evaluation method for SV analysis tools, offering practical guidance for biomedical research and facilitating the advancement of better genomic research tools.
Clinical genetics has an important role in the healthcare system to provide a definitive diagnosis for many rare syndromes. It also can have an influence over genetics prevention, disease prognosis and assisting the selection of the best options of care/treatment for patients. Next-generation sequencing (NGS) has transformed clinical genetics making possible to analyze hundreds of genes at an unprecedented speed and at a lower price when comparing to conventional Sanger sequencing. Despite the growing literature concerning NGS in a clinical setting, this review aims to fill the gap that exists among (bio)informaticians, molecular geneticists and clinicians, by presenting a general overview of the NGS technology and workflow. First, we will review the current NGS platforms, focusing on the two main platforms Illumina and Ion Torrent, and discussing the major strong points and weaknesses intrinsic to each platform. Next, the NGS analytical bioinformatic pipelines are dissected, giving some emphasis to the algorithms commonly used to generate process data and to analyze sequence variants. Finally, the main challenges around NGS bioinformatics are placed in perspective for future developments. Even with the huge achievements made in NGS technology and bioinformatics, further improvements in bioinformatic algorithms are still required to deal with complex and genetically heterogeneous disorders.
… In Figure 1, we list a subset of the computational tools used in each of the approaches. In the sections that follow, we review the rationale and tools for each approach and conclude by …
Next‐generation sequencing (NGS) technologies have revolutionized the field of genetics and are trending toward clinical diagnostics. Exome and targeted sequencing in a disease context represent a major NGS clinical application, considering its utility and cost‐effectiveness. With the ongoing discovery of disease‐associated genes, various gene panels have been launched for both basic research and diagnostic tests. However, the fundamental inconsistencies among the diverse annotation sources, software packages, and data formats have complicated the subsequent analysis. To manage disease‐associated NGS data, we developed Vanno, a Web‐based application for in‐depth analysis and rapid evaluation of disease‐causative genome sequence alterations. Vanno integrates information from biomedical databases, functional predictions from available evaluation models, and mutation landscapes from TCGA cancer types. A highly integrated framework that incorporates filtering, sorting, clustering, and visual analytic modules is provided to facilitate exploration of oncogenomics datasets at different levels, such as gene, variant, protein domain, or three‐dimensional structure. Such design is crucial for the extraction of knowledge from sequence alterations and translating biological insights into clinical applications. Taken together, Vanno supports almost all disease‐associated gene tests and exome sequencing panels designed for NGS, providing a complete solution for targeted and exome sequencing analysis. Vanno is freely available at http://cgts.cgu.edu.tw/vanno.
High-throughput sequencing platforms are generating massive amounts of genetic variation data for diverse genomes, but it remains a challenge to pinpoint a small subset of functionally important variants. To fill these unmet needs, we developed the ANNOVAR tool to annotate single nucleotide variants (SNVs) and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP. ANNOVAR can utilize annotation databases from the UCSC Genome Browser or any annotation data set conforming to Generic Feature Format version 3 (GFF3). We also illustrate a ‘variants reduction’ protocol on 4.7 million SNVs and indels from a human genome, including two causal mutations for Miller syndrome, a rare recessive disease. Through a stepwise procedure, we excluded variants that are unlikely to be causal, and identified 20 candidate genes including the causal gene. Using a desktop computer, ANNOVAR requires ∼4 min to perform gene-based annotation and ∼15 min to perform variants reduction on 4.7 million variants, making it practical to handle hundreds of human genomes in a day. ANNOVAR is freely available at http://www.openbioinformatics.org/annovar/.
Chemogenomics data generally refers to the activity data of chemical compounds on an array of protein targets and represents an important source of information for building in silico target prediction models. The increasing volume of chemogenomics data offers exciting opportunities to build models based on Big Data. Preparing a high quality data set is a vital step in realizing this goal and this work aims to compile such a comprehensive chemogenomics dataset. This dataset comprises over 70 million SAR data points from publicly available databases (PubChem and ChEMBL) including structure, target information and activity annotations. Our aspiration is to create a useful chemogenomics resource reflecting industry-scale data not only for building predictive models of in silico polypharmacology and off-target effects but also for the validation of cheminformatics approaches in general.
… sets and combining them with all available functional data to draw inferences … data for genome-wide functional inference and describes several methods by which these disparate data …
… for large-scale data processing in biological research. We … user-friendly interfaces for complex data analysis, as well as … to process and analyze large-scale biological data is crucial …
Today we can generate hundreds of gigabases of DNA and RNA sequencing data in a week for less than US$5,000. The astonishing rate of data generation by these low-cost, high-throughput technologies in genomics is being matched by that of other technologies, such as real-time imaging and mass spectrometry-based flow cytometry. Success in the life sciences will depend on our ability to properly interpret the large-scale, high-dimensional data sets that are generated by these technologies, which in turn requires us to adopt advances in informatics. Here we discuss how we can master the different types of computational environments that exist — such as cloud and heterogeneous computing — to successfully tackle our big data problems.
BackgroundNew systems biology studies require researchers to understand how interplay among myriads of biomolecular entities is orchestrated in order to achieve high-level cellular and physiological functions. Many software tools have been developed in the past decade to help researchers visually navigate large networks of biomolecular interactions with built-in template-based query capabilities. To further advance researchers' ability to interrogate global physiological states of cells through multi-scale visual network explorations, new visualization software tools still need to be developed to empower the analysis. A robust visual data analysis platform driven by database management systems to perform bi-directional data processing-to-visualizations with declarative querying capabilities is needed.ResultsWe developed ProteoLens as a JAVA-based visual analytic software tool for creating, annotating and exploring multi-scale biological networks. It supports direct database connectivity to either Oracle or PostgreSQL database tables/views, on which SQL statements using both Data Definition Languages (DDL) and Data Manipulation languages (DML) may be specified. The robust query languages embedded directly within the visualization software help users to bring their network data into a visualization context for annotation and exploration. ProteoLens supports graph/network represented data in standard Graph Modeling Language (GML) formats, and this enables interoperation with a wide range of other visual layout tools. The architectural design of ProteoLens enables the de-coupling of complex network data visualization tasks into two distinct phases: 1) creating network data association rules, which are mapping rules between network node IDs or edge IDs and data attributes such as functional annotations, expression levels, scores, synonyms, descriptions etc; 2) applying network data association rules to build the network and perform the visual annotation of graph nodes and edges according to associated data values. We demonstrated the advantages of these new capabilities through three biological network visualization case studies: human disease association network, drug-target interaction network and protein-peptide mapping network.ConclusionThe architectural design of ProteoLens makes it suitable for bioinformatics expert data analysts who are experienced with relational database management to perform large-scale integrated network visual explorations. ProteoLens is a promising visual analytic platform that will facilitate knowledge discoveries in future network and systems biology studies.
The amount of available data is continuously growing. This phenomenon promotes a new concept, named big data. The highlight technologies related to big data are cloud computing (infrastructure) and Not Only SQL (NoSQL; data storage). In addition, for data analysis, machine learning algorithms such as decision trees, support vector machines, artificial neural networks, and clustering techniques present promising results. In a biological context, big data has many applications due to the large number of biological databases available. Some limitations of biological big data are related to the inherent features of these data, such as high degrees of complexity and heterogeneity, since biological systems provide information from an atomic level to interactions between organisms or their environment. Such characteristics make most bioinformatic-based applications difficult to build, configure, and maintain. Although the rise of big data is relatively recent, it has contributed to a better understanding of the underlying mechanisms of life. The main goal of this article is to provide a concise and reliable survey of the application of big data-related technologies in biology. As such, some fundamental concepts of information technology, including storage resources, analysis, and data sharing, are described along with their relation to biological data.
The completion of the Human Genome Project lays a foundation for systematically studying the human genome from evolutionary history to precision medicine against diseases. With the explosive growth of biological data, there is an increasing number of biological databases that have been developed in aid of human-related research. Here we present a collection of human-related biological databases and provide a mini-review by classifying them into different categories according to their data types. As human-related databases continue to grow not only in count but also in volume, challenges are ahead in big data storage, processing, exchange and curation.
Summary: bioDBnet is an online web resource that provides interconnected access to many types of biological databases. It has integrated many of the most commonly used biological databases and in its current state has 153 database identifiers (nodes) covering all aspects of biology including genes, proteins, pathways and other biological concepts. bioDBnet offers various ways to work with these databases including conversions, extensive database reports, custom navigation and has various tools to enhance the quality of the results. Importantly, the access to bioDBnet is updated regularly, providing access to the most recent releases of each individual database. Availability: http://biodbnet.abcc.ncifcrf.gov Contact: stephensr@mail.nih.gov Supplementary information: Supplementary data are available at Bioinformatics online
… data on a huge scale in both volume and complexity, significant technical challenges still remain for the smaller scale data … advanced analysis and data source interrogation capabilities …
Data sharing, integration and annotation are essential to ensure the reproducibility of the analysis and interpretation of the experimental findings. Often these activities are perceived as a role that bioinformaticians and computer scientists have to take with no or little input from the experimental biologist. On the contrary, biological researchers, being the producers and often the end users of such data, have a big role in enabling biological data integration. The quality and usefulness of data integration depend on the existence and adoption of standards, shared formats, and mechanisms that are suitable for biological researchers to submit and annotate the data, so it can be easily searchable, conveniently linked and consequently used for further biological analysis and discovery. Here, we provide background on what is data integration from a computational science point of view, how it has been applied to biological research, which key aspects contributed to its success and future directions.
The rapid development of high-throughput sequencing (HTS) techniques has led biology into the big-data era. Data analyses using various bioinformatics tools rely on programming and command-line environments, which are challenging and time-consuming for most wet-lab biologists. Here, we present TBtools (a Toolkit for Biologists integrating various biological data handling tools), a stand-alone software with a user-friendly interface. The toolkit incorporates over 130 functions, which are designed to meet the increasing demand for big-data analyses, ranging from bulk sequence processing to interactive data visualization. A wide variety of graphs can be prepared in TBtools, with a new plotting engine ("JIGplot") developed to maximum their interactive ability, which allows quick point-and-click modification to almost every graphic feature. TBtools is a platform-independent software that can be run under all operating systems with Java Runtime Environment 1.6 or newer. It is freely available to non-commercial users at https://github.com/CJ-Chen/TBtools/releases.
The last decade has witnessed an explosion in the amount of available biological sequence data, due to the rapid progress of high-throughput sequencing projects. However, the biological data amount is becoming so great that traditional data analysis platforms and methods can no longer meet the need to rapidly perform data analysis tasks in life sciences. As a result, both biologists and computer scientists are facing the challenge of gaining a profound insight into the deepest biological functions from big biological data. This in turn requires massive computational resources. Therefore, high performance computing (HPC) platforms are highly needed as well as efficient and scalable algorithms that can take advantage of these platforms. In this paper, we survey the state-of-the-art HPC platforms for big biological data analytics. We first list the characteristics of big biological data and popular computing platforms. Then we provide a taxonomy of different biological data analysis applications and a survey of the way they have been mapped onto various computing platforms. After that, we present a case study to compare the efficiency of different computing platforms for handling the classical biological sequence alignment problem. At last we discuss the open issues in big biological data analytics.
MotivationIn the biological sciences, the need to analyse vast amounts of information has become commonplace. Such large-scale analyses often involve drawing together data from a variety of different databases, held remotely on the internet or locally on in-house servers. Supporting these tasks are ad hoc collections of data-manipulation tools, scripting languages and visualisation software, which are often combined in arcane ways to create cumbersome systems that have been customised for a particular purpose, and are consequently not readily adaptable to other uses. For many day-to-day bioinformatics tasks, the sizes of current databases, and the scale of the analyses necessary, now demand increasing levels of automation; nevertheless, the unique experience and intuition of human researchers is still required to interpret the end results in any meaningful biological way. Putting humans in the loop requires tools to support real-time interaction with these vast and complex data-sets. Numerous tools do exist for this purpose, but many do not have optimal interfaces, most are effectively isolated from other tools and databases owing to incompatible data formats, and many have limited real-time performance when applied to realistically large data-sets: much of the user's cognitive capacity is therefore focused on controlling the software and manipulating esoteric file formats rather than on performing the research.MethodsTo confront these issues, harnessing expertise in human-computer interaction (HCI), high-performance rendering and distributed systems, and guided by bioinformaticians and end-user biologists, we are building reusable software components that, together, create a toolkit that is both architecturally sound from a computing point of view, and addresses both user and developer requirements. Key to the system's usability is its direct exploitation of semantics, which, crucially, gives individual components knowledge of their own functionality and allows them to interoperate seamlessly, removing many of the existing barriers and bottlenecks from standard bioinformatics tasks.ResultsThe toolkit, named Utopia, is freely available from http://utopia.cs.man.ac.uk/.
BackgroundNew techniques for determining relationships between biomolecules of all types – genes, proteins, noncoding DNA, metabolites and small molecules – are now making a substantial contribution to the widely discussed explosion of facts about the cell. The data generated by these techniques promote a picture of the cell as an interconnected information network, with molecular components linked with one another in topologies that can encode and represent many features of cellular function. This networked view of biology brings the potential for systematic understanding of living molecular systems.ResultsWe present VisANT, an application for integrating biomolecular interaction data into a cohesive, graphical interface. This software features a multi-tiered architecture for data flexibility, separating back-end modules for data retrieval from a front-end visualization and analysis package. VisANT is a freely available, open-source tool for researchers, and offers an online interface for a large range of published data sets on biomolecular interactions, including those entered by users. This system is integrated with standard databases for organized annotation, including GenBank, KEGG and SwissProt. VisANT is a Java-based, platform-independent tool suitable for a wide range of biological applications, including studies of pathways, gene regulation and systems biology.ConclusionVisANT has been developed to provide interactive visual mining of biological interaction data sets. The new software provides a general tool for mining and visualizing such data in the context of sequence, pathway, structure, and associated annotations. Interaction and predicted association data can be combined, overlaid, manipulated and analyzed using a variety of built-in functions. VisANT is available at http://visant.bu.edu.
本报告对生物信息学文献进行了系统化梳理,构建了涵盖基因组学测序与组装、变异功能分析、多组学大数据挖掘与人工智能、结构生物信息学、计算架构与基础设施、以及领域方法论综述六大维度的知识框架。该分类有效覆盖了从底层测序数据处理、算法架构创新到高层生物学模型解析及临床精准医疗应用的全生命周期,体现了生物信息学在处理复杂生物数据时向系统化、智能化与高性能化发展的学科态势。