publications
Publications by categories in reversed chronological order. Generated by jekyll-scholar.
2024
- Genomic data processing with GenomeFlowJunseok Park, Eduardo A Maury, Changhoon Oh, and 3 more authorsBMC Bioinformatics, 2024
Advances in genome sequencing technologies generate massive amounts of sequence data that are increasingly analyzed and shared through public repositories. On-demand infrastructure services on cloud computing platforms enable the processing of such large-scale genomic sequence data in distributed processing environments with a significant reduction in analysis time. However, parallel processing on cloud computing platforms presents many challenges to researchers, even skillful bioinformaticians. In particular, it is difficult to design a computing architecture optimized to reduce the cost of computing and disk storage as genomic data analysis pipelines often employ many heterogeneous tools with different resource requirements. To address these issues, we developed GenomeFlow, a tool for automated development of computing architecture and resource optimization on Google Cloud Platform, which allows users to process a large number of samples at minimal cost. We outline multiple use cases of GenomeFlow demonstrating its utility to significantly reduce computing time and cost associated with analyzing genomic and transcriptomic data from hundreds to tens of thousands of samples from several consortia. Here, we describe a step-by-step protocol on how to use GenomeFlow for a common genomic data processing task. We introduce this example protocol geared toward a bioinformatician with little experience in cloud computing and large data processing and estimate that it will take <1 hour to execute.
- Human cytomegalovirus harnesses host L1 retrotransposon for efficient replicationSung-Yeon Hwang*, Hyewon Kim*, Danielle Denisko*, and 9 more authors2024
Genetic parasites, including viruses and transposons, exploit components from the host for their own replication. However, little is known about virus-transposon interactions within host cells. Here, we discover a strategy where human cytomegalovirus (HCMV) hijacks L1 retrotransposon encoded protein during its replication cycle. HCMV infection upregulates L1 expression by enhancing both the expression of L1-activating transcription factors, YY1 and RUNX3 and the chromatin accessibility of L1 promoter regions. Increased L1 expression in turn promotes HCMV replicative fitness. Affinity proteomics reveals UL44, HCMV DNA polymerase subunit, as the most abundant viral binding protein of the L1 ribonucleoprotein (RNP) complex. UL44 directly interacts with L1 ORF2p, inducing DNA damage responses in replicating HCMV compartments. While increased L1- induced mutagenesis is not observed in HCMV for genetic adaptation, the interplay between UL44 and ORF2p accelerates viral DNA replication by resolving stalled replication forks. Our findings shed light on how HCMV exploits host retrotransposons for enhanced viral fitness.
2023
- Motif elucidation in ChIP-seq datasets with a knockout controlDanielle Denisko, Coby Viner, and Michael M HoffmanBioinformatics Advances, 2023
Chromatin immunoprecipitation-sequencing is widely used to find transcription factor binding sites, but suffers from various sources of noise. Knocking out the target factor mitigates noise by acting as a negative control. Paired wild-type and knockout (KO) experiments can generate improved motifs but require optimal differential analysis. We introduce peaKO—a computational method to automatically optimize motif analyses with KO controls, which we compare to two other methods. PeaKO often improves elucidation of the target factor and highlights the benefits of KO controls, which far outperform input controls.PeaKO is freely available at https://peako.hoffmanlab.org.michael.hoffman@utoronto.ca
2022
- Assessing and assuring interoperability of a genomics file formatYi Nian Niu, Eric G Roberts, Danielle Denisko, and 1 more authorBioinformatics, 2022
Bioinformatics software tools operate largely through the use of specialized genomics file formats. Often these formats lack formal specification, making it difficult or impossible for the creators of these tools to robustly test them for correct handling of input and output. This causes problems in interoperability between different tools that, at best, wastes time and frustrates users. At worst, interoperability issues could lead to undetected errors in scientific results.We developed a new verification system, Acidbio, which tests for correct behavior in bioinformatics software packages. We crafted tests to unify correct behavior when tools encounter various edge cases—potentially unexpected inputs that exemplify the limits of the format. To analyze the performance of existing software, we tested the input validation of 80 Bioconda packages that parsed the Browser Extensible Data (BED) format. We also used a fuzzing approach to automatically perform additional testing. Of 80 software packages examined, 75 achieved less than 70\% correctness on our test suite. We categorized multiple root causes for the poor performance of different types of software. Fuzzing detected other errors that the manually designed test suite could not. We also created a badge system that developers can use to indicate more precisely which BED variants their software accepts and to advertise the software’s performance on the test suite.Acidbio is available at https://github.com/hoffmangroup/acidbio.Supplementary data are available at Bioinformatics online.
2021
- GA4GH: International policies and standards for data sharing across genomic research and healthcareHeidi L. Rehm, Angela J.H. Page, Lindsay Smith, and 199 more authorsCell Genomics, 2021
The Global Alliance for Genomics and Health (GA4GH) aims to accelerate biomedical advances by enabling the responsible sharing of clinical and genomic data through both harmonized data aggregation and federated approaches. The decreasing cost of genomic sequencing (along with other genome-wide molecular assays) and increasing evidence of its clinical utility will soon drive the generation of sequence data from tens of millions of humans, with increasing levels of diversity. In this perspective, we present the GA4GH strategies for addressing the major challenges of this data revolution. We describe the GA4GH organization, which is fueled by the development efforts of eight Work Streams and informed by the needs of 24 Driver Projects and other key stakeholders. We present the GA4GH suite of secure, interoperable technical standards and policy frameworks and review the current status of standards, their relevance to key domains of research and clinical care, and future plans of GA4GH. Broad international participation in building, adopting, and deploying GA4GH standards and frameworks will catalyze an unprecedented effort in data sharing that will be critical to advancing genomic medicine and ensuring that all populations can access its benefits.