论文部分内容阅读
Background: DNA modifications such as DNA methylation and DNA damage can play critical regulatory roles in biological systems.High throughput DNA modification profiling can provide a more comprehensive view of the composition of DNA sequences, providing unique and useful insights for understanding biological phenomena beyond what can be discerned from As, Gs, Cs, and Ts only.Pacific Biosciencess single molecule, real time (SMRT) sequencing technology generates DNA sequences as well as DNA polymerase kinetic information, which can be used for the direct detection of DNA modifications.However, there is no statistical model exists for modeling the polymerase kinetic information.Herein, we propose a flexible hierarchical model, which can greatly improve SMRT sequencing based DNA modification detection accuracy and reduce sequencing cost.Methods: We demonstrate that local sequence context has a strong impact on DNA polymerase kinetics in the neighborhood of the incorporation site during the DNA synthesis reaction, allowing for the possibility of estimating the expected kinetic rate of the enzyme at the incorporation site using kinetic rate information collected from existing SMRT sequencing data (historical data) covering the same local sequence contexts of interest.We develop a flexible hierarchical model that can detect DNA modifications accurately by incorporating historical data.Results: Our results demonstrate that the hierarchical model outperforms the na(i)ve casecontrol method in which the kinetics from whole genome amplified (WGA) DNA (control) are compared to the corresponding native DNA (case) to detect kinetic variation events when a negative control sample exists.Besides, when there is no negative control sample, the hierarchical model can also achieve a reasonably good accuracy for detecting modifications that have a strong signal-to-noise ratio.Conclusions: We highlight the importance of local sequence context on 3rd-generationsequencing-based DNA modification detection.By incorporating historical data, detection accuracy can be increased and sequencing cost can be also reduced by using the proposed model .