Analysis on n-gram statistics and linguistic features of whole genome protein sequences

来源 :哈尔滨工业大学学报 | 被引量 : 0次 | 上传用户:yucunjiang
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
To obtain the statistical sequence analysis on a large number of genomic and proteomie sequences available for different organisms,the n-grams of whole genome protein sequences from 20 organisms were extracted.Their linguistic features were analyzed by two tests:Zipf power law and Shannon entropy,developed for analysis of natural languages and symbolic sequences.The natural genome proteins and the artificial genome proteins were compared with each other and some statistical features of n-grams were discovered.The results show that:the n-grams of whole genome protein sequences approximately follow the Zipf law when n is larger than 4;the Shannon n-gram entropy of natural genome proteins is lower than that of artificial proteins;a simple unigram model can distinguish different organisms;there exist organism-specific usages of "phrases" in protein sequences.It is suggested that further detailed analysis on n-gram of whole genome protein sequences will result in a powerful model for mapping the relationship of protein sequence,structure and function.
其他文献
Thin film is a widely used structure in the present microelectromechanical systems (MEMS) and plays a vital role in many functional devices. However, the great
Carrier-smoothed-code(CSC)algorithm is an effective pseudorange multipath mitigation technique,which can alleviate the computational burden and reduce the commu
Carbonate minerals and water (or geofiuids) reactions are important for modeling of geochemical processes and have received considerable attention over the past
A new adsorption process for the removal of As(V) ion from aqueous solutions is studied in this paper using lanthanum-loaded zeolite. The removal efficiency of
To obtain a stable and proper linear filter to make the filtering error system robustly and strictly passive,the problem of full-order robust passive filtering
Based on fuzzy random variables, the concept of fuzzy stochastic sequences is defined. Strong limit theorems for fuzzy stochastic sequences are established. Som
The soil-structure interaction(SSI)decoupling is applied to simplify buried structure against internal blast lpad as spring effect.Shear failure.bending failure
Firstly, the macroscopic chemical equilibrium state of a series of chemical reactions between intercrystal brine and its media salt layer (salt deposit) in Qarh
To study the bending strength of mass concrete under dynamic loading.the pure bending zone of three-graded concrete beam is considered as a three-phase composit
Two trace impurities in the bulk drug lisinopril were detected by means of high-performance liquid chromatography coupled with mass spectrometry (HPLC/MS) with