Inference of patterns and associations using dictionary models

来源 :IMS-China International Conference on Statistics and Probabi | 被引量 : 0次 | 上传用户:shanghui
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
  Pattern discovery is a ubiquitous problem in many disciplines.It is especially prominent in recent years due to our greatly improved data-generation capabilities in science and technologies.The method I present here is motivated by the "motif finding" and "module-finding" problems in biology, i.e., to find sequence patterns (i.e., "words") that seem to appear more frequent than usual in a given set of text sequences (i.e., sentences) and to find which of these "words" tend to co-occur in a sentence.A challenge in the motif-finding problem is that there are no spacing and punctuations between the words and the dictionary of"words" is unknown to us.Existing methods are mostly "bottom-up" approaches, i.e., to build up the dictionary starting with single-letter words and then concatenate some existing words that appear to occur next to each other in sentences more frequently than chance.Our new approach is a top-down strategy, which uses a tree structure to represent the relationship among all possible existing words and uses the FM algorithm to estimate the usage frequency of each word.It automatically trims down most of the incorrect "words" by letting their usage frequencies converge to zero.The module finding problem is closely related to the well-known "market basket" problem, in which one attempts to mine association rules among the items in a supermarket based on customers transaction records.It is also related to the two-way clustering problem.In this problem, we assume that the words are given, and our goal is to find subsets of words that tend to co-occur in a sentence.We call the set of co-occurring words (not necessarily orderly) a "theme or a "module".We can generalize the dictionary model to the " theme"-model and use a similar EM-strategy to infer these themes.I will demonstrate its applications in a few examples including an analysis of Chinese medicine prescriptions and an analysis of a Chinese novel.
其他文献
本实验用双核细胞研究不同类型诱变剂对多种体外培养细胞株微核率的影响。共筛选了BALB/C-3T3、CHL、6S3、NIH3T3,Rat-1、V79、CHO、HEP2和人胚肺成纤维细胞等九株细胞,其中
会议
目的 探讨某三甲综合性医院手术室医务人员感染防护知-信-行情况调查及相关影响因素.方法 选择2018年1月-2019年3月某三甲综合性医院手术室75名医务人员作为观察对象,通过调
会议
会议
会议
目的 比较2种消毒剂不同更换时间的消毒效果,找出一种PIVAS洁净工作台的最优消毒方式.方法 PIVAS洁净工作台消毒剂选用医用消毒湿巾和75%乙醇,每15 d更换使用;每次调配结束后,
  We develop a Bayesian "testimation" procedure for recovering a high-dimensional vector observed in the white noise.The components of the unknown vector are
会议
电视机旁边,摆放着一个镜框,里面镶嵌着他父亲的照片。去年8月17日是他父亲诞辰110周年纪念日。中央文献出版社出版了《邓子恢画传》,里面引用了历史学家司马迁的一句话:“古
会议