Inference of patterns and associations using dictionary models

来源 :IMS-China International Conference on Statistics and Probabi | 被引量 : 0次 | 上传用户：shanghui

【摘要】

：

　　Pattern discovery is a ubiquitous problem in many disciplines.It is especially prominent in recent years due to our greatly improved data-generation capabil

【作者】

：

Jun Liu

【机构】

：

HarvardUniversity

【出处】

：

IMS-China International Conference on Statistics and Probabi

【发表日期】

：

2008年6期

下载到本地 , 更方便阅读

下载此文赞助VIP

声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架

论文部分内容阅读

　　Pattern discovery is a ubiquitous problem in many disciplines.It is especially prominent in recent years due to our greatly improved data-generation capabilities in science and technologies.The method I present here is motivated by the "motif finding" and "module-finding" problems in biology, i.e., to find sequence patterns (i.e., "words") that seem to appear more frequent than usual in a given set of text sequences (i.e., sentences) and to find which of these "words" tend to co-occur in a sentence.A challenge in the motif-finding problem is that there are no spacing and punctuations between the words and the dictionary of"words" is unknown to us.Existing methods are mostly "bottom-up" approaches, i.e., to build up the dictionary starting with single-letter words and then concatenate some existing words that appear to occur next to each other in sentences more frequently than chance.Our new approach is a top-down strategy, which uses a tree structure to represent the relationship among all possible existing words and uses the FM algorithm to estimate the usage frequency of each word.It automatically trims down most of the incorrect "words" by letting their usage frequencies converge to zero.The module finding problem is closely related to the well-known "market basket" problem, in which one attempts to mine association rules among the items in a supermarket based on customers transaction records.It is also related to the two-way clustering problem.In this problem, we assume that the words are given, and our goal is to find subsets of words that tend to co-occur in a sentence.We call the set of co-occurring words (not necessarily orderly) a "theme or a "module".We can generalize the dictionary model to the " theme"-model and use a similar EM-strategy to infer these themes.I will demonstrate its applications in a few examples including an analysis of Chinese medicine prescriptions and an analysis of a Chinese novel.

其他文献

体外细胞双核微核试验的应用研究

本实验用双核细胞研究不同类型诱变剂对多种体外培养细胞株微核率的影响。共筛选了BALB/C-3T3、CHL、6S3、NIH3T3,Rat-1、V79、CHO、HEP2和人胚肺成纤维细胞等九株细胞,其中

期刊

微核试验体外细胞细胞微核双核细胞剂量反应关系硫酸镍体外培养苯并香烟烟雾诱变剂

Functional Characterization of a Missense Mutation in CESA4 of Rice (Oryza sativa L.)

会议

某三甲综合性医院手术室医务人员感染防护知-信-行情况调查及相关影响因素分析

目的探讨某三甲综合性医院手术室医务人员感染防护知-信-行情况调查及相关影响因素.方法选择2018年1月-2019年3月某三甲综合性医院手术室75名医务人员作为观察对象,通过调

期刊

手术室医务人员感染防护知-信-行影响因素预防干预对策

Biphenyl and benzophenone metabolism

会议

Comparative Analysis of the Chemical Profiles of the Invasive Neophytes Polygonum cuspidatum and Pol

会议

LeGWD-a tomato gene involved in starch phosphorylation and degradation is essential for pollen viabi

会议

医院静脉用药调配中心洁净工作台优化消毒方式研究

目的比较2种消毒剂不同更换时间的消毒效果,找出一种PIVAS洁净工作台的最优消毒方式.方法 PIVAS洁净工作台消毒剂选用医用消毒湿巾和75％乙醇,每15 d更换使用;每次调配结束后,

期刊

PIVAS洁净工作台消毒方式75％乙醇医用消毒湿巾

Bayesian testimation in the normal means problem

　　We develop a Bayesian "testimation" procedure for recovering a high-dimensional vector observed in the white noise.The components of the unknown vector are

会议

邓瑞生：父亲邓子恢和我人生中的三个名字

电视机旁边,摆放着一个镜框,里面镶嵌着他父亲的照片。去年8月17日是他父亲诞辰110周年纪念日。中央文献出版社出版了《邓子恢画传》,里面引用了历史学家司马迁的一句话:“古

期刊

文献出版社杨尚昆脱离群众农业合作化运动胡乔木而名壮烈牺牲革命历程革命生涯问过

Molecular mechanism of flower pigmentation in Japanese gentian plants

会议

Inference of patterns and associations using dictionary models

与本文相关的学术论文