从字频的角度浅析《红楼梦》

来源 :现代语文(语言研究) | 被引量 : 0次 | 上传用户:tianhaiyandml
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
  一、语料概述
  
  名著《红楼梦》电子文本的基本信息列表如下(表1):
  


  本论文分析所用《红楼梦》的语料是从http://ling.ccnu.edu.cn/ylk/gudian.htm网址下载获得。该语料共120回,其中除了汉字以外还包含210个特殊符号,比如非汉字符号、图形符号、结构符、标点、阿拉伯数字、日语、拉丁字母等等。
  下面是这些特殊符号的列表,按这些特殊符号的显示特征分两组列出:一组是可以看到的,也就是可显示的;另一组是无显示的,虽然在文本中看不到符号,但是都有各自的码位。第一组(表2)共有173个特殊符号,第二组(表3)共37个特殊符号,由于第二组符号无显示,因此我们把十六进制的编码附在了括号中。
  


  这些符号都是《红楼梦》的组成部分,但是本文主要是考虑可显示的文字特征,所以在做分析和统计的时候并没有考虑这些非汉字特殊符号的作用。考虑到标点符号所占比率较高,我们会在下面专门对标点符号进行一些分析和说明。
  
  二、对标点符号的统计
  
  标点符号也表示一定的语义,对小说的理解和语言的表达都有一定的作用。
  统计数据显示,《红楼梦》中非汉字字符出现的次数是137850,其中标点符号出现的次数是137540。已知小说字符总数是868996,标点符号占了 137540/868996≈0.158,也就是说,小说中有15.8%都是标点符号。其中频次排在最高的两个标点符号是逗号和句号,同时也是小说字符中频率最高的字符,这说明了小说中重复出现次数最多的符号不是汉字而是标点符号。逗号和句号重复出现次数分别高达59357和29400。利用句号、感叹号这种句子结束符,我们可以大致推测小说的规模,但是具体到小说的内容,单单看这些数据是无能为力的。只有将小说中的字词和标点符号结合起来才能更好地理解和解析文本。下面开始研究小说中的单个字。
  
  三、字频统计和字关联
  
  考虑到非汉字符号对分析《红楼梦》没有太大的贡献,所以在对字频进行统计时并没有考虑非汉字符号。因此我们现在的数据信息是:汉字的个数(不重复)是4316,出现总数(重复)是731146。我们不可能对四千多个汉字都一一进行分析,有些汉字可能只出现过一次或者几次,所以我们选择了有代表性的,即出现频率在0.1%以上的汉字作为研究目标。在选择代表性的汉字时,我们可以以出现次数的累计总和占所有汉字的出现总数(731146)的过半作为标准;但是观察了统计数据后我们发现,一些助词的出现比率远远高于表达具体语义的动词和名词,所以我们最终选择了出现频率0.1%以上的汉字,从结果可以看出这样的选择是可取的。
  出现频率在0.1%以上的汉字共有194个,总数为502347,占小说总数的68.7%,基本上涵盖了将近70%的字数,但是汉字的个数却只占了 194/4316≈0.0449,还不到5%。下面按频次的高低列出《红楼梦》中所有高频汉字。由于汉字很多,所以每个汉字的信息都用分号隔开,每组汉字的信息包括:字,出现次数,百分比(之间用逗号隔开)。
  了,21193,0.028986;的,15720,0.0215;不,15025,0.02055;一,12149,0.016616;来,11429,0.01563;道,11059,0.0151;人,10542,0.0144;是,10142,0.01387;说,9692,0.013256;我,9173,0.012546;这,7810,0.010682;他,7737,0.01058;你,7142,0.009768;去,6186,0.00846;着,6166,0.00843;也,6106,0.00835;儿,6074,0.008308;玉,6051,0.008276;有,5987,0.008189;宝,5820,0.00796;个,5656,0.007736;子,5466,0.007476;又,5220,0.007139;贾,5201,0.00711;里,5143,0.00703;那,4909,0.00671;们,4893,0.00669;见,4804,0.00657;只,4677,0.006397;太,4302,0.00588;便,4078,0.005578;好,4042,0.005528;在,4002,0.00547;笑,3957,0.00541;家,3917,0.005357;上,3809,0.0052;么,3670,0.00502;得,3610,0.004937;大,3466,0.00474;姐,3443,0.004709;头,3403,0.00465;听,3301,0.004515;就,3253,0.004449,出,3225,0.00441;回,3070,0.004199;知,2922,0.003996;日,2917,0.00399;要,2903,0.00397;下,2775,0.003795;都,2677,0.00366;心,2655,0.00363;事,2641,0.00361;二,2630,0.003597;老,2602,0.003559;过,2584,0.00353;话,2504,0.003425;还,2496,0.0034;起,2477,0.003388;自,2455,0.003358;如,2357,0.0032;看,2353,0.003218;叫,2267,0.0031;到,2243,0.003068;没,2243,0.003068;两,2230,0.00305;母,2206,0.003017;些,2172,0.00297;时,2156,0.002949;之,2139,0.002926;今,2117,0.002895;小,2020,0.00276;问,2001,0.002737;因,1977,0.0027;凤,1949,0.002666;奶,1947,0.00266;等,1938,0.00265;娘,1871,0.002559;可,1863,0.002548;什,1855,0.002537;呢,1826,0.002497;忙,1822,0.00249;夫,1805,0.002469;想,1792,0.00245;面,1781,0.002436;爷,1773,0.002425;才,1771,0.0024;中,1672,0.002287;王,1661,0.00227;打,1588,0.00217;进,1548,0.002117;此,1538,0.0021;倒,1534,0.002098;罢,1525,0.002086;样,1507,0.00206;吃,1455,0.00199;和,1453,0.001987;正,1411,0.0019;几,1400,0.001915;无,1400,0.001915;姑,1395,0.001908;后,1388,0.001898;黛,1383,0.00189;天,1362,0.00186;然,1292,0.001767;前,1281,0.00175;为,1274,0.00174;意,1261,0.001725;别,1253,0.0017;再,1253,0.0017;门,1242,0.001699;丫,1232,0.001685;走,1222,0.00167;外,1221,0.00167;袭,1213,0.001659;作,1212,0.001658;怎,1206,0.001649;三,1203,0.001645;众,1189,0.001626;妹,1188,0.001625;方,1170,0.0016;生,1170,0.0016;多,1164,0.00159;明,1157,0.00158;将,1156,0.00158;已,1150,0.00157;身,1142,0.00156;把,1141,0.00156;以,1133,0.00155;气,1125,0.001539;钗,1119,0.0015;何,1117,0.001528;亲,1087,0.001487;给,1077,0.00147;拿,1066,0.001458;与,1059,0.001448;手,1054,0.00144;坐,1054,0.00144;年,1048,0.00143;若,1038,0.0014;十,1036,0.001417;用,1036,0.001417;请,1031,0.0014;房,1027,0.001405;发,993,0.001358;薛,993,0.001358;且,991,0.001355;春,983,0.001344;妈,979,0.001339;政,978,0.001338;命,972,0.001329;姨,959,0.0013;原,952,0.00130;花,950,0.001299;所,948,0.001297;处,934,0.001277;先,909,0.00124;边,904,0.001236;谁,902,0.001234;己,899,0.00123;平,899,0.00123;瞧,895,0.001224;琏,892,0.00122;内,888,0.001215;住,887,0.001213;管,886,0.001212;女,880,0.001204;死,866,0.001184;送,856,0.001171;连,834,0.001141;至,831,0.001137;告,830,0.001135;早,823,0.001126;会,817,0.001117;东,815,0.001115;香,812,0.001111;林,807,0.001104;往,802,0.001097;西,802,0.001097;月,797,0.00109;带,794,0.001086;虽,790,0.00108;应,785,0.001074;必,772,0.001056;从,770,0.001053;口,767,0.001049;分,765,0.001046;怕,761,0.001041;声,758,0.001037;四,754,0.001031;当,746,0.00102;放,745,0.001019;能,744,0.001018;未,744,0.001018;云,736,0.001007
  根据上面的统计数据,我们可以看出:
  1)《红楼梦》中虚词使用频率相当高,包括:了、的、不、着、也、个、又、得、就、还、之……
  虽然虚词比实词少,但是意义却比较复杂,一般都作为实词的修饰成分,它们和实词组合后产生各种语义。虚词的作用只能搬到小说中根据它的搭配来进行理解和分析。
  2)名词比率也很高,例如:人、儿、子、玉、宝、贾、家、姐、头、母、凤、奶、娘、夫、爷、王、姑、黛、丫、妹、薛、妈、姨、女等等。
  从这些使用频率高的名词可以看出,《红楼梦》主要是围绕人展开的,主体是讲贾、王、史、薛四大家族的事情。主人公的名字当中用的“宝”“玉”和“黛”等字频率也较高。再根据这些名词之间的联系,我们可以推测这是一个大家族,有儿有女,爷、奶、母、姐、妹、姑、夫俱全,而且女人的角色占较大比率。如果把“丫”字和“头”字组合,也可以推测《红楼梦》讲述的应该是丫头众多的有钱大户人家的事情。
  3)频率高的动词:来、道、是、去、有、见、笑、听、出、知、要、看、叫、到、死……
  从这些动词的特点很难推测《红楼梦》中人物的主要活动,这些动词在文中可能有很多词性,看单字只会想到歧义,无法正确理解它们在文中的确切含义。所以动词之间的联系和小说内容之间的关系还得在小说文本中联系上下文进行分析。
  4)还有一些频率高的名词,比如:香、月、云、花、春等等。通过这些字,也容易联想到《红楼梦》中应该不乏诗情画意和浪漫的爱情。
  其实这些高频字中隐隐约约也包含了作者使用语言的特点,同时,对每一回进行一次字频的统计,可以在某种程度上推测故事发展的细微变化、贯穿出小说的主题思路。
  
  四、总结
  
  从上面的统计数据可以看出,高频字虽然很少,但在小说表达故事内容时却占有举足轻重的地位。不过文中指的高频字是占小说总字数的0.1%以上的字,根据上面的分析,我们取的出现频率0.1%以上的字在全文中占的百分比接近70%,所以说出现频率0.1%以上的字基本上可以作为小说高频字的代表。这也说明了这些字在小说中占据的分量。虽然它们个数不多,却是小说表意的中心所在。取0.1%上的字可以大致推测出小说中的主要角色、内容的大体趋向等,如果要给出更确切的观点和解释,还需要回到文本中进一步分析并获得更详细的数据信息,不能完全靠频率推测。
  这些字大多为名词、动词、代词和助词,这也说明了小说在文字应用上的特性,这些汉字在时代变迁中应用的变化不是很大,基本上保持在高频词的位置上。
  下一步我们将进一步深入该项研究,从分析字扩展到分析词汇,从《红楼梦》扩展到其他名著,从中找出它们的共同点和不同点,进而总结语言的发展变化规律,探讨字词和故事情节之间的紧密联系。
  
  参考文献:
  [1] http://www.yp.edu.sh.cn/sflxx/mingren/01-12/caoxq.htm
  [2]孙展.关于“红楼”的真实与猜想[J].中国新闻周刊,2006,(38).
  [3]曹洁.谈《红楼梦》语言世界的“偏离”[J].平顶山学院学报,2006,(3).
  [4]王绍新.《红楼梦》词汇与现代词汇的词义比较研究[J].语言教学与研究,2002,(3).
  [5]孔昭琪.《红楼梦》的词语活用[J].泰安师专学报,2000,(4).
  [6]于平.试论“红楼梦语言”形成的社会文化因素[J].南京师大学报(社会科学版),1999,(6).
  [7]李小明 王亚莉.自动分词中的单字虚词处理[A].http://chinese.fudan.edu.cn/phoneticslab/yuyin5/papers/07-10-089.pdf
  (那日松 吉日嘎拉,中国传媒大学播音主持艺术学院)
其他文献
目的:探讨新型氟化物氟钛酸钾(K2TiF6)对牙釉质粘接树脂(DEBR)理化性能的影响.方法:采用调拌的方法将0、5、10、15、20wt%K2TiF6添加到牙釉质粘接树脂中,测试不同含量K2TiF6对牙釉
传统路径规划算法在实时系统中存在速度慢等缺陷,笔者给出了路径规划算法的神经网络模型,并通过硬件描述语言在FPGA中得以实现.该算法充分发挥了FPGA的并行运算功能,稳定性好
网络处理器同时兼有硬件高速性和软件灵活性的优点,能够较好解决网络性能瓶颈,适应各种新型的网络协议,具有良好的应用前景.笔者介绍了网络处理器的体系结构和功能特点,详细
随着港台与大陆交流的日益频繁,大陆民众对港台的关注也逐渐加深。在三地的信息交流中,离不开文字。虽然港台使用繁体字,大陆使用简体字,但港台作品在大陆大多是经过繁简转换后用简体出版的,所以大陆人阅读这样的港台作品时并不会出现很大的理解困难。而由于香港与台湾分别属于粤方言区和闽方言区,在港台书面语中常常存在着一些粤闽方言字,粤闽方言区以外的人们不认识这些字,也不能理解它们的意思,这些不能通过繁简转换解决
期刊
随着应用系统的增多,对权限控制提出了更高的要求,对所控制的资源的粒度越来越细.登录的安全问题不但涉及用户名和密码,还关系到权限的交错.文章详细描述了一种.NET Framewor
自发性晶体脱位国内报告甚少。我院收治一例,报告如下。患者女,21岁,未婚,住院号40651。因双眼视力减退3天,于1987年7月13日入院。无外伤及其它眼病史。家族中无类似眼病。眼
用于脏器移植后的免疫抑制剂硫唑嘌吟和类固醇激素等可降低正常免疫反应,常出现感染等合并症。有一定选择性的环孢素A(CYA)也不能降低感染的发生率,仅能选择性地抑制移植免疫
结合实验语音学和传统语言学的研究方法,以不同语音类型所表现出的具有代表性的声学参数为基础,设计并实现了言语声学参数分析系统.其功能主要包括语音的自动切分、保存、语
将神经网络和PID控制器有机结合,形成一种基于RBF网络在线辨识、单神经元网络在线整定的自适应PID控制器,用于对主动队列管理(AQM)的拥塞控制.仿真结果表明,该控制器对负载队列的控
本文对新词语“买手”进行了探讨,首先对“买手”的词源进行了探讨,接着对“买手”词义的新发展进行分析。我们认为,“买手”与普通意义上的词义有所区别,它已发展为一种新职业及