126 | 0 | 13 |
下载次数 | 被引频次 | 阅读次数 |
行业语料库构建是行业人工智能发展的基础,词频分析是行业语料库构建的关键步骤。本文以当前民航广泛使用的航空情报汇编资料(Aeronautical Information Publication,AIP)为语料提取源,提出了一种基于文本特征提取的航空情报语料词频分析方法。首先在对航空汇编资料原始结构分析基础上加入领域词典优化分词效果,随后对词频统计结果采用词频g指数的方法确定高频词阈值,对提取结果进行共词聚类分析。利用航空资料汇编样本进行方法验证,实验结果表明:本文所提出的方法能有效对航空汇编资料进行文本特征提取,为领域语料库应用奠定基础。
Abstract:The construction of domain corpus is the basis for the development of industry artificial intelligence and word frequency analysis is a key step in the construction of domain corpus.This paper takes Aeronautical Information Publication(AIP) as the research object,which is widely used in civil aviation,proposes a word-frequency analysis method of aviation intelligence corpus based on text feature extraction.Firstly,on the basis of the analysis of structure of AIP,the domain dictionary is added to optimize the word segmentation effect.Then,the word frequency g-index method is used to determine the high-frequency word thresholds and the extracted results are subjected to word frequency statistics and co-word clustering analysis.The method is validated by using the compiled samples of AIP and the experimental results show that the proposed method can effectively extract textual features of aeronautical information publication and lay the foundation for constructing a corpus in the field of aeronautical information.
[1]中国民航航空情报管理(AIM)实施指南[S].北京:中国民用航空局空中交通管理局,2018.
[2]Capuozzo P,Lauriola I,Strapparava C,et al.DecOp:A Multilingual and Multi-domain Corpus for Detecting Deception in Typed Text[C]//Proceedings of the 12th Language Resources and Evaluation Conference.2020:1423-1430.
[3]冯鸾鸾.面向特定科技领域的技术和术语识别方法研究[D].苏州:苏州大学,2020.
[4]王航,张宏军,程恺,等.面向知识图谱构建的电子战领域语料库建设[J].指挥信息系统与技术2023(2):69-75.
[5]李荣枝.航行数据语料库研究与构建[D].广汉:中国民用航空飞行学院,2022.
[6]张千,王庆玮,张悦,等.基于深度学习的文本特征提取研究综述[J].计算机技术与发展,2019(12):61-65.
[7]蔡志鹏,曾维理,郭子逸.空中交通管制中的危险源关键特征提取[J].航空计算技术,2023(6):35-39.
[8]唐琳,郭崇慧,陈静锋.中文分词技术研究综述[J].数据分析与知识发现,2020(Z1):1-17.
[9]Kang Y,Cai Z,Tan C W,et al.Natural language processing(NLP)in management research:A literature review[J].Journal of Management Analytics,2020(2):139-172.
[10]赵一鸣,尹嘉颖.语义增强型全文本共词网络的构建与分析[J].情报学报,2023(10):1187-1198.
[11]虞秋雨,徐跃权.共词分析中高频词阈值确定方法的实证研究——以新冠肺炎文献高频词选取为例[J].情报科学,2020(09):90-95.
基本信息:
DOI:
中图分类号:TP391.1;V35
引用信息:
[1]赖欣,张恒嫣,冯嘉宇等.基于文本特征提取的航空情报语料词频分析[J].中国民航飞行学院学报,2025,36(03):21-26.
基金信息:
四川省自然科学基金项目(2023NSFSC0903); 中央高校基本科研业务费资助项目-重点项目(ZJ2023-003)