1. 程式人生 > >中文電子病歷命名實體識別(CNER)研究進展

中文電子病歷命名實體識別(CNER)研究進展

中文電子病歷命名實體識別(CNER)研究進展

    中文電子病歷命名實體識別(Chinese Clinical Named Entity Recognition, Chinese-CNER)任務目標是從給定的電子病歷純文字文件中識別並抽取出與醫學臨床相關的實體提及,並將它們歸類到預定義的類別。最近把之前收集整理的一些CNER相關的研究進展放在了github上。主要內容包括Chinese-CNER的相關論文列表,以及目前各個主要資料集上的一些先進結果,希望對CNER感興趣的讀者有所幫助。

github地址:https://github.com/lingluodlut/Chinese-BioNLP

中文電子病歷實體識別研究相關論文

    在中文電子病歷實體識別任務上,已經有不少研究方法被提出,這些研究主要集中在對領域特徵的探索上,即在通用領域NER方法的基礎上,研究中文漢字特徵和電子病歷知識特徵等來提升模型效能。

綜述論文

  1. 電子病歷命名實體識別和實體關係抽取研究綜述. 楊錦鋒, 於秋濱, 關毅等. 自動化學報, 2014, 40(8):1537-1561.[paper]
  2. 中文電子病歷的命名實體識別研究進展. 楊飛洪,張宇,覃露等.中國數字醫學,2020,15(02):9-12. [paper]
  3. Overview of CCKS 2018 Task 1: Named Entity Recognition in Chinese Electronic Medical Records. Zhang J, Li J, Jiao Z, et al. In China Conference on Knowledge Graph and Semantic Computing
    , Springer, 2019:158-164. [paper]
  4. Overview of the CCKS 2019 Knowledge Graph Evaluation Track: Entity, Relation, Event and QA. Han X, Wang Z, Zhang J, et al. arXiv preprint, 2020, arXiv:2003.03875. [paper]

方法論文

  1. HITSZ_CNER: a hybrid system for entity recognition from Chinese clinical text. Hu J, Shi X, Liu Z, et al. Proceedings of the Evaluation Tasks at the China Conference on Knowledge Graph and Semantic Computing (CCKS 2017)
    , Chendu, China, 2017:1-6. [paper].
  2. Clinical named entity recognition from Chinese electronic health records via machine learning methods. Zhang Y, Wang X, Hou Z, et al. JMIR medical informatics. 2018;6(4):e50. [paper]
  3. A BiLSTM-CRF Method to Chinese Electronic Medical Record Named Entity Recognition. Ji B, Liu R, Li S, et al. In Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence, 2018:1-6.[paper]
  4. A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records. Chowdhury S, Dong X, Qian L, et al. BMC bioinformatics. 2018, 19(17):75-84.[paper]
  5. A Conditional Random Fields Approach to Clinical Name Entity Recognition. Yang X, Huang W. Proceedings of the Evaluation Tasks at the China Conference on Knowledge Graph and Semantic Computing (CCKS 2018). Tianjin, China, 2018:1-6.[paper]
  6. DUTIR at the CCKS-2018 Task1: A Neural Network Ensemble Approach for Chinese Clinical Named Entity Recognition. Luo L, Li N, Li S, et al. Proceedings of the Evaluation Tasks at the China Conference on Knowledge Graph and Semantic Computing (CCKS 2018). Tianjin, China, 2018:1-6. [paper]
  7. Incorporating dictionaries into deep neural networks for the chinese clinical named entity recognition. Wang Q, Zhou Y, Ruan T, et al. Journal of biomedical informatics, 2019, 92: 103133. [paper]
  8. A hybrid approach for named entity recognition in Chinese electronic medical record. Ji B, Liu R, Li S, et al. BMC medical informatics and decision making. 2019 Apr;19(2):149-58. [paper]
  9. Chinese Clinical Named Entity Recognition Using Residual Dilated Convolutional Neural Network with Conditional Random Field. Qiu J, Zhou Y, Wang Q, et al. IEEE Transactions on NanoBioscience. 2019, 18(3):306-315. [paper]
  10. An attention-based deep learning model for clinical named entity recognition of Chinese electronic medical records. Li L, Zhao J, Hou L, et al. BMC medical informatics and decision making. 2019, 19(5):1-1. [paper]
  11. Chinese clinical named entity recognition with word-level information incorporating dictionaries. Lu N, Zheng J, Wu W, et al. In 2019 International Joint Conference on Neural Networks (IJCNN), 2019,1-8. [paper]
  12. Fine-tuning BERT for joint entity and relation extraction in Chinese medical text. Xue K, Zhou Y, Ma Z, et al. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2019, 892-897. [paper]
  13. Chinese clinical named entity recognition with radical-level feature and self-attention mechanism. Yin M, Mou C, Xiong K, et al. Journal of biomedical informatics. 2019, 98:103289. [paper]
  14. Adversarial training based lattice LSTM for Chinese clinical named entity recognition. Zhao S, Cai Z, Chen H, et al. Journal of biomedical informatics. 2019, 99:103290. [paper]
  15. 基於句子級 Lattice-長短記憶神經網路的中文電子病歷命名實體識別. 潘璀然, 王青華, 湯步洲等. 第二軍醫大學學報. 2019,40(05):497-507.[paper]
  16. 基於BERT與模型融合的醫療命名實體識別. 喬銳,楊笑然,黃文亢. Proceedings of the Evaluation Tasks at the China Conference on Knowledge Graph and Semantic Computing (CCKS 2019) [paper]
  17. Noisy Label Learning for Chinese Medical Named Entity Recognition Based on Uncertainty Strategy. Li Z, Gan Z, Zhang B, et al. Proceedings of the Evaluation Tasks at the China Conference on Knowledge Graph and Semantic Computing (CCKS 2020) [paper]
  18. 基於BERT與字形字音特徵的醫療命名實體識別. 晏陽天, 趙新宇, 吳賢. Proceedings of the Evaluation Tasks at the China Conference on Knowledge Graph and Semantic Computing (CCKS 2020) [paper]
  19. Cross domains adversarial learning for Chinese named entity recognition for online medical consultation. Wen G, Chen H, Li H, et al. Journal of Biomedical Informatics. 2020 Dec 1;112:103608. [paper]
  20. Chinese medical named entity recognition based on multi-granularity semantic dictionary and multimodal tree. Wang C, Wang H, Zhuang H, et al. Journal of Biomedical Informatics. 2020, 111:103583. [paper]
  21. Chinese Clinical Named Entity Recognition in Electronic Medical Records: Development of a Lattice Long Short-Term Memory Model With Contextualized Character Representations. Li Y, Wang X, Hui L, et al. JMIR Medical Informatics. 2020;8(9):e19848. [paper]
  22. Chinese clinical named entity recognition with variant neural structures based on BERT methods. Li X, Zhang H, Zhou XH. Journal of biomedical informatics. 2020, 107:103422. [paper]
  23. 融入語言模型和注意力機制的臨床電子病歷命名實體識別. 唐國強,高大啟,阮彤等. 電腦科學,2020,47(03):211-216.[paper]
  24. 基於筆畫ELMo和多工學習的中文電子病歷命名實體識別研究. 羅凌, 楊志豪, 宋雅文等. 計算機學報, 2020, 43(10): 1943-1957. [paper]

  

中文電子病歷實體識別現存方法效能

    中文電子病歷實體識別任務的資料集以及相應資料集上系統模型效能表現。目前現存公開的中文電子病歷標註資料十分稀缺,為了推動CNER系統在中文臨床文字上的表現,中國知識圖譜與語義計算大會(China Conference on Knowledge Graph and Semantic Computing, CCKS)在近幾年都組織了面向中文電子病歷的命名實體識別評測任務,下面我們主要關注CCKS CNER資料集上的結果。

  • CCKS 2017
  • CCKS 2018
  • CCKS 2019
  • CCKS 2020

CCKS 2017

CCKS17資料集:原始資料集分為訓練集和測試集,其中訓練集包括300個醫療記錄,人工標註了五類實體(包括症狀和體徵、檢查和檢驗、疾病和診斷、治療、身體部位)。測試集包含100個醫療記錄。

語料資料統計

 

症狀體徵

檢查檢驗

疾病診斷

治療

身體部位

總數

訓練集

7,831

9,546

722

1,048

10,719

29,866

測試集

2,311

3,143

553

465

3,021

9,493

現存方法效能比較 (%F值)

方法

症狀體徵

檢查檢驗

疾病診斷

治療

身體部位

總體

論文

HIT-CNER (Hu et al., 2017) Top1

96.00

94.43

78.97

81.47

87.48

91.14

HITSZ_CNER: a hybrid system for entity recognition from Chinese clinical text

BiLSTM-CRF-DIC (Wang et al., 2019)

-

-

-

-

-

91.24

Incorporating dictionaries into deep neural networks for the chinese clinical named entity recognition

RD-CNN-CRF (Qiu et al., 2019)

-

-

-

-

-

91.32

Chinese Clinical Named Entity Recognition Using Residual Dilated Convolutional Neural Network with Conditional Random Field

Tang et al. (2019)

-

-

-

-

-

91.34

融入語言模型和注意力機制的臨床電子病歷命名實體 識別

PDET Feature in Model-II (Lu et al., 2019)

-

-

-

-

-

92.68

Chinese Clinical Named Entity Recognition with Word-Level Information Incorporating Dictionaries

BiLSTM-CRF-SP+ELMo (Luo et al., 2020)

95.37

94.94

81.13

83.32

88.74

91.75

基於筆畫ELMo和多工學習的中文電子病歷命名實體識別研究

FT-BERT + BiLSTM + CRF+Fea (Li et al., 2020)

96.57

94.09

81.26

82.62

88.37

91.60

Chinese clinical named entity recognition with variant neural structures based on BERT methods

注:Top表示當時評測的前三名系統方法。

CCKS 2018

CCKS18資料集:原始資料集包括訓練集和測試集.其中訓練集包括600個醫療記錄,人工標註了五 類實體(包括解剖部位、症狀描述、獨立症狀、藥物、 手術)。測試集包含400個醫療記錄原始資料。

語料資料統計

 

解剖部位

症狀描述

獨立症狀

藥物

手術

總數

訓練集

9,472

2,484

3,712

1,221

1,329

18,218

測試集

6,339

918

1,327

813

735

10,132

現存方法效能比較 (%F值)

方法

解剖部位

症狀描述

獨立症狀

藥物

手術

總體

論文

Alihealth Lab (Yang and Huang) (2018) Top1

87.97

90.59

92.45

94.49

85.43

89.13

A Conditional Random Fields Approach to Clinical Name Entity Recognition

DUTIR (Luo et al., 2018) Top3

87.59

90.77

91.72

91.53

86.41

88.63

DUTIR at the CCKS-2018 Task1: A Neural Network Ensemble Approach for Chinese Clinical Named Entity Recognition

BiLSTM-CRF (Ji et al., 2018)

86.65

89.13

90.69

91.15

85.61

87.68

A BiLSTM-CRF Method to Chinese Electronic Medical Record Named Entity Recognition

Lattice-LSTM (潘璀然等人, 2019)

-

-

-

-

-

89.75

基於句子級 Lattice- 長短記憶神經網路的中文電子病歷命名實體識別

Attention-BiLSTM-CRF + all (Ji et al, 2019)

-

-

-

-

-

90.82

A hybrid approach for named entity recognition in Chinese electronic medical record

MSD_DT_NER (Wang et al., 2020)

88.01

92.57

90.71

94.58

85.62

89.88

Chinese medical named entity recognition based on multi-granularity semantic dictionary and multimodal tree

BiLSTM-CRF-SP+ELMo (Luo et al., 2020)

89.69

91.83

92.01

91.30

86.22

90.05

基於筆畫ELMo和多工學習的中文電子病歷命名實體識別研究

FT-BERT + BiLSTM + CRF+Fea (Li et al., 2020)

89.12

90.66

92.94

87.99

87.59

89.56

Chinese clinical named entity recognition with variant neural structures based on BERT methods

注:Top表示當時評測的前三名系統方法。

CCKS 2019

CCKS19資料集:原始資料集包括訓練集和測試集.其中訓練集包括1000個醫療記錄,人工標註了六類實體(包括疾病和診斷、檢查、檢驗、手術、藥物、解剖部位)。測試集包含379個醫療記錄原始資料。

語料資料統計(唯一實體個數)

 

疾病和診斷

檢查

檢驗

手術

藥物

解剖部位

總數

訓練集

2,116

222

318

765

456

1486

5,363

測試集

682

91

193

140

263

447

1,816

現存方法效能比較 (%F值)

方法

疾病和診斷

檢查

檢驗

手術

藥物

解剖部位

總體

論文

Alihealth (喬銳等人, 2019) Top1

84.29

86.29

76.94

83.33

96.02

86.18

85.62

基於BERT與模型融合的醫療命名實體識別

MSIIP (Liu et al., 2019) Top2

-

-

-

-

-

-

85.59

Team MSIIP at CCKS 2019 Task 1

DUTIR (Li et al., 2019) Top3

82.81

88.01

75.65

86.79

94.49

85.99

85.16

DUTIR at the CCKS-2019 Task 1: Improving Chinese clinical named entity recognition using stroke ELMo and transfer learning

注:Top表示當時評測的前三名系統方法。

CCKS 2020

CCKS20資料集:原始資料集包括訓練集和測試集.其中訓練集包括1050個醫療記錄,人工標註了六類實體(包括疾病和診斷、檢查、檢驗、手術、藥物、解剖部位)。測試集未公開。

語料資料統計

 

疾病和診斷

檢查

檢驗

手術

藥物

解剖部位

總數

訓練集

4,345

1002

1297

923

1935

8811

18313

現存方法效能比較 (%F值)

方法

疾病和診斷

檢查

檢驗

手術

藥物

解剖部位

總體

論文

CASIA_Unisound (Li et al.,2020) Top1

90.93

89.96

85.94

94.85

93.56

91.62

91.56

Noisy Label Learning for Chinese Medical Named Entity Recognition Based on Uncertainty Strategy

TMAIL (晏陽天等人, 2020) Top2

90.53

88.47

83.50

96.21

93.75

92.00

91.54

基於BERT與字形字音特徵的醫療命名實體識別

ChiEHRBert (楊文明等人, 2020) Top3

91.10

88.62

85.71

95.52

92.93

91.16

91.24

基於 ChiEHRBert 與多模型融合的醫療命名實體識別

注:Top表示當時評測的前三名系統方法。

&n