[1]YAO Lin,LIU Yi,LI Xinxin,et al.Chinese named entity recognition via word boundarybased character embedding[J].CAAI Transactions on Intelligent Systems,2016,11(1):37-42.[doi:10.11992/tis.201507065]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
11
Number of periods:
2016 1
Page number:
37-42
Column:
学术论文—自然语言处理与理解
Public date:
2016-02-25
- Title:
-
Chinese named entity recognition via word boundarybased character embedding
- Author(s):
-
YAO Lin1; 2; 3; LIU Yi1; LI Xinxin4; LIU Hong2
-
1. Shenzhen High-Tech Industrial Park, Shenzhen 518057, China;
2. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China;
3. School of Software, Harbin Institute of Technology, Harbin 150001, China;
4. School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China
-
- Keywords:
-
machine learning; Chinese named entity recognition; deep neutral networks; feature vector; feature extraction
- CLC:
-
TP391.1
- DOI:
-
10.11992/tis.201507065
- Abstract:
-
Most Chinese named entity recognition systems based on machine learning are realized by applying a large amount of manual extracted features. Feature extraction is time-consuming and laborious. In order to remove the dependence on feature extraction, this paper presents a Chinese named entity recognition system via word boundary based character embedding. The method can automatically extract the feature information from a large number of unlabeled data and generate the word feature vector, which will be used in the training of neural network. Since the Chinese characters are not the most basic unit of the Chinese semantics, the simple word vector will be cause the semantics ambiguity problem. According to the same character on different position of the word might have different meanings, this paper proposes a character vector method with word boundary information, constructs a depth neural network system for the Chinese named entity recognition and achieves F1 89.18% on Sighan Bakeoff-3 2006 MSRA corpus. The result is closed to the state-of-the-art performance and shows that the system can avoid relying on feature extraction and reduce the character ambiguity.