[1]QU Zhaowei,WU Chunye,WANG Xiaoru.Aspects extraction based on semi-supervised self-training[J].CAAI Transactions on Intelligent Systems,2019,14(4):635-641.[doi:10.11992/tis.201806006]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
14
Number of periods:
2019 4
Page number:
635-641
Column:
学术论文—机器学习
Public date:
2019-07-02
- Title:
-
Aspects extraction based on semi-supervised self-training
- Author(s):
-
QU Zhaowei1; WU Chunye1; WANG Xiaoru2
-
1. Institute of Network Technology, Beijing University of Posts and Telecommunication, Beijing 100876, China;
2. College of Computer Science, Beijing University of Posts and Telecommunication, Beijing 100876, China
-
- Keywords:
-
aspect extraction; word vector; semi-supervised; self-training; unlabeled data; opinion mining; seed words; similar words
- CLC:
-
TP18
- DOI:
-
10.11992/tis.201806006
- Abstract:
-
Aspect extraction is a key step in opinion mining and sentiment analysis. With the development of social networks, users are increasingly inclined to make decisions based on review information and pay more attention to the fine-grained information of comments. Therefore, it is important to help users to make these decisions by quickly mining information from massive comments. Most topic-based models and clustering methods do not work well in terms of consistency in aspect extraction. The traditional supervised learning method works well, but it requires a large amount of annotation text as training data, and labeling text requires a lot of labor costs. Based on the above issues, a method for aspects extraction based on semi-supervised self-training (AESS) is proposed in this paper. The method takes full advantage of the large amount of unlabeled data that exist in the web. Words similar to seed words on the unlabeled datasets using a word vector model are found, and multiple aspects word sets that are most related to the data set are constructed. Our approach avoids a large number of text annotations and makes full use of the value of unlabeled data, and our method has made good performance in both Chinese and English datasets.