NLP – Understanding Scientific Language
Project Goal
The use case describes the analysis of scientific articles using Natural Language Processing (NLP) in order to distinguish two different types of gene mutations. Currently, this classification of gene mutations is done manually. We developed an algorithm to do so automatically.
Dataset Used
A training dataset with 3321 samples and a test dataset with 368 samples were used. The training data contain scientific full text publications including the corresponding gene and mutation class. The genes are divided into 9 different mutation classes. The test data have the same structure as the training data, but without the mutation class.
Challenges
The classes of the training data are very unbalanced. The amount of training samples is relatively small for a 9-class NLP classification. The use of pre-trained models might be limited due to the high specificity of the scientific literature.
Applied Methods
The texts are converted into a list of “tokens” (single words) for analysis purpose, naming “Word Embedding”. Three different Word Embedding methods were used for this: Bag of Words, TF-IDF and Word2Vec (including pre-trained Word2Vec from Google and self-trained Word2Vec). For the subsequent classification, three machine learning algorithms were used and further compared: logistic regression, random forest classification and support vector machine classification. Furthermore, BERT, a comprehensive NLP framework from Google, was used for word embedding and modeling.
Project outcome
The results are predictions of genetic mutation classes based on scientific articles. The combination of self-trained Word2Vec and Random Forest model provides the best performance as an accuracy of 63.5% has been achieved. This information can also be used for a diagnosis by physicians. In this way, the likelihood of cure can be increased by enabling timely treatment through early detection of diseases.

Category
NLP
Technologies
KI
BERT
tf-idf
Bag of Words