Translate

Wednesday, October 20, 2021

Simple Rule Based Text Classification using Orange Data Mining

Typically, we use a lot of algorithms to perform simple classification such as Decision Trees, SVM, Naive Bayes, Logistic Regression etc. How about using simple rules. For example, can't we look at some specific words to define whether it is a positive or negative sentiment? how about words like pathetic, worst, poor for negative sentiments whereas great, fabulous, superb for positive. 

Let us see how we can use the Orange Data Mining tool to achieve the above objective. Following is the Orange Data Mining flow.


You can download the workflow from the Github  dineshasanka/Orange-Data-Mining---Text-Analyitics (github.com) 

Let us explain the package one by one. 
1. From the import documents, a film review data set was extracted.
2. Preprocess Text was used to convert the texts to lowercase and remove some URLs. 
3. Statistics is the key component in this package. This is where you identify the keywords.


4. Then using two aggregate columns were used to create two POS and NEG columns. This will sum the two positive and negative sentiments to the two columns.


5. As of now this is the dataset. 



6. Two feature constructors were introduced. If you are good at Python you can use a Python Script component. 

7. Depending on the positive and negative keywords, we can introduce a new column predicted as follows. 


8. Let us look at the confusion matrix from Pivot table control.

You can see that it has 70% accuracy while more than 85% accuracy for negative sentiments. 

This technique shows that you do not need to rely on complex algorithms but a simple technique will give you more accuracy. 

No comments:

Post a Comment