Translate

Monday, October 4, 2021

Orange Data Mining - Text Processing

During the last few blog posts, we have discussed how Image processing can be used for different purposes using the popular Orange Data Mining tool. Now let us move our discussions to Text analytics another one of the complex data sources. 
Text tab in the Orange Data Mining tool is not available by default and that has to be added to the tool by updating the add-in. From time to time, there can be updates that have to be added periodically. 

Now let us see what are the basic functions of the Orange Data Mining for Text Analytics.



Today let us use a few and important features for Text Mining. As you can see, there are existing corpora (Text datasets) for you to use as shown below. 


Further, if you have proper access, you can extract data from NY Times, Pubmed, Twitter etc. 

Let us use the customized dataset to perform simple text preprocessing technique. By using, Import Documents widget.  IMDB review dataset was used as it has 2000 positive and negative reviews. You can download the relevant Orange Data Mining workflows from this link. https://github.com/dineshasanka/Orange-Data-Mining---Text-Analyitics.git and following the entire workflow.


After the 2000 documents are imported, we need to perform some preprocessing techniques in order to clean data. This is something very important task in text analytics as you tend to see a lot of issues with the text due to the free form nature of text data. 


The above configuration covers basic preprocessing techniques, Tokenization, filtering and transformation. In this example, we have not used other available preprocessing techniques such as N-Gram, normalization and POS tag. 
Tokenization will decide how would to separate words from the sentences and the filter will decide to remove unnecessary words. Since we have used English stopwords and customized text to remove the non-semantic words. The basic transformation was done by removing URLs and converting the text to lowercase. 
Now you can visualize the words in the word cloud as shown in the following screen.


Apart from word cloud analysis, you can perform simple statistics on your text documents using the Statistics widgets. 

Then you can view the data either from a data table or from the feature statistics.  

No comments:

Post a Comment