Translate

Sunday, October 24, 2021

Federalist Papers : Case for Naïve Bayes Text Classification

Alexander Hamilton, James Madison, and John Jay

The Federalist Papers is a collection of 85 articles and essays written by Alexander Hamilton, James Madison, and John Jay. 1787 after the UK was thrown out from US, many were in the view that 13 counties should rule independently. 

John Jay, James Madison, Alexander Hamilton wrote letters independently to pursue that the US should have a strong central government with the individual state government. Between 1787 - 1788 these papers were published under the pseudonym PUBLIS. While the authorship of 73 of The Federalist essays is fairly certain, the identities of those who wrote the twelve remaining essays are disputed by some scholars. In 1963 this dispute was fixed by Mosteller and Wallace using Bayesian Methods.

Let us see do a simple analysis of these papers by performing a analyse of the titles of these papers using the Orange Data Mining Tool. You can retrieve the sample files and the Orange workflow from dineshasanka/FederalistPapersOrangeDataMining (github.com)

Following is the Orange Data Mining workflow and let us go through important controls. 

After importing the CSV, text was preprocessed and word cloud was generated to identify the word distribution. 


Bags of Words are used to identify the keywords. Then six classifiers are used which are Neural Network, Naive Bayes, Decision Trees, Random Forest, SVM and AdaBoost. Following is the evaluation results and it shows that the Random Forest technique has the edge over the other techniques 

We can build the decision tree as shown below. 

Saturday, October 23, 2021

Monitor Your System Metrics With InfluxDB Cloud in Under One Minute

InfluxDB is a Time Series stack that will provide end-to-end features to capture, store and present time series data. 

Following is the InfluxDB 2.0 stack that covers all aspects of time series data.

Influx DB 2.0 stack

Let us see how we can use InfluxDB 2.0 to monitor the operating system. 

First, you need to create a account to FluxDB Could. 

Next is to create a bucket to store the data. 


Creating a Bucket in InfluxDB

After the bucket is created, you need to configure the plugin from Telegraf. 

Configuring the Plugin

Now you need to set the environment variable and need to execute Telegraf by using the following commands. 



Then you can create a query from the following. 


You can create the graph as below. 
 

Thursday, October 21, 2021

Resumable Index Rebuilding in SQL Server 2017

Index rebuild is one of the important tasks in Index Maintenance. As you know Index Rebuild run in a Transaction which means if you abort the Index Rebuild, you have to start all over again. Index Reorganizing is not under Transaction which means you can start from where you left. However, Index Rebuild will solve both Internal and External Index Fragmentation, Index Rebuild is the better option that you would like to go. 

As Index Rebuild will consume a lot of resources from the system, many users would like to perform Index Rebuild in a scheduled manner. This is now possible with SQL Server 2017 and SQL Server 2019. This feature is called the Resumable index. 

Let us see how we can demonstrate this feature. Let us create a table and populate it with the following code.

CREATE TABLE SampleData (ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
Name CHAR(1000),
AddressI CHAR(100)
)


INSERT INTO SampleData
(Name, AddressI)
VALUES
('DINESH','AddressI')
GO 2000000
Now we can rebuild the index with the RESUMABLE option is set to ON.

ALTER INDEX [PK__SampleD] ON [dbo].[SampleData]
REBUILD WITH (ONLINE = ON, RESUMABLE = ON);

Now you can PAUSE and RESUME the index. 

ALTER INDEX [PK__SampleD] ON [SampleData]  PAUSE


ALTER INDEX [PK__SampleD] ON [SampleData]  RESUME

After pausing the index, you can see what percentage of index rebuilding is done with a few other details.

SELECT total_execution_time, percent_complete, 
   name,
   state_desc,last_pause_time,
   page_count
FROM sys.index_resumable_operations;



However, until you complete the index rebuild, the transaction log is utilised which is an important fact to remember. 

 

Wednesday, October 20, 2021

Simple Rule Based Text Classification using Orange Data Mining

Typically, we use a lot of algorithms to perform simple classification such as Decision Trees, SVM, Naive Bayes, Logistic Regression etc. How about using simple rules. For example, can't we look at some specific words to define whether it is a positive or negative sentiment? how about words like pathetic, worst, poor for negative sentiments whereas great, fabulous, superb for positive. 

Let us see how we can use the Orange Data Mining tool to achieve the above objective. Following is the Orange Data Mining flow.


You can download the workflow from the Github  dineshasanka/Orange-Data-Mining---Text-Analyitics (github.com) 

Let us explain the package one by one. 
1. From the import documents, a film review data set was extracted.
2. Preprocess Text was used to convert the texts to lowercase and remove some URLs. 
3. Statistics is the key component in this package. This is where you identify the keywords.


4. Then using two aggregate columns were used to create two POS and NEG columns. This will sum the two positive and negative sentiments to the two columns.


5. As of now this is the dataset. 



6. Two feature constructors were introduced. If you are good at Python you can use a Python Script component. 

7. Depending on the positive and negative keywords, we can introduce a new column predicted as follows. 


8. Let us look at the confusion matrix from Pivot table control.

You can see that it has 70% accuracy while more than 85% accuracy for negative sentiments. 

This technique shows that you do not need to rely on complex algorithms but a simple technique will give you more accuracy. 

Wednesday, October 6, 2021

Time Series CheatSheet - v 9.0

 This time we will have few more updates to the Time Series cheat sheet that can be seen from the following image. Image size was changed as we are covering few more components and you can get to the original file from Time-Series-Cheat-Sheet


Improvements in v 9.0.

1. Bench-Mark dataset
Few research papers have indicated that there are benchmark datasets for Time Series analysis so those are included. 

2. Timestamp Attribute Derivation
During the analysis of datasets, it was found that there are some datasets that do not have an explicit timestamp attribute. In some datasets, the time attribute is distributed between multiple columns such as year, month, day, hour, minute etc. In addition, sometimes there are no timestamp attributes and that has to be generated. 

3. Time Series Reconstruction
By looking at few more research papers, we identified that there are many different techniques of Time Series Reconstruction. 

Monday, October 4, 2021

Orange Data Mining - Text Processing

During the last few blog posts, we have discussed how Image processing can be used for different purposes using the popular Orange Data Mining tool. Now let us move our discussions to Text analytics another one of the complex data sources. 
Text tab in the Orange Data Mining tool is not available by default and that has to be added to the tool by updating the add-in. From time to time, there can be updates that have to be added periodically. 

Now let us see what are the basic functions of the Orange Data Mining for Text Analytics.



Today let us use a few and important features for Text Mining. As you can see, there are existing corpora (Text datasets) for you to use as shown below. 


Further, if you have proper access, you can extract data from NY Times, Pubmed, Twitter etc. 

Let us use the customized dataset to perform simple text preprocessing technique. By using, Import Documents widget.  IMDB review dataset was used as it has 2000 positive and negative reviews. You can download the relevant Orange Data Mining workflows from this link. https://github.com/dineshasanka/Orange-Data-Mining---Text-Analyitics.git and following the entire workflow.


After the 2000 documents are imported, we need to perform some preprocessing techniques in order to clean data. This is something very important task in text analytics as you tend to see a lot of issues with the text due to the free form nature of text data. 


The above configuration covers basic preprocessing techniques, Tokenization, filtering and transformation. In this example, we have not used other available preprocessing techniques such as N-Gram, normalization and POS tag. 
Tokenization will decide how would to separate words from the sentences and the filter will decide to remove unnecessary words. Since we have used English stopwords and customized text to remove the non-semantic words. The basic transformation was done by removing URLs and converting the text to lowercase. 
Now you can visualize the words in the word cloud as shown in the following screen.


Apart from word cloud analysis, you can perform simple statistics on your text documents using the Statistics widgets. 

Then you can view the data either from a data table or from the feature statistics.  

Saturday, October 2, 2021

Article: Text Classification in Azure Machine Learning using Word Vector


WEKA or Waikato Environment for Knowledge Analysis developed at the University of Waikato, New Zealand, is a good tool to perform text Information Retrieval as it has a lot of features like Term Frequency (TF). Inverse Document Frequency (IDF), NGram Tokenization, Stopwords, Stemming, Document Length. 

This latest article Text Classification in Azure Machine Learning using Word Vectors describes how the output of word vectors in weka can be used in Azure Machine learning in order to process better classification.

Following is the table of content for the article series on Azure Machine Learning.

Introduction to Azure Machine Learning using Azure ML Studio
Data Cleansing in Azure Machine Learning
Prediction in Azure Machine Learning
Feature Selection in Azure Machine Learning
Data Reduction Technique: Principal Component Analysis in Azure Machine Learning
Prediction with Regression in Azure Machine Learning
Prediction with Classification in Azure Machine Learning
Comparing models in Azure Machine Learning
Cross Validation in Azure Machine Learning
Clustering in Azure Machine Learning
Tune Model Hyperparameters for Azure Machine Learning models
Time Series Anomaly Detection in Azure Machine Learning
Designing Recommender Systems in Azure Machine Learning
Language Detection in Azure Machine Learning with basic Text Analytics Techniques
Azure Machine Learning: Named Entity Recognition in Text Analytics
Filter based Feature Selection in Text Analytics
Latent Dirichlet Allocation in Text Analytics
Recommender Systems for Customer Reviews
AutoML in Azure Machine Learning
AutoML in Azure Machine Learning for Regression and Time Series
Building Ensemble Classifiers in Azure Machine Learning
Text Classification in Azure Machine Learning using Word Vectors