Data is everywhere, but?: Data Mining

Showing posts with label Data Mining. Show all posts

Saturday, July 29, 2023

Clustering as Pre-Processing Technique for Classification

Clustering is often used to identify natural groups in a dataset. Since the clustering technique does not depend on any independent variable, the clustering technique is said to be unsupervised learning. The classification technique is supervised as it models data for a target or dependent variable. This post describes clustering as a pre-processing task for classification. This post has used the Orange Data Mining tool to demonstrate the above scenario.

Following is the complete orange data mining workflow, and this is available in the Git Hub as well.

The attrition sample dataset is used to demonstrate the above scenario. This dataset contains 18 variables with 1470 instances. Then the Jaccard distance is used, and it is essential to remember that you may have to choose a different distance technique in order to improve the results.

Then hierarchical clustering is used for three clusters as too many clustering will have smaller clusters and may cause overfitting. Following is the cluster distribution for the three clusters.

After the clustering is completed, the dataset is selected for each cluster and unnecessary columns (Clsuter and Selected) are removed. Then for each cluster, different classification techniques were executed. This scenario uses SVM, Logistic Regression, Neural Network, Random Forest, AdaBoost, Naive Bayes, Tree, and kNN classification techniques.

When classification techniques were executed for the entire dataset, without clustering, the best classification technique was Logistic Regression with a Classification accuracy of 87%.

Following are the classification accuracies and other evaluation parameters for each cluster once it is evaluated with all the above said eight different classification techniques.

The above table shows that C1 and C2 clusters have higher accuracy than the full dataset. However, the third cluster performance is not as good as the overall dataset classification.

This technique is better when there are large volumes of data if not clustering will reduce the data which can lead to model overfitting.

Sunday, February 14, 2021

Data Mining Techniques in Prevention and Diagnosis of Non Communicable Diseases

During the time of the pandemic, the entire world is sceptical about human health. Non Communicable diseases such as Diabetes Mellitus, Heart Disease, Hypertension, Cancer are troubling societies for a long time. The research was done to Prevent and Diagnosis Non-Communicable using data mining techniques. This research was carried out using a data sample in Semi-Rural area in Sri Lanka.

The major challenge in the health sector in rural, underdeveloped areas that the patients are not attending the medical clinics. These numbers are further high in male categories. We identified the challenge of getting the males to the medical clinics, so we used the spouse data to predict the other better half's health conditions.

Logistic regression analysis, Classification and Regression Tree (CART), decision tree, Chi-squared Automatic Interaction Detector (CHAID), exhaustive CHAID, and discriminant analysis techniques were used in this research.

Read the research paper at ResearchGate.

Friday, October 16, 2020

Orange, Color or Fruit or…?

Orange is a tool for data visualization and data mining. It has a variety of features to perform predictive analytics. This session will discuss Image Analytics in Orange tool. We will be discussing Image Clustering, Image Classification techniques with Orange.

Join me on 21st October 2020 at 430 PM SL Time to discuss the Image Analytics in Orange.

Sri Lankan Data Community October 2020 Online Meetup https://www.meetup.com/en-AU/sldatacommunity/events/273971825/

Monday, October 5, 2020

Data Mining in SQL Server

Data Mining or Prediction has become a buzz word not only in academia but also in the industry as well. SQL Server is providing a rich set of algorithms to support data Mining for a long time. However, most of these features are not used due to many reasons. The following article series which I completed at sqlshack provides details of how to use data mining in SQL Server. The major important advantage is that you can use the existing data in the SQL Server with the Data Mining itself. Further, you have to option of using MS BI family for data mining.

Enjoy the article series here.

Introduction to SQL Server Data Mining

Naive Bayes Prediction in SQL Server

Microsoft Decision Trees in SQL Server

Microsoft Time Series in SQL Server

Association Rule Mining in SQL Server

Microsoft Clustering in SQL Server

Microsoft Linear Regression in SQL Server

Implement Artificial Neural Networks (ANNs) in SQL Server

Implementing Sequence Clustering in SQL Server

Measuring the Accuracy in Data Mining in SQL Server

Data Mining Query in SSIS

Text Mining in SQL Server

Wednesday, August 28, 2019

Naive Bayes Prediction in SQL Server

This is the second article of the data mining series.
https://www.sqlshack.com/naive-bayes-prediction-in-sql-server/

Wednesday, July 31, 2019

Introduction to SQL Server Data Mining

Prediction, is it a new thing for you? You won’t believe you are predicting from the bed to the office and to back to the bed. Just imagine, you have a meeting at 9 AM at the office. If you are using public transport, you need to predict at what time you have to leave so that you can reach the office for the meeting on time. Time may vary by considering the time and the day of the week, and the traffic condition etc. Before you leave your home, you might predict whether it will rain today and you might want to take an umbrella or necessary clothes with you. If you are using your vehicle then the prediction time would be different. If so, you don’t need to worry about the rain but you need to consider the fuel level you need to have to reach to the office. By looking at this simple example, you will understand how critical it is to predict and you understand that all these predictions are done with your experience but not by any scientific method.

Read full article https://www.sqlshack.com/introduction-to-sql-server-data-mining/

Tuesday, October 20, 2015

Time Series Algorithms in SQL Server

This is the fourth article on data mining series. The below are the previous articles in this series.

Shopping Basket Analysis in SQL Server

Using Decision Trees in SQL Server

Data Mining Cluster Analysis in SQL Server

This article focuses Time Series Algorithms which are a forecasting technique. One of the most common algorithms used in industry are time series algorithms which can be used to answer questions on the future values such sales volume for the next season, or petrol prices in winter. Most of the cases, time series algorithms are limited to prices and quantities. However, using the same theories and techniques, they have the capabilities of predicting trajectory or a moving object and next video frame.

Saturday, August 15, 2015

SS SLUG August 2015 Meet-up

Wednesday,August,19

SS SLUG August 2015 Meet-up

11th Floor, DHPL Building
No. 42 Nawam Mawatha
Colombo 00200, Sri Lanka
(map)
18:00 - 20:30 (UTC+05:30) Sri Jayawardenepura
Language: English

We're back after a month's hiatus. We're sorry that we couldn't hold a meet-up last month. This month we've got two exciting sessions lined up for you...

Sessions:

SQL Server Full Text Search

Ravin Perera, Tech Lead Geveo Australasia

Natural text-search capabilities are a must in modern applications. As SQL Server developers, what are the technologies we can use to provide search-engine-like features in our applciations? "Lucene.Net" is very popular, but comes with a huge integration and administrative overhead. In this session we'll explore the SQL Server built-in Full-Text Search feature which allows developers to fulfill most of the common natural text search needs with zero management overhead. You can integrate natural text-serach capabilities into your databases right away!

About Ravin:
Ravin is a Microsoft certified developer and certified scrum master, who has worked on various types of projects. Connect with Ravin on LinkedIn and Facebook (https://lk.linkedin.com/pub/ravin-perera/2b/466/61b|https://www.facebook.com/ravinsp)

Data Mining: Microsoft Time Series

Dinesh Asanka, Senior Database Specialist Pearson Lanka

Continuing with his data mining series, Dinesh joins us with the Microsoft Time Series algorithm. This algorithm provides regression algorithms that are optimized for the forecasting of continuous values, such as product sales, over time. Learn how to predict trends based on the original data set and add new data to the model and automatically use this data in your analysis.

About Dinesh:
Dinesh is a SQL Server MVP and a long time database enthusiast, and has been contributing to the database community for several years through his blog, technical forums and various speaking engagements. He is also a visiting lecturer at the Sri Lanka Institute of Information Technology

Wednesday, May 27, 2015

Shopping Basket Analysis in SQL Server

A famous super market chain in USA once observed that men who are buying beer for weekend tend to buy nappies for their kids. This revelation enabled the chain in increase sales volume and revenue by placing the items in close proximity to each other. An alternative approach would have been to move the items apart to encourage store exploration.
Read how to improve your sales using shopping basket analysis.

http://www.sql-server-performance.com/2015/shopping-basket-analysis-sql-server/

Friday, January 16, 2015

SQL Server Sri Lanka User Group - January 2015 Meet-up

Welcome to a new year of data goodness!

To start of this year, we have a distinguished personality in the data space in Sri Lanka, along with a SQL Server MVP speaking on a couple of interesting topics at the January meet-up.

See attached image for details, and ssslug.sqlpass.org for more.
(Also, it's not required, but it would be great to hear if you would be participating: Click here to confirm)

Thursday, January 15, 2015

SS SLUG January 2015 Meet-up

Saturday, November 8, 2014

Workshop on Data Mining

I will be doing the session on Microsoft.

Thursday, July 31, 2014

Data Mining Cluster Analysis in SQL Server

Grouping is something we naturally do in our day to day life. We group foods depending on taste, we group friends depending on their different attributes.

Clustering is an algorithm which finds natural groupings inside your data when these groupings are not obvious. It finds the hidden variable that accurately classifies your data.

Read the article on Clustering here.

Monday, April 21, 2014

APRIL 2014 MEET-UP

More at http://www.sqlserveruniverse.com/SSSLUG.aspx

Friday, January 17, 2014

January 2014 Meet-up

Session #1

TITLE: Predictive Modeling with the Microsoft Naïve Bayes algorithm

Join this session where Dinesh showcases the capabilities of predictive analysis using the Microsoft Naïve Bayes algorithm. Naïve Bayes is a classification algorithm that ships with SQL Server Analysis Services and is used to mine for and predict outcomes based on selected parameters.

CATEGORY: Business Intelligence (Data Mining)

SPEAKER: Dinesh Asanka (MVP), Database Specialist (Pearson Lanka)
Linked.In | Blog | Facebook | @dineshasanka

40 minutes approx.

Session #2

TITLE: Writing Resilient T-SQL Code - Part II

Continuing from where he left off from November's session, Gogula will guide you through writing better T-SQL code that is more resilient to unexpected issues and common code failures. This session is based on the book Defensive Database Programming by Alex Kuznetsov.

CATEGORY: Development

SPEAKER: Gogula G. Aryalingam (MVP), Technical Architect (Navantis)
Linked.In | @gogula | Blog

40 minutes approx.

Time & Location

JANUARY 22, 2013 - 6:00 PM Onwards at MICROSOFT SRI LANKA

11th Floor, DHPL Building, No. 42, Nawam Mawatha, Colombo 2, SRI LANKA

An excellent opportunity to network and learn. Refreshments provided.

Entrance FREE

Tuesday, October 8, 2013

Getting Started with Data Mining in SQL Server

As database professionals, we typically work in a field of exact science. For example, a common practice in business intelligence (BI) solutions is creating duplicate copies of data sets, then comparing the results from the different sources to make sure they're the same. If you extract five years' worth of data from an application's database and put it into a data mart, the results in the data mart must be the same as the results in the application's database, even if the table structures were changed and older records were archived. You might build a cube or semantic model and again check to make sure the results are exactly the same as the source system. If the numbers don't add up, the results are rejected because you know that something is wrong and must be corrected. I have to confess that not getting a conclusive result when working on a tough data problem sometimes keeps me up at night.

Tuesday, September 10, 2013

Weka 3: Data Mining Software in Java

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

Found only on the islands of New Zealand, the Weka is a flightless bird with an inquisitive nature. The name is pronounced like this, and the bird sounds like this.

Weka is open source software issued under the GNU General Public License.

Data Mining with Weka, a 5 week MOOC, starting on September 9th 2013, is now open for enrolment:

http://weka.waikato.ac.nz

Tuesday, March 5, 2013

IEEE ICCSE 2013

I will be presenting at IEEE ICCSE 2013 in Data mining category. I will update more information when I get them.

Wednesday, September 26, 2012

Data Mining Add-ins Error

FAQ for data mining add-ins.

Sunday, June 10, 2012

SQL Server 2012 Data Mining add-in for Excel 2010 Released

Microsoft has released SQL Server 2012 Data Mining add-in for Excel 2010. This has been a much awaited release for those vested in Data Mining.

The previous release was only available for Excel 2007 and if you tried to use
it in Excel 2010 it only worked for 32 bit machines. This was a major issue in the previous release. You’ll find now a 32 bit and 64 bit version available now.

Go download and start playing with the latest release here.

http://www.microsoft.com/download/en/details.aspx?id=29061

Translate

Saturday, July 29, 2023

Sunday, February 14, 2021

Friday, October 16, 2020

Monday, October 5, 2020

Wednesday, August 28, 2019

Wednesday, July 31, 2019

Tuesday, October 20, 2015

This is the fourth article on data mining series. The below are the previous articles in this series.

Saturday, August 15, 2015

SS SLUG August 2015 Meet-up

Sessions:

SQL Server Full Text Search

Ravin Perera, Tech Lead Geveo Australasia

Data Mining: Microsoft Time Series

Dinesh Asanka, Senior Database Specialist Pearson Lanka

Wednesday, May 27, 2015

Friday, January 16, 2015

Thursday, January 15, 2015

Saturday, November 8, 2014

Thursday, July 31, 2014

Monday, April 21, 2014

Friday, January 17, 2014

Session #1

Session #2

Time & Location

Tuesday, October 8, 2013

Tuesday, September 10, 2013

Tuesday, March 5, 2013

Wednesday, September 26, 2012

Sunday, June 10, 2012