Data is everywhere, but?: 2020

Thursday, December 31, 2020

Templates in Draw.IO

During a post around two months ago, we discussed that Draw.IO can be used to draw diagrams. Further to the discussed options, there are more than 100 templates that you can choose from Draw.io as shown below.

Let us look at a few templates that you have in Draw.Io for Azure and AWS.

The following figure is something that was designed using the Draw.IO using the native controls.

Tuesday, December 29, 2020

Power BI End-To-End Features

Power BI is a Business analytics framework from Microsoft who is the leader in BI for the last 12 years according to the Gartner. Since we have a dynamic environment, we would love to know the capabilities and integration with this tool.

The following link will take you to the end-to-end features of Power BI.

https://static1.squarespace.com/static/5d28ebb6fbc5cd000177d261/t/5ede3952ea5e395a8580fbd2/1591621971557/PowerBIEndToEndDiagram_MelissaCoates.pdf

Wednesday, December 23, 2020

Creating your First Azure SQL Database

As the cloud has become something that you cannot avoid in the current technology race, it is important to understand what are the options you have to create a database in the Azure platform. As shown in the below image, there are different architectural options.

If you are interested in only the Azure cloud option, then you have the following options.

To elaborate on how to create an Azure SQL database, here is the latest article at sqlshack.

https://www.sqlshack.com/creating-your-first-azure-sql-database/

Wednesday, December 16, 2020

Defining Fuzzy Membership Function Using Box Plot

The membership function is the key component in fuzzy techniques. When fuzzy techniques are extended to the data warehouse, so that we can make decisions using fuzzy techniques in a data warehouse, it was identified that in the many implementations does no have the data-driven techniques to define fuzzy membership function.

In this research paper, which is a research project on Investigation and Development for Fuzzy Data Warehouse, we have used the famous Box Plot technique to derive a fuzzy function. In this technique, we have mapped the fuzzy function parameters to the Box Plot parameters as shown below.

In this technique, you can define three states or five states function where they have combined trigonometric and trapezoidal functions. The following are the three states membership function defined from the Box - Plot.

Read the full research article. This research article has all the implementation details as well as the evaluation techniques. This article already has more than 10 citations ignoring the self-citations.

Saturday, December 12, 2020

Data Warehouse in SQL Server

Data Warehouse is a comprehensive technology that provides the key people within an enterprise with access to any level of the required information within the enterprise. It is an enterprise-wide framework that permits the management of all enterprise information.

Let us see how we can utilise Microsoft technologies at varies stages of the Data Warehouse technologies.

Let us look at how data design concepts can be used in Microsoft Technologies. First of all, you need to look at the infrastructure planning for a data warehouse. During the data warehouse design, it is important to include surrogate keys to dimension tables. Date dimensions is a special dimension that is used in data warehouse modelling. Historical data is an important aspect in a data warehouse that is used in Slowly Changing Dimensions (SCD).

Friday, December 11, 2020

RDBMS -> NoSQL -> NewSQL

https://www.thepsi.com/rdbms-vs-nosql-vs-newsql-which-one-to-choose/

Nowadays there are a lot of data formats which needs to cater to your different needs, Relational Database Management systems are used for many years. Then came the NoSQL in order to support Horizontal calling and distributed computing. With NoSQL, you are losing the ACID properties in transactions. With the evolvement of technology and user needs, we are looking at distribution databases which have the features of ACID properties. This has lead to the new paradigm of NewSQL.

Let us look at the comparisons as shown below.

Source: https://medium.com/rabiprasadpadhy/google-spanner-a-newsql-journey-or-beginning-of-the-end-of-the-nosql-era-3785be8e5c38

Look at these comparisons in detail at https://www.thepsi.com/rdbms-vs-nosql-vs-newsql-which-one-to-choose/ and https://www.xenonstack.com/blog/sql-vs-nosql-vs-newsql/

Thursday, December 10, 2020

Customized Transaction Log Backups

Transaction Log backups are important in a Production environment. It will make sure that you manage your log file size and keeping backups in case of a need to restore.

I am pretty much sure, most of you have scheduled transaction log backups. If you have scheduled Transaction log backups every 15 minutes, then you will see four log backups every hour and will result in nearly 100 backup files a day and you are looking at around 700 log backups per day. Unlike differential backups, you need all your lob backups to recover. Sometimes, you might have less or no transactions but still, there will be a log backup.

Now the question is, Can we create transaction log backup when there is sufficient size. Yes, you can if you are running SQL Server 2017 or later.

In sys.dm_db_log_stats Dynamic Management Function (DMF), there is a new column called log_since_last_log_backup_mb tells you what is the log file size after the last log backup.

Using the following script, you can perform transaction log backups when the log file size is more than a specific size.

DECLARE @log_since_last_log_backup_mb NUMERIC(9, 2)
DECLARE @ThreasholdSize INT = 25
DECLARE @folderName VARCHAR(30) = 'D:\DBBACKUP'
DECLARE @DatabaseName VARCHAR(30) = 'LB1'

SELECT @log_since_last_log_backup_mb = log_since_last_log_backup_mb
FROM sys.dm_db_log_stats(db_id(@DatabaseName))

IF @log_since_last_log_backup_mb > @ThreasholdSize
BEGIN
   DECLARE @fileName NVARCHAR(400) = @folderName + '\' +

   @DatabaseName + SUBSTRING(REPLACE(CONVERT(VARCHAR, GETDATE(), 111), '/', '')

   + REPLACE(CONVERT(VARCHAR, GETDATE(), 108), ':', ''), 0, 13) + '.bak'

	BACKUP LOG [LB1] TO DISK = @fileName
	WITH NOFORMAT
		,NOINIT
		,SKIP
		,NOREWIND
		,NOUNLOAD
		,STATS = 10
END
ELSE
	PRINT 'No BACKUP'

Monday, December 7, 2020

Technology Initiatives

If someone asks what are the top three priorities what do you say? Is it Cloud, DevOps, Machine Learnings, IoT? The following is the survey done by Flexera for 303 respondents.

Still, DevOps, Machine Learning, Big Data are not in the priorities list though many of us are taking on those topics. Digital transformation, Cybersecurity and Cloud migrations are in the top technology initiatives.

Friday, December 4, 2020

Database Design and Modeling with PostgreSQL

This is a self-publish book on PostgreSQL.

https://www.researchgate.net/publication/341931233_Database_Design_and_Modeling_with_PostgreSQL

This discusses all the basics of database modelling and implementations in PostgreSQL with few case studies.

Wednesday, December 2, 2020

Epidemic Mathematical Model

Source: https://www.cirad.fr/

In these times of Covid, Epidemic has become a buzz word everywhere. While saluting the health professionals and others putting their utmost effort to salvage people wherever the in the world, do you know that there is a mathematical model for Epidemics. This model is called Epidemic Protocol or more famously Gossip Protocol.

The theory based on a population where there is an infected node, uninfected will be infected as we are observing in current Covid-19 pandemic.

Let us look at this mathematical theory.

Though this theory is mainly used to identify the propagation of Epidemic, this theory is used to communicate between Peer-to-Peer system. https://flopezluis.github.io/gossip-simulator/is providing a simulator for the Gossip / Epidemic theory.

Tuesday, December 1, 2020

Hierarchies for Data Analytics in SSAS

In most data analytics, Hierarchies play a vital role. It provides a much easier way to analyse and present data.

There are several hierarchies that you can create such as natural hierarchies, bucketing hierarchies, unbalance hierarchies etc.

This article describes how to create hierarchies in SQL Server Analysis Services of Multi-Dimensional models.

Read the full article at https://www.sqlshack.com/enhancing-data-analytics-with-ssas-dimension-hierarchies/

Monday, November 30, 2020

Sri Lanka Qualifications Framework (SLQF) for Higher Education

There are various types of courses available, BSc, Postgraduate Diploma, BA, MSc, MA, MBA, MDA, MPhil, PhD etc. Many of us don't know how these courses are ordered and unaware that there is a framework for these qualifications.

In 2013, the Ministry of Higer Education with the funding from Worldbank defined the SLQF for higher education in Sri Lanka. In 2015 this was updated with the world standards,

The following is the SLQF in summary.

As you can see this shows the clear organization of different qualification.

Each qualification has minimum requirements as shown in the following table.

If you are holding any qualifications, verify whether you have achieved necessary requirements as many institutes find ways to bypass. If they have done that bypass, ultimately, you will be in trouble not them.

Importantly, you do not need to go step-by-step as there are defined pathways as shown below.

If you want to read the details of this report you can go to the following link.

https://www.ugc.ac.lk/attachments/1156_SLQF_2016_en.pdf

Sunday, November 29, 2020

Linguistic Analytics in Data Warehouse Using Fuzzy Techniques

A data warehouse is no more a "nice-to-have" system in your system but it is an integral part of your data strategy to face up the fierce competition. In most of the time, we do crisp analytics in the data warehouse such as "High", "Low" etc. What about Linguistic analytics which is shown below.

By using Fuzzy techniques this was achieved. Read the research paper at IEEE.

https://www.researchgate.net/publication/332081971_Linguistic_Analytics_in_Data_Warehouse_Using_Fuzzy_Techniques

By using Linguistic analytics, the following is the employee count for a different experience of employees in a factory.

This research is part of the Investigation and Development for Fuzzy Data Warehouse Project.

Thursday, November 26, 2020

Troubleshooting using Wait Stats in SQL Server

Troubleshooting is an art not a science in any domain. The same symptom may be due to different reasons. If your query is slow, it is can be due to many reasons ranging from hardware to other queries. In this latest article at Sqlshack, it is discussed how to perform troubleshooting in SQL Server using wait stats. In my personal experience, I always start my troubleshooting with Waitstats and it gives us the complete picture as well as it does not cost your system.

Read the article at https://www.sqlshack.com/troubleshooting-using-wait-stats-in-sql-server/

Wednesday, November 25, 2020

Different Types of Clustering Techniques

Clustering is an unsupervised technique that is used to perform natural grouping. Though -Means, Hierarchical and Fuzzy Clustering are the most common CLustering techniques, there are a few numbers of Clustering techniques as shown in the below figure.

Who are the best players in Meeting Solutions?

Source: https://www.liquit.com/

During this pandemic times, meeting solutions are playing a huge role by keeping the professionals, teachers, students at home and still being able to help their work and study whether it is IT or Non-IT.

You might have your own favourite tool to communicate between your teams and groups, but who is the best among all. Let us hear from the Gartner for their opinion on these Meeting Solutions. They have come up with their traditional magic quadrants for Unified Communications as a Service and Meeting Solutions released in 2020 November.

In both quadrants, Microsoft, Zoom and Cisco are leading and they have fallen to the visionary leader quadrant. LogMeIn and Google are challenging these leaders.

Download a copy of the Gartner Magic Quadrant for Meeting Solutions report to get more details.

In 2019 Cisco was leading Microsoft was lagging far behind. However, Microsoft has made significant progress over two years.

Tuesday, November 24, 2020

Image Classification in Orange

We have discussed How Orange tool can be used for Image Clustering. Now let us look at how we can perform Classification in Orange.

Like before, we let us select an image set which in classified folders. Those folder names will be taken as the classify names.

Following are the set of images that were used for the Image Classification.

These images are separated into six categories Animals, Birds, Flowers, People, Places and Trees.

The following is the Orange model.

Let us go through the above model step by step.

In the Image Embedding, Inception V3 was used as the Image Embedder. Logistic Regression, Neural Network, SVM, Random Forest, and Naive Bayes as the classifiers.

In the Test and Score, you can find out what are the accuracy matrices.

According to the above figure, Logistic Regression has the highest accuracy alone with the highest F1 score and Precision.

In classification, the Confusion matrix is an important measure to find out your prediction distributions.

As you know, in the Confusion Matrix, diagonal is the correctly predicted results. If you want to find out what is the reason for the incorrect prediction, you can select the relevant click and view that data from the Image Viewer.

In the confusion matrix, there are two were predicted as birds actually they were classified as birds. Look at (2, 4) cell.

Let us look at what those two images are.

Ok, you can't really blame the tool or the algorithm as these images have mixed of Bords and flowers. Though we have categorized them in flowers, according to the Classification algorithm, they were predicted as Birds.

We will look at a few interesting scenarios in future posts.

Monday, November 23, 2020

Model Comparison in Azure Machine Learning

We are building models in Machine Learning. How do you know these models are correct. What are the accuracy levels of these models? As we know there are a lot of parameters to verify. In the case of Classification, we use Recall, Precision, F1 measure are the most common evaluation methods apart from accuracy. In this article, it provides how can we compare models that were built in Azure Machine Learning.

Following are the other articles in the series.

Introduction to Azure Machine Learning using Azure ML Studio

Data Cleansing in Azure Machine Learning

Prediction in Azure Machine Learning

Feature Selection in Azure Machine Learning

Data Reduction Technique: Principal Component Analysis in Azure Machine Learning

Prediction with Regression in Azure Machine Learning

Prediction with Classification in Azure Machine Learning

Saturday, November 21, 2020

IPL and Overfitting

After a long wait, Indian Premier League was started and completed for the thirteenth time with Mumbai Indian declared as the undisputed winner.

However, this post does not do another cricket analysis which I am not an expert anyway.

We discussed about overfitting by using USA and Sri Lankan President elections and this post is to understand overfitting from a different perspective. Overfitting is your machine learning or predicting model is too accurate. You might think that how come high accuracy is a problem as we always try to increase the accuracy of reciting models. Overfitting can occur due to two reasons.

Too little data - When you have too little data, it is very likely that the model has high accuracy but it can go wrong with the next data point. Further, with many combinations, you will be making your data less.
When unnecessary data is collected - We can predict using uncorrelated data, For example, No US president is left-handed who is divorced or No Sri Lankan President is re-elected who has moustache are classical examples for over fittings.

Now let us look at Overfitting criterias in IPL for the champions over the last thirteen years.

Year	2008	2009	2010
Conclusion	No team has won an IPL Title	No 4th placed team in the preliminary round has won the IPL title	No team who is lead by an Indian player has won the Championship. Man of the series never loos the final
Event	Until Rajasthan Royals won the Championship	Until Decan Charges won the Championship who were ranked 4th in the preliminary round.	Until Dhoni's Chennai Super Kings won the Championship. Previously this was won by the teams captained by Shane Warne and Adam Gilchrist. In previous editions, Shane Watson (RR) and Adam Glischist (DC) were the Man of the series who were from the winning team. in 2010, Sachin Tendulkar was the man of series who is not part of Chennai Super Kings.
Reason	If predicting with less data is bad, what about predicting with no data.	Predicting with less data is bad.	Tough the captain plays a huge role in winning, it does not make any sense to him become an Indian or Foreigner.

Year	2011	2012	2013
Conclusion	No team has won the championship in continuously	4th ranked team in the preliminary round was never beaten in the final	Chennai Super Kings were never beaten twice in finals. MI never beaten CSK in a final
Event	Chennai Super Kings won the championship in 2010 and 2011 became the first team to won the championship being the defending champion.	Kolkata Knight Riders won the championship by beating the Chennai Super Kings who were 4th in the Preliminary round.	in 2012 and 2013 CSK were beaten in the finals by the KKR and MI respectively. MI and CSK has met in 2010 before in the final where CSK became the winner.
Reason	With four data points, what you can predict is very limited. Even though historically, your accuracy is 100%.	Again this prediction is correct till 2012, only one time 4th ranked made it final of the IPL championship.	When you are predicting with combinations of events, it is obvious that your accuracy will be very high as there are the only handful of events.

Year	2014	2015	2016
Conclusion	Most runs player team has never won the IPL title	Chennai Super Kings were never beaten twice in finals. MI was never beaten CSK in a final	The third-ranked team in the preliminary stage has never beaten the second-ranked team in the final.
Event	Robin Uthappa of KKR was the highest scorer in the tournament and was a member of eventual winners KKR. In three of the previous occasions, highest scorer represented runner up team but no the champions.	in 2012 and 2013 CSK were beaten in the finals by the KKR and MI respectively. MI and CSK has met in 2010 before in the final where CSK became the winner.	Sunrisers Hyderabad beat Royal challengers Banglore int he final which is the first time third-ranked team beat the second ranked team in the final. in 2010 CSK who was the third-ranked team who became the champions by beating the ranked one team.
Reason	Again only six events before thus very fewer data do not make very good in predictions.	When you are predicting with combinations of events, it is obvious that your accuracy will be very high as there are the only handful of events.	Two many combinations in fewer data should not be used for predictions.

Year	2017	2018	2019
Conclusion	Every MI win resulted in Highest Wicket taker is an Indian bowler. Number one ranked team has never beaten the second team in the final.	CSK has never won by Chasing	MI has never beaten CSK when MI was ranked 1
Event	In 2013 and 2015 MI has won the championship. in both those years, highest wickets take was Bravo from West Indies. In 2017 Bhuwaneswar Kumar was the highest wicket-taker who is an Indian.	in 2010 and 2011 CSK won but by defending. This is the first time that they have won the championship by chasing.	MI became champions in 2017 when they were ranked first in the preliminary stage. However, when they beat CSK in 2013 they were the second-ranked team. in 2019 they were ranked first who was able to defeat ranked two CSK.
Reason	Well, this is a combination of data and uncorrelated data.	Again, by combining the team and method, you are reducing the data.	Again, by combining the multiple teams and ranking, you are reducing the data.

Year	2020	2021	2022
Conclusion	MI has never won in even year. MI hs never won chasing. The highest number of Six hitter was never in the championship-winning team. Fair Play team never has won the Title.
Event	MI has won the championship in 2013, 2015, 2017 and 2019 and this is the first time that they have won while chasing. Ishan Kishan (MI) who was the highest six-hitter and became the first time that six-hitter was part of the tournament champs. Fair Play award initiated in 2012 and this is the first time that Fair Play ward winning has won the championship.
Reason	Though these are somewhat valid prediction, due to fewer data points accuracy is 100%.

So, it is important to collect adequate data for the prediction but even if it is with large data when you are making predication for combinations of your attributes, you are making your data fewer. Further, though you have data, it does not always make correlated predictions such as Fair Play Winner and Foreign Captain etc,

Translate