Translate

Monday, August 30, 2021

Physical Join Operators in SQL Server

Every developer knows the Logical Operators such as Inner Join, Left Outer, Right Outer Join etc in SQL Server. 


Do you how these operators work internally. This article discusses the Internals of Physical Join Operators (Nested Loops Join, Hash Match Join & Merge Join) in SQL Server.



Friday, August 27, 2021

MindsDB : AutoML Predictive Layer

 

MindsDB provides you with a Machine Learning platform that can work with many standards databases such as SQL Server, MongoDB and the list is shown below. 


MindsDB which is still in its early days supports AutoML in which the user has to provide a clean dataset and MindsDB will select the algorithms. It supports basic, classification and Time Series.

You can connect to a database or upload a dataset as shown below.


Then the configuration can be done as shown below. 

After the training is done, you can query the forecast value like below. 
There are few Use Cases as well that MindsDB support. 




Tuesday, August 24, 2021

Language Detection using Python


Language detection is an important functionality in Natural Language Processing (NLP) as Language detection is the basic pre-processing task that you perform. In the previous article, we discussed how language detection can be done using Azure Machine Learning. 

Let us do the same task for same dateset using python. Language Detection.ipynb - Colaboratory will provide the entire python code and let us understand the code step by step. 

First let us import the relevant libraries. 

!pip install textblob polyglot spacy spacy_cld langdetect langid 

Let us add the sample code with different languages. 


Following code will detect the language and display  the language code.
 

One of the important task in language detection is detecting the required language. Following simple query will display the English texts.
 


ETL Tools



For any data integration process, Extract-Transform-Load (ETL) is an important and challenging process. Mainly due to the fact that you need to deal with heterogeneous data sources such as RDBMS, NoSQL, Text Files, Email, Images, Chats etc. Specifically, in Datawarehouse, ETLs are used heavily as data warehouses typically has read-optimized data structures which as different from the operational data sources.

Source: icedq.com

Apart from the different types of sources, you need to integrate between non-compatible data sources. Non-Compatibility may arise due to technology as well as due to domain. Due to these complexities, it is obvious that we need special tools for ETL.
Then the question is what are the best ETL tools and what are the features of those ETL tools. 
Here is a list of twenty-two ETL tools that you can find to match your case. 


This article has listed ETLs tools such as SSIS, Panthaho, Informatica, Mattilion etc.  What is the ETL tool that you are using and Why?

If you wish to read few technical articles on SSIS, follow the following link.

Monday, August 23, 2021

Trends in Database Market



The global market for database management systems (DBMS) is estimated at nearly $63.1 billion for the year 2020 and is projected to reach $125.6 billion by 2026, growing at a Compound Anual Growth Rate (CAGR) of 12.4% over the period. 
The database market has taken different trends over the years. What are the new trends that are in front of us in future? Current Database Trends & Applications 2021  shows details about future trends and let us summarized those trends. 

1. SQL 
As we know, over the last couple of decades NoSQL technologies were emerging. However, according to this report, classical SQL is coming back. Further, new technologies which are called NewSQL ( Crochroche DB, Google Spanner ) are trying the exploit the features of both SQL and NoSQL technologies. Even newer machine learning-based offerings, such as MindDB’s ML framework and AWS Redshift ML, have incorporated SQL as the default querying language that clarifies this trend. 

2. ML Driven Databases
Machine learning is something that you cannot avoid. Since databases can hold a large volume of data, if a database has the in build ML capabilities, it can leverage the Predictive and Prescriptive analytics from the existing databases. 
Like SQL Server who has machine learning capabilities, there are new databases MindsDB and SingleStore that supporting machine learning capabilities.

3. Microservice Integration
Modern applications are moving towards Microservice. Database too needs to facilitate with this feature. The most notable NOSQL features MongoDB and AWS DynamoDB provide the schema flexibility, redundancy/scalability requirements, and serverless architecture pattern support required for microservices.


4. In-Memory Databases
Today mission-critical application needs high-performance databases. In-memory databases will cater for these requirements. SQL Server supports in-memory tables.

Though there are many other features, above mentioned features are important trending features. What do you think, what are the new features that you would like to see in your databases.  



Monday, August 16, 2021

Article: AutoML in Azure Machine Learning Service Regression and Time Series

Having discussed the AutoML features in Azure Machine Learning, the latest article of the series is to discuss the AutoML for Regression and Time Series in Azure Machine Learning. 

Following are the other list of the articles in the Series. 




Saturday, August 14, 2021

Impact on Time Series from External Factors

Many of the statistical time Series are considering Trend, Cyclic and Sessional factors of their own values. However, the real-world data is far from true. External environment factors are impacting the time series most of the time. For example, oil prices do not depend on the historical oil prices themselves. It depends on the world political situation especially the political situation in the middle-east. 


As you can see from the above figure, USA cigarettes sales are very much dependant on external factors. For example, there is a rise in cigarette sales after the end of world war II ends in 1945. However, imposing the federal tax, banning inflight smoking had resulted in a significant drop in sales of cigarettes. 
Further, the lung cancer time series has the same shape as the cigarette sales. This means, by controlling cigarette sales, you can control lung cancer. This is called cross prediction and this can be controlled by ARTxP algorithm/ Furhter, in Azure AutoML public datasets are used to model the above factors. 


Thursday, August 12, 2021

Matching People when they are Getting Older

We have done multiple image processing tasks with Orange over the last few blog posts. We were looking at how to match when modelling personals are with and without their popular makeups. We have done this for both Hollywood and Bollywood beauties. Further, we looked at how to classify people with and without face masks on their faces after looking at few image classification techniques. In the last post, we have discussed how to identify actresses when they are without their fancy hair. 

Today we are looking at how to identifying when these modellers become elder. 

These are the images of actresses who have become elder. 


We can use the Orange Data Mining tool similar to what we did last time. 


The openface was used for Image Embedding while cosine distance was used for the distance measure in the Neighbors. 
Though we had higher success rates in the previous activities, this activity was able to match only 11 out of 42 images. Does that mean, though you can match these beauties with no makeup or less hair, it is difficult to match them when they are getting elder!

Tuesday, August 10, 2021

Article: AutoML in Azure Machine Learning Service

AutoML is a new concept in Machine Learning where most of the steps are carried by the AutoML process. Following are the typical steps that are processed by the standard machine learning process. 


In the AutoML process, when the data is accepted all the other processes such as Data preparation, feature engineering, model tunning etc are done automatically, AutoML in Azure Machine Learning is the latest article in the Azure Machine Learning article series.  
Following are the other list of the articles in the Series. 

Introduction to Azure Machine Learning using Azure ML Studio
Data Cleansing in Azure Machine Learning
Prediction in Azure Machine Learning
Feature Selection in Azure Machine Learning
Data Reduction Technique: Principal Component Analysis in Azure Machine Learning
Prediction with Regression in Azure Machine Learning
Prediction with Classification in Azure Machine Learning
Comparing models in Azure Machine Learning
Cross Validation in Azure Machine Learning
Clustering in Azure Machine Learning
Tune Model Hyperparameters for Azure Machine Learning models
Time Series Anomaly Detection in Azure Machine Learning
Designing Recommender Systems in Azure Machine Learning
Language Detection in Azure Machine Learning with basic Text Analytics Techniques
Azure Machine Learning: Named Entity Recognition in Text Analytics
Filter based Feature Selection in Text Analytics
Latent Dirichlet Allocation in Text Analytics
Recommender Systems for Customer Reviews
AutoML in Azure Machine Learning



 

Wednesday, August 4, 2021

Time Series Cheatsheet 6.0

We have been releasing the Time Series Cheatsheet during this year in order to capture all the features of Time Series Forecasting. This is the newest release of that effort. 


You can access this list from this link
The following are the major changes done from the previous versions. 
1. Separation of Forecasting algorithms into main four categories. 
            a) Time Series Non-Linear Methods
            b) Classical Statistical Time Series Linear Methods
            c) Supervised Machine Learning Methods
            d) Regression Methods

2. Inclusion of Stationary Validation Techniques

Sunday, August 1, 2021

Our World in Data or Democratized Data

As we know data is everywhere, but do we have access to them or is data democratized? There are some efforts were done to provide data access to the general public and  Our World in Data is one of them.

Not only data, but this site also provides few publications as well. 


Our World in Data Publications

We will leave these discussions for later. Today let us some features of the Our World in Data. There are many topics such as Health, Food & Agriculture, Education and Knowledge etc and each topic has many subtopics. 


Let us look at one of the "popular" datasets these days which is COVID-19. 


You can subscribe to this site so that you will get the latest data into your inbox. We will discuss the datasets in the coming blogposts.