Translate

Monday, June 21, 2021

Data Analysis for Singlish Texts

Like many other non-English nations, we Sri Lankans used to type Sinhala words using English text. Though there are many word processing tools and apps are available for Sinhala texts, still, we see a lot of people use Singlish words. 
Not only these Singlish texts are difficult to read, at the research level there are difficulties in identifying these words.
In every text related research, the first task would be identifying the Language. We have discussed how to detect a language using Azure machine learning in a previous article
This post is to look at whether we can detect Singlish text using Azure Machine Learning. The following is the configured Azure Machine Learning experiment.




You can download the experiment from the Azure AI gallery. Let us look at some important findings in this experiment. 

Out of the 1400+ texts, 35% were identified as English may be due to the fact that letters are in English. Then the big surprise is much Singlish texts were identified as Indonesian and Romanian and the percentages are 26%, 13%. Not sure there is a relationship between the Singlish language with Indonesian and Romanian languages. 
Another important finding is that Singlish texts are identified as 40 different languages such as Maly, Turkish, Polish, Irish etc. 




Tuesday, June 15, 2021

ETL Framework for Document Databases & Relational Databases

The following two research papers are targeted to achieve ETL functionality between Document and RDBMS databases. For both researches, MongoDB and SQL Server were used as proof of concept.

the first research is about building a replication layer between the document and document databases. The basic architecture was displayed in the below figure.


During this research, CPU processes were compared as shown in the below figure.



You can see that when SQL Server processing % in the range of 70 - 100 while MongoDB is not even reached 1 %.
By using the findings in the above research, another research was done to define a ETL framework between DocumentDB and SQL Server. 

Sunday, June 13, 2021

Few SSIS Articles

SQL Server Integration Services (SSIS) is an Extract-Transform-Load (ETL) tool that is in the SQL Server Family. As shown in the following image, ETL is an important part of the data warehouse tool.


Conditional Split and Fuzzy Lookup are important controls in the SSIS. Further, using the SSIS script component, data sources can be created. Slowly changing Dimensions or SCD are configured in SSIS as it enables historical aspects of data warehouses. 
Change Data Capture (CDC) enables the extraction of incremental data for the data warehouse. SSIS supports CDC with multiple controls.  
You can execute Data Mining queries from SSIS to query the Data Mining Model. There are options to retry the SSIS package during the failure. 

Sri Lankan Data Community June 2021 Online Meetup

 


Join with me at SL Data Community to discuss Time Series Analysis using the TICK stack. 
Register at here