Data with insights, don't forget your coffee!!

Friday, February 09, 2024

NLP pitstop using GPT-3.5 translate, summarize, sentiment analysis and Named Entity Extraction!!!

Introduction:

Large Language Models (LLMs) represent a groundbreaking advancement in the field of artificial intelligence, particularly in natural language processing. These models are designed to understand, generate, and manipulate human-like text on a massive scale. One of the most notable examples of LLMs is OpenAI's GPT-3 (Generative Pre-trained Transformer 3), which is part of the Transformer architecture. What sets LLMs apart is their ability to learn from vast amounts of diverse data, enabling them to perform a wide range of language-related tasks, such as text completion, translation, summarization, and even creative writing. These models are pre-trained on extensive datasets and can then be fine-tuned for specific applications. The architecture of LLMs, particularly the Transformer model, utilizes attention mechanisms to capture relationships between words and phrases in a text. This allows the models to grasp context and generate coherent and contextually relevant responses. GPT-3, for instance, has a staggering number of parameters, reaching hundreds of billions, contributing to its ability to understand and generate complex language patterns. As researchers continue to explore and improve upon the capabilities of Large Language Models, the technology is expected to play a pivotal role in shaping the future of human-computer interaction, information retrieval, and content generation. However, it is crucial to approach the development and deployment of LLMs with a careful consideration of ethical implications and societal impact.

NLP pitstop powered by GenAI [gpt-3.5-turbo]:

Based on, Gadget framework [ A serverless Node.js environment for running your JavaScript code accompanied by Actions and HTTP routes that are used to execute and organize your backend code]. Above it has defined three data models: user, session and translation. Below is an example how it describes relation between variable. Ex: User has many translations. You may supply multiple languages separated by space or comma.

As user logs in after being authenticated into the application, it stores the records as below.

Below is the JS and is calling Open-AI API end point [openai.chat.completions]:

Access Gadget based NLP pitstop:

To access, you can use GMail to authenticate and login: translate and capture sentiment here

Hope you had fun using this simple translator with sentiment capture!!

Thursday, February 01, 2024

Pentaho (PDI) to translate human language (i.e. NLP task) using Open-AI!

LLMs (Large Language Models), example like [LLaMA, GPT-4, ChatGPT, Megatron-Turning NLG, Mistral 7B] continue to fascinate the way NLP tasks [Question/Answering, Translation of human languages, Sentiment analysis, Text Summarization] being able to solve! Here is a question for Open-AI .

Who was the first person to land on moon.. and the answer from the system is basically given the statistical distribution of words in public corpus, what words are likely to follow the sequence? Here Neil Armstrong!!

How about human language translation, which can increase productivity coming from different cultural background, LLMs can be trained to covert one language to few more languages !? Here is an example.

Wow, how Open-AI could easily translate into few languages for an English sentence!

How about using this in a data pipe using ETL/ELT tool like (Pentaho Data Integration-PDI) to call REST end point of Open-AI Chat Completions. Below is a transformation (data pipeline) which takes input as above API endpoint, to supply a language, input data, which engine to use.

The input is a step 'Data grid', then prepare a JSON payload (using step 'Add constant'), using string replacement, replace the string for LANG, SENTENCE, GEN_AI_MODEL as per the input coming from Data grid step. Once the JSON body is prepared then pass onto the end point via calling 'REST client' step in Spoon.

Once REST client invokes Open-AI end point and gets the result (with code 200 as success), then you can see the data coming out from "dummy" step as below.

Translated Output

Conclusion: This simple translator in a data pipe line, shows you how you can use Open-AI REST end-point, while solving a business scenario where you are dealing with customer feedbacks, VoC data sets, increase in efficiency so sales, strengthen partnerships, avoid misunderstandings/disputes etc.

The challenge you may get into while using this model is performance, how long the Open-AI takes for a paragraph or even for a document to translate, another is cost: for large text size as these systems charge in number of tokens, so an Open source LLM, may be another option.

Hope you enjoyed!

Saturday, September 16, 2023

Pentaho PDI : working with Mongo & Kafka while processing hierarchical JSON!!

Need: When you are working with JSON objects, sometimes you need to create hierarchical JSON and store into Mongo collection. Then you may read Mongo collection and stream under a Kafka topic!

Here is flight data coming from various CSV files and once the files are collected, then used a Date dimension to capture few more columns as part of JSON. That is the source JSON. The data in preview mode, looks like this.

Then once you process the data through hierarchical [plugin from PDI], need to install before launching Spoon.

Once stored the hierarchical collection, then read the data from Mongo collection and produce under the Kafka stream. Provide necessary parameter under a secured Kafka broker as below.

Already Kafka listener (consumer) listening (also secured Kafka parameters are supplied to the consumer site as well given below. Let this transformation is running and if the topics are already in Kafka broker then keep processing through consumer. This consumer transformation refers to stream transformation and then produces data in text file output or any other format of output as desired.

Friday, April 28, 2023

OpenAI APIs via Pentaho-PDI (aka Spoon)

As OpenAI made the generative AI models publicly available to use (ie ChatGPT is usage of generative AI from OpenAI); If you don't want to use bot, then OpenAI offers APIs to invoke towards AI application usage.

In order to test the APIs offering from OpenAI, you need to have an account, then generate the API keys to use in your application. You can use "Postman" to test endpoints (like below as an example) and then will show how to use Pentaho's Spoon to invoke using "REST client" step for the OpenAI APIs.

While using the Spoon, once you supply JSON body, it invokes the above API endpoint and gives you the result back. You need to authenticate with your email and token (API-KEY) generated under your account > "API Keys" within OpenAI interface.

JSON :{"model":"code-davinci-edit-001","input":"What day of the wek is it?","instruction":"Fix the spelling mistakes"}

Similarly you can access other endpoints for example parse unstructured data,

A table summarizing the fruits from Goocrux: There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy. There are also loheckles, which are a grayish blue fruit and are very tart, a little bit like a lemon. Pounits are a bright green color and are more savory than sweet. There are also plenty of loopnovas which are a neon pink flavor and taste like cotton candy. Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them. | Fruit | Color | Flavor |

Endpoint API : https://api.openai.com/v1/completions

JSON body: {"model": "text-davinci-003", "prompt": "A table summarizing the fruits from Goocrux:\n\nThere are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy. There are also loheckles, which are a grayish blue fruit and are very tart, a little bit like a lemon. Pounits are a bright green color and are more savory than sweet. There are also plenty of loopnovas which are a neon pink flavor and taste like cotton candy. Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them.\n\n | Fruit | Color | Flavor |", "temperature": 0, "max_tokens": 100, "top_p": 1.0, "frequency_penalty": 0.0, "presence_penalty": 0.0}

The APIs from OpenAI is very powerful and you can try may other APIs like image generation, embedding, audio, moderation etc. You may tryout OpenAI APIs here: https://platform.openai.com/docs/api-reference

Hope you enjoyed this blog on OpenAI APIs usage using Spoon(PDI)!

Wednesday, July 29, 2020

Solve NLP task : PyTorch Transformer Pipeline

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.

Sentiment analysis (SA) is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques. Sentiment analysis tools allow businesses to identify customer sentiment toward products, brands or services in online feedback.

Here we are going to try out the PyTorch based transformer pipeline (deep learning NLP) to use the SQuAD from Stanford (References [1]).

# Q&A pipeline
!pip install qq transformers
from transformers import pipeline
qapipe = pipeline('question-answering')
qapipe({
'question': """how can question answering service produce answers""",
'context': """One such task is reading comprehension. Given a passage of text, we can ask questions about the passage that can be answered by referencing short excerpts from the text. For instance, if we were to ask about this paragraph, "how can a question be answered in a reading comprehension task" ..."""
})

Output:
{'score': 0.38941961529900837,
'start': 128,
'end': 169,
'answer': 'referencing short excerpts from the text.'}

# Sentiment Analysis pipeline

from transformers import pipeline
sentiment_pipe= pipeline('sentiment-analysis')
sentiment_pipe ("I sure would like to see a resurrection of a up dated Seahunt series with the tech they have today it would bring back the kid excitement in me.I grew up on black and white TV and Seahunt with Gunsmoke were my hero's every week.You have my vote for a comeback of a new sea hunt.We need a change of pace in TV and this would work for a world of under water adventure.Oh by the way thank you for an outlet like this to view many viewpoints about TV and the many movies.So any ole way I believe I've got what I wanna say.Would be nice to read some more plus points about sea hunt.If my rhymes would be 10 lines would you let me submit,or leave me out to be in doubt and have me to quit,If this is so then I must go so lets do it.")

Output:

[{'label': 'POSITIVE', 'score': 0.91602623462677}]

Experiments: Above Python codes tried under Google Colab. You can try other pipelines like NER (Named Entity Recognition), Feature Extraction. The Huggingface link is useful References[2].

Conclusion:
With the advancement of deep learning in NLP is rapidly growing, via transformer driven architectures,
it's pretty convenient to use with minimal coding and put these models in practice, while maintaining higher accuracy
for a given NLP task. In order to use the existing model for a custom data sets, and then do epochs(train/test) on that
custom data sets, that also feasible and most of cases, you can achieve higher accuracy than the
baseline model. So keep exploring the new transformer based models!

References:
1. The Stanford data set for Q&A is available here. https://rajpurkar.github.io/SQuAD-explore
2. More on Transformer Pipeline: https://huggingface.co/transformers/main_classes/pipelines.html

Monday, March 23, 2020

COVID-19 data analysis using Pentaho tools..

The world is going under pandemic and is being caused by novel corona virus (disease is COVID-19). In order to understand how it is impacted around the world, JHU's Corona Virus Research center, has provided data sets.
Data Source: https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases

For the analysis here, have chosen global narrow data sets (dates are in one column) for ETL processing (they have also data sets where each date is a column).

Pentaho Data Integration used here. A job was designed to download the files (Confirmed, Recovered, Deaths). Also a transformation was created to load data into a MySQL table. Later used DSW [Data Source Wizard] from PUC[Pentaho User Console], which is a Mondrian model based model and generated reporting Pentaho Analyzer (PAZ) reports to put under PDD (Pentaho Designer Dashboard). More snapshot of the process is below here.

ETL via Pentaho Data Integration:

Dash Board for analysis:

As you can see there are #3 reports are collected, which are controlled through two dashboard prompts (country and date). As clear, from confirmed cases China's data is already flatten (stable health), where as other countries are going upward. The next prominent country is "Italy" and so also we have seen many deaths there and the improvement on medical recovery is still challenging (recovered).

Mar22, 2020

Mar27, 2020

As you can see, US and Italy has surpassed China on confirmed cases and deaths, but yet to see recovery numbers to grow higher.

Hoping the recovery curve will be uplifted in the weeks to come (sooner).

Apr 3, 2020

You can see the numbers are rising still in confirmed cases (scaled independently to each country). It will still take some time to get both recovery and confirmed lines to merge (or come close, like ex: China). Hoping sooner.

Social Distancing Does Matter - Running a Python program via PDI to show a spread variable can make difference on infected population.
- Spread =1, with no social-distancing, it infected all 100K population,
- Spread = 0.5 [1 person infected , other 1 person maintained social distancing, that brought down infected population to 80K]
- Spread = 0.25 [1 infected 3 maintained social distancing it brought down infected population to 35K].
- Spread = 0.2 [1 infected 4 maintained social distancing it brought down infected population to 20K].

Apr 13, 2020:

This chart captures the daily delta on confirmed COVID-19 cases and as you can see an early sign that US has started flattening. Also other European countries, we can see similar trend. From now and next few weeks, will be better to maintain social distancing to completely flatten out these curves!

Apr 27, 2020:

The scale is independent.
The recovery curve is still progressing slowly in US. UK's recovery is very small. Germany,Spain are doing better in recovery trajectory.

July 29, 2020:

As you see now the cases have been growing in countries like Brazil, India. Here is a projection of daily delta on confirmed cases.

Here is the trend of countries by almost end of July 2020. US, India, Brazil are going upward, with Spain as well. It's clear that Germany, Italy, UK are tending towards a stable curve.

Chart on Deaths trend by country:

Sep 10, 2020:

When #of cases to project on delta count compared to previous day, India has crossed Brazil and US.

Recovery pattern analysis: Brazil, India, Germany are following a pattern where the recovery rate is close to cases, where as the recovery rate in Spain, US, Italy (Cases and recovery) are not close. UK recovery number can be exception to this.

Nov 11, 2020:

The US confirmed cases are increasing pretty rapidly. The current daily numbers is more than two times the peak from Jul.

Nov 19 2020:

Just to see 5Days Moving-Avg of confirmed cases, it's clear that in the US, cases are going up.

Apr12 2021:

As world is gradually moving towards vaccination and also opening up economy, so it's a mixed outcome coming out of many countries. Cases are spiking up in India, Brazil which is clear from this, comparing the vaccinated individuals to population size.

--5Day Moving Average Confirmed cases

--Daily Delta Confirmed cases

Apr20 2021:

As the Corona virus cases are spiking from the second wave in India, guides lines have come from CDC Guidelines on travel to India

It's going to be sometime, till we see the curve gets flatten for India.

Apr27 2021:

As you can still see the daily cases in India are still growing, with the second wave of virus, in matter of weeks cases are surging.

Wednesday, November 27, 2019

OCR via pytesseract (Capture text from the image)!!

As taken from the site:
Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

This is based from Google's Tesseract https://github.com/tesseract-ocr/tesseract

Here will show you how to run via Google' Colab interface.

Image File (test.png):

In Google's Collab > Open a Python3 notebook :

#Install these Python Libraries

!sudo pip install pytesseract

!sudo apt install tesseract-ocr

!sudo apt install libtesseract-dev

#Read Image and extract text

from PIL import Image

import PIL.Image

from pytesseract import image_to_string

import pytesseract

from google.colab import drive

drive.mount('/content/gdrive', force_remount=True)

root_dir = "/content/gdrive/My Drive/"

ocr_file = root_dir + 'YOUR_DRIVE/test.png'

pytesseract.tesseract_cmd = r'/usr/local/bin/pytesseract'

TESSDATA_PREFIX = '/usr/local/bin/pytesseract'

img = Image.open(ocr_file)

output = pytesseract.image_to_string (img, lang='eng')

print (output)

Outcome: