This demo has been created by Centroida based on a specific request from two financial institutions. Both banks were interested to conduct real-time sentiment analysis of textual data at scale.
Two separate engineering challenges have been addressed during the development of the demo:
- Building an NLP model that achieves near state-of-art results in sentiment analysis
- Building a high-performant and scalable data pipeline that utilizes the model and provides real-time analytics over twitter data
Based on a standardized dataset in this area (SemEval 2017) we achieve near state-of-art results of 67,23% validation accuracy and 65.24% recall. To put this in perspective, if you’d take Microsoft Azure’s Text Analytics API and evaluate it on the same dataset, it scores about 54% in accuracy and 57% in recall.
Our pipeline also doesn’t fall short. To illustrate a sample load, we’ve loaded 336mln tweets (to simulate load) and process them in a batch manner, inferring the tweets’ sentiment using our model. With the current demo we’re squeezing out an average throughput of about 50,000 tweets per second which number can be further scaled given sufficient resources.
Even though the demo has been tailored to Twitter data at present, the underlying approach can be applied to any textual dataset – e.g., Facebook feeds, news feed, user reviews, comments, etc. As a result, this approach is applicable to and renders significant value across various domains.