Sentiment Analysis
This project employs machine learning and natural language processing (NLP) to classify tweet sentiments as positive, negative, or neutral, leveraging tokenization, TF-IDF vectorization, and five classification algorithms. The Decision Tree model achieved 97% accuracy, demonstrating sentiment analysis's value in public sentiment tracking. A Flask API was developed to deploy the model, hosted on Render, enabling easy access for broader use and analysis.
Understanding Sentiment Analysis
In today’s data-driven world, understanding public sentiment is more critical than ever. Sentiment analysis, a powerful intersection of machine learning and natural language processing (NLP), empowers businesses and organizations to extract meaningful insights from unstructured text data. By classifying sentiments as positive, negative, or neutral, it helps shape marketing strategies, improve customer experiences, and monitor brand reputation. From analyzing social media trends to gauging consumer feedback, sentiment analysis drives informed decision-making and fosters success in an increasingly competitive landscape.
Solving Organization Challenges
In an era dominated by unstructured data from social media, reviews, and online interactions, businesses and organizations face challenges in effectively understanding public sentiment. Traditional methods of analyzing feedback are time-intensive, lack scalability, and fail to provide actionable insights promptly. This limitation impacts their ability to monitor brand reputation, track trends, and respond to consumer needs efficiently.
To address this, this project employs machine learning techniques to classify sentiment in tweet text, providing an automated, accurate, and scalable solution for real-time sentiment analysis to enhance strategic decision-making and outcomes.
Data
The data files used in this project were sourced from Kaggle, a popular platform for datasets and data competitions.
Workflow
Statistical and descriptive analysis, both quantitative and visual, was conducted to better understand the data, identify patterns, and determine necessary preprocessing and feature engineering steps.
Preprocessing
The preprocessing in this project involved several steps to clean and prepare the text data:
RegEx
Used to remove special characters (e.g., /, $, %, &, #) that do not add semantic value.
BeautifulSoup:
Employed to strip HTML tags (e.g., <\br>).
Lowercasing
Converted text to lowercase for consistency.
Stopword Removal
Used stopwords from the corpus to eliminate common, non-informative words (e.g., "the," "a," "an").
Lemmatization
Reduced words to their canonical forms (e.g., "running" to "run").
TF-IDF Vectorization
Transformed text into numerical vectors for machine learning using Term Frequency-Inverse Document Frequency.
Model Building & Training
A Random Forest Classification model was developed and trained on the training dataset. Random Forest was chosen for its ability to minimize bias compared to a single decision tree, leveraging multiple decision trees to weigh class outcomes, enhancing accuracy. Upon testing and evaluation, the model achieved a performance of 94.2%. After hyperparameter tuning, its accuracy improved to 96%, demonstrating its effectiveness in classifying outcomes.
WEB API
Model Deployment
The pre-trained model was deployed on Render. A Flask web API was developed to handle requests, utilizing a vectorization joblib
file to transform new text input into vectorized data. The model’s joblib
file then classifies the sentiment of the text and returns the classification result.