Royce Zhai (mzhai4@illinois.edu)
The topic is about sentiment analysis on tweets from Twitter. The task is to pass tweets to the model and predict the sentiment in the tweets. It is interesting because we can identify whether people are being positive or negative in their posts. The approach is to train models on a certain dataset containing tweets.
Python 3.7Jupyter- notebooks used to train and test the modelsPickle- used to save the trained models and vectors as binary filesPandas, NumPy- load and manipulate data using DataFramesNLTK- used in data pre-processing and cleaningScikit-learn- machine learning algorithm toolkitTweepy- Twitter API to stream live tweets
Make sure you have Anaconda installed.
Create the environment: conda create -n rbmint python=3.7
Activate the environment: conda activate rbmint
Install packages: pip install -r environment_setup.txt
Install stopwords: python -c "import nltk; nltk.download('stopwords')"
app.py- Main application file that interacts with the tweets and the modelsTrainModel.ipynb- This notebook contains the pre-processing and model trainingenvironment_setup.txt- File containing the Python requirements for this projectTest/directoryTest.ipynb- Notebook containing test code to unpack and load the model for predictionstwitter_analysis.py- Initial tests using the Twitter API and the trained modelstwitter_api.py- Initial tests setting up the Twitter API
Pickled data/directoryLR.pickle- Pickled trained Logistic regression modelnaive-bayes.pickle- Pickled trained Naive Bayes modelnn.pickle- Pickled trained Neural Network modelvector.pickle- Pickled TF-IDF vector to transform the data
- 1.6 m individual tweets with a 1 (Positive) or 0 (Negative) label
- Data cleaning involved the following steps
- Convert the tweet to lowercase, remove stopwords
- Remove the hashtag symbol (
#) - Remove
@mentions, websites - Perform stemming
Logistic Regression- 77%Naive Bayes- 76%Neural Network- 71%
app.pyis a command line app that supports the following arguments- Tweets from a specific user
--useror-u- username of the user to fetch tweets from (example - taylorswift13 (without the@))--countor-c- number of tweets to fetch and analyze (example - 5, defaults to 10)
- Stream tweets for a list of topics
--stream- list of topics to fetch live tweets from Twitter and perform analysis (example - "illinois" "football")--timeor-t- total duration of the stream in seconds (example - 10, defaults to 5)--visualizeor-v- visualizes the predictions using a pie chart.
- Tweets from a specific user
❯ python app.py --stream "illinois" "football" --time 10 ─╯
❯ python app.py --user taylorswift13 --count 10 ─╯
❯ python app.py --stream "pokemon" "winter" --time 10 --visualize ─╯
Please see the link below: https://drive.google.com/file/d/1YjGNmhE63XVLJRw_414mOsBBqZgNmJem/view?usp=sharing