Twitch Chat Analytics: Analyzing jinnytty’s streams
Context
Yoo Yoonjin (Korean: 유윤진; born July 28, 1992) better known as Jinnytty, Yoo is best known for her IRL Twitch livestreams who has been streaming for 5 years From 2020 to 2022, Yoo traveled to and streamed live in more than 20 countries in different continents like Asia, Europe and north america.
Personal Motivation
Twitch community is well known as tech-savvy so I decided to develop the analysis in this community because they could give me good feedback about the report and coding ain addition jinnytty is one of my favorites streamers
Business Task
“There isn’t a business task involve, I mainly did this project to improve my skill using pandas and their packages and the process of developing a BI analytics
Key Stakeholders
So i said before this is personal project but through the endeavor doing it i found some people that is interesting with it
Preparation and first cleaning
Twitch is a website focused on streaming although they had developed their chat with the Internet Relay Chat (IRC) protocol also you can access their database with an API knowing that we are going to use 2 tools for getting chat logs RechatTool and Chatterino
None of them provide data and timestamp so I’ll explain who do the data for each one separately.
RechatTool
The first cleaning that i do it is done it with bash and is automated
#!/bin/bash
# I find every Quotation mark and delete them. I do this because a quotation mark could interfere when I am running some code in the future
sed 's/"//g' *.txt > withoutcomillas.txt &&
# I use vertical bars as separators so this symbol could make another column when i am reading the file with pandas
sed 's/|//g' withoutcomillas.txt > datawithits.txt &&
# Because this analysis is about twitch and the main way to express yourself its with emotes i need to normalize all world with apostrophe
sed -r "s/It’s/its/g" datawithits.txt | sed -r "s/It’s/its/g" | sed -r "s/That’s/Thats/g" | sed -r "s/M&M's/MMs/g" > yyjdata.txt | sed -r "s/don't/dont/g" > yyjdata.txt &&
# Separating the data in different files will makes easy the cleaning at the end i will merge them
awk '{print $1}' yyjdata.txt | awk '{print substr($0,2,8);}' > time.csv &&
# Separating the data in different files will makes easy the cleaning at the end i will merge them
awk '{print $2}' yyjdata.txt | awk -F: '{print $1}' > user.csv &&
# Separating the data in different files will makes easy the cleaning and at the end i will merge them
awk -F: '{ for(i=1; i<=3; i++){ $i="" }; print $0 }' yyjdata.txt | awk '{print substr($0, 5, length($0))}' > messages.csv &&
# Merging all the files into one
paste -d '\|' time.csv user.csv messages.csv > readydata.txt &&
# At the start of the stream is always played an intro and some users spam and that makes
# the analysis biased so I need to find when the intro finish and that is easy because usually, a bot sent a message saying
# that scene switched to live so i find that row and everything before that it is deleted
sed '1,/super_stream_server|Scene switched to Live/d' readydata.txt > awkcleaning.txt &&
# Then i deleted some rows with text strings that spawn nan values in the last column
awk '!/just earned/ && !/sending messages too quickly/ && !/emote-only/ && !/You can find your currently available/ && !/raiders from/ && !/redeemed/ && !/streamelements/ && !/innytty is live!/ && !/StreamElements/
# I delete the rest of the files
rm datawithits.txt time.csv user.csv messages.csv withoutcomillas.txt yyjdata.txt readydata.txt awkcleaning.txt
Next i open the file with pandas because we need to add the timestamp to every row
import pandas as pd
df = pd.read_csv('../yyj.csv', delimiter='|', encoding='utf8', header=None, names=["Time", "User", "Message"])
# Sometimes i do the analysis the next day that the stream was streamed so i need to subtract one day
df["Time"] = pd.to_datetime(df["Time"]) - + pd.Timedelta(days=1)
# Here i will add the timestamp
df["Time"] = pd.to_datetime(df["Time"]) + pd.Timedelta(hours=6, minutes=3)
# next i separate the date from the timestamp in one column
df["Day"] = pd.to_datetime(df["Time"]).dt.strftime('%Y-%m-%d')
# and then i do the same with timestamp
df["Time"] = pd.to_datetime(df["Time"]).dt.strftime('%H:%M:%S')
# i organize the columns and the file its ready for the analysis
df = df[["Day", "Time", "User", "Message"]]
df.to_csv('../yyj.txt',index=False, header=False, sep='|')
Chatterino
With Chatterino the cleaning process it is similar to RechatTool although we need to do some extra steps
# merge files with date /usr/bin/merge
for i in find . -name "*.log" -type f; do
ls $i | xargs -I{} awk '{print "{}", $0}' $i
done
# Cleaning the ""
sed 's/"//g' data.txt > withoutcomillas.txt &&
# Cleaning the |
sed 's/|//g' withoutcomillas.txt > datawithits.txt &&
# Cleaning It's and it's
sed -r "s/It’s/its/g" datawithits.txt | sed -r "s/It’s/its/g" | sed -r "s/That’s/Thats/g" | sed -r "s/M&M's/MMs/g" > yyjdata.txt &&
# Clean the first column and export only the data in a new file
awk '{print substr($1,12,10); }' yyjdata.txt > a.txt &&
# Take the column of time to another file
awk '{print $2}' yyjdata.txt > time.txt &&
# Removing the 1st and last character of every line "[]" from the file time.txt then export the data to b.txt
sed 's/.//;s/.$//' time.txt > b.txt &&
# Deleting the last character from the third column ":"
awk -F: '{if (NR!=0) {print substr($3, 6, length($3))}}' yyjdata.txt > c.txt &&
# extracting messages
awk -F: '{ for(i=1; i<=3; i++){ $i="" }; print $0 }' yyjdata.txt > mssgs.txt &&
# Deleting spaces from mssgs
awk '{print substr($0, 5, length($0))}' mssgs.txt > d.txt &&
# Merging the files a, b, and c into one file
paste -d '\|' a.txt b.txt c.txt d.txt > readydata.txt &&
# starting stream
sed '1,/super_stream_server|Scene switched to Live/d' readydata.txt > awkcleaning.txt &&
# search all lines that contains # and delete them
awk '!/just earned/ && !/sending messages too quickly/ && !/emote-only/ && !/You can find your currently available/ && !/raiders from/ && !/redeemed/ && !/streamelements/ && !/innytty is live!/' awkcleaning.txt > yyj.txt
# Removing the files a.txt b.txt c.txt d.txt
rm a.txt b.txt c.txt d.txt data.txt yyjdata.txt time.txt readydata.txt awkcleaning.txt mssgs.txt withoutcomillas.txt datawithits.txt
Loading dependencies and data
Now that the data is organized we can start handle it with pandas
import pandas as pd
from matplotlib import dates as mpl_dates
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from textblob import TextBlob
import math
from datetime import datetime, timedelta
from jinja2 import Environment, FileSystemLoader
read = pd.read_csv('june.txt', delimiter=',', encoding='utf8', header=None, names=["day", "time", "user", "message"])
Summary
df.describe()
df.info()
Normalizing and cleaning the data
We already did a lot of cleaning before but we need to take extra steps
Because we are using two sources they record the user names different we are going to standardize the data
First all user names need to be in lowercase
df['user'] = df['user'].str.lower()
Then replace the user who has two record for their name to only one
df["user"].replace({"센트23 vincentt23":"빈센트23", "루이스와이푸 haruiswaifu":"하루이스와이푸", "雞力 winterreise1988":"愛雞力", "죠우":"코죠우"}, inplace=True)
The last step is deleting all NA values
clean_wothoutNA = df.dropna()
Processing the data
Identifying which emotes where more used
As i said before twitch community like to use emotes to express their self so it is really interesting to see which emote was used more i will do this analysis first by taking the data frame without nam values then take each word of each row separating them and then group each one for counting them.
I will take only the first seventy words and then i apply a filter that only leaves emotes and not English words
most_used_words = pd.Series(' '.join(clean_wothoutNA['message']).split()).value_counts()[:70].reset_index()
clean1 = most_used_words.replace('', np.nan, regex=True)
text_withoutNan = clean1.dropna(axis='rows').replace({'\'':''}, {')':''}, regex=True)
text_withoutNan
#define values
values = ["The", "the", "it", "be", "is", "you", "a", "to", "no", "in", "that", "she", "this", "for",
"not", "good", "I", "on", "and", "i", "1", "2", "Lmao", "Lol", "You", "like", "just", "its", "?", "lol", "all", "so", "will",
"of", "are", "they", "bye", "⠀", "yes", "he", "can", "11", "go", "him", "your", "back", "her", "D", "u", "do", "take", "need",
"more", "why", "have", "what", "with", "dont", "get", "eat", "drink", "jinny", "was", "my", "we"
,"nice", "too", "me", "one", "yuggie", "at", "how", "it\'s", "ye", "yea", "!bet", "hair", "milk", "🥕", "wind",
"S", "yeah", "ok", "mode", "water", "there", "drone", "lacari", "love", "buzz", "ass", "now", "Kofu", "kofu", "suck", "WEAR",
"IT!", "MMs", "Buy", "IT!WEAR", "lolz", "ur", "hahaha", "eye", "see", "SIMBA", "did", "never", "No", "Jinny", "!yc"
,"lmao", "+100", "up", "bar", "hidden", "🇸🇬", "🤝", "🇲🇾", "F", "❌", "", "KOFU", "DO", "THIS", "IRL", "cunt", "YOU", "balls", "A", ".", "shoey", "us",
"buy", "HSP", "chat", "don\"t", "💨", "🌊","yr", "yup", "jimbo", "lul", "uh", "cool", "aw", "oh", "time", "drunk", "phone", "gone",
"That\'s", "IT", "LOOKS", "ICANT", "LOCK", "AND", "YOUR", "ICANT", "if", "😃", "😂", "🤣", "well", "😆", "🙂", "👋🙂", "🤣🤣", "f","🤳"]
#drop rows that contain any value in the list
textwithoutmostusedwords = text_withoutNan[text_withoutNan['index'].isin(values) == False]
#another way to search for most used words
#df.message.value_counts().reset_index()
#--------------------------------------------------Saving in a document----------------------------------------------------------------
savetextwithoutmostusedwords = textwithoutmostusedwords.to_csv("topemotes.txt", sep=' ', header=False, index=False)
#--------------------------------------------------------------------------------------------------------------------------------------
Cleanreadytop20chatters = pd.read_csv("topemotes.txt", delimiter=' ', encoding='utf8', header=None, names=["Emote", "Times Used"])
Cleanreadytop20chatters
Outcome
Top chatters with their most used emote
The first times i did the analysis and show to some friends they ask me which emote they use more and how many times they send a messages I found that the most interested people in the analysis is usually the user who sends more messages per stream
top20Chatters = df.user.value_counts()[:20].reset_index()
nametop_1 = [None] * 20
Searching_by_User_top_1 = [None] * 20
Searching_by_UserTop_1_emote = [None] * 20
textwithoutmostusedwords_byuser1 = [None] * 20
topemotefromtop1chatter = [None] * 20
howManyTimesWasUsedThe_topemotefrom_top1chatter = [None] * 20
topemotefromtop_1chatter_second_emote = [None] * 20
howManyTimesWasUsedThe_topemotefrom_top_1chatter_second_emote = [None] * 20
for m in range(20):
nametop_1[m] = top20Chatters.loc[m, 'index']
Searching_by_User_top_1[m] = df[(df["user"] == nametop_1[m]) & (df["message"] )]
Searching_by_UserTop_1_emote[m] = pd.Series(' '.join(Searching_by_User_top_1[m]['message']).split()).value_counts()[:50].reset_index().replace('', np.nan, regex=True).dropna(axis='rows')
textwithoutmostusedwords_byuser1[m] = Searching_by_UserTop_1_emote[m][Searching_by_UserTop_1_emote[m]['index'].isin(values) == False].reset_index()
topemotefromtop1chatter[m] = textwithoutmostusedwords_byuser1[m].loc[0, 'index']
howManyTimesWasUsedThe_topemotefrom_top1chatter[m] = textwithoutmostusedwords_byuser1[m].loc[0,0]
topemotefromtop_1chatter_second_emote[m] = textwithoutmostusedwords_byuser1[m].loc[1, 'index']
howManyTimesWasUsedThe_topemotefrom_top_1chatter_second_emote[m] = textwithoutmostusedwords_byuser1[m].loc[1,0]
top20Chatters['Most used emote by user'] = pd.Series(topemotefromtop1chatter)
top20Chatters['Times used'] = pd.Series(howManyTimesWasUsedThe_topemotefrom_top1chatter)
top20Chatters['Second most used emote by user'] = pd.Series(topemotefromtop_1chatter_second_emote)
top20Chatters['Total for second emote'] = pd.Series(howManyTimesWasUsedThe_topemotefrom_top_1chatter_second_emote)
top20Chatters.to_csv('top20Chatters.csv', header=False, index=False,)
top20Chatters
Outcome
Total messages per day
Finding which day has more interactions
new_df = clean_wothoutNA.day.value_counts().reset_index()
Total_messages_per_day = new_df.to_csv("prueba.txt", sep=' ', header=False, index=False)
Total_messages_per_day = pd.read_csv("prueba.txt", delimiter=' ', encoding='utf8', header=None, names=["Day", "Messages"])
Total_messages_per_day
Total user per day
A similar process as i did before but this time I will find how many users were in each day
days_month = Total_messages_per_day["Day"].to_numpy()
Total_users_per_day = [len(df[(df["day"] == day)].user.value_counts()) for day in days_month]
read_saving_prueba = Total_messages_per_day
read_saving_prueba["Chatters"] = pd.Series(Total_users_per_day)
newplot = read_saving_prueba.sort_values(by='Day')
newplot
Total user and messages per day graphs
chart_messeges_per_day = plt.figure(figsize=(30,10))
chart_messeges_per_day = plt.plot(newplot.Day, newplot.Messages, "g.-")
chart_messeges_per_day = plt.title("Messages per day")
chart_messeges_per_day = plt.xlabel('Days.', fontsize=18)
chart_messeges_per_day = plt.ylabel('Messages.', fontsize=16)
chart_messeges_per_day = plt.xticks(fontsize=12, rotation=360)
chart_chatters_per_day = plt.yticks(fontsize=13)
chart_messeges_per_day = plt.show()
chart_chatters_per_day = plt.figure(figsize=(30,10))
chart_chatters_per_day = plt.plot(newplot.Day, newplot.Chatters, "b.-")
chart_chatters_per_day = plt.title("Unqiue chatters per day")
chart_chatters_per_day = plt.xlabel('Days.', fontsize=18)
chart_chatters_per_day = plt.ylabel('Unique chatters.', fontsize=18)
chart_chatters_per_day = plt.xticks(fontsize=12, rotation=360)
chart_chatters_per_day = plt.yticks(fontsize=13)
chart_chatters_per_day = plt.show()
Finding the day and week of the stream
import calendar
calendar.setfirstweekday(6)
def get_week_of_month(year, month, day):
x = np.array(calendar.monthcalendar(year, month))
week_of_month = np.where(x==day)[0][0] + 1
return(week_of_month)
cell = [None] * int(len(newplot))
d = [None] * int(len(newplot))
for g in range(len(newplot)):
cell[g] = newplot.loc[g, 'Day']
d[g] = cell[g][8:]
newplot['day'] = pd.Series(d)
Week_1 = 1
Week_2 = 5
Week_3 = 12
Week_4 = 19
Week_5 = 26
last_day = 30
def applyFunc(s):
if s == Week_1 or s < Week_2:
return "Week 1"
elif s == Week_2 or s < Week_3:
return 'Week 2'
elif s == Week_3 or s < Week_4:
return 'Week 3'
elif s == Week_4 or s < Week_5:
return 'Week 4'
elif s == Week_5 or s <= last_day:
return 'Week 5'
return ''
newplot["Day"] = pd.to_datetime(newplot["Day"])
newplot['day_of_the_week'] = pd.Series(newplot['Day'].dt.day_name())
newplot['day'] = newplot['day'].astype(int)
newplot['Week'] = newplot['day'].apply(applyFunc)
newplot["Day"] = newplot["Day"].dt.strftime('%Y-%m-%d')
newplot.reset_index(drop='index')
Heatmaps and pivot tables
Top messages Pivot table
heatmap = newplot.pivot_table(index="Week", columns="day_of_the_week", values="Messages").fillna(0)
week_pivot = heatmap.reindex(columns=['Sunday','Monday','Tuesday', 'Wednesday','Thursday','Friday','Saturday'])
week_pivot
Top messages Heatmap
week_messeges_heatmap = plt.figure(figsize = (20,10), facecolor="w")
week_messeges_heatmap = plt.title("Heatmap top messages")
week_messeges_heatmap = sns.heatmap(week_pivot, annot=False, cbar_kws={'shrink': 0.9})
week_messeges_heatmap = plt.xticks(fontsize=12, rotation=45)
week_messeges_heatmap = plt.yticks(fontsize=15)
Top users Pivot table
chatters_heatmap = newplot.pivot_table(index="Week", columns="day_of_the_week", values="Chatters").fillna(0)
chatters_heatmap = chatters_heatmap.reindex(columns=['Sunday','Monday','Tuesday', 'Wednesday','Thursday','Friday','Saturday'])
chatters_heatmap
Top users Heatmap
week_chatters_heatmap = plt.figure(figsize = (20,10))
week_chatters_heatmap = plt.title("Heatmap top chatters")
week_chatters_heatmap = sns.heatmap(chatters_heatmap, annot=False)
week_chatters_heatmap = plt.xticks(fontsize=12, rotation=45)
week_chatters_heatmap = plt.yticks(fontsize=15)
Distribution of messages between top 5 and top 20
plt.rcParams['font.size'] = 15
top_chatters_frame = pd.DataFrame({'Top Chatterns' : [notopChatter, top20withouttop5, totalTop5Chatters]}, index=[plot_index_Totalmsg, plot_index_top20, plot_index_top5])
top_chatters_frame
plot = top_chatters_frame.plot.pie(y='Top Chatterns', figsize=(10, 10), fontsize=10)
plt.savefig('plot')
Mean Median Mode of total of messages per user
table_user_and_total_messages = df.user.value_counts()
aaa = table_user_and_total_messages.reset_index().rename(columns={"index": "User", "user": "Total_Messeges_Per_User"})
Messeges_Per_User = aaa["Total_Messeges_Per_User"].to_numpy()
mean = np.mean(Messeges_Per_User)
median = np.median(Messeges_Per_User)
Mode = stats.mode(Messeges_Per_User)
table_total_user_and_total_messages = len(pd.unique(df['user']))
print('Total chatters =' ,table_total_user_and_total_messages)
print('Mean =' ,mean)
Mean_text = str('Mean ') + str(mean)
print('Median =' ,median)
Median_text = str('Median ') + str(median)
print('Mode =' ,Mode)
Mede_text = str('Median ') + str(median)
Total chatters = 40122 Mean = 72.05341342439122 Median = 2.0 Mode = ModeResult(mode=array([1], dtype=int64), count=array([16121]))
So i discoverd that the most user only send 1 or 2 messages per stream so i decided to make a visulitation about it
Messages sent by users (grouped by quantity)
df_Per_User = aaa['Total_Messeges_Per_User'].value_counts().reset_index().rename(columns={"index": "Total",})
df_Per_User
chart_chatters_grouped_by_quantity = plt.figure(figsize=(20,10))
chart_chatters_grouped_by_quantity = plt.xticks(fontsize=12, rotation=90)
chart_chatters_grouped_by_quantity = plt.yticks(fontsize=13)
chart_chatters_grouped_by_quantity = plt.plot(df_Per_User.Total, df_Per_User.Total_Messeges_Per_User)
chart_chatters_grouped_by_quantity = plt.title("Messages sent by users (grouped by quantity)")
chart_chatters_grouped_by_quantity = plt.xlabel('Total of Messages.', fontsize=20)
chart_chatters_grouped_by_quantity = plt.ylabel('Number of Users.', fontsize=20)
Unfortunately, as you can see the graph contains a lot of information so i decided to remove outliers and with this have a better view of what happened with the majority of viewers
df2 = df_Per_User[(df_Per_User["Total"] <= 10)]
df2
Now that we have a new dataset with the outliers we are going to redo the graph
chart_chatters_grouped_by_quantity_bar = plt.figure(figsize=(12,10))
chart_chatters_grouped_by_quantity_bar = plt.xticks(fontsize=12)
chart_chatters_grouped_by_quantity_bar = plt.yticks(fontsize=13)
chart_chatters_grouped_by_quantity_bar = plt.bar(df2.Total, df2.Total_Messeges_Per_User)
chart_chatters_grouped_by_quantity_bar = plt.title("Messages sent by users (grouped by quantity)")
chart_chatters_grouped_by_quantity_bar = plt.xlabel('Total of Messages.', fontsize=20)
chart_chatters_grouped_by_quantity_bar = plt.ylabel('Number of Users.', fontsize=20)
chart_chatters_grouped_by_quantity_bar = plt.xlim(1,10)
chart_chatters_grouped_by_quantity_bar = plt.savefig('msbu')
Sentiment analysis
sentiment_df = clean_wothoutNA
def sentiment_calc(text):
try:
return TextBlob(text).sentiment.polarity
except:
return None
sentiment_df['sentiment'] = sentiment_df['message'].apply(sentiment_calc)
sentiment_df = sentiment_df[sentiment_df.sentiment != 0]
Sentiment_of_chat = sentiment_df["sentiment"].to_numpy()
mean = np.mean(Sentiment_of_chat)
median = np.median(Sentiment_of_chat)
Mode = stats.mode(Sentiment_of_chat)
print('Mean =' ,mean)
Mean_text = str('Mean ') + str(mean)
print('Median =' ,median)
Median_text = str('Median ') + str(median)
print('Mode =' ,Mode)
Mede_text = str('Median ') + str(median)
boxplot_sentioment = plt.figure(figsize=(12,10))
boxplot_sentioment= plt.boxplot(sentiment_df.sentiment)
Mean = 0.189543889185649 Median = 0.25 Mode = ModeResult(mode=array([0.5]), count=array([51712]))