Twitch Chat Analytics: Analyzing jinnytty’s streams

Context

Yoo Yoonjin (Korean: 유윤진; born July 28, 1992) better known as Jinnytty, Yoo is best known for her IRL Twitch livestreams who has been streaming for 5 years From 2020 to 2022, Yoo traveled to and streamed live in more than 20 countries in different continents like Asia, Europe and north america.

Personal Motivation

Twitch community is well known as tech-savvy so I decided to develop the analysis in this community because they could give me good feedback about the report and coding ain addition jinnytty is one of my favorites streamers

Business Task

“There isn’t a business task involve, I mainly did this project to improve my skill using pandas and their packages and the process of developing a BI analytics

Key Stakeholders

So i said before this is personal project but through the endeavor doing it i found some people that is interesting with it

Preparation and first cleaning

Twitch is a website focused on streaming although they had developed their chat with the Internet Relay Chat (IRC) protocol also you can access their database with an API knowing that we are going to use 2 tools for getting chat logs RechatTool and Chatterino

None of them provide data and timestamp so I’ll explain who do the data for each one separately.

RechatTool

The first cleaning that i do it is done it with bash and is automated

#!/bin/bash

# I find every Quotation mark and delete them. I do this because a quotation mark could interfere when I am running some code in the future
sed 's/"//g' *.txt > withoutcomillas.txt &&

#  I use vertical bars as separators so this symbol could make another column when i am reading the file with pandas 
sed 's/|//g' withoutcomillas.txt > datawithits.txt &&

#  Because this analysis is about twitch and the main way to express yourself its with emotes i need to normalize all world  with apostrophe
sed -r "s/It’s/its/g" datawithits.txt | sed -r "s/It’s/its/g" | sed -r "s/That’s/Thats/g" | sed -r "s/M&M's/MMs/g" > yyjdata.txt | sed -r "s/don't/dont/g" > yyjdata.txt &&

#  Separating the data in different files will makes easy the cleaning at the end i will merge them
awk '{print $1}' yyjdata.txt | awk '{print substr($0,2,8);}' > time.csv &&

#  Separating the data in different files will makes easy the cleaning at the end i will merge them
awk '{print $2}' yyjdata.txt | awk -F: '{print $1}' > user.csv &&

#  Separating the data in different files will makes easy the cleaning and at the end i will merge them
awk -F: '{ for(i=1; i<=3; i++){ $i="" }; print $0 }' yyjdata.txt | awk '{print substr($0, 5, length($0))}' > messages.csv &&

#  Merging all the files into one
paste -d '\|' time.csv user.csv messages.csv > readydata.txt &&

#  At the start of the stream is always played an intro and some users spam and that makes 
# the analysis biased so I need to find when the intro finish and that is easy because usually, a bot sent a message saying 
# that scene switched to live so i find that row and everything before that it is deleted
sed '1,/super_stream_server|Scene switched to  Live/d' readydata.txt > awkcleaning.txt  &&

#  Then i deleted some rows with text strings that spawn nan values in the last column
awk '!/just earned/ &&  !/sending messages too quickly/ && !/emote-only/ && !/You can find your currently available/ && !/raiders from/ && !/redeemed/ && !/streamelements/ && !/innytty is live!/ && !/StreamElements/

# I delete the rest of the files
rm datawithits.txt time.csv user.csv messages.csv withoutcomillas.txt yyjdata.txt readydata.txt awkcleaning.txt

Next i open the file with pandas because we need to add the timestamp to every row

import pandas as pd
df = pd.read_csv('../yyj.csv', delimiter='|', encoding='utf8', header=None, names=["Time", "User", "Message"])

# Sometimes i do the analysis the next day that the stream was streamed so i need to subtract one day
df["Time"] = pd.to_datetime(df["Time"]) - + pd.Timedelta(days=1)

# Here i will add the timestamp 
df["Time"] = pd.to_datetime(df["Time"]) + pd.Timedelta(hours=6, minutes=3)

# next i separate the date from the timestamp in one column
df["Day"] = pd.to_datetime(df["Time"]).dt.strftime('%Y-%m-%d')

# and then i do the same with timestamp
df["Time"] = pd.to_datetime(df["Time"]).dt.strftime('%H:%M:%S')

# i organize the columns and the file its ready for the analysis 
df = df[["Day", "Time", "User", "Message"]]
df.to_csv('../yyj.txt',index=False, header=False, sep='|')

Chatterino

With Chatterino the cleaning process it is similar to RechatTool although we need to do some extra steps

# merge files with date /usr/bin/merge
for i in find . -name "*.log" -type f; do
    ls $i | xargs -I{} awk '{print "{}", $0}' $i
done

# Cleaning the ""
sed 's/"//g' data.txt > withoutcomillas.txt &&

# Cleaning the |
sed 's/|//g' withoutcomillas.txt > datawithits.txt &&

# Cleaning It's and it's 
sed -r "s/It’s/its/g" datawithits.txt | sed -r "s/It’s/its/g" | sed -r "s/That’s/Thats/g" | sed -r "s/M&M's/MMs/g" > yyjdata.txt &&

# Clean the first column and export only the data in a new file
awk '{print substr($1,12,10); }' yyjdata.txt > a.txt &&

# Take the column of time to another file
awk '{print $2}' yyjdata.txt > time.txt &&

# Removing the 1st and last character of every line "[]" from the file time.txt then export the data to b.txt
sed 's/.//;s/.$//' time.txt > b.txt &&

# Deleting the last character from the third column ":"
awk -F: '{if (NR!=0) {print substr($3, 6, length($3))}}' yyjdata.txt > c.txt &&

# extracting messages
awk -F: '{ for(i=1; i<=3; i++){ $i="" }; print $0 }' yyjdata.txt > mssgs.txt &&

# Deleting spaces from mssgs
awk '{print substr($0, 5, length($0))}' mssgs.txt > d.txt &&

# Merging the files a, b, and c into one file
paste -d '\|' a.txt b.txt c.txt d.txt > readydata.txt &&

# starting stream
sed '1,/super_stream_server|Scene switched to  Live/d' readydata.txt > awkcleaning.txt &&

# search all lines that contains # and delete them
awk '!/just earned/ &&  !/sending messages too quickly/ && !/emote-only/ && !/You can find your currently available/ && !/raiders from/ && !/redeemed/ && !/streamelements/ && !/innytty is live!/' awkcleaning.txt > yyj.txt

# Removing the files a.txt b.txt c.txt d.txt
rm a.txt b.txt c.txt d.txt data.txt yyjdata.txt time.txt readydata.txt awkcleaning.txt mssgs.txt withoutcomillas.txt datawithits.txt

Loading dependencies and data

Now that the data is organized we can start handle it with pandas

import pandas as pd
from matplotlib import dates as mpl_dates 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd 
import seaborn as sns
from scipy import stats
from textblob import TextBlob
import math
from datetime import datetime, timedelta 
from jinja2 import Environment, FileSystemLoader

read = pd.read_csv('june.txt', delimiter=',', encoding='utf8', header=None, names=["day", "time", "user", "message"])

Summary

df.describe()

df.info()

Normalizing and cleaning the data

We already did a lot of cleaning before but we need to take extra steps

Because we are using two sources they record the user names different we are going to standardize the data

First all user names need to be in lowercase

df['user'] = df['user'].str.lower()

Then replace the user who has two record for their name to only one

df["user"].replace({"센트23 vincentt23":"빈센트23", "루이스와이푸 haruiswaifu":"하루이스와이푸", "雞力 winterreise1988":"愛雞力", "죠우":"코죠우"}, inplace=True)

The last step is deleting all NA values

clean_wothoutNA = df.dropna()

Processing the data

Identifying which emotes where more used

As i said before twitch community like to use emotes to express their self so it is really interesting to see which emote was used more i will do this analysis first by taking the data frame without nam values then take each word of each row separating them and then group each one for counting them.

I will take only the first seventy words and then i apply a filter that only leaves emotes and not English words

most_used_words = pd.Series(' '.join(clean_wothoutNA['message']).split()).value_counts()[:70].reset_index()
clean1 = most_used_words.replace('󠀀', np.nan, regex=True)
text_withoutNan = clean1.dropna(axis='rows').replace({'\'':''}, {')':''}, regex=True)
text_withoutNan

#define values
values = ["The", "the", "it", "be", "is", "you", "a", "to", "no", "in", "that", "she", "this", "for", 
          "not", "good", "I", "on", "and", "i", "1", "2", "Lmao", "Lol", "You", "like", "just", "its", "?", "lol", "all", "so", "will",
          "of", "are", "they", "bye", "⠀", "yes", "he", "can", "11", "go", "him", "your", "back", "her", "D", "u", "do", "take", "need", 
          "more", "why", "have", "what", "with", "dont", "get", "eat", "drink", "jinny", "was", "my", "we"
          ,"nice", "too", "me", "one", "yuggie", "at", "how", "it\'s", "ye", "yea", "!bet", "hair", "milk", "🥕", "wind",
          "S", "yeah", "ok", "mode", "water", "there", "drone", "lacari", "love", "buzz", "ass", "now", "Kofu", "kofu", "suck", "WEAR",
          "IT!", "MMs", "Buy", "IT!WEAR", "lolz", "ur", "hahaha", "eye", "see", "SIMBA", "did", "never", "No", "Jinny", "!yc"
          ,"lmao", "+100", "up", "bar", "hidden", "🇸🇬", "🤝", "🇲🇾", "F", "❌", "󠀀", "KOFU", "DO", "THIS", "IRL", "cunt", "YOU", "balls", "A", ".", "shoey", "us",
          "buy", "HSP", "chat", "don\"t", "💨", "🌊","yr", "yup", "jimbo", "lul", "uh", "cool", "aw", "oh", "time", "drunk", "phone", "gone",
          "That\'s", "IT",  "LOOKS", "ICANT", "LOCK", "AND", "YOUR", "ICANT", "if", "😃", "😂", "🤣", "well", "😆", "🙂", "👋🙂", "🤣🤣", "f","🤳"]


#drop rows that contain any value in the list
textwithoutmostusedwords = text_withoutNan[text_withoutNan['index'].isin(values) == False]

#another way to search for most used words 
#df.message.value_counts().reset_index()

#--------------------------------------------------Saving in a document----------------------------------------------------------------
savetextwithoutmostusedwords = textwithoutmostusedwords.to_csv("topemotes.txt", sep=' ', header=False, index=False)
#--------------------------------------------------------------------------------------------------------------------------------------

Cleanreadytop20chatters = pd.read_csv("topemotes.txt", delimiter=' ', encoding='utf8', header=None, names=["Emote", "Times Used"])
Cleanreadytop20chatters

Outcome

Top chatters with their most used emote

The first times i did the analysis and show to some friends they ask me which emote they use more and how many times they send a messages I found that the most interested people in the analysis is usually the user who sends more messages per stream

top20Chatters = df.user.value_counts()[:20].reset_index()
nametop_1 = [None] * 20
Searching_by_User_top_1 = [None] * 20
Searching_by_UserTop_1_emote = [None] * 20
textwithoutmostusedwords_byuser1 = [None] * 20
topemotefromtop1chatter = [None] * 20
howManyTimesWasUsedThe_topemotefrom_top1chatter = [None] * 20
topemotefromtop_1chatter_second_emote = [None] * 20
howManyTimesWasUsedThe_topemotefrom_top_1chatter_second_emote = [None] * 20

for m in range(20):
    nametop_1[m] = top20Chatters.loc[m, 'index']
    Searching_by_User_top_1[m] = df[(df["user"] == nametop_1[m]) & (df["message"] )]
    Searching_by_UserTop_1_emote[m] = pd.Series(' '.join(Searching_by_User_top_1[m]['message']).split()).value_counts()[:50].reset_index().replace('󠀀', np.nan, regex=True).dropna(axis='rows')
    textwithoutmostusedwords_byuser1[m] = Searching_by_UserTop_1_emote[m][Searching_by_UserTop_1_emote[m]['index'].isin(values) == False].reset_index()
    topemotefromtop1chatter[m] = textwithoutmostusedwords_byuser1[m].loc[0, 'index']
    howManyTimesWasUsedThe_topemotefrom_top1chatter[m] = textwithoutmostusedwords_byuser1[m].loc[0,0]
    topemotefromtop_1chatter_second_emote[m] = textwithoutmostusedwords_byuser1[m].loc[1, 'index']
    howManyTimesWasUsedThe_topemotefrom_top_1chatter_second_emote[m] = textwithoutmostusedwords_byuser1[m].loc[1,0]

top20Chatters['Most used emote by user'] = pd.Series(topemotefromtop1chatter)

top20Chatters['Times used'] = pd.Series(howManyTimesWasUsedThe_topemotefrom_top1chatter)

top20Chatters['Second most used emote by user'] = pd.Series(topemotefromtop_1chatter_second_emote)

top20Chatters['Total for second emote'] = pd.Series(howManyTimesWasUsedThe_topemotefrom_top_1chatter_second_emote)

top20Chatters.to_csv('top20Chatters.csv', header=False, index=False,)
top20Chatters

Outcome

Total messages per day

Finding which day has more interactions

new_df = clean_wothoutNA.day.value_counts().reset_index()
Total_messages_per_day = new_df.to_csv("prueba.txt", sep=' ', header=False, index=False)
Total_messages_per_day = pd.read_csv("prueba.txt", delimiter=' ', encoding='utf8', header=None, names=["Day", "Messages"])
Total_messages_per_day

Total user per day

A similar process as i did before but this time I will find how many users were in each day

days_month = Total_messages_per_day["Day"].to_numpy()
Total_users_per_day = [len(df[(df["day"] == day)].user.value_counts()) for day in days_month] 
read_saving_prueba = Total_messages_per_day
read_saving_prueba["Chatters"] = pd.Series(Total_users_per_day)
newplot = read_saving_prueba.sort_values(by='Day')
newplot

Total user and messages per day graphs

chart_messeges_per_day = plt.figure(figsize=(30,10))
chart_messeges_per_day = plt.plot(newplot.Day, newplot.Messages, "g.-")
chart_messeges_per_day = plt.title("Messages per day")
chart_messeges_per_day = plt.xlabel('Days.', fontsize=18)
chart_messeges_per_day = plt.ylabel('Messages.', fontsize=16)
chart_messeges_per_day = plt.xticks(fontsize=12, rotation=360)
chart_chatters_per_day = plt.yticks(fontsize=13)
chart_messeges_per_day = plt.show()

chart_chatters_per_day = plt.figure(figsize=(30,10))
chart_chatters_per_day = plt.plot(newplot.Day, newplot.Chatters, "b.-")
chart_chatters_per_day = plt.title("Unqiue chatters per day")
chart_chatters_per_day = plt.xlabel('Days.', fontsize=18)
chart_chatters_per_day = plt.ylabel('Unique chatters.', fontsize=18)
chart_chatters_per_day = plt.xticks(fontsize=12, rotation=360)
chart_chatters_per_day = plt.yticks(fontsize=13)
chart_chatters_per_day = plt.show()

Finding the day and week of the stream

import calendar
calendar.setfirstweekday(6)

def get_week_of_month(year, month, day):
    x = np.array(calendar.monthcalendar(year, month))
    week_of_month = np.where(x==day)[0][0] + 1
    return(week_of_month)

cell = [None] * int(len(newplot))

d = [None] * int(len(newplot))
for g in range(len(newplot)):
    cell[g] = newplot.loc[g, 'Day']
    d[g] = cell[g][8:]

newplot['day'] = pd.Series(d)

Week_1 = 1
Week_2 = 5
Week_3 = 12
Week_4 = 19
Week_5 = 26
last_day = 30
def applyFunc(s):
    if s == Week_1 or s < Week_2:
        return "Week 1"
    elif s == Week_2 or s < Week_3:
        return 'Week 2'
    elif s == Week_3 or s < Week_4:
        return 'Week 3'
    elif s == Week_4 or s < Week_5:
        return 'Week 4'
    elif s == Week_5 or s <= last_day:
        return 'Week 5'
    return ''
newplot["Day"] = pd.to_datetime(newplot["Day"])
newplot['day_of_the_week'] = pd.Series(newplot['Day'].dt.day_name())
newplot['day'] = newplot['day'].astype(int)
newplot['Week'] = newplot['day'].apply(applyFunc)
newplot["Day"] = newplot["Day"].dt.strftime('%Y-%m-%d')
newplot.reset_index(drop='index')

Heatmaps and pivot tables

Top messages Pivot table

heatmap = newplot.pivot_table(index="Week", columns="day_of_the_week", values="Messages").fillna(0)
week_pivot = heatmap.reindex(columns=['Sunday','Monday','Tuesday', 'Wednesday','Thursday','Friday','Saturday'])
week_pivot

Top messages Heatmap

week_messeges_heatmap = plt.figure(figsize = (20,10), facecolor="w")
week_messeges_heatmap = plt.title("Heatmap top messages")
week_messeges_heatmap = sns.heatmap(week_pivot, annot=False, cbar_kws={'shrink': 0.9})
week_messeges_heatmap = plt.xticks(fontsize=12, rotation=45)
week_messeges_heatmap = plt.yticks(fontsize=15)

Top users Pivot table

chatters_heatmap = newplot.pivot_table(index="Week", columns="day_of_the_week", values="Chatters").fillna(0)
chatters_heatmap = chatters_heatmap.reindex(columns=['Sunday','Monday','Tuesday', 'Wednesday','Thursday','Friday','Saturday'])
chatters_heatmap

Top users Heatmap

week_chatters_heatmap = plt.figure(figsize = (20,10))
week_chatters_heatmap = plt.title("Heatmap top chatters")
week_chatters_heatmap = sns.heatmap(chatters_heatmap, annot=False)
week_chatters_heatmap = plt.xticks(fontsize=12, rotation=45)
week_chatters_heatmap = plt.yticks(fontsize=15)

Distribution of messages between top 5 and top 20

plt.rcParams['font.size'] = 15
top_chatters_frame = pd.DataFrame({'Top Chatterns' : [notopChatter, top20withouttop5, totalTop5Chatters]}, index=[plot_index_Totalmsg, plot_index_top20, plot_index_top5])
top_chatters_frame
plot = top_chatters_frame.plot.pie(y='Top Chatterns', figsize=(10, 10), fontsize=10)
plt.savefig('plot')

Mean Median Mode of total of messages per user

table_user_and_total_messages = df.user.value_counts()
aaa = table_user_and_total_messages.reset_index().rename(columns={"index": "User", "user": "Total_Messeges_Per_User"})
Messeges_Per_User = aaa["Total_Messeges_Per_User"].to_numpy()
mean = np.mean(Messeges_Per_User)
median = np.median(Messeges_Per_User)
Mode = stats.mode(Messeges_Per_User)
table_total_user_and_total_messages = len(pd.unique(df['user']))

print('Total chatters =' ,table_total_user_and_total_messages)
print('Mean =' ,mean)
Mean_text = str('Mean ') +  str(mean)
print('Median =' ,median)
Median_text = str('Median ') +  str(median)
print('Mode =' ,Mode)
Mede_text = str('Median ') +  str(median)

Total chatters = 40122 Mean = 72.05341342439122 Median = 2.0 Mode = ModeResult(mode=array([1], dtype=int64), count=array([16121]))

So i discoverd that the most user only send 1 or 2 messages per stream so i decided to make a visulitation about it

Messages sent by users (grouped by quantity)

df_Per_User = aaa['Total_Messeges_Per_User'].value_counts().reset_index().rename(columns={"index": "Total",})
df_Per_User
chart_chatters_grouped_by_quantity = plt.figure(figsize=(20,10))
chart_chatters_grouped_by_quantity = plt.xticks(fontsize=12, rotation=90)
chart_chatters_grouped_by_quantity = plt.yticks(fontsize=13)
chart_chatters_grouped_by_quantity = plt.plot(df_Per_User.Total, df_Per_User.Total_Messeges_Per_User)
chart_chatters_grouped_by_quantity = plt.title("Messages sent by users (grouped by quantity)")
chart_chatters_grouped_by_quantity = plt.xlabel('Total of Messages.', fontsize=20)
chart_chatters_grouped_by_quantity = plt.ylabel('Number of Users.', fontsize=20)

Unfortunately, as you can see the graph contains a lot of information so i decided to remove outliers and with this have a better view of what happened with the majority of viewers

df2 = df_Per_User[(df_Per_User["Total"] <= 10)]
df2

Now that we have a new dataset with the outliers we are going to redo the graph

chart_chatters_grouped_by_quantity_bar = plt.figure(figsize=(12,10))
chart_chatters_grouped_by_quantity_bar = plt.xticks(fontsize=12)
chart_chatters_grouped_by_quantity_bar = plt.yticks(fontsize=13)
chart_chatters_grouped_by_quantity_bar = plt.bar(df2.Total, df2.Total_Messeges_Per_User)
chart_chatters_grouped_by_quantity_bar = plt.title("Messages sent by users (grouped by quantity)")
chart_chatters_grouped_by_quantity_bar = plt.xlabel('Total of Messages.', fontsize=20)
chart_chatters_grouped_by_quantity_bar = plt.ylabel('Number of Users.', fontsize=20)
chart_chatters_grouped_by_quantity_bar = plt.xlim(1,10)
chart_chatters_grouped_by_quantity_bar = plt.savefig('msbu')

Sentiment analysis

sentiment_df = clean_wothoutNA

def sentiment_calc(text):
    try:
        return TextBlob(text).sentiment.polarity
    except:
        return None

sentiment_df['sentiment'] = sentiment_df['message'].apply(sentiment_calc)

sentiment_df = sentiment_df[sentiment_df.sentiment != 0]

Sentiment_of_chat = sentiment_df["sentiment"].to_numpy()
mean = np.mean(Sentiment_of_chat)
median = np.median(Sentiment_of_chat)
Mode = stats.mode(Sentiment_of_chat)


print('Mean =' ,mean)
Mean_text = str('Mean ') +  str(mean)
print('Median =' ,median)
Median_text = str('Median ') +  str(median)
print('Mode =' ,Mode)
Mede_text = str('Median ') +  str(median)

boxplot_sentioment = plt.figure(figsize=(12,10))
boxplot_sentioment= plt.boxplot(sentiment_df.sentiment)

Mean = 0.189543889185649 Median = 0.25 Mode = ModeResult(mode=array([0.5]), count=array([51712]))