Cats - Unsupervised! · rcphillips.github.io

Hello Everyone!

For the first project conducted exclusively for this blog, I wanted to work on something near and dear to my heart. Cats.

ripley

Where did this data come from?

If you haven't heard, https://toolbox.google.com/datasetsearch is awesome.

As I often do when exposed to a new tool, I typed in "cat" to see what would happen.

In so doing, I was rewarded with the Cat Personality Dataset from https://data.unisa.edu.au/Dataset.aspx?DatasetID=271178

Published in March 2018, this is a FIFTY-TWO item dataset quantifying the personality of 2802 cats.

They are from Australia and New Zealand, but I am going to assume no major differences at this point.

I should also point out that each entry is a 1-7 scale filled out by the owner, not that cat. So these cat owners saw something similar to "Is your cat Erratic? (rate 1-7)", and did that 51 other times.

What sort of questions can we answer?

I wanted to understand what types of cats exist in this data set, and so, potentially, the world.

Every cat is special, and the number of possible combinations in this data set is 7^52 or 88,124,000,000,000,016,384,184,968,936,160,080,432,408,840 possible cats. This isn't useful at all.

How can I reduce this number to something I can get my brain around?

The answer is clustering. This is a form of unsupervised learning in which we apply an algorithm to data in order to figure out which features (cat attributes, in this case) group together, and which ones distinguish different categories of cats.

Alternatives exist, such as PCA, or k-means++, but this is a fine starting point.

Given the attributes included in the survey, it seems CERTAIN that some will group together ("Gentle" and "Calm") whereas others will be quite distinct ("Predictable" and "Erratic")

import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn import metrics
from scipy.spatial.distance import cdist

# loading the data
data_folder = Path("E:/Downloads")
file_to_open = data_folder / "Cat_personality_data.csv"
df = pd.read_csv(file_to_open)
df.head()

	Personality1_Vigilant	Personality2_Stable	Personality3_Bold	Personality4_Clumsy	Personality5_Defiant	Personality6_Gentle	Personality7_Constrained	Personality8_Inquisitive	Personality9_Inventive	Personality10_Irritable	...	Personality45_Decisive	Personality46_Self_assured	Personality47_Anxious	Personality48_Trusting	Personality49_Active	Personality50_Cooperative	Personality51_Shy	Personality52_Eccentric	Country	Cat_sex
0	6	5	7	5	5	6	1	5	4	3	...	4	6	4	5	7	2	2	4	NewZealand	Male
1	7	3	4	1	7	7	3	4	3	2	...	4	3	6	6	1	6	2	5	NewZealand	Female
2	7	5	7	1	2	3	3	7	7	2	...	7	6	3	4	7	2	3	4	Australia	Male
3	7	4	1	1	7	2	7	1	3	7	...	5	3	3	2	2	2	6	2	Australia	Female
4	5	7	7	1	4	7	7	4	4	1	...	5	7	1	7	7	6	2	7	Australia	Male

5 rows × 54 columns

# cleaning up names
names = []
for i in df.columns[0:1]:
    #print(i[14:])
    names.append(i[14:])
for i in df.columns[1:9]:
    #print(i[13:])
    names.append(i[13:])
for i in df.columns[9:-2]:
    #print(i[14:])
    names.append(i[14:])
for i in df.columns[-2:]:
    #print(i)
    names.append(i)
df.columns=names
# visualizing the results
df.head()

	Vigilant	Stable	Bold	Clumsy	Defiant	Gentle	Constrained	Inquisitive	Inventive	Irritable	...	Decisive	Self_assured	Anxious	Trusting	Active	Cooperative	Shy	Eccentric	Country	Cat_sex
0	6	5	7	5	5	6	1	5	4	3	...	4	6	4	5	7	2	2	4	NewZealand	Male
1	7	3	4	1	7	7	3	4	3	2	...	4	3	6	6	1	6	2	5	NewZealand	Female
2	7	5	7	1	2	3	3	7	7	2	...	7	6	3	4	7	2	3	4	Australia	Male
3	7	4	1	1	7	2	7	1	3	7	...	5	3	3	2	2	2	6	2	Australia	Female
4	5	7	7	1	4	7	7	4	4	1	...	5	7	1	7	7	6	2	7	Australia	Male

5 rows × 54 columns

df=df.iloc[:,:-2] # we drop gender and nationality for now

# entering it into k-means clustering
X = df.values
kmeans = KMeans(
    n_clusters=5,
    random_state=0).fit(X)

kmeans.labels_ #these are the clusters it came up with

array([2, 3, 2, ..., 0, 4, 4])

df['cluster']=kmeans.labels_

So, we have 5 clusters of cats. Is that the right number?

For this, I turned to what's known as a "Scree Plot". AKA, trying a bunch of different numbers.

It's worth noting that I ran this several times as, in k-means, you can get a sub-optimal clustering based on initial conditions}.

# thanks to: 
# https://pythonprogramminglanguage.com/kmeans-elbow-method/ 
# k means determine k
distortions = []
K = range(1,20)
for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(X)
    kmeanModel.fit(X)
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])
# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

png

This indicates that while our clusters fit better and better as we assert that there are more and more different types of cats, the big gains are made at k = 10, or maybe even k=7.

One of the advantages of kmeans is that it was trivial to work in 52-dimensional space. Now, however, that makes visualization a bit mind boggling. What we can do is examine the weakest and strongest average attributes for each cluster.

# repeating what we did above with clusters = 7 
X = df.values
kmeans = KMeans(
    n_clusters=7,
    random_state=0).fit(X)
df['cluster']=kmeans.labels_

cluster_attributes=df.groupby(['cluster']).mean()

zero_cluster=cluster_attributes[cluster_attributes.index==0].T

# Get top 3 and bottom 3 for one cluster
zero_cluster=cluster_attributes[cluster_attributes.index==0].T
sorted_zero_cluster=zero_cluster.sort_values(by=0)
top_sorted_zero_cluster=sorted_zero_cluster.iloc[[49,50,51]]
bottom_sorted_zero_cluster=sorted_zero_cluster.iloc[[0,1,2]]

bottom_sorted_curr_cluster.index.tolist()

['Aggressive_to_people', 'Irritable', 'Tense']

bottom_sorted_zero_cluster

cluster	0
Aggressive_to_people	1.459290
Clumsy	1.812109
Erratic	2.008351

# Repeat for each cluster:
cluster_output = []
top_attributes = []
bot_attributes = []
for i in range(7):
    curr_cluster=cluster_attributes[cluster_attributes.index==i].T
    sorted_curr_cluster=curr_cluster.sort_values(by=i)
    top_sorted_curr_cluster=sorted_curr_cluster.iloc[[49,50,51]]
    bottom_sorted_curr_cluster=sorted_curr_cluster.iloc[[0,1,2]]
    cluster_output.append([i,i,i])
    top_attributes.append(top_sorted_curr_cluster.index.tolist())
    bot_attributes.append(bottom_sorted_curr_cluster.index.tolist())
    #print("cluster ",i," top traits:")
    #print(top_sorted_curr_cluster)
    #print("cluster ",i," bottom traits:")
    #print(bottom_sorted_curr_cluster)

top_df = pd.DataFrame(data=top_attributes, columns=['highest', 'second highest', 'third highest'])

bot_df = pd.DataFrame(data=bot_attributes,columns=['lowest', 'second lowest', 'third lowest'])

results_df=top_df.join(bot_df, rsuffix='_bot')

results_df.T

	0	1	2	3	4	5	6
highest	Affectionate	Friendly_to_people	Smart	Inquisitive	Predictable	Friendly_to_people	Friendly_to_people
second highest	Vigilant	Playful	Suspicious	Smart	Insecure	Affectionate	Gentle
third highest	Smart	Affectionate	Vigilant	Self_assured	Suspicious	Gentle	Affectionate
lowest	Aggressive_to_people	Aggressive_to_people	Clumsy	Clumsy	Aggressive_to_people	Aggressive_to_people	Aggressive_to_people
second lowest	Clumsy	Irritable	Friendly_other_cats	Submissive	Bullying	Erratic	Irritable
third lowest	Erratic	Fearful_of_people	Submissive	Aggressive_to_people	Bold	Fearful_of_people	Tense

What do we think of these?

First, considering the highest rated attribute of each cluster. Friendly to people is higest rated in 3 of our 7 categories. (1, 5, and 6) While it makes sense that people have cats that love them, I wonder what makes those categories different? Affectionate ranks very highly in those categories too.

For those same categories, Fearful_of_people is low. Makes sense.

More interesting perhaps, are the "smart cats". The ravenclaws, if you will.
Cluster 0 is still affectionate, but is also rated highly for Vigilance and Smarts.
Cluster 2 has actually lost affection as one of their top attributes, and has "Suspicious"
Cluster 3 is highest rated as inquisitive, as well as self assured.

Finally, poor cluster 4. They're Predictable, Insecure, and Suspicious.

So it looks like we have some interesting information, but that this data probably needs a bit more massaging. To follow up on what we've done here, let's just see how widely distributed our clusters are.

df.groupby('cluster').count()

	Vigilant	Stable	Bold	Clumsy	Defiant	Gentle	Constrained	Inquisitive	Inventive	Irritable	...	Playful	Vocal	Decisive	Self_assured	Anxious	Trusting	Active	Cooperative	Shy	Eccentric
cluster
0	479	479	479	479	479	479	479	479	479	479	...	479	479	479	479	479	479	479	479	479	479
1	317	317	317	317	317	317	317	317	317	317	...	317	317	317	317	317	317	317	317	317	317
2	476	476	476	476	476	476	476	476	476	476	...	476	476	476	476	476	476	476	476	476	476
3	473	473	473	473	473	473	473	473	473	473	...	473	473	473	473	473	473	473	473	473	473
4	349	349	349	349	349	349	349	349	349	349	...	349	349	349	349	349	349	349	349	349	349
5	259	259	259	259	259	259	259	259	259	259	...	259	259	259	259	259	259	259	259	259	259
6	449	449	449	449	449	449	449	449	449	449	...	449	449	449	449	449	449	449	449	449	449

7 rows × 52 columns

Pretty good distribution. Group 5 is the least with 259 cats, but most have around 350-450.

Finally, I want to investigate the distribution of some of these common groups along different clusters.

#all cats included
df.loc[:,['Friendly_to_people']].hist(bins=7)

array([[&lt;matplotlib.axes._subplots.AxesSubplot object at 0x00000000159E7358&gt;]],
      dtype=object)

png

# just one of the friendlies
one=df[df.cluster==1]
one.loc[:,['Friendly_to_people']].hist(bins=7)

array([[&lt;matplotlib.axes._subplots.AxesSubplot object at 0x0000000015D74748&gt;]],
      dtype=object)

png

# the insecure cats
one=df[df.cluster==4]
one.loc[:,['Friendly_to_people']].hist(bins=7)

array([[&lt;matplotlib.axes._subplots.AxesSubplot object at 0x0000000015DB6828&gt;]],
      dtype=object)

png

There's a lot more to do here, including PCA and visualization to get an idea of how good the separation is here, but I'm out of time. Gotta go feed my cat.

Tentative findings: - There's about 7 types of cats. - Four flavors of friendly cat. - Two Flavors of smart cat. - One Flavor of scaredy cat.

Cats - Unsupervised!

Hello Everyone!

Where did this data come from?

What sort of questions can we answer?

So, we have 5 clusters of cats. Is that the right number?

What do we think of these?

Published

Category

Tags