It’s quite common the case that one needs to cluster categorical variables. When it comes to clustering, a number of well known techniques abound such as k-means, k-medoids, and hierarchical clustering. With k-means and k-medoids clustering, you must have an idea of the number of clusters \(k\) desired. Furthermore, the final clusters can vary significantly depending on the number of initial clusters chosen. With hierarchical clustering, an intuitive cluster hierarchy can be observed, however computational efficiency and immediate utility is less clear. Nonetheless, the hierarhical clustering approach lends itself well to interpretation so let us consider this approach first.

d=dist(as.matrix(mtcars))
hc=hclust(d)
plot(hc)

We can further decide on an appropriate number of clusters, say 3, and observe membership.

clustercut = cutree(hc,3)
sort(clustercut)
##           Mazda RX4       Mazda RX4 Wag          Datsun 710 
##                   1                   1                   1 
##           Merc 240D            Merc 230            Merc 280 
##                   1                   1                   1 
##           Merc 280C            Fiat 128         Honda Civic 
##                   1                   1                   1 
##      Toyota Corolla       Toyota Corona           Fiat X1-9 
##                   1                   1                   1 
##       Porsche 914-2        Lotus Europa        Ferrari Dino 
##                   1                   1                   1 
##          Volvo 142E      Hornet 4 Drive             Valiant 
##                   1                   2                   2 
##          Merc 450SE          Merc 450SL         Merc 450SLC 
##                   2                   2                   2 
##    Dodge Challenger         AMC Javelin   Hornet Sportabout 
##                   2                   2                   3 
##          Duster 360  Cadillac Fleetwood Lincoln Continental 
##                   3                   3                   3 
##   Chrysler Imperial          Camaro Z28    Pontiac Firebird 
##                   3                   3                   3 
##      Ford Pantera L       Maserati Bora 
##                   3                   3

It is not always the case that cluster memberships break up so easily into evenly sized, appropriate clusters.

For this case, we will use the Markov Clustering Algorithm because this will allow us to distinguish between overlapping categories.

Stijn van Dongen, Graph Clustering by Flow Simulation, PhD thesis, University of Utrecht, May 2000. ( http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm )

For example, consider a group of movies. Movies are usually grouped under multiple genres such as drama, action, documentary, etc. This makes it tricky to truly distinguish one movie from the next. Perhaps there are more reprentative genre groupings. A sample of our incoming dataset is given here. This data comes courtesy of IMDb and is available here.

head(df[,c('movie_title','genres')])
##                                               movie_title
## 1                                                 Avatar 
## 2               Pirates of the Caribbean: At World's End 
## 3                                                Spectre 
## 4                                  The Dark Knight Rises 
## 5 Star Wars: Episode VII - The Force Awakens             
## 6                                            John Carter 
##                            genres
## 1 Action|Adventure|Fantasy|Sci-Fi
## 2        Action|Adventure|Fantasy
## 3       Action|Adventure|Thriller
## 4                 Action|Thriller
## 5                     Documentary
## 6         Action|Adventure|Sci-Fi

In order to use hierarchical clustering, we need to assign a distance value between every element and every other element. For \(n\) elements, this usually involves needing to compute \((n^2-n)/2 = n(n-1)/2\) values. In our list we have 5043 movies. So we need to do 1.27134e+07 computations. To simplify this task we will sample 100 movies, a more reasonable level.

set.seed(123)
sample_size = 100
sample_index = sample(dim(df)[1], sample_size, replace=FALSE)
df_sample = df[sample_index,]

One simple measure is the Jaccard index: the intersection of sets over the union of sets.
That is, given movies \(A\) and \(B\), \[J(A,B) = \frac{|A\cap B|}{|A\cup B|}\] Where the absolute values represent the size of the sets in question. We can preallocate a vector to represent distance values.

distances = rep(0,(sample_size^2-sample_size)/2)

Then we simply need to iterate over the movies in our sample set to see what the Jaccard index of any pair of movies is according to genre tags.

npairs = (sample_size^2-sample_size)/2
distances = rep(0,npairs)
movie_titlesA = rep(" ", npairs)
movie_titlesB = rep(" ", npairs)
k=0
for (i in 1:(sample_size-1)){
  for (j in (i+1):sample_size){
    k=k+1
    A = unlist(strsplit(df_sample[i,'genres'],"[| ]"))
    B = unlist(strsplit(df_sample[j,'genres'],"[| ]"))
    movie_titlesA[k] = df_sample[i,'movie_title']
    movie_titlesB[k] = df_sample[j,'movie_title']
    distances[k]= length(intersect(A,B))/length(union(A,B))
  }
}
df_out = data.frame(cbind(movie_titlesA,movie_titlesB,distances),stringsAsFactors=FALSE)
df_out = df_out[df_out$distances > 0,]

After we export the dataframe to a file `

We call the mcl algorithm mcl movie_pairs.abc --abc -I 4.0. 4.0 is the extent to which we would like the algorithm to differentiate. This one parameter allows us to vary the final number of clusters we arrive at.

The Nativity Story  The Holy Girl  Twin Falls Idaho  Chloe  The Age of Adaline  Queen of the Mountains  Illuminata  The Clan of the Cave Bear  Remember the Titans  Forsaken  The End of the Affair  Everest  Courage  The Crow  Dracula Untold  A Farewell to Arms  The Celebration  The Passion of the Christ  All or Nothing  The Express  Day One  Adam  Solomon and Sheba  The Outsiders  Freeheld  Mooz-Lum  Evita 

Cluster 1: Elysium  The Scorpion King  X-Men Origins: Wolverine  The Mask  London Has Fallen  In the Valley of Elah  N-Secure  Airlift  Grosse Pointe Blank  Spider-Man 3  The Heat  Ride Along  Beverly Hills Cop II  The Living Daylights  Wing Commander  12 Rounds  The Dead Undead  Fast Five  Desperado  The Dark Knight Rises 

Cluster 2: But I’m a Cheerleader  Bran Nue Dae  Tumbleweeds  Shortbus  Friday  Agora  Quigley Down Under  Sonny with a Chance  Coyote Ugly  Bandits  The Intern  The Love Letter  Of Horses and Men  Cold Mountain  Topsy-Turvy  Annie Get Your Gun 

Cluster 3: Connie and Carla  Harold & Kumar Go to White Castle  For Your Consideration  Airplane!  Monster-in-Law  Team America: World Police  The Big Bounce  Corky Romano  Supporting Characters  The Wash  An Alan Smithee Film: Burn Hollywood Burn  Chicago 

Cluster 4: The Adventurer: The Curse of the Midas Box  The Secret of Kells  Monty Python and the Holy Grail  Chicken Little  Stuart Little 2  Big Fat Liar  Yogi Bear  The Wild Thornberrys Movie  Journey 2: The Mysterious Island  The Last Legion  Dragon Hunters 

Cluster 5: The Call of Cthulhu  The Mothman Prophecies  A Nightmare on Elm Street 2: Freddy’s Revenge  Premonition  The Jacket  Urban Legend  The Visit  Red Eye  Thir13en Ghosts 

Cluster 6: The Adjustment Bureau  The Invasion  Beyond the Black Rainbow 

Censored Voices  My Date with Drew