Analysis of Ours to Shape Comments, Part 5

Introduction

In the penultimate post of this series, we’ll use some unsupervised learning approaches to uncover comment clusters and latent themes among the comments to President Ryan’s Ours to Shape website.

The full code to recreate the analysis in the blog posts is available on GitHub.

Cluster Analysis

Cluster analysis is about discovering groups in data, such that the observations within each group share an internal cohesiveness, they look more like one another than like members of other clusters. This is exploratory data analysis; we don’t know the groups ahead of time.

Let’s start with some preliminary visualization. First, I stem the words to reduce the dimensionality a bit further, then turn it into a tf-idf weighted matrix – we’re looking for the words that help distinguish documents – and normalize it for document length, and finally extract the principal components. The figure plots each comment along the first two principal components.

Scatterplot of principal components 1 and 2

To be clear, we haven’t implemented a cluster analysis yet, we’re just looking at the data a little differently. The first two principal components only account for a small amount of variation in the overall text data, so won’t be able to visually distinguish differences along many dimensions. But in these first two dimensions, we do see some clumps.

Let’s look at that clump to the far right, above 0.75 on PC1:


1 What if the University of Virginia built a state of the art Ice Sports Arena? We would have a place of true physical education for all ages and abilities. An ice arena is a venue for participation. 500 athletes a day (with 1/3rd coming from the non-University community) would come to skate with family or to make new friends while discovering the joy of movement whether it be hockey, figure skating or just ice skating. The UVA ICE PARK would be a bridge between Town and Gown and a way for the school to be of service to the community while helping all people discover the joy of movement and exercise.

Here’s our ice arena lobby! There are 10 comments here, mostly versions of the example comment above. What about the other comments spread along the first dimension, between 0.15 and 0.75? Two of the 31 comments in this range are printed below:


1 I've recently learned that there might be an opportunity for the University to add an arena for ice sports. It has always seemed strange that UVA did not already have a facility for its varsity hockey teams as well as all of the other purposes an arena would serve for the UVA stduents, faculty, staff, and broader community. Thanks for supporting that. I think it would be worthwhile.
2 In the past year, Charlottesville has seen the loss of its downtown ice arena which has left a significant hole in our community. Over the past two years, I have witnessed our students come together to watch our men's and women's hockey teams and the figure-skating program, as well as bond by participating in other activities such as public skating, floor hockey, and broomball. It was truly a focal point, not just for student life but also for bringing UVA and C’ville together. Building a new ice arena on Grounds would finally provide students, faculty, alumni and local residents with a permanent fix so we can all enjoy the true spirit of a shared community.

Ah, more ice arena advocacy, but without the duplication. Let’s see what’s distinguishing comments high on the second principal component, say, above 0.15. Here are a few of the 48 comments occupying this space on the graph:


1 Global warming is becoming a bigger problem with every new day that passes, and we are now reaching rates of global warming we have never experienced before. Temperature changes that were previously predicted to take thousands of years are now occurring within the span of a few decades.
2 I am writing a letter this afternoon to discuss the concrete steps that the University of Virginia must take in order to limit global warming and reduce fossil fuel emissions in Charlottesville. In the last century, Charlottesville has had less than five days of extreme, deadly 
3 I wanted to reach out to you as a student of the University of Virginia concerned with our university’s attempts to mitigate climate change. The topic has become thoroughly debated in recent politics, and being currently enrolled in Professor Deborah Lawrence’s Introduction to Climate 

Actually, I’m only showing the first part of these comments – they are all really, really long! But we seem to have found a climate and sustainability cluster!

Let’s formalize this in a cluster analysis!

Hierarchical Clustering

There are a lot of choices in clustering – multiple clustering algorithms, distance metrics and weights, optimization methods, and tuning parameters within given algorithms. And, of course, the choice of k or the number of clusters you want the model to define (choice of k is a tricky thing and I’m going to gloss over it here). I’m going to stick with some common choices here, starting with hierarchical, or connectivity-based, methods.

In particular, I use the tf-idf weighted dfm, calculate the Euclidean distance, and create links through agglomeration (beginning with n partitions and successively fusing the clusters that are closest) using Ward’s method to determine which clusters are closest. I chose k=20 clusters to start.


comments_hc_20
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
757  18   8   1   1   2   1  41   1   1   1   1   1   1   1   1   1   1 
 19  20 
  1   2

This produced 20 clusters, because I asked it to, but most of the clusters are singles, and the bulk of observations are thrown into that first massive clusters. So, not very satisfying. Still, let’s see what those handful of more distinct groups (groups 2 and 8) contain.

To do that, I pull out the 10 highest weighted words (features are weighted by term frequency-inverse document frequency) among the comments within each cluster.


$`2`
       energi         reduc climate_chang         emiss          warm 
     78.19342      76.57719      74.56030      73.20347      71.92488 
        chang          wast          X1.5          food           can 
     53.86263      53.45833      49.39675      45.57220      44.22610


$`8`
      ice     skate    hockey     arena      rink     sport  movement 
147.27017  92.31054  91.49594  72.40954  57.62955  49.36716  37.67839 
      joy      team     figur 
 36.14503  34.41481  33.64702

Cluster 2 seems to contain at least some of the climate change sustainability comments, though from the exploratory analysis above, I know there are more than 18 of these.

Cluster 8 is the by now familiar ice arena group. The 41 comments in this cluster matches the 41 comments I saw in the exploratory graph above, so that’s nice, at least.

But this hasn’t pulled out the kind of internal cohesion that might help me summarize the comments or generate insight.

Let’s try one more clustering algorithm.

Kmeans Clustering

Centroid-based clustering offers another approach, where observations are divvied up into groups by minimizing some numerical criterion – k-means is the most common partitioning approach.

K-means starts with k-centroids (the points that will be the center of the clusters), assigns each data point to the nearest centroid, updates each centroid to be the average of the data points assigned to it, and re-assigns each data point to the nearest centroid (and then repeats this until there is minimal change). Because the initial randomly-chosen centroids can affect the outcome, we generally try a lot of starting points and let the algorithm select the solution that minimizes the within sum of squares (so maximizes within cluster homogeneity).

I’m going to use K=20 again.


  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
  1   1   1 768   1   1   1   1  30   7   9   1   1   3   1   9   1   1 
 19  20 
  1   3

Again, we have lots of groups of one, and only a handful of potentially distinctive clusters. Let’s look at the key distinctive words for clusters 9, 10, 11, and 16.


$`9`
      ice     skate    hockey     arena      rink  movement       joy 
130.47621  85.20973  82.07724  62.27221  47.75020  37.67839  36.14503 
    sport     figur    discov 
 33.35619  32.18411  29.25828


$`10`
   energi      dine       use       can transport       car      hall 
 32.04649  30.11296  25.28746  25.19208  24.69838  23.54779  21.90415 
    reduc      food       bus 
 20.99697  20.83300  20.72311


$`11`
climate_chang         reduc         emiss        energi          warm 
     54.86513      43.22906      40.33660      32.04649      31.27169 
            c        climat         chang          X1.5        carbon 
     28.25879      25.64948      23.08398      19.75870      18.32846


$`16`
  energi     heat     warm     wast    emiss  celsius     food    degre 
42.30136 40.23208 39.08961 35.16995 34.36081 33.23036 32.55157 31.24357 
   chang    reduc 
30.77864 29.64278

Cluster 9 is our hockey/ice skating enthusiasts; clusters 10, 11, and 16 are all variations on energy use and climate change – so potentially some subgroups within a larger sustainability theme. But that’s about it.

Overall, this is kind of disappointing, what with that undifferentiated mass of comments. It’s possible we haven’t tried a big enough k, but given all of the singles we’re already getting, that’s doubtful. It’s also possible that there really aren’t many common themes, nothing that holds a subset together. But more likely is that many of these comments are long and involved and possibly not single-themed. So let’s try something different.

Topic Models

Another approach to uncover the main themes (or topics) in an unstructured corpus is topic modeling.

Topic models require no prior information, no training set, no special annotation of the texts beforehand; only, as in cluster analysis, a decision about the number of topics, K. Unlike in cluster analysis, documents are not required to belong to a single topic or cluster (i.e., single membership), but may simultaneously belong to several topics.

I’m going to use an R package called stm – for structural topic models – which implements a correlated topic model and allows for the inclusion of covariates as predictors of topic prevalence. I reduce the dimensionality of my document feature matrix a bit and convert this into the form the stm package expects before estimating a topic model with K=20 topics (it takes about a minute on my machine).

Let’s see the top words for each estimated topic.


Topic 1 Top Words:
      Highest Prob: univers, uva, polici, can, work, communiti, parent 
      FREX: polici, leav, parent, adopt, paid, execut, governor 
Topic 2 Top Words:
      Highest Prob: uva, univers, presid, divers, communiti, group, ask 
      FREX: presid, slaveri, august, equiti, stop, respect, ask 
Topic 3 Top Words:
      Highest Prob: center, short, uva, marc, miller, invit, hire 
      FREX: marc, miller, short, invit, center, speaker, racism 
Topic 4 Top Words:
      Highest Prob: uva, follow, futur, learn, way, past, student 
      FREX: follow, uniqu, futur, past, monticello, relationship, method 
Topic 5 Top Words:
      Highest Prob: energi, reduc, uva, can, univers, emiss, wast 
      FREX: initi, emiss, heat, wast, energi, reduc, planet 
Topic 6 Top Words:
      Highest Prob: communiti, uva, staff, univers, charlottesvill, student, peopl 
      FREX: sens, staff, cvill, communiti, region, welcom, greater 
Topic 7 Top Words:
      Highest Prob: year, uva, student, school, communiti, first, charlottesvill 
      FREX: music, dorm, concert, youth, kid, year, free 
Topic 8 Top Words:
      Highest Prob: student, faculti, learn, school, can, uva, studi 
      FREX: innov, studi, explor, abroad, intern, busi, perspect 
Topic 9 Top Words:
      Highest Prob: divers, univers, encourag, faculti, polit, educ, thought 
      FREX: thought, opinion, divers, polit, correct, belief, liber 
Topic 10 Top Words:
      Highest Prob: can, uva, student, food, univers, dine, make 
      FREX: transport, bus, food, dine, car, electr, meat 
Topic 11 Top Words:
      Highest Prob: ice, uva, hockey, skate, communiti, arena, sport 
      FREX: hockey, skate, arena, rink, ice, sport, figur 
Topic 12 Top Words:
      Highest Prob: research, faculti, uva, teach, book, univers, time 
      FREX: research, librari, book, alderman, collabor, medic, renov 
Topic 13 Top Words:
      Highest Prob: univers, uva, system, fratern, greek, student, scienc 
      FREX: fratern, greek, footbal, neighborhood, system, sexual, behavior 
Topic 14 Top Words:
      Highest Prob: experi, student, uva, univers, residenti, knowledg, way 
      FREX: residenti, experi, art, space, museum, classroom, knowledg 
Topic 15 Top Words:
      Highest Prob: climate_chang, chang, make, communiti, need, world, uva 
      FREX: climate_chang, climat, negat, warm, technolog, chang, action 
Topic 16 Top Words:
      Highest Prob: servic, student, communiti, need, work, help, provid 
      FREX: servic, special, children, commonwealth, volunt, requir, hour 
Topic 17 Top Words:
      Highest Prob: student, hous, connect, uva, alumni, faculti, opportun 
      FREX: connect, hous, outsid, career, alumni, opportun, write 
Topic 18 Top Words:
      Highest Prob: sustain, univers, ground, like, improv, can, communiti 
      FREX: sustain, improv, aggress, neutral, environment, green, although 
Topic 19 Top Words:
      Highest Prob: student, staff, live, uva, better, univers, make 
      FREX: wage, lot, staff, pay, month, citi, better 
Topic 20 Top Words:
      Highest Prob: must, mani, valu, univers, focus, one, feel 
      FREX: valu, must, truth, focus, honor, mani, tradit 

The “Highest Prob” words are what folks are often accustomed to seeing in topic model output – the words with the highest probability of being in that topic (based on frequency). The problem is, many of the high probability words appear across multiple topics because they appear frequently throughout the corpus. I like the “FREX” descriptors – balancing the frequency with which words appear in a topic and the exclusivity with with they appear in that topic – as it better reflects distinctive but important words.

There’s a lot to unpack here. Topic 2 appears to reflect comments addressing racial diversity, the events of August 11-12, the President’s Commission on Slavery and the University, and related ideas. Topic 3 pulls out references to the Miller Center’s hiring of Marc Short. Topic 11 captures the comments of the ice arena lobby. Topic 12 is picking up mentions of Alderman Library and renovation plans. And Topics 5, 15, and 18 address dimensions of sustainability and environmental change.

Let’s visualize the overall prevalence of each of these topics in our collection of comments:

Bar plot of overall topic prevalence by topic.

Topics 17, 6, and 8 are the most frequent. On the surface, these topics aren’t quite as clear (to me) as some of the ones I referenced earlier. Here are the top words for these top topics.


Topic 17 Top Words:
      Highest Prob: student, hous, connect, uva, alumni, faculti, opportun 
      FREX: connect, hous, outsid, career, alumni, opportun, write 
Topic 6 Top Words:
      Highest Prob: communiti, uva, staff, univers, charlottesvill, student, peopl 
      FREX: sens, staff, cvill, communiti, region, welcom, greater 
Topic 8 Top Words:
      Highest Prob: student, faculti, learn, school, can, uva, studi 
      FREX: innov, studi, explor, abroad, intern, busi, perspect 

Looking at comments that score high on Topic 17 confirms that this one is a bit of a hodgepodge, combining attention to affordable housing and to research support.


[1] "UVa needs to take a broader view of community to include the City of Charlottesville. UVa student demand is an obstacle to affordable housing in the city. Today students are paying $1000 for one room in direct competition with underprivileged city residents. A significant investment in on campus student housing away from the city and requiring more students to live on campus would reduce housing demand in the area. This would allow the city to promote affordable housing growth."                 
[2] "There is an affordable housing crisis in Charlottesville for limited income residents. The university has a desire to offer more on-grounds housing options for upperclassmen students. \n\nThis problem presents an opportunity for a public-private partnership of mixed UVA student & Charlottesville community member housing. Alumnae Micaela Connery, CLAS '09 is researching current efforts similar to this. This demonstrates our commitment to the city and presents learning opportunities for students."
[3] "We need to create a centralize, focused effort on research.  In addition, we need to lower the overhead rate and apply what is collected to supporting the infrastructure."                                                                                                                                                                                                                                                                                                                                         
[4] "It is critical to foster an environment where diverse ideas, views, and perspectives are fostered and explored. Too often on campuses across the U.S. these days, certain points of view are attacked or shut down entirely to the great detriment of the academic community there. The University of Chicago has set a great example recently in this area and I hope that UVA follows suit and continues to be a place where the ideals of free speech and debate are celebrated."                                
[5] "Hire faculty who are tops in their field but who also have a commitment to engaging undergraduate in their research, not just their classrooms. Provide more opportunities for students, even undergrads, to work with our superstars --- provide more students with experiences like the Harrison Undergraduate Research Awards."

The comments that exemplify Topic 6 suggest a little more coherence, primarily around town-gown relations.


[1] "“The community” comes into UVA every day--as faculty, staff, students, and patients. The community staffs and maintains the University. UVA is the product of the local community. The community IS here.  But more of the community should be here. The enormous holdings of the UVA libraries are available to any citizen of the Commonwealth, exhibitions drawn from the rich collections of the Small Library and the Fralin should be go-to opportunities for community members, as should the wealth of o...n-grounds music, dance, and drama events. Community news outlets like the Daily Progress, Cville, and WVIR could be more energetically and creatively used to publicize UVA offerings and events. But UVA’s lack of short-term parking for access to these resources is a deterrent to members of the community coming to UVA with their family and friends for an event—or from coming here at all.  That is especially true for retirees (a rapidly growing proportion of the local population) and for people with children.   \n\nThe UVA community itself would benefit from (and probably appreciate) seeing “ordinary” UVA faculty and staff members featured on the front end of the UVA website in addition to the award winners, innovators, and news pieces that at present populate that high profile space.\nRead More"                                                                                                                                                                                                                                     
[2] "I think we need to walk the walk with supporting people, whether it be students, staff, or faculty. There is distrust within the University community and between UVA and the larger community. But there is an opportunity there--to do better, to be better. We can earn trust among ourselves, which will enable us to be better equipped to build trust in the Charlottesville and Albemarle County community. Let's be true to our word. Let's be clear and transparent in our communication. Let's be kind. Let's be steadfast in our integrity and willing to acknowledge mistakes and challenges and then all roll up our sleeves to do better because of them."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[3] "Model it. We need to start modeling a sense of strength as a community among faculty and staff, so that we have something to stand on when we expect it of our students or the city. We should come to collaborative meetings ready to works together, rather than dig our heals in. We need leadership that is willing to listen and make the tough decisions about our priorities so that everyone isn't fighting over resources. We need to make sure that we are not abdicating our responsibilities. We are an extremely competitive university, even and maybe especially among units and divisions of the university. We need to move away from turf guarding and instead toward our strengths as teammates."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[4] "Community is ultimately spiritual, because community requires sacrifice, including prestige, career, and power, to help those in need, to the glory of God; and the community has to be worthy of that sacrifice. Violence is the antithesis of sacrifice, and results in destruction and isolation. So, the university needs to separate itself as much as possible from organizations that rely on violence for any reason, such as survival, power, and prestige. Unfortunately most of UVa's grants come from su...ch organizations, and those grants undermine the community partly because they require resources that prevent the sacrifice needed to build community, and partly because they direct the results of research toward further violence. Until those who lead UVa place higher priority on community than those grants, UVa will not have a sense of community. UVa at least needs to avoid punishing those who witness againts violence, no matter the justification for the violence, whether it be for security, research, education, health care, poverty assistance, or anti-hate. As long as UVa is under heavy influence of organizations that rely on violence, UVa needs to cooperate with the leadership provided by the churches, houses of worship, and civic organizations that maintain a core principle of nonviolence, and hope that they have the spiritual guidance to inspire the sacrifice that creates community. Without compulsion, the leadership of UVa would do well to set individual examples of supporting such organizations.\nRead More"
[5] "Engage in the broader community with humility, as community members, as partners. Ensure that community agencies that support experiential learning opportunities for students get a useful product and get thanked. Engage in regional planning with the county and city in earnest, committing to improved communication and joint problem-solving. Consider steps to mitigate the impact UVA students have on the affordable housing landscape of greater Charlottesville."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
[6] "Community is ultimately spiritual, because it requires sacrifice, including prestige, career, and power; and the community has to be worthy of that sacrifice. So, UVa needs to cooperate with the leadership by churches, houses of worship, and civic organizations that maintain a core principle of nonviolence. Without compulsion, the leadership of UVa would do well to set individual examples of supporting such organizations. Nothing justifies violence, whether security, education, or health care."

But Topic 8 (not shown) appears a bit muddled, with comments centering on the computer science department, innovation, the A-School, Nazis, centralization, and a sports arena.

Topic modeling is always an iterative process – estimate, evaluate, and reiterate. These results represent only the first iteration. Ideally, we’d try some different values of K, as I suspect from the somewhat unclear categories above that K=20 isn’t sufficient for this corpus.

Finally, how does attention to these topics, as defined by our sense of the distinctive words, vary across comment category?

Barplot of topic prevalence by comment category.

I’ve sequenced the panels by the topic’s overall prevalence in the corpus. We can see

  • Among comments submitted under “community”, Topic 6 appears most common
  • Among comments submitted under “discovery”, Topic 12 is most common
  • Among comments submitted under “service”, Topic 16 is most common

But overall, the comment categories contain a lot of overlap, suggesting that contributors aren’t operating with the same definitions of these categories (and that these categories, therefore, may not be especially meaningful).

Almost Done

In the final exploration in this series, we'll look at the relationship between many of the features we've extracted from this corpus to see what more we can learn.


Michele Claibourn
Director, Research Data Services
University of Virginia Library
January 31, 2019


For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.