Introduction
In the penultimate post of this series, we’ll use some unsupervised learning approaches to uncover comment clusters and latent themes among the comments to President Ryan’s Ours to Shape website.
The full code to recreate the analysis in the blog posts is available on GitHub.
Cluster Analysis
Cluster analysis is about discovering groups in data, such that the observations within each group share an internal cohesiveness, they look more like one another than like members of other clusters. This is exploratory data analysis; we don’t know the groups ahead of time.
Let’s start with some preliminary visualization. First, I stem the words to reduce the dimensionality a bit further, then turn it into a tf-idf weighted matrix – we’re looking for the words that help distinguish documents – and normalize it for document length, and finally extract the principal components. The figure plots each comment along the first two principal components.
To be clear, we haven’t implemented a cluster analysis yet, we’re just looking at the data a little differently. The first two principal components only account for a small amount of variation in the overall text data, so won’t be able to visually distinguish differences along many dimensions. But in these first two dimensions, we do see some clumps.
Let’s look at that clump to the far right, above 0.75 on PC1:
1 What if the University of Virginia built a state of the art Ice Sports Arena? We would have a place of true physical education for all ages and abilities. An ice arena is a venue for participation. 500 athletes a day (with 1/3rd coming from the non-University community) would come to skate with family or to make new friends while discovering the joy of movement whether it be hockey, figure skating or just ice skating. The UVA ICE PARK would be a bridge between Town and Gown and a way for the school to be of service to the community while helping all people discover the joy of movement and exercise.
Here’s our ice arena lobby! There are 10 comments here, mostly versions of the example comment above. What about the other comments spread along the first dimension, between 0.15 and 0.75? Two of the 31 comments in this range are printed below:
1 I've recently learned that there might be an opportunity for the University to add an arena for ice sports. It has always seemed strange that UVA did not already have a facility for its varsity hockey teams as well as all of the other purposes an arena would serve for the UVA stduents, faculty, staff, and broader community. Thanks for supporting that. I think it would be worthwhile.
2 In the past year, Charlottesville has seen the loss of its downtown ice arena which has left a significant hole in our community. Over the past two years, I have witnessed our students come together to watch our men's and women's hockey teams and the figure-skating program, as well as bond by participating in other activities such as public skating, floor hockey, and broomball. It was truly a focal point, not just for student life but also for bringing UVA and C’ville together. Building a new ice arena on Grounds would finally provide students, faculty, alumni and local residents with a permanent fix so we can all enjoy the true spirit of a shared community.
Ah, more ice arena advocacy, but without the duplication. Let’s see what’s distinguishing comments high on the second principal component, say, above 0.15. Here are a few of the 48 comments occupying this space on the graph:
1 Global warming is becoming a bigger problem with every new day that passes, and we are now reaching rates of global warming we have never experienced before. Temperature changes that were previously predicted to take thousands of years are now occurring within the span of a few decades.
2 I am writing a letter this afternoon to discuss the concrete steps that the University of Virginia must take in order to limit global warming and reduce fossil fuel emissions in Charlottesville. In the last century, Charlottesville has had less than five days of extreme, deadly
3 I wanted to reach out to you as a student of the University of Virginia concerned with our university’s attempts to mitigate climate change. The topic has become thoroughly debated in recent politics, and being currently enrolled in Professor Deborah Lawrence’s Introduction to Climate
Actually, I’m only showing the first part of these comments – they are all really, really long! But we seem to have found a climate and sustainability cluster!
Let’s formalize this in a cluster analysis!
Hierarchical Clustering
There are a lot of choices in clustering – multiple clustering algorithms, distance metrics and weights, optimization methods, and tuning parameters within given algorithms. And, of course, the choice of k or the number of clusters you want the model to define (choice of k is a tricky thing and I’m going to gloss over it here). I’m going to stick with some common choices here, starting with hierarchical, or connectivity-based, methods.
In particular, I use the tf-idf weighted dfm, calculate the Euclidean distance, and create links through agglomeration (beginning with n partitions and successively fusing the clusters that are closest) using Ward’s method to determine which clusters are closest. I chose k=20 clusters to start.
comments_hc_20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
757 18 8 1 1 2 1 41 1 1 1 1 1 1 1 1 1 1
19 20
1 2
This produced 20 clusters, because I asked it to, but most of the clusters are singles, and the bulk of observations are thrown into that first massive clusters. So, not very satisfying. Still, let’s see what those handful of more distinct groups (groups 2 and 8) contain.
To do that, I pull out the 10 highest weighted words (features are weighted by term frequency-inverse document frequency) among the comments within each cluster.
$`2`
energi reduc climate_chang emiss warm
78.19342 76.57719 74.56030 73.20347 71.92488
chang wast X1.5 food can
53.86263 53.45833 49.39675 45.57220 44.22610
$`8`
ice skate hockey arena rink sport movement
147.27017 92.31054 91.49594 72.40954 57.62955 49.36716 37.67839
joy team figur
36.14503 34.41481 33.64702
Cluster 2 seems to contain at least some of the climate change sustainability comments, though from the exploratory analysis above, I know there are more than 18 of these.
Cluster 8 is the by now familiar ice arena group. The 41 comments in this cluster matches the 41 comments I saw in the exploratory graph above, so that’s nice, at least.
But this hasn’t pulled out the kind of internal cohesion that might help me summarize the comments or generate insight.
Let’s try one more clustering algorithm.
Kmeans Clustering
Centroid-based clustering offers another approach, where observations are divvied up into groups by minimizing some numerical criterion – k-means is the most common partitioning approach.
K-means starts with k-centroids (the points that will be the center of the clusters), assigns each data point to the nearest centroid, updates each centroid to be the average of the data points assigned to it, and re-assigns each data point to the nearest centroid (and then repeats this until there is minimal change). Because the initial randomly-chosen centroids can affect the outcome, we generally try a lot of starting points and let the algorithm select the solution that minimizes the within sum of squares (so maximizes within cluster homogeneity).
I’m going to use K=20 again.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 1 1 768 1 1 1 1 30 7 9 1 1 3 1 9 1 1
19 20
1 3
Again, we have lots of groups of one, and only a handful of potentially distinctive clusters. Let’s look at the key distinctive words for clusters 9, 10, 11, and 16.
$`9`
ice skate hockey arena rink movement joy
130.47621 85.20973 82.07724 62.27221 47.75020 37.67839 36.14503
sport figur discov
33.35619 32.18411 29.25828
$`10`
energi dine use can transport car hall
32.04649 30.11296 25.28746 25.19208 24.69838 23.54779 21.90415
reduc food bus
20.99697 20.83300 20.72311
$`11`
climate_chang reduc emiss energi warm
54.86513 43.22906 40.33660 32.04649 31.27169
c climat chang X1.5 carbon
28.25879 25.64948 23.08398 19.75870 18.32846
$`16`
energi heat warm wast emiss celsius food degre
42.30136 40.23208 39.08961 35.16995 34.36081 33.23036 32.55157 31.24357
chang reduc
30.77864 29.64278
Cluster 9 is our hockey/ice skating enthusiasts; clusters 10, 11, and 16 are all variations on energy use and climate change – so potentially some subgroups within a larger sustainability theme. But that’s about it.
Overall, this is kind of disappointing, what with that undifferentiated mass of comments. It’s possible we haven’t tried a big enough k, but given all of the singles we’re already getting, that’s doubtful. It’s also possible that there really aren’t many common themes, nothing that holds a subset together. But more likely is that many of these comments are long and involved and possibly not single-themed. So let’s try something different.
Topic Models
Another approach to uncover the main themes (or topics) in an unstructured corpus is topic modeling.
Topic models require no prior information, no training set, no special annotation of the texts beforehand; only, as in cluster analysis, a decision about the number of topics, K. Unlike in cluster analysis, documents are not required to belong to a single topic or cluster (i.e., single membership), but may simultaneously belong to several topics.
I’m going to use an R package called stm – for structural topic models – which implements a correlated topic model and allows for the inclusion of covariates as predictors of topic prevalence. I reduce the dimensionality of my document feature matrix a bit and convert this into the form the stm package expects before estimating a topic model with K=20 topics (it takes about a minute on my machine).
Let’s see the top words for each estimated topic.
Topic 1 Top Words:
Highest Prob: univers, uva, polici, can, work, communiti, parent
FREX: polici, leav, parent, adopt, paid, execut, governor
Topic 2 Top Words:
Highest Prob: uva, univers, presid, divers, communiti, group, ask
FREX: presid, slaveri, august, equiti, stop, respect, ask
Topic 3 Top Words:
Highest Prob: center, short, uva, marc, miller, invit, hire
FREX: marc, miller, short, invit, center, speaker, racism
Topic 4 Top Words:
Highest Prob: uva, follow, futur, learn, way, past, student
FREX: follow, uniqu, futur, past, monticello, relationship, method
Topic 5 Top Words:
Highest Prob: energi, reduc, uva, can, univers, emiss, wast
FREX: initi, emiss, heat, wast, energi, reduc, planet
Topic 6 Top Words:
Highest Prob: communiti, uva, staff, univers, charlottesvill, student, peopl
FREX: sens, staff, cvill, communiti, region, welcom, greater
Topic 7 Top Words:
Highest Prob: year, uva, student, school, communiti, first, charlottesvill
FREX: music, dorm, concert, youth, kid, year, free
Topic 8 Top Words:
Highest Prob: student, faculti, learn, school, can, uva, studi
FREX: innov, studi, explor, abroad, intern, busi, perspect
Topic 9 Top Words:
Highest Prob: divers, univers, encourag, faculti, polit, educ, thought
FREX: thought, opinion, divers, polit, correct, belief, liber
Topic 10 Top Words:
Highest Prob: can, uva, student, food, univers, dine, make
FREX: transport, bus, food, dine, car, electr, meat
Topic 11 Top Words:
Highest Prob: ice, uva, hockey, skate, communiti, arena, sport
FREX: hockey, skate, arena, rink, ice, sport, figur
Topic 12 Top Words:
Highest Prob: research, faculti, uva, teach, book, univers, time
FREX: research, librari, book, alderman, collabor, medic, renov
Topic 13 Top Words:
Highest Prob: univers, uva, system, fratern, greek, student, scienc
FREX: fratern, greek, footbal, neighborhood, system, sexual, behavior
Topic 14 Top Words:
Highest Prob: experi, student, uva, univers, residenti, knowledg, way
FREX: residenti, experi, art, space, museum, classroom, knowledg
Topic 15 Top Words:
Highest Prob: climate_chang, chang, make, communiti, need, world, uva
FREX: climate_chang, climat, negat, warm, technolog, chang, action
Topic 16 Top Words:
Highest Prob: servic, student, communiti, need, work, help, provid
FREX: servic, special, children, commonwealth, volunt, requir, hour
Topic 17 Top Words:
Highest Prob: student, hous, connect, uva, alumni, faculti, opportun
FREX: connect, hous, outsid, career, alumni, opportun, write
Topic 18 Top Words:
Highest Prob: sustain, univers, ground, like, improv, can, communiti
FREX: sustain, improv, aggress, neutral, environment, green, although
Topic 19 Top Words:
Highest Prob: student, staff, live, uva, better, univers, make
FREX: wage, lot, staff, pay, month, citi, better
Topic 20 Top Words:
Highest Prob: must, mani, valu, univers, focus, one, feel
FREX: valu, must, truth, focus, honor, mani, tradit
The “Highest Prob” words are what folks are often accustomed to seeing in topic model output – the words with the highest probability of being in that topic (based on frequency). The problem is, many of the high probability words appear across multiple topics because they appear frequently throughout the corpus. I like the “FREX” descriptors – balancing the frequency with which words appear in a topic and the exclusivity with with they appear in that topic – as it better reflects distinctive but important words.
There’s a lot to unpack here. Topic 2 appears to reflect comments addressing racial diversity, the events of August 11-12, the President’s Commission on Slavery and the University, and related ideas. Topic 3 pulls out references to the Miller Center’s hiring of Marc Short. Topic 11 captures the comments of the ice arena lobby. Topic 12 is picking up mentions of Alderman Library and renovation plans. And Topics 5, 15, and 18 address dimensions of sustainability and environmental change.
Let’s visualize the overall prevalence of each of these topics in our collection of comments:
Topics 17, 6, and 8 are the most frequent. On the surface, these topics aren’t quite as clear (to me) as some of the ones I referenced earlier. Here are the top words for these top topics.
Topic 17 Top Words:
Highest Prob: student, hous, connect, uva, alumni, faculti, opportun
FREX: connect, hous, outsid, career, alumni, opportun, write
Topic 6 Top Words:
Highest Prob: communiti, uva, staff, univers, charlottesvill, student, peopl
FREX: sens, staff, cvill, communiti, region, welcom, greater
Topic 8 Top Words:
Highest Prob: student, faculti, learn, school, can, uva, studi
FREX: innov, studi, explor, abroad, intern, busi, perspect
Looking at comments that score high on Topic 17 confirms that this one is a bit of a hodgepodge, combining attention to affordable housing and to research support.
[1] "UVa needs to take a broader view of community to include the City of Charlottesville. UVa student demand is an obstacle to affordable housing in the city. Today students are paying $1000 for one room in direct competition with underprivileged city residents. A significant investment in on campus student housing away from the city and requiring more students to live on campus would reduce housing demand in the area. This would allow the city to promote affordable housing growth."
[2] "There is an affordable housing crisis in Charlottesville for limited income residents. The university has a desire to offer more on-grounds housing options for upperclassmen students. \n\nThis problem presents an opportunity for a public-private partnership of mixed UVA student & Charlottesville community member housing. Alumnae Micaela Connery, CLAS '09 is researching current efforts similar to this. This demonstrates our commitment to the city and presents learning opportunities for students."
[3] "We need to create a centralize, focused effort on research. In addition, we need to lower the overhead rate and apply what is collected to supporting the infrastructure."
[4] "It is critical to foster an environment where diverse ideas, views, and perspectives are fostered and explored. Too often on campuses across the U.S. these days, certain points of view are attacked or shut down entirely to the great detriment of the academic community there. The University of Chicago has set a great example recently in this area and I hope that UVA follows suit and continues to be a place where the ideals of free speech and debate are celebrated."
[5] "Hire faculty who are tops in their field but who also have a commitment to engaging undergraduate in their research, not just their classrooms. Provide more opportunities for students, even undergrads, to work with our superstars --- provide more students with experiences like the Harrison Undergraduate Research Awards."
The comments that exemplify Topic 6 suggest a little more coherence, primarily around town-gown relations.
[1] "“The community” comes into UVA every day--as faculty, staff, students, and patients. The community staffs and maintains the University. UVA is the product of the local community. The community IS here. But more of the community should be here. The enormous holdings of the UVA libraries are available to any citizen of the Commonwealth, exhibitions drawn from the rich collections of the Small Library and the Fralin should be go-to opportunities for community members, as should the wealth of o...n-grounds music, dance, and drama events. Community news outlets like the Daily Progress, Cville, and WVIR could be more energetically and creatively used to publicize UVA offerings and events. But UVA’s lack of short-term parking for access to these resources is a deterrent to members of the community coming to UVA with their family and friends for an event—or from coming here at all. That is especially true for retirees (a rapidly growing proportion of the local population) and for people with children. \n\nThe UVA community itself would benefit from (and probably appreciate) seeing “ordinary” UVA faculty and staff members featured on the front end of the UVA website in addition to the award winners, innovators, and news pieces that at present populate that high profile space.\nRead More"
[2] "I think we need to walk the walk with supporting people, whether it be students, staff, or faculty. There is distrust within the University community and between UVA and the larger community. But there is an opportunity there--to do better, to be better. We can earn trust among ourselves, which will enable us to be better equipped to build trust in the Charlottesville and Albemarle County community. Let's be true to our word. Let's be clear and transparent in our communication. Let's be kind. Let's be steadfast in our integrity and willing to acknowledge mistakes and challenges and then all roll up our sleeves to do better because of them."
[3] "Model it. We need to start modeling a sense of strength as a community among faculty and staff, so that we have something to stand on when we expect it of our students or the city. We should come to collaborative meetings ready to works together, rather than dig our heals in. We need leadership that is willing to listen and make the tough decisions about our priorities so that everyone isn't fighting over resources. We need to make sure that we are not abdicating our responsibilities. We are an extremely competitive university, even and maybe especially among units and divisions of the university. We need to move away from turf guarding and instead toward our strengths as teammates."
[4] "Community is ultimately spiritual, because community requires sacrifice, including prestige, career, and power, to help those in need, to the glory of God; and the community has to be worthy of that sacrifice. Violence is the antithesis of sacrifice, and results in destruction and isolation. So, the university needs to separate itself as much as possible from organizations that rely on violence for any reason, such as survival, power, and prestige. Unfortunately most of UVa's grants come from su...ch organizations, and those grants undermine the community partly because they require resources that prevent the sacrifice needed to build community, and partly because they direct the results of research toward further violence. Until those who lead UVa place higher priority on community than those grants, UVa will not have a sense of community. UVa at least needs to avoid punishing those who witness againts violence, no matter the justification for the violence, whether it be for security, research, education, health care, poverty assistance, or anti-hate. As long as UVa is under heavy influence of organizations that rely on violence, UVa needs to cooperate with the leadership provided by the churches, houses of worship, and civic organizations that maintain a core principle of nonviolence, and hope that they have the spiritual guidance to inspire the sacrifice that creates community. Without compulsion, the leadership of UVa would do well to set individual examples of supporting such organizations.\nRead More"
[5] "Engage in the broader community with humility, as community members, as partners. Ensure that community agencies that support experiential learning opportunities for students get a useful product and get thanked. Engage in regional planning with the county and city in earnest, committing to improved communication and joint problem-solving. Consider steps to mitigate the impact UVA students have on the affordable housing landscape of greater Charlottesville."
[6] "Community is ultimately spiritual, because it requires sacrifice, including prestige, career, and power; and the community has to be worthy of that sacrifice. So, UVa needs to cooperate with the leadership by churches, houses of worship, and civic organizations that maintain a core principle of nonviolence. Without compulsion, the leadership of UVa would do well to set individual examples of supporting such organizations. Nothing justifies violence, whether security, education, or health care."
But Topic 8 (not shown) appears a bit muddled, with comments centering on the computer science department, innovation, the A-School, Nazis, centralization, and a sports arena.
Topic modeling is always an iterative process – estimate, evaluate, and reiterate. These results represent only the first iteration. Ideally, we’d try some different values of K, as I suspect from the somewhat unclear categories above that K=20 isn’t sufficient for this corpus.
Finally, how does attention to these topics, as defined by our sense of the distinctive words, vary across comment category?
I’ve sequenced the panels by the topic’s overall prevalence in the corpus. We can see
- Among comments submitted under “community”, Topic 6 appears most common
- Among comments submitted under “discovery”, Topic 12 is most common
- Among comments submitted under “service”, Topic 16 is most common
But overall, the comment categories contain a lot of overlap, suggesting that contributors aren’t operating with the same definitions of these categories (and that these categories, therefore, may not be especially meaningful).
Almost Done
In the final exploration in this series, we'll look at the relationship between many of the features we've extracted from this corpus to see what more we can learn.
Michele Claibourn
Director, Research Data Services
University of Virginia Library
January 31, 2019
For questions or clarifications regarding this article, contact statlab@virginia.edu.
View the entire collection of UVA Library StatLab articles, or learn how to cite.