As part of a series of workshops on quantitative analysis of text this fall, I started examining the comments submitted to President Ryan’s Ours to Shape website. The site invites people to share their ideas and insights for UVA going forward, particularly in the domains of service, discovery, and community. The website was only one venue for providing suggestions and voicing possibilities – President Ryan hosted community discussions as well – but the website afforded an opportunity for individuals to chime in multiple times and at their convenience, so in theory should represent an especially inclusive collection.
After spending several weeks developing this example for the workshops, I thought I’d share some interesting bits in a series of blog posts.
I’m going to focus on results rather than code, but if you’re interested, the full code – to scrape the content (using RSelenium and rvest), read in and clean up the comments (using quanteda), and analyze the text (mostly with quanteda) in multiple ways – is available on GitHub. If you’re new to quanteda and would like to know more, check out our new StatLab article (by Leah Malkovich) on getting started with quanteda!
Comments, Categories, and Connections
First, let’s get to know the data!
community discovery service 481 212 155
We have 848 comments (as of December 7, 2018), with over half of the comments added under the “community” tag. I was actually pretty excited about that at first, all that attention to community, but we’ll dig a little further into what that means in a later post.
When contributing ideas on the website, individuals were asked to identify their connection to UVa – let’s see who’s adding their thoughts.
alumni community staff student supporter faculty parent number 1 0.47 0.23 0.22 0.2 0.15 0.14 0.13 1.53
Alumni are the most frequent contributors, with 47% indicating that status, followed by members of the community, university staff, students, individuals identifying as university supporters, faculty, and parents of students. But individuals could identify with mutiple (pre-defined) roles, and on average, they identified 1.5 connections.
The bulk of comments (66%) come from individuals who identified only a single connection. There are a small number who intersect with UVa in muliple ways, but so few that I truncated the upper end of number of roles to 4 or more in a measure of connection strength/number of connections.
It’s also worth thinking about individual’s primary connection, so I also construct a mutually exclusive categorization of each commenter’s primary role. That is, if someone is both a faculty member at UVA and an alumnus of UVA, I’ll argue that the current faculty role is the primary role, the one that has the strongest influence on one’s current thinking. Similarly, if one is both a student and a supporter, one’s experience as a student is likely to have a stronger influence. And if an individual identifies as both UVA staff and a community member, her experience as an employee is likely to have a primary influence. One can disagree with any of these rankings, of course, so it’s worth being clear on the primacy I’ve given to each role.
- Faculty, staff, or student will take precedence over alumni, parent, community member, and supporter
- Faculty will take precedence (on the presumption that individuals claiming this role are faculty)
- Student will take precedence over staff (on the presumption that these are students who are also employed by UVA)
- Alumni will take precedence over community, supporter, and parent (on the presumption that one’s experience within UVA will be the more powerful)
- Parent will take precedence over community and supporter
- Community will take precedence over supporter
Given this operationalization of “primary role”, the distribution of contributors looks like…
alumni community faculty parent staff student supporter 312 44 110 39 172 163 8
The comments are still dominated by individuals whose primary connection is as alumni (36%), followed by staff and students (20% each), then faculty (13%), community members and parents (5% each), and institutional supporters (1%).
Do those with more connections contribute more to comments about community, discovery, or service?
There is no apparent relationship between the number of connections a contributor identifies and the category of their comments (a chi-square test confirms the absence of any statistically discernible difference in category distribution by number of connections). Okay, what about the nature of the primary connection?
Here there are differences. In particular, faculty are more likely than others to comment in the category of “discovery” and are the least likely to add ideas to the “community” field. Supporters, too, add to the “discovery” category at rates higher than other contributors. The tradeoffs are primarily between community and discovery; contributions to the “service” category are relatively low across all primary connections.
Length and Readability
So far, we’ve been looking at the document metadata. Next, let’s create a corpus from the text of the comments, extract the length of each comment (number of words), and compare across comment categories and commenter type.
Corpus consisting of 848 documents and 14 docvars. # A tibble: 3 x 3 type `mean(words)` `sd(words)` 1 community 132. 181. 2 discovery 106. 128. 3 service 82.8 51.1
Not only are there more “community” comments, these tend to be a bit longer, though the length of these comments is also highly variable (check out that standard deviation – bigger than the mean!). Indeed, the figure below shows that comments have similar lengths across categories, but the community category attracts a small number of extremely long comments.
Length by number of connections?
# A tibble: 4 x 3 numroles4 `mean(words)` `sd(words)` 1 1 135. 185. 2 2 82.9 35.9 3 3 77.4 30.2 4 4 83.7 28.4
Those identifying only one connection do leave lengthier comments, on average. This surprised me a bit as my prior was that those who intersect with UVA in multiple ways would have more to say, or would speak to multiple dimensions. What about comment length by a contributor’s primary connection?
# A tibble: 7 x 3 primary `mean(words)` `sd(words)` 1 alumni 84.7 64.6 2 community 128. 139. 3 faculty 118. 193. 4 parent 87.2 56.8 5 staff 111. 145. 6 student 187 234. 7 supporter 99.9 42.7
Some suggestive differences arise by primary connection – in particular, the cluster of relatively wordy comments by students, compared to other groups.
Comment length is one way of considering complexity. Another is the readability of the text. Readability here is a measure of how easy a comment is to read based on vocabulary and sentence complexity. Let’s extract readability and compare it across groups.
Min. 1st Qu. Median Mean 3rd Qu. Max. -2.23 10.09 12.17 12.46 14.46 29.45
On average, comments to Ours to Shape are written just over a 12th-grade reading level. That’s a reasonably high level of complexity – newspapers hover around the 11th-grade reading level and common advice is to write material intended for the mass public at the 9th-grade level. Does readability/grade-level of comments differ across comment categories?
No, not really. There’s no real difference in the readability/complexity of the comments by category (community, service, discovery) or by the number of connections a commenter has to the university (numroles4). What about by the number of connections a contributor has with UVA?
Again, no apparent differences. And here’s the average readability by primary connection:
# A tibble: 7 x 4 primary `mean(read)` `sd(read)` `n()` 1 alumni 12.1 3.57 312 2 community 11.3 4.37 44 3 faculty 13.2 3.87 110 4 parent 11.1 4.36 39 5 staff 13.2 3.76 172 6 student 12.4 3.64 163 7 supporter 14.7 5.77 8
Nothing, really – though I think the violin plots are pretty! There are no substantive difference by comment category, number of connections, or primary connection. We’re all writing, on average, equally complex feedback!
For fun, here’s the least “easy-to-read” comment based on this measure:
type read 1 service 29.44754 text 1 Prepare students for the real world—a world without “safe spaces” and “trigger warnings”, a world where opinions must be based on facts instead of feelings, and where if they don't become mature, responsible adults (instead of coddled children) they will be utterly subjugated by China (which produces more engineers in a year than we produce college graduates and doesn't care about your feelings) by the time they're 50
Huh… okay… not as much fun as I’d hoped…
Key Words in Context
Moving on, let’s look at a few key words and see how they’re used. This is a really useful way to get to know a corpus. Here I’m just going straight to my own interests, looking for occurrences of words around equity/equality and around library. First up, equity! Here are all 27 occurrences of equity/equitable, equal(s)/equality across 23 Ours to Shape comments.
[24, 225] , which almost | equals | to the GDP [32, 69] everyone is given | equal | opportunities. This [34, 493] fitness to be | equal | to the development [55, 66] staff members as | equals | . More should [57, 125] action to support | equity | . [88, 128] where I had | equal | male and female [292, 33] Pathways to Health | Equity | "" report [295, 17] resources to promote | equity | . The University [295, 26] Action for Racial | Equity | "" Call [295, 62] point for promoting | equity | . [327, 86] after separate but | equal | status in the [352, 51] do to incentivize | equity | , sustainability, [357, 56] attachment to an | equitable | workplace as well [359, 42] basic values of | equality | and justice. [388, 42] Action for Racial | Equity | ) and especially [409, 47] a leader in | equality | , equity and [409, 49] in equality, | equity | and academic excellence [428, 10] online learners as | equal | members of our [431, 10] the community toward | equality | for all people [431, 22] beyond appearances. | Equality | means a decent [472, 30] our values of | equity | and inclusion in [549, 92] for a more | equitable | understanding of the [602, 10] that promotes health | equity | , research should [697, 44] every one has | equal | respect. One [715, 85] . Only as | equals | can we serve [772, 73] service in which | equity | and the environment [811, 56] all opinions are | equal | , nor are
All but a handful appear to be referencing notions of equity, though, to be honest, this represents less attention to equity than I’d expected.
How (often) does the library come up?
[3, 59] the UVA | libraries | are available [3, 78] the Small | Library | and the [14, 426] in Alderman | library | resulted in [24, 664] our school | library | . I [118, 32] of Alderman | Library | . While [118, 38] believe the | library | does need [118, 109] using the | library | in their [125, 42] of Alderman | Library | . While [125, 48] believe the | library | does need [125, 123] using the | library | in their [131, 13] at Alderman | library | not be [131, 97] have completed | library | renovations by [133, 88] and Clemons | libraries | ' storage [243, 61] in the | library | who have [483, 21] the Alderman | Library | . Without [483, 43] other UVA | libraries | ) remain [519, 4] the Alderman | Library | renovation, [524, 132] of Alderman | Library | . Our [524, 150] of Virginia | Library | system. [524, 257] of Alderman | Library | dramatically lower [524, 512] main research | library | at the [524, 587] at Alderman | Library | and the [524, 595] Special Collections | Library | , all [527, 30] access to | library | holdings- [529, 5] , competitive | library | collection with [530, 16] a vibrant | library | . The [530, 89] trend in | libraries | has a [531, 6] plans for | library | renovation from [531, 32] . Alderman | Library | used to [532, 27] of Alderman | Library | , as [532, 41] by the | library's | administration with [532, 88] use the | library | for our [533, 28] the current | library | volume will [533, 67] with their | library | capacities decreasing [535, 10] of Alderman | Library | prioritizes keeping [535, 41] adjusting the | library's | layout. [624, 45] world class | library | collections, [626, 4] faculty, | Library | and library [626, 6] Library and | library | help are [626, 19] , the | Library | became a [626, 32] . But | Library | services have [644, 17] . UVA's | libraries | , lecture [646, 8] of Alderman | library | MUST be [659, 8] a world-class | library | that provides [718, 9] renovate Alderman | Library | are both [718, 51] and easiest | library | that was
There are 46 references to the library across 24 comments, mostly connected to the planned renovation of Alderman Library. Sigh. Notice the repeated comment, though? Comment 118 and 125? This is the first hint of a behavior we’ll see again later, the same comment submitted multiple times...
Still to Come
In the next post, we’ll create a document feature matrix and start examining word frequencies, relative frequencies by groups, distinctive words, and ngrams. Later, we’ll take some forays into document similarity and feature co-occurrence, sentiment analysis, document clustering, and topic modeling.
Director, Research Data Services
University of Virginia Library
December 13, 2018