Analysis of Ours to Shape Comments, Part 1


As part of a series of workshops on quantitative analysis of text this fall, I started examining the comments submitted to President Ryan’s Ours to Shape website. The site invites people to share their ideas and insights for UVA going forward, particularly in the domains of service, discovery, and community. The website was only one venue for providing suggestions and voicing possibilities – President Ryan hosted community discussions as well – but the website afforded an opportunity for individuals to chime in multiple times and at their convenience, so in theory should represent an especially inclusive collection.

After spending several weeks developing this example for the workshops, I thought I’d share some interesting bits in a series of blog posts.

I’m going to focus on results rather than code, but if you’re interested, the full code – to scrape the content (using RSelenium and rvest), read in and clean up the comments (using quanteda), and analyze the text (mostly with quanteda) in multiple ways – is available on GitHub. If you’re new to quanteda and would like to know more, check out our new StatLab article (by Leah Malkovich) on getting started with quanteda!

Comments, Categories, and Connections

First, let’s get to know the data!

community discovery   service 
      481       212       155

We have 848 comments (as of December 7, 2018), with over half of the comments added under the “community” tag. I was actually pretty excited about that at first, all that attention to community, but we’ll dig a little further into what that means in a later post.

When contributing ideas on the website, individuals were asked to identify their connection to UVa – let’s see who’s adding their thoughts.

  alumni community staff student supporter faculty parent number
1   0.47      0.23  0.22     0.2      0.15    0.14   0.13   1.53

Alumni are the most frequent contributors, with 47% indicating that status, followed by members of the community, university staff, students, individuals identifying as university supporters, faculty, and parents of students. But individuals could identify with mutiple (pre-defined) roles, and on average, they identified 1.5 connections.

Bar plot of number of roles selected.

The bulk of comments (66%) come from individuals who identified only a single connection. There are a small number who intersect with UVa in muliple ways, but so few that I truncated the upper end of number of roles to 4 or more in a measure of connection strength/number of connections.

It’s also worth thinking about individual’s primary connection, so I also construct a mutually exclusive categorization of each commenter’s primary role. That is, if someone is both a faculty member at UVA and an alumnus of UVA, I’ll argue that the current faculty role is the primary role, the one that has the strongest influence on one’s current thinking. Similarly, if one is both a student and a supporter, one’s experience as a student is likely to have a stronger influence. And if an individual identifies as both UVA staff and a community member, her experience as an employee is likely to have a primary influence. One can disagree with any of these rankings, of course, so it’s worth being clear on the primacy I’ve given to each role.

  • Faculty, staff, or student will take precedence over alumni, parent, community member, and supporter
  • Faculty will take precedence (on the presumption that individuals claiming this role are faculty)
  • Student will take precedence over staff (on the presumption that these are students who are also employed by UVA)
  • Alumni will take precedence over community, supporter, and parent (on the presumption that one’s experience within UVA will be the more powerful)
  • Parent will take precedence over community and supporter
  • Community will take precedence over supporter

Given this operationalization of “primary role”, the distribution of contributors looks like…

   alumni community   faculty    parent     staff   student supporter 
      312        44       110        39       172       163         8

The comments are still dominated by individuals whose primary connection is as alumni (36%), followed by staff and students (20% each), then faculty (13%), community members and parents (5% each), and institutional supporters (1%).

Do those with more connections contribute more to comments about community, discovery, or service?

Distribution of comment categories by number of connections.

There is no apparent relationship between the number of connections a contributor identifies and the category of their comments (a chi-square test confirms the absence of any statistically discernible difference in category distribution by number of connections). Okay, what about the nature of the primary connection?

Distribution of comment categories by primary connections.

Here there are differences. In particular, faculty are more likely than others to comment in the category of “discovery” and are the least likely to add ideas to the “community” field. Supporters, too, add to the “discovery” category at rates higher than other contributors. The tradeoffs are primarily between community and discovery; contributions to the “service” category are relatively low across all primary connections.

Length and Readability

So far, we’ve been looking at the document metadata. Next, let’s create a corpus from the text of the comments, extract the length of each comment (number of words), and compare across comment categories and commenter type.

Corpus consisting of 848 documents and 14 docvars.

# A tibble: 3 x 3
  type      `mean(words)` `sd(words)`
1 community         132.        181. 
2 discovery         106.        128. 
3 service            82.8        51.1

Not only are there more “community” comments, these tend to be a bit longer, though the length of these comments is also highly variable (check out that standard deviation – bigger than the mean!). Indeed, the figure below shows that comments have similar lengths across categories, but the community category attracts a small number of extremely long comments.

Boxplots of number of words by comment category.

Length by number of connections?

# A tibble: 4 x 3
  numroles4 `mean(words)` `sd(words)`
1         1         135.        185. 
2         2          82.9        35.9
3         3          77.4        30.2
4         4          83.7        28.4

Stripchart of number of words by number of connections.

Those identifying only one connection do leave lengthier comments, on average. This surprised me a bit as my prior was that those who intersect with UVA in multiple ways would have more to say, or would speak to multiple dimensions. What about comment length by a contributor’s primary connection?

# A tibble: 7 x 3
  primary   `mean(words)` `sd(words)`
1 alumni             84.7        64.6
2 community         128.        139. 
3 faculty           118.        193. 
4 parent             87.2        56.8
5 staff             111.        145. 
6 student           187         234. 
7 supporter          99.9        42.7

Stripchart of number of words by primary connections.

Some suggestive differences arise by primary connection – in particular, the cluster of relatively wordy comments by students, compared to other groups.

Comment length is one way of considering complexity. Another is the readability of the text. Readability here is a measure of how easy a comment is to read based on vocabulary and sentence complexity. Let’s extract readability and compare it across groups.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -2.23   10.09   12.17   12.46   14.46   29.45

On average, comments to Ours to Shape are written just over a 12th-grade reading level. That’s a reasonably high level of complexity – newspapers hover around the 11th-grade reading level and common advice is to write material intended for the mass public at the 9th-grade level. Does readability/grade-level of comments differ across comment categories?

Violin plots of readability by comment category.

No, not really. There’s no real difference in the readability/complexity of the comments by category (community, service, discovery) or by the number of connections a commenter has to the university (numroles4). What about by the number of connections a contributor has with UVA?

Violin plots of readability by number of connections.

Again, no apparent differences. And here’s the average readability by primary connection:

# A tibble: 7 x 4
  primary   `mean(read)` `sd(read)` `n()`
1 alumni            12.1       3.57   312
2 community         11.3       4.37    44
3 faculty           13.2       3.87   110
4 parent            11.1       4.36    39
5 staff             13.2       3.76   172
6 student           12.4       3.64   163
7 supporter         14.7       5.77     8

Violin plots of readability by primary connection.

Nothing, really – though I think the violin plots are pretty! There are no substantive difference by comment category, number of connections, or primary connection. We’re all writing, on average, equally complex feedback!

For fun, here’s the least “easy-to-read” comment based on this measure:

     type     read
1 service 29.44754
1 Prepare students for the real world—a world without “safe spaces” and “trigger warnings”, a world where opinions must be based on facts instead of feelings, and where if they don't become mature, responsible adults (instead of coddled children) they will be utterly subjugated by China (which produces more engineers in a year than we produce college graduates and doesn't care about your feelings) by the time they're 50

Huh… okay… not as much fun as I’d hoped…

Key Words in Context

Moving on, let’s look at a few key words and see how they’re used. This is a really useful way to get to know a corpus. Here I’m just going straight to my own interests, looking for occurrences of words around equity/equality and around library. First up, equity! Here are all 27 occurrences of equity/equitable, equal(s)/equality across 23 Ours to Shape comments.

 [24, 225]       , which almost |  equals   | to the GDP             
  [32, 69]    everyone is given |   equal   | opportunities. This    
 [34, 493]        fitness to be |   equal   | to the development     
  [55, 66]     staff members as |  equals   | . More should          
 [57, 125]    action to support |  equity   | .                      
 [88, 128]          where I had |   equal   | male and female        
 [292, 33]   Pathways to Health |  Equity   | "" report              
 [295, 17] resources to promote |  equity   | . The University       
 [295, 26]    Action for Racial |  Equity   | "" Call                
 [295, 62]  point for promoting |  equity   | .                      
 [327, 86]   after separate but |   equal   | status in the          
 [352, 51]    do to incentivize |  equity   | , sustainability,      
 [357, 56]     attachment to an | equitable | workplace as well      
 [359, 42]      basic values of | equality  | and justice.           
 [388, 42]    Action for Racial |  Equity   | ) and especially       
 [409, 47]          a leader in | equality  | , equity and           
 [409, 49]         in equality, |  equity   | and academic excellence
 [428, 10]   online learners as |   equal   | members of our         
 [431, 10] the community toward | equality  | for all people         
 [431, 22]  beyond appearances. | Equality  | means a decent         
 [472, 30]        our values of |  equity   | and inclusion in       
 [549, 92]           for a more | equitable | understanding of the   
 [602, 10] that promotes health |  equity   | , research should      
 [697, 44]        every one has |   equal   | respect. One           
 [715, 85]            . Only as |  equals   | can we serve           
 [772, 73]     service in which |  equity   | and the environment    
 [811, 56]     all opinions are |   equal   | , nor are

All but a handful appear to be referencing notions of equity, though, to be honest, this represents less attention to equity than I’d expected.

How (often) does the library come up?

    [3, 59]             the UVA | libraries | are available        
    [3, 78]           the Small |  Library  | and the              
  [14, 426]         in Alderman |  library  | resulted in          
  [24, 664]          our school |  library  | . I                  
  [118, 32]         of Alderman |  Library  | . While              
  [118, 38]         believe the |  library  | does need            
 [118, 109]           using the |  library  | in their             
  [125, 42]         of Alderman |  Library  | . While              
  [125, 48]         believe the |  library  | does need            
 [125, 123]           using the |  library  | in their             
  [131, 13]         at Alderman |  library  | not be               
  [131, 97]      have completed |  library  | renovations by       
  [133, 88]         and Clemons | libraries | ' storage            
  [243, 61]              in the |  library  | who have             
  [483, 21]        the Alderman |  Library  | . Without            
  [483, 43]           other UVA | libraries | ) remain             
   [519, 4]        the Alderman |  Library  | renovation,          
 [524, 132]         of Alderman |  Library  | . Our                
 [524, 150]         of Virginia |  Library  | system.              
 [524, 257]         of Alderman |  Library  | dramatically lower   
 [524, 512]       main research |  library  | at the               
 [524, 587]         at Alderman |  Library  | and the              
 [524, 595] Special Collections |  Library  | , all                
  [527, 30]           access to |  library  | holdings-            
   [529, 5]       , competitive |  library  | collection with      
  [530, 16]           a vibrant |  library  | . The                
  [530, 89]            trend in | libraries | has a                
   [531, 6]           plans for |  library  | renovation from      
  [531, 32]          . Alderman |  Library  | used to              
  [532, 27]         of Alderman |  Library  | , as                 
  [532, 41]              by the | library's | administration with  
  [532, 88]             use the |  library  | for our              
  [533, 28]         the current |  library  | volume will          
  [533, 67]          with their |  library  | capacities decreasing
  [535, 10]         of Alderman |  Library  | prioritizes keeping  
  [535, 41]       adjusting the | library's | layout.              
  [624, 45]         world class |  library  | collections,         
   [626, 4]            faculty, |  Library  | and library          
   [626, 6]         Library and |  library  | help are             
  [626, 19]               , the |  Library  | became a             
  [626, 32]               . But |  Library  | services have        
  [644, 17]             . UVA's | libraries | , lecture            
   [646, 8]         of Alderman |  library  | MUST be              
   [659, 8]       a world-class |  library  | that provides        
   [718, 9]   renovate Alderman |  Library  | are both             
  [718, 51]         and easiest |  library  | that was

There are 46 references to the library across 24 comments, mostly connected to the planned renovation of Alderman Library. Sigh. Notice the repeated comment, though? Comment 118 and 125? This is the first hint of a behavior we’ll see again later, the same comment submitted multiple times...

Still to Come

In the next post, we’ll create a document feature matrix and start examining word frequencies, relative frequencies by groups, distinctive words, and ngrams. Later, we’ll take some forays into document similarity and feature co-occurrence, sentiment analysis, document clustering, and topic modeling.

Michele Claibourn
Director, Research Data Services
University of Virginia Library
December 13, 2018

For questions or clarifications regarding this article, contact

View the entire collection of UVA Library StatLab articles, or learn how to cite.