Presentation

Search Abstracts | Symposia | Slide Sessions | Poster Sessions | Poster Slams

Using natural language processing to examine word frequency in spoken vs. written English

Poster C17 in Poster Session C, Friday, October 7, 10:15 am - 12:00 pm EDT, Millennium Hall

Ann Marie Finley1; 1Temple University

People convey messages in writing and speaking in fundamentally different ways. Yet, much of our understanding of spoken language rests on written language norms. In part, this is because until recently, the scope and size of spoken language corpora were considerably limited relative to written language corpora. Recent advances in natural language processing and the rise of podcasting have intersected to create a sample of spoken language data readily available for research. We report the steps we took to create a podcast-derived spoken word frequency dataset and contrast spoken vs. written English word frequencies. First, we identified a large spoken language podcast corpus consisting of audio files and transcriptions for 105,360 episodes of English-language podcasts. We developed custom data processing and cleaning pipelines using the R programming language and ran all scripts using a high-performance interactive-use server. We then used the ‘quanteda’ R package to convert the cleaned texts into a corpus object that we tokenized into single words. We transformed this word vector into a document-feature matrix and calculated word frequency for 110,168 words, drawn from a corpus of 622,115,467 spoken words. We calculated word frequency using the Zipf scale, which implements a scale of 1 (e.g., low frequency) to 7 (e.g., high frequency). To compare spoken vs. written English word frequencies, we applied the same cleaning processes to a comparable corpus of written English. This yielded written word frequency data for 191,495 unique words drawn from a corpus of 641,410,953 total words. We hypothesized that word frequency would differ between spoken and written English. We examined word frequency by modality using the 108,948 words with frequency measures present in both the spoken and the written corpora. We first compared the proportion of high:low frequency words in written vs. spoken English. Using the Jaccard similarity index, we found that spoken and written English contain roughly similar proportions of high:low frequency words (two-sample z test with a Yates continuity correction yielded  = .05: X2 = 3.72, df = 1, p = 0.97, 95% CI = [2.6e-03, 1]). Next, we examined whether a greater proportion of spoken vs. written words would overlap across modalities (e.g., occur at least once in each modality). We found that our prediction was supported (two-sample z test with a Yates continuity correction; significant at  = .05: X2 = 61,501 with df = 1, p < .001, 95% CI = [0.42, 1]. Finally, we used the Wasserstein distance to compare word frequency by modality. We found that word frequency distribution differed by modality (Wasserstein distance d significant at p < 0.0001* for all contrasts (e.g., full distribution, 40,000 most frequent words; 40,000 least frequent words; overlapping words). Our results indicate that word frequency differs by language modality across many different contexts and genres. Additionally, it seems that the effect of modality on word frequency depends on the relative frequency of the word in question. Future research should include more fine-grained analyses to examine interaction effects of modality within different contexts and genres.

Topic Areas: Computational Approaches, Methods