My Account

Poster B29, Tuesday, August 20, 2019, 3:15 – 5:00 pm, Restaurant Hall

A cross-method, cross-language comparison of semantic feature norms

Sasa Kivisaari1, Annika Hultén1, Riitta Salmelin1;1Department of Neuroscience and Biomedical Engineering, Aalto University

Introduction: We perceive the physical world around us as rich with meaning. Models of semantics often assume that these meanings are composed of features such as taste, feel or function of an object. Presently, multiple approaches are used to quantify this semantic feature space, but it is not known how the different approaches relate to one another. For this study, we collected a set of Finnish behavioral production norms for 300 (99 abstract + 201 concrete) words using an online questionnaire. We compared these semantic feature norms with an existing behavioral production norm set (Centre for Speech Language and the Brain concept (CSLB) property norms) in English. In addition, we compared word embeddings from large-scale Finnish and English text corpora using the Word2vec algorithm. This allowed us to compare two different acquisition methods (behavioral production norms vs. Word2vec) and two genetically dissimilar languages (Finnish and English), and evaluate the extent to which they produce similar information. Method: 273 respondents filled in an online questionnaire. Each target word, (e.g. apple, Finnish: omena) was presented with 15 open text fields to which respondents filled in the the attributes they considered relevant for this item. The open field responses were first automatically lemmatized using the Omorfi parser. Synonyms or similar words were collapsed into one feature (e.g. small, smallish, little and miniature → small) and the production frequency was normalized with the number of respondents. Only concrete words were examined in this study. The CSLB property norms were extracted from www.csl.psychol.cam.ac.uk/propertynorms. Word embeddings were based on a 6-billion token internet-derived text corpus for English and 1.5 billion token internet corpus in Finnish (lemmatized). In both cases, the semantic space was built using a Word2vec skip-gram model with a maximum context of 5 + 5 words (5 words before and after the word of interest). The norms were compared using a second-order correlation dissimilarity matrices based on cosine distance of feature vectors. Results: All norm sets demonstrated a clear and comparable taxonomical category structure, in that words from the same semantic category clustered together. The norm sets also significantly correlated with one another. The highest correlation was between the two Word2vec based word embeddings in Finnish and English (rho = 0.56, p < 0.001). The second highest correlation was between the Finnish Word2vec model and CSLB production norms (Spearman rho = 0.36, p < 0.001) on par with the English word2vec model and CSLB production norms (Spearman rho = 0.33, p < 0.001). The Spearman rho between Finnish production norms and English and Finnish word embeddings was 0.25 (p < 0.001) and 0.23 (p < 0.001). Conclusions: Semantic features extracted using different methods provide comparable information. The highest similarity was observed between the two word embeddings based on different languages, which suggest that the norm collection method may be more important than cultural variations in semantics. Differences between production norms and word embeddings may in part reflect the fact that the latter are more heavily influenced by associative relationships.

Themes: Meaning: Lexical Semantics, Methods
Method: Behavioral

Back