The Vanderbilt University Data Science Institute is working on a research project that aims to gain a better understanding of how story narratives evolve and how deep learning models capture and learn information. The project utilizes an open-source tool called the BERT model from HuggingFace to extract embeddings from the last hidden layer of an encoder model. Our researchers are also exploring various methods to visualize the embedding space.
During the Spring 2023 semester, the project team consisted of DS MS students George Lyu, Sophia Tannir, Yunfei Lyu, Jordan Nieusma, Yuning Wu, and Amogh Vig. The Principal Investigator for the project is Assistant Professor of Mathematics Education and the Learning Sciences Dr. Corey Brady. They collected data from the Project Gutenberg website, which has a library of over 70,000 free eBooks with different genres and narrative styles.
Exploratory visualizations were conducted to understand the variance in the feature space, using dimension reduction techniques with Uniform Manifold Approximation and Projection and t-SNE. They then conducted a cosine similarities analysis of terms in the first 1000 slice window, which showed that unrelated terms like ‘mystery’ and ‘boredom’ had a high correlation in the 768-dimensional space.
The researchers on the Narrative Arcs Project explored two theories to explain the high correlation between unrelated terms. The first theory suggests that the embeddings of these terms form a cluster, while the term-window relationships form another distant cluster. The second theory suggests that there may be a natural limitation in capturing semantics with deep learning models.
To further validate their results, the project team plans to collaborate with domain experts from the humanities to check if the results make sense. They also will investigate the high correlation between terms to determine if it is a problem with their approach or a natural limitation of capturing semantics with deep learning models.
The Narrative Arcs Project has made significant progress, processing over a thousand books on the Nvidia DGX server and generating preliminary results through visualizations. Moving forward, the project team aims to understand the transformer embedding space better.