Networks of characters
My latest paper may be of interest to both authors and scientists. Using a mix of social network analysis and artificial intelligence algorithms, our research team analyzed the social networks found in fictional works.
In a novel or film, there is social network with the nodes representing characters, and the edges corresponding to co-occurrence: characters appearing together in the text or scene. We call these character networks. They arise in all cultural works, even non-fiction such as historical books or biographies.
Here is a link to the paper:
Character networks are typically small with a few dozen nodes, but can have thousands of edges. For example, Harry Potter has over 4,000 edges to other characters in Harry Potter and the Goblet of Fire. They can grow to hundreds of nodes (as is the case for the Lord of the Rings; see this paper), or even thousands of nodes (as is the case for the Marvel universe).
What we did
We had two separate directions in our study.
- Mining the networks for their properties. This is a deep dive into the analysis of individual works, extracting network properties not readily seen by the reader. Think of this as putting an individual character network under a microscope.
- Modelling the networks. This gives a big picture view, considering mathematical models and finding the best fitting model for the networks. Think of this as a macroscopic view of multiple character networks.
Inspired by the recent paper Network of Thrones, where the social network of Game of Thrones is studied, we considered three novels: Twilight by Stephanie Meyer, Steven King’s The Stand, and J.K. Rowling’s Harry Potter and the Goblet of Fire. We chose these as they are well-known, and have many characters.
Below are the visualizations of these networks, with the names of more prominent characters in larger font than the others.
The fun thing was that we derived these graphs without directly reading any of the text of the novels! We devised a graph extraction algorithm (analogous to the one in Network of Thrones) that based on character names being fifteen or fewer words apart in the text.
The colors in the figures represent communities, picked out by a network parameter called modularity. We also used centrality measures like PageRank and betweeness to find the top characters in each work. For example, Harry, Hermione and Ron were identified as the most influential characters in Harry Potter and the Goblet of Fire. The communities in this network are: Hogwarts, the Dursleys, the Weasleys, Sytherin, and the inseparable friends Seamus and Dean.
The character networks we studied showed many of the properties of complex networks: skewed degree distribution, the small world property, and strong community structure. Complex networks range from the web graph and its links, to on-line social networks like Facebook, to protein interaction networks in living cells.
Enter machine learning
The unique and novel part of our work was to use artificial intelligence algorithms to tell us how to model character networks. Mathematical models exist for all kind of phenomena in nature, ranging from weather to the spread of a virus. Complex network models have been around for under two decades, and have become quite sophisticated at simulating features such as the small world property , which demands short paths between random nodes; in a slogan: six degrees of separation.
For this approach, we employed machine learning algorithms such as Support Vector Machine and Decision Trees. We also used a large data set of 800 social networks in movies catalogued in moviegalaxies.com.
We used four different machine learning algorithms to test how well network models fit our character networks data. The algorithms used motifs (or small subgraph counts) and eigenvalue histograms to compare the character networks to graphs generated by the models. Think of the motifs as the DNA of a network: if two networks have close motif counts, then they are similar.
One model called the Chung-Lu (CL) model was the decisive winner. The model is named after graph theorists Fan Chung Graham and Lincoln Lu. (In an interesting coincidence, Fan was interviewed by me in a previous post.) In the CL model, networks are randomly generated according to a given degree distribution. What this means practically is that the model is better attuned to the importance of a character. For modeling character networks, the CL model outperforms the configuration model (CFG), preferential attachment (PA) model, and binomial random graphs (ER).
What does this all mean?
To sum up:
We discovered a simple model for the structure of social networks in novels and movies.
The CL models skews the creation of random edges towards the more influential nodes. This mirrors, perhaps, the psychology of how writers create the social networks for their characters. For instance, J.K Rowlings may have decided to make Harry, Hermione and Ron the top characters in her books, and then built the network of lesser characters around them. We found that the CL model also has dense structures like cliques (subgraphs where all nodes are joined, like H3 and F10 above) not readily found in the other models.
The big data of literature
Big data-theoretic ideas are now increasingly used to support or debate literary notions. I wrote about this in my blog on the shape of stories popularized by Kurt Vonnegut, now supported by data from thousands of cultural works.
We may be moving to the automatization of literary analysis. Such a move will be an adjunct to traditional analysis and not replace it. The advantage of the method is the ability to visualize the social networks in a fictional work in a new way, using thousands of data points. Such an approach gives us both local and global information in a way not possible even 15 years ago.
Interested in visualizing your novel’s social network? Contact me!