www.nettime.org
Nettime mailing list archives

<nettime> Computing and Visualizing the 19th-Century Literary Genome.
Felix Stalder on Mon, 20 Aug 2012 14:20:45 +0200 (CEST)


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

<nettime> Computing and Visualizing the 19th-Century Literary Genome.


[Another quantitative study of cultural history, like Moretti's
Graphs, Maps, Trees (2003) or Lev Manovich's work in cultural
analytics. Fascinating stuff, if I only knew what to make of it. The
figures, for example, are really beautiful, though, for me, entirely
incomprehensible. Ah, the joys of visualization. Felix ]


Jockers, Matthew, Stanford University, USA, mjockers {AT} stanford.edu

http://tinyurl.com/9khetrl

Overview


In literary studies, we have no shortage of anecdotal wisdom regarding
the role of influence on creativity. Consider just a few of the most
prominent voices:

    - 'Talents imitate, geniuses steal' - Oscar Wilde (1854-1900?).1

    - 'All ideas are second hand, consciously and unconsciously drawn
from a million outside sources' Mark Twain (1903).

    - 'The historical sense compels a man to write not merely with his
own generation in his bones, but with a feeling that the whole of the
literature - has a simultaneous existence.' T. S. Eliot (1920).

    - 'The elements of which the artwork is created are external to the
author and independent of him.' Osip Brik (1929).

    Anxiety of Influence - Harold Bloom (1973).

Whether consciously influenced by a predecessor or not, it might be
argued that every book is in some sense a necessary descendant of,
or necessarily 'connected to', those before it. Influence may be
direct, as when a writer models his or her writing on another writer,2
or influence may be indirect in the form of unconscious borrowing.
Influence may even be 'oppositional' as in the case of a writer
who wishes to make his or her writing intentionally different from
that of a predecessor. The aforementioned thinkers offer informed
but anecdotal evidence in support of their claims of influence.
My research brings a complementary quantitative and macroanalytic
dimension to the discussion of influence. For this, I employ the tools
and techniques of stylometry, corpus linguistics, machine learning,
and network analysis to measure influence in a corpus of late 18th-
and 19th-century novels. 

Method

The 3,592 books in my corpus span from 1780 to 1900 and were written
by authors from Britain, Ireland, and America; the corpus is almost
even in terms of gender representation. From each of these books, I
extracted stylistic information using techniques similar to those
employed in authorship attribution analysis: the relative frequencies
of every word and mark of punctuation are calculated and the resulting
data winnowed so as to exclude features not meeting a preset relative
frequency threshold.3 From each book I also extracted thematic (or
?topical') information using Latent Dirichlet Allocation (Blei, Ng
et al. 2003; Blei, Griffiths et al. 2004; Chang, Boyd-Graber et al.
2009). The thematic data includes information about the percentages of
each theme/topic found in each text.4 I combine these two categories
of data - stylistic and thematic - to create 'book signals' composed
of 592 unique feature measurements. The 'Euclidian' metric is then
used to calculate every book's distance from every other book in the
corpus. The result is a distance matrix of dimension 3,592 x 3,592.5

While measuring and tracking 'actual' or 'true' influence -
conscious or unconscious - is impossible, it is possible to use the
stylistic-thematic distance/similarity measurements as a proxy for
influence.6 Network visualization software can then be used as a way
to organize, visualize, and study the presence of influence among
of books in my corpus.7 To prepare the data for use in a network
environment, I converted the distance matrix into a long-form table
with 12,902,464 rows and three columns in which each row captures a
distance relationship between two books. The first cell contains a
'source' book, the second cell a 'target' book, and a third cell the
measured distance between the two. After removing all of the records
in which the target book was published before, or in the same year
as, the source book,8 the data was reduced from 12,902,464 records
to 6,447,640. This data and a separate table of metadata were then
imported into the open source network analysis software package Gephi
(2009) for analysis and visualization.


Networks are constructed out of nodes (books) and edges (distances).
When plotted, nodes with less similarity (i.e. larger distances
between them) will spread out further in the network. Figure 1 offers
a simplified example of three imaginary books.

Figure 1 http://tinyurl.com/9khetrl

Figure 1: a sample network with edge numbers representing measured
distances between nodes

While it is not possible to show the details of the entire network
here, it is possible to display several of the most obvious
macro-structures. Figure 2, for example, presents a zoomed out
view of the network with book nodes colored according to dates of
publication.9

Figure 2 http://tinyurl.com/9khetrl

Figure 2: The 19th-century novel network colored according to
publication date

The shading of nodes and edges according to publication date reveals
the inherently chronological nature of stylistic and thematic change.
The progressive darkening of the nodes from east to west allows us to
see, at the macro-scale, how style and theme are changing and evolving
over time. 10Also seen in this image is a 'satellite' of books in the
northwest. This satellite represents a 'community' of novels that are
highly self-similar but at the same time markedly different from the
books in the main network cluster. 11When the network is recolored
according to gender (figure 3), a new axis can be seen splitting the
network into northern and southern sectors along gender lines.

Figure 3 http://tinyurl.com/9khetrl

Figure 3: The 19th-century novel network colored according to
author-gender

This visualization (Figure 3) reveals that works by female authors
(colored light gray) and male authors (black) are more stylistically
and thematically homogeneous within their respective gender classes.
As a result of this similarity in 'signals,' female-authored books
cluster together on the south side of the main network, while
male-authored books are drawn together in the north.12 These two
'views' of the network allow us to begin imagining the larger
macro-history of thematic-stylistic change and influence in the
19th-century novel. What is not obvious in this macro-view, however,
is that a great many of the individual books we have traditionally
studied are in fact 'mutations' or outliers from the general trends.
Harriet Beecher Stowe's Uncle Tom's Cabin, for example, clusters
closer to the works of male authors, and Maria Edgeworth's Belinda
has a signal that does not become dominant for forty years after the
date of Belinda's publication. Also absent from the macro-view are
the individual thematic-stylistic 'legacies'. Using three measures
of network significance (weighted in-degree, weighted out-degree and
Page-Rank), 13I will end my presentation with the argument that Jane
Austen and Walter Scott are at once the least influenced (i.e. most
original) of the early writers in the network and, at the same time,
the most influential in terms of the longevity, or 'fitness,' of their
thematic-stylistic signals. The signals introduced by Austen and Scott
position them at the beginning of a stylistic-thematic genealogy; they
are, in this sense, the literary equivalent of Homo erectus or, if you
prefer, Adam and Eve





-- 

--- http://felix.openflows.com ------------------------ books out now:
*|Vergessene Zukunft. Radikale Netzkulturen in Europa. transcript 2012
*|Deep Search. The Politics of Searching Beyond Google. Studienv. 2009
*|Mediale Kunst/Media Arts Zurich.13 Positions. Scheidegger&Spiess2008
*|Manuel Castells and the Theory of the Network Society.Polity P. 2006
*|Open Cultures and the Nature of Networks. Ed Futura / Revolver, 2005







#  distributed via <nettime>: no commercial use without permission
#  <nettime>  is a moderated mailing list for net criticism,
#  collaborative text filtering and cultural politics of the nets
#  more info: http://mx.kein.org/mailman/listinfo/nettime-l
#  archive: http://www.nettime.org contact: nettime {AT} kein.org