Nettime mailing list archives

<nettime> Statistically Improbable Phrases and the 'real reader'
Tjebbe van Tijen/Imaginary Museum Projects on Tue, 27 Dec 2005 22:18:06 +0100 (CET)

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

<nettime> Statistically Improbable Phrases and the 'real reader'

Since a few months Amazon Books have introduced a new device:
Statistically Improbable Phrases (shortened to SIP).

To give an example, for the book

Armstrong, David F. ()/2000, William C. Stokoe Jr ()/Wilcox, Sherman ()
"Gesture and the nature of language" 1995/Cambridge University Press

The following SIPs are given:

> Statistically Improbable Phrases (SIPs): (learn more)
> primary sign languages, visible gestural, spoken language phonology,

> language modular, visible gestures, signed languages, sublexical
> level, sign language word, gestural approach, semantic phonology,
> spatial syntax, grammar module, gestural theory, vocal gestures, deaf

> signers, associationist theories, perceptual categorization, image
> schemata, grammatical processing, primary consciousness, global
> mappings, iconic gestures, modular theories, adaptive complex, order

> consciousness

By clicking on one of these 'phrases' a web page with other books with
the same phrase and the number of occurrences of that particular SIP
will be generated.

The idea of SIP is explained on the Amazon site:

> Amazon.com's Statistically Improbable Phrases, or "SIPs", are the most
> distinctive phrases in the text of books in the Search Inside!
> program. To identify SIPs, our computers scan the text of all books in
> the Search Inside! program. If they find a phrase that occurs a large
> number of times in a particular book relative to all Search Inside!
> books, that phrase is a SIP in that book.
> SIPs are not necessarily improbable within a particular book, but they
> are improbable relative to all books in Search Inside!.

and a new Wikepdia entry reads:

> Statistically Improbable Phrases is a system developed by Amazon.com
> to compare all of the books they index and find phrases in each that
> are the most unlikely to be found in any other book indexed.
> The system is used to find the most unique portions of books for use
> as a summary or keyword.

This new device prompted me to the following reaction to the Amazon
Book team:

Well statistics of what?

is my first question... I suggest  you supply  basic statistics about
the source of your SIPs:

- how many books/titles have you indexed
- are these full text indexes or just indexes of the 'inside the book'

pages you do supply on the web
- how many million words
- how many sentences

When this is not given it is like the manipulative percentages of a census or
opinion poll without the total number of people that form the basis of these

(2) Though it might seem stupid to say, I would like you also to state explicitly
that these SIPs are generated <automatically> according a certain algorithm, also
explaining in more detail what that algorithm entails.

(3) As people have been trying to jump 'up' the list of Google's search machine
rating, an unanticipated effect might be that writers, editors and publishers
would check a new text before publication for occurrence of SIPs and make
alterations to get a higher score. This might generate only statistically a more
"outstanding" text.

(4) We still need to value the most ourselves, us humans, because we are the only
ones that can 'read' (though machines can process text alright, but there is no
form understanding in the sense that each human reader becomes a re-writer when
"processing" a text in her or his personal way). The reader's reviews on your
website do give that kind of understanding and are often very helpful in learning
about a book and its reception. Recently I started to archive some of the Amazon
Books customer reviews in my bibliographical database. The on-line reader reviews
are part of a very old tradition, like the Renaissance 'commonplace books' and the
Greek/Roman 'hypomnemata' filled with quotations and remarks that students would
make to keep for themselves make and show to each other.

The value of readers comments lies in the rephrasing and synthesizing of the
content of a book, something that can only be appreciated by 'reading'.

The mechanisms of 'rating', choosing the top ten, hundred or whatever, are an
undeniable a part of our market oriented culture, still - even in a pure
commercial setting like Amazon Books - there can be a prominent place for personal
exchange of opinions between 'real reader's, beyond any automated statistics. An
exchange that allows for both praise and critique outside the realm of
professional and commercial reviewing.

(5) Sip-ratings can well develop into an useful search instrument, but let it be a
well understood that it is just a product coming from ' machine processing', a
secondary tool at most.

Tjebbe van Tijen

Imaginary Museum Projects
dramatizing historical information

#  distributed via <nettime>: no commercial use without permission
#  <nettime> is a moderated mailing list for net criticism,
#  collaborative text filtering and cultural politics of the nets
#  more info: majordomo {AT} bbs.thing.net and "info nettime-l" in the msg body
#  archive: http://www.nettime.org contact: nettime {AT} bbs.thing.net