Douglas Bagnall on Fri, 27 May 2016 16:11:09 +0200 (CEST)


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: <nettime> artfcity: Turbulence.org Going Offline


As it happens, according to some measures[1][2], I am the reigning
world champion of amateur author identification, a hobby I picked up a
couple of years ago in response to a particular political situation in
New Zealand.

On 26/05/16 03:22, t byfield wrote:

> Stylometry tries to measure a tiny handful of aspects of the parts we
> think we understand.

And for this reason it failed badly, and the term became an
embarrassment. Trying to pick stylistic features (because we think we
understand) is a ridiculous practice. There is really very little
information in text -- you can count the bytes, and they are sparsely
packed -- so the trick is to avoid discarding it prematurely. Use a
model that can find the patterns where they lie, and which makes sense
in terms of information theory. That will send you toward the right
asymptote. Even so questions about authorship will remain uncertain:
most of the very little information available in text actually is
devoted to conveying and contextualising the "message", and not to
leaking identity.

> But Anonymouth is just one of many stylometric projects. Many will
> result in libraries, and many of those libraries will be included in the
> neatly packaged-up software tools -- mainly for identifying speakers and
> attributing utterances. Over time, and probably not very much of it,
> this 'many eyes' effect will outstrip the artisanal editorial skills of
> the kind I mentioned. So, on a certain level, the orientation,
> sophistication, and quality of Anonymouth is immaterial to the fact that
> writing is becoming biometric. And this issue will be a properly
> informational problem, not in the simplistic sense of "we have 'vast
> amounts' of data" but in the more classical sense of measuring the
> reduction of uncertainty -- in this case the uncertainty of whether X
> wrote Y or Y was written by X (which are completely different
> questions).

Yes.

Vast amounts of data can fall two ways. If you have a lot of text from
the target sources (deciding, say, whether $FAMOUS_AUTHOR wrote
$ANONYMOUS_BOOK) the evidence piles up nicely. But if you are trying
to figure out which of these million suspects is running a naughty
twitter account, the combinatorics are against you -- even if you are
really really big and clever. If you narrow those suspects down to a
handful, the game is back on. Thus my opsec advice to Phineas Phisher
and the Panama Papers entity would be to cut back on the manifestos.
They can't be picked out of the multitude, but it won't help if they
ever make the shortlist.

All the money seems to be in "author profiling", which is of course
that boring stuff about targeting ads without fixing identity.

It turned out in the case of the scandal that led me into this field
that I could find patterns of deception, but this was not nearly as
convincing as the already available facts (you know, leaked emails).
And, of course, I was months late.

cheers,
Douglas

[1]http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-papers-final/pan15-authorship-verification/stamatatos15-overview.pdf
[2]http://arxiv.org/abs/1506.04891

#  distributed via <nettime>: no commercial use without permission
#  <nettime>  is a moderated mailing list for net criticism,
#  collaborative text filtering and cultural politics of the nets
#  more info: http://mx.kein.org/mailman/listinfo/nettime-l
#  archive: http://www.nettime.org contact: nettime@kein.org
#  @nettime_bot tweets mail w/ sender unless #ANON is in Subject: