Eugene Thacker on 15 Oct 2000 05:05:59 -0000


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

<nettime> GEML: Gene Expression Markup Language


GEML: Gene Expression Markup Language

Eugene Thacker


"We tried to create what we called a discovery system instead of a data
retrieval system. A data retrieval system is something that works if you
get the data out that you put in. A discovery system is one that finds
connections you didn't know about before."
James Ostell, NCBI (National Center for Biotechnology Information)


Despite the Web's talent for reducing the body to a clicking finger and
scanning eye, there's no shortage of bodies of all types on the Web
itself, from public bodies on outdoor Web-cams, to the private bodies of
video chat, to the modular bodies of streaming porn. But recently, a
very special kind of Web-body has emerged. It isn't a body that's
visually represented over the Web, but rather a body that is directly
encoded into computer databases and through software applications. It
isn't a file that's referred to, the way a Webpage refers to a Flash
movie or a Quicktime clip, but it is, in itself, a code, a programming
language. 

This past September, Rosetta Inpharmatics - their name itself
significant - a bioinformatics company from the Pacific Northwest,
announced that it had developed a cross-platform standard for
computer-based molecular genetics and biotech research. Called Gene
Expression Markup Language - or GEML - this programming language is
designed to facilitate the current file-format and compatibility
problems in computer-based biotech research. Now researchers in biotech
can, with the ease of cross-platform standardization, work between
widely different biological databases, from Celera's human genome
sequence, to the SWISS-PROT protein database, to the Human Gene Mutation
database. All online, all digital, and now all in a standardized backend
programming language. GEML is based upon XML - or Extensible Markup
Language - and operates independent of any particular database file
format schema. GEML manages two types of genetic data - genetic patterns
(gene expression, or analyses of which sets of genes are switched on or
off, and which may include biochemical pathway information or
gene-protein relationships) and genetic profiles (digital scans of
microarray chips, which are used to speedily and efficiently analyze
genetic samples). GEML can keep meticulous track of a given piece of
genetic data, noting what the original file format was, where the data
was retrieved from, the type of database search method used, and the
operations performed with a given data file - presumably biotech
research, like Photoshop, now includes several levels of "undo" as well.

The impetus for the development of universal standards like GEML is the
emerging field of bioinformatics. Put briefly, bioinformatics is the use
of computer and networking technology to handle large amounts of genetic
data. Uses of computer databases in molecular biology go back to the
1970s, but with the development of advanced computer processing power
and the Internet, researchers began discovering that they could
potentially do alot more than just catalogue data. The deluge of genomic
data generated from the human genome project has made it necessary to
develop more sophisticated, standardized means of managing that
information. But, once you've encoded the genetic body, you make an
incredibly important shift, from genetic "codes" to computer "codes."
The former you could still experiment with in the lab, using recombinant
DNA techniques, plasmids, vectors, and other micro-organismic tools. But
a shift (or an uploading) to computer codes adds another set of
requirements to that genetic data. It demands, first of all, an
abstraction of the body (already abstracted a first, bio-scientific
level through the discourse of genetic information) to the level of
binary on-off switches, genetic molecules into pulses of light. 

At this level, we get a strange mixture of two systems - a
genetic/cellular one and a computer/network one. With sophisticated
online databases, genomic analysis software (including the use of
intelligent agents and data mining protocols), and microarrays or DNA
chips, bioinformatics is promising to automate the gene discovery
process. Set up your search parameters, load in a blank CD-R, and press
enter - now go home, relax, and come back tomorrow morning, where the
results of the search will tell you if there are any hits, what novel
genes and/or drug targets have been isolated as candidates, what their
expression patterns are, what pathways they're involved in, and whether
or not patents currently exist for those genes. The recent race to
sequence the human genome would not have been possible without the
technical advancements made by tools-companies such as Perkin-Elmer (the
primary provider of automated DNA sequencing computers). And the recent
interest of the computer industry in biotech (Sun, IBM, Compaq,
Motorola) has made bioinformatics a research field within itself. Last
spring, the investment research firm Oscar Gruss projected that within
five years, bioinformatics' market value could exceed $2 billion.

All of which is to say that, with biotech research, the computer is no
longer just a tool. It has become that, and much more, extending its
range of operations, bringing in computer science and programming, and
transforming the "wet" biology lab into a networked computer lab.
Unfortunately, researchers and companies seem to still accept
bioinformatics tools, such as the GEML language, as a transparent tool,
something that will transparently aid in the advancement of science
research, without fundamentally altering research itself or the objects
of study. When a GEML-based application accesses several genomic
databases, is it also accessing genetic bodies? Or is this simply just
data acting on data, and if so, where exactly are the points of
connection to material, biological bodies? No one is asking whether such
bioinformatics techniques fundamentally change our notion of what the
body is, or whether the level of complexity that bioinformatics can deal
with will fundamentally challenge traditional bioscience and genetics
research. 

This places an "object" like GEML in a very strange position. On the one
hand, GEML, as a programming language, refers to or points to something
in its tags, the same way that the tag <IMG SRC="mycells.jpg"> points to
a digital image of my cells on a Web page. In this sense, GEML operates
not only as HTML does (with referring tags and attributes), but it also
operates according to the traditional signifier-sign relationships that
characterize modern linguistics. Only, the "thing" the language points
to is another type of data, a genetic code, itself with its own set of
rules and protocols for functioning. This also means that, as an
XML-based language, GEML is developed from the ground up, so to speak,
so that the types of tags and attributes used, as well as their
interrelationships, will be dictated by the ways in which genetic data
itself operates. Each use of a GEML implementation needs to be
identified by a Document Type Defintion or DTD file, which lists the
types of attributes used. This DTD file will be based, in the case of
GEML, on the ways in which genetic code operates in the body - that is,
according to sequences for genes, chromosomal positioning, gene-protein
relationships, promoter-terminator regions, splice variants, gene
polymorphisms, and so on. In other words, the DTD file for GEML is based
on the current state of knowledge in biotech research - how reductive or
complex that knowledge is, how rigid or flexible it is, etc. What's
being produced with GEML, then, is a kind of meta-code for approaching
the molecular code of the genome. In a sense GEML doesn't add or
modifying anything in the genome - it is not a genetic engineering in
the traditional sense of the term.

This interrelationship between molecular genetics and computer science
means that bioinformatics will only be as complex, technically
sophisticated, and potentially transformative as its DTD file - or the
types of knowledge input into bioinformatics code. The conventional
truism of molecular genetics - that, in a causal, linear fashion, "DNA
makes RNA makes protein" - will only produce a bioinformatics to that
level of complexity. However, as researchers and laboratories have been
acknowledging, most diseases and phenotypic markers are the product of
multiple genetic triggers and multiple biochemical pathways, not to
mention networked interactions with context or environment. This is why
the most interesting alternative approaches within biotech research -
such as systems biology - have demanded that both molecular genetics and
computer science transform the discourse of genetics and biotech, moving
away from the over-determinism of single-gene theories, and towards more
distributed and networked approaches.

There are hundreds of biological databases in existence, from human
genome to protein to human gene mutation to tissue banks, some owned by
research institutes, some owned by universities, some owned by
corporations. In all of this, no one has addressed a basic question:
where is "the body"? Or better, where is "the biological"? Addressing
this question means going back to the fundamentals of molecular
genetics, when the discours of the "genetic code" first began gaining
momentum. This is not just a scientific issue, but an issue concerning
the possible tensions between bodies and machine, biologies and technologies.

The central compatability issue for bioinformatics approaches such as
GEML is not that between different computer-based genomic databases.
Basically, the databases of Celera, DoubleTwist, or the public
consortium all consist of digital files encoding sequences of As, Ts,
Cs, and Gs, themselves encoded from a series of DNA samples from
anonymous human donors. Creating trans-database compatability is just a
matter of writing more code. The real challenge - not just a technical
one, but a philosophical one, an ontological challenge - is to create
that same compatability between "wet" cells and silicon databases,
between genetic "codes" and computer "codes." 

After all, DNA in a blood sample is not a computer database...or is it?
After all, you can encode DNA from a blood sample into a digital format,
but not the other way around...or can you?


Links:

Celera Genomics <http://www.celera.com>.

Fikes, Bradley. "Bioinformatics Tries to Find a Common Means to Express
Biological Data." DoubleTwist (22 September 2000): <http://www.doubletwist.com>.

HUGO Mutation Database Initiative <http://ariel.ucs.unimelb.edu.au:80/~cotton/mdi.htm>.

Primer on Molecular Genetics (U.S. Dept. of Energy) <http://www.bis.med.jhmi.edu/Dan/DOE/intro.html>.

Rosetta Inpharmatics <http://www.rosettainpharmatics.com>.

SWISS-PROT <http://www.expasy.ch/sprot>.


 

Eugene Thacker
e: maldoror@eden.rutgers.edu
w: http://gsa.rutgers.edu/maldoror/index.html
Pgrm. in Comparative Literature, Rutgers Univ.

CURRENT:
"Participating in the Biotech Industry: 
Notes on the Gene Trust" @ nettime: 
<http://www.nettime.org>.

"The Post-Genomic Era Has Already Happened"
@ Biopolicy Journal <http://bioline.bdt.org.br/py>

"SF, Technoscience, Net.art: The Politics 
of Extrapolation" @ Art Journal 59:3 
<http://www.collegeart.org/caa/
publications/AJ/artjournal.html>

"Point-and-Click Biology: Why Programming is
the Future of Biotech" @ MUTE (Issue 17 - archives
at http://www.metamute.com)

"Fakeshop: Science Fiction, Future Memory & the
Technoscientific Imaginary" @ CTHEORY
<http://www.ctheory.com>

also:
FAKESHOP <http://www.fakeshop.com>


#  distributed via <nettime>: no commercial use without permission
#  <nettime> is a moderated mailing list for net criticism,
#  collaborative text filtering and cultural politics of the nets
#  more info: majordomo@bbs.thing.net and "info nettime-l" in the msg body
#  archive: http://www.nettime.org contact: nettime@bbs.thing.net