[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Processing data, information and knowledge
Here's an episode from my column that mentions Architext and NSA's
statistical text searching techniques. Try the 'concept search' at hotwired
(www.wired.com) or Time Inc's Pathfinder (www.timeinc.com), or look at the
demo at Architext - www.atext.com.
Rishab
---
Electric Dreams
Weekly column for The Asian Age by Rishab Aiyer Ghosh
#40, 12/December/1994: Processing data, information and knowledge
Computers are good at processing data. Juggling numbers,
indexing names and addresses, these are the rudimentary
tasks upon which much of the world's infrastructure
depends. Computers are mediocre at processing information,
the layering of data with complex inter-relationships. But
they are simply pathetic at handling knowledge, the models
based on piles of information used to understand and
predict an aspect of the world around us, expressed by
humans not in tables and charts, but in articles and
books.
Computers are organized. They can understand streams of
homogeneous inputs, they can follow links between data
that are made clear and detailed. This preference for
structure makes it somewhat difficult to get computers to
process more naturally expressed concepts and knowledge
embodied in human-language text.
Passing over the entirely academic debate about the
ability or otherwise of machines to ever understand human
ideas, the fact is that most attempts at getting computers
to process or aid in processing such ideas has
concentrated on making computers 'artificially
intelligent' - making them form their own structured model
of relatively unstructured text.
Computer systems for natural language processing try to
find meaning in a text by translating it into some
internal representation, with the aid of a detailed
grammar-book far more explicit than most humans could
bear. Most natural language processing is either too slow,
too inaccurate, or too limited to a particular human
language or set of concepts to be practically useful on a
large scale. While it may be pretty good for simple voice-
based interfaces, NLP is unlikely in the near future to be
able to, for instance, quickly go through 2 years of Time
magazine and identify the US government's changing policy
on the war in Bosnia.
While NLP begins with the assumption that machines need
some sort of understanding to process text, other methods
concentrate more on practical applications. These usually
abandon any attempt to search for a structure in textual
inputs, and rely instead on identifying a vague pattern.
Neural networks, which try to simulate the working of the
brain, are frequently used to identify patterns in images,
sounds and financial data. Though they are often quite
successful at their limited tasks, they are not normally
used to process text. One reason for this is perhaps that
text either needs to be interpreted in the small chunks of
conversation, which requires a knowledge of grammar that
conventional NLP provides; the other use for text
processing is in organizing huge volumes of it, for which
neural networks are too slow.
The alternative comes strangely enough from the US
National Security Agency. It has always been suspected
that the NSA searches through e-mail traffic for
'sensitive' material, which for the large volumes involved
would require considerable help from computers. Earlier
this year, the agency began soliciting collaborations from
business to develop commercial applications of their
technique. It claimed to be able to quickly search through
large quantities of text, in any language, for
similarities to sample documents, and even automatically
sort documents according to topics that it identifies. A
similar though independently developed system is available
from California-based Architext.
Though statistical techniques for text processing are not
entirely new, the continuing development in the area is a
sign of the growing use of computers as knowledge-
processing aids. By identifying patterns more-or-less
blindly, without any attempt at understanding the concepts
they represent, they can help us make some sense of the
ocean of information that otherwise threatens to swamp us.
Rishab Aiyer Ghosh is a freelance technology consultant
and writer. You can reach him through voice mail (+91 11
3760335) or e-mail ([email protected]).
--====(C) Copyright 1994 Rishab Aiyer Ghosh. ALL RIGHTS RESERVED====--
This article may be redistributed in electronic form only, PROVIDED
THAT THE ARTICLE AND THIS NOTICE REMAIN INTACT. This article MAY NOT
UNDER ANY CIRCUMSTANCES be redistributed in any non-electronic form,
or redistributed in any form for compensation of any kind, WITHOUT
PRIOR WRITTEN PERMISSION from Rishab Aiyer Ghosh ([email protected])
--==================================================================--