Linguist, lexicographer, writer, editor, broadcaster

Skeptical About the Technical Feasibility of TIA

The federal government’s Total Information Awareness project has been talked about for months by cross-discipline geniuses with appropriate amounts of fear, skepticism and disgust. It is a project with loose goals, an infinite capacity for abuse, and little public oversight, all directed by a five-time federal felon. It’s designed to collect random bits of information, both public and private, about American citizens, and attempt to find patterns which could lead to preventing future terrorist attacks. The screw-turning irony enhancers have been on the job, too, showing how even with currently available information in the public domain, anyone could pinpoint on maps and satellite photographs the home of John M. and Linda Poindexter of 10 Barrington Fare, Rockville, MD, 20850. John is the man in charge of the TIA project, and a familiar name if you were alive (or awake) in the Eighties. Now, the subtext says, imagine what kind of abuse there could be of a secret database which collects everything—every single fact which can be found in public, private, commercial, governmental or institutional databases—about you. As much as this idea repulses me, one part of me (the geek part which reads O’Reilly and Slashdot, not the skeptical part which reads Huxley and Orwell) wants to see such a system in action. Wired says, “TIA program directors make it clear they also believe the task to be beyond current technology.” So do I. And I’ll tell you why: even with terabytes of data, and the brightest minds in the data-mining world, I don’t see how they could have a sufficient data set to train such a system. I want to see a proof-of-concept. Consider Bayesian spam filtering, the common nethead’s technical fad of the last six months. With a Bayesian spam filter, you have to show the filter both spam—messages considered negative—and ham—non-spam messages considered positive. You must show such a filter both spam and ham, and lots of each, totaling thousands of messages. After that, the filter will apply itself to incoming messages and judge whether they are more likely to be spam or ham. To keep such a filter accurate, all new data entering the system is rated to keep up with trends in data, such as spammers using new techniques to defeat the filters. Such a basic system can accurately judge 99 percent of messages. Now, this is an oversimplification, but according to what I have read, the TIA would work in a similar fashion. Only the TIA is looking for terrorists, not spam. My doubts about the technical abilities of such a system derive from this: You have to throw a large corpus of data at such a system, both good data and bad. Given the relatively few number of terrorist attacks in the history of the world, and the variety of methods used, how could we possibly accrue a large enough sample of terrorist “spam”? I think the results are bound to be skewed. And so I predict that, to correct for this, a large artificial data set will be created of markers which are believed to signal the presence of terrorist activity, thus influencing or biasing the discover of new, previously unknown markers by the TIA. In other words, TIA will be weighted towards the same kinds of institutional and professional judgment biases which have led to the various intelligence failures to date. It reminds me of a story someone told me on Saturday. When she lived in a bad part of Philadelphia, she was robbed. She called the cops, they came and got her, and the drove around looking for the man she described as tall, black, skinny, with cornrows in his hair. The Philly cops, she said, stopped every black man they saw: short, fat, afroed. Then, at the police station, when she couldn’t identify the robber from mug shots, they would push a photo cross the table and say, “Isn’t this the guy? Isn’t it? He’s a really bad guy.” Instead of using the new data, the police continued to work with old. All they wanted was a flag from the victim which would confirm their preferences. And, they wanted a result. No result was unacceptable. That’s a large part of the public fear of the TIA system: when you go looking for possible terrorists, you are bound to find as many of them as you want. In such a system we are all criminals waiting to be caught, because a suspected terrorist may have only matched a terrorist-predictive pattern (in a system weighted with biased data), rather than have actually committed acts in preparation for true terrorism. It’s a pyramid of suspicion: at the tiny point are the terrorists who have already committed terrorist acts. Just below that are the small number of terrorists who have written plans, known plots and dangerous matériel which most people would accept as sufficient evidence of the intention to do evil. But below that, what? An ever-widening base of suspicion without verification. Poindexter says they will weight the system with invented data during tests: “To proceed with development, without intruding on domestic or foreign concerns, we are creating a data base of synthetic transactions using a simulation model. This will generate billions of transactions constituting realistic background noise. We will insert into this noise simulated transactions by a red team acting as a terrorist organization to see if we can detect and understand this activity.” So what’s the point of such a predictive system if it is to be tested with invented data? If you already have difficulties in detecting terrorist activities, how can you success