The federal government’s Total Information Awareness project has been talked about for months by cross-discipline geniuses with appropriate amounts of fear, skepticism and disgust. It is a project with loose goals, an infinite capacity for abuse, and little public oversight, all directed by a five-time federal felon. It’s designed to collect random bits of information, both public and private, about American citizens, and attempt to find patterns which could lead to preventing future terrorist attacks. The screw-turning irony enhancers have been on the job, too, showing how even with currently available information in the public domain, anyone could pinpoint on maps and satellite photographs the home of John M. and Linda Poindexter of 10 Barrington Fare, Rockville, MD, 20850. John is the man in charge of the TIA project, and a familiar name if you were alive (or awake) in the Eighties. Now, the subtext says, imagine what kind of abuse there could be of a secret database which collects everything—every single fact which can be found in public, private, commercial, governmental or institutional databases—about you. As much as this idea repulses me, one part of me (the geek part which reads O’Reilly and Slashdot, not the skeptical part which reads Huxley and Orwell) wants to see such a system in action. Wired says, “TIA program directors make it clear they also believe the task to be beyond current technology.” So do I. And I’ll tell you why: even with terabytes of data, and the brightest minds in the data-mining world, I don’t see how they could have a sufficient data set to train such a system. I want to see a proof-of-concept. Consider Bayesian spam filtering, the common nethead’s technical fad of the last six months. With a Bayesian spam filter, you have to show the filter both spam—messages considered negative—and ham—non-spam messages considered positive. You must show such a filter both spam and ham, and lots of each, totaling thousands of messages. After that, the filter will apply itself to incoming messages and judge whether they are more likely to be spam or ham. To keep such a filter accurate, all new data entering the system is rated to keep up with trends in data, such as spammers using new techniques to defeat the filters. Such a basic system can accurately judge 99 percent of messages. Now, this is an oversimplification, but according to what I have read, the TIA would work in a similar fashion. Only the TIA is looking for terrorists, not spam. My doubts about the technical abilities of such a system derive from this: You have to throw a large corpus of data at such a system, both good data and bad. Given the relatively few number of terrorist attacks in the history of the world, and the variety of methods used, how could we possibly accrue a large enough sample of terrorist “spam”? I think the results are bound to be skewed. And so I predict that, to correct for this, a large artificial data set will be created of markers which are believed to signal the presence of terrorist activity, thus influencing or biasing the discover of new, previously unknown markers by the TIA. In other words, TIA will be weighted towards the same kinds of institutional and professional judgment biases which have led to the various intelligence failures to date. It reminds me of a story someone told me on Saturday. When she lived in a bad part of Philadelphia, she was robbed. She called the cops, they came and got her, and the drove around looking for the man she described as tall, black, skinny, with cornrows in his hair. The Philly cops, she said, stopped every black man they saw: short, fat, afroed. Then, at the police station, when she couldn’t identify the robber from mug shots, they would push a photo cross the table and say, “Isn’t this the guy? Isn’t it? He’s a really bad guy.” Instead of using the new data, the police continued to work with old. All they wanted was a flag from the victim which would confirm their preferences. And, they wanted a result. No result was unacceptable. That’s a large part of the public fear of the TIA system: when you go looking for possible terrorists, you are bound to find as many of them as you want. In such a system we are all criminals waiting to be caught, because a suspected terrorist may have only matched a terrorist-predictive pattern (in a system weighted with biased data), rather than have actually committed acts in preparation for true terrorism. It’s a pyramid of suspicion: at the tiny point are the terrorists who have already committed terrorist acts. Just below that are the small number of terrorists who have written plans, known plots and dangerous matériel which most people would accept as sufficient evidence of the intention to do evil. But below that, what? An ever-widening base of suspicion without verification. Poindexter says they will weight the system with invented data during tests: “To proceed with development, without intruding on domestic or foreign concerns, we are creating a data base of synthetic transactions using a simulation model. This will generate billions of transactions constituting realistic background noise. We will insert into this noise simulated transactions by a red team acting as a terrorist organization to see if we can detect and understand this activity.” So what’s the point of such a predictive system if it is to be tested with invented data? If you already have difficulties in detecting terrorist activities, how can you successfully fake such activities to test against? Of course, I hope I’m wrong. But consider this: if we were truly capable of collecting, managing and predicting based upon large data sets, wouldn’t we have the stock market cracked by now? Nowhere else in current or past worlds has there been such a concentrated, distributed effort to gather and analyze data in order to make predictions with the hope of beating the averages. That same data set is also under constant scrutiny for atypical patterns which stand out against randomized background noise and tidal market shifts and are likely to be criminal. Yet, we find that individual agents—such as dishonest accountants, inside traders and junk-pushers—are capable of hiding within the system at least long enough to commit crime. The Electronic Privacy Information Center expresses similar doubts in its comments on proposed screening of consumer aviation and ticket-purchase records: “Fraud management systems rely on capturing deviations from the norm—a challenging task even when tracking a relatively simple problem of credit card fraud. Neural networks at the core of data mining programs rely on a very large number of examples of deviance to ‘train’ the system, it is unclear what examples the TSA will use and whether those examples are reliable indicators of future terrorist action. Even if the system were used to find non-threats, it is not clear what criteria would go into developing a non-threat model and whether the system might operate discriminatorily or punish non-conformity. The tolerance for failure and imprecision in the law enforcement context is significantly different, and the stakes for misidentification are not trivial.” When this system is active, we should demand the following, either through congressional committees or Freedom of Information Act requests: — All sources of data used.
— Ratio of negative to positive hits.
— Types, quantity and quality of human verification of TIA hits. That’s just a start. The better we are informed about the ways such a system works, the more capable we are of monitoring it.
Posted April 15, 2003