Thursday, October 16, 2003

Text Mining and Web Log Analysis

"With the aid of text mining, Michael N. Liebman, a research director at the University of Pennsylvania, is exploring whether bearing a child later in life has any link to postmenopausal diseases."

"Text-mining programs, unlike search engines that display lists of documents that contain certain keywords, go further, categorizing information, making links between otherwise unconnected documents and providing visual maps (some look like tree branches or spokes on a wheel) to lead users down new pathways that they might not have been aware of."

"Currently these programs are used by academic researchers and companies, but information scientists expect that to change. Lower-cost text-mining tools eventually will be offered to ordinary people who want to dig into medical or political issues using public documents. Madan Pandit, an expert in text analysis in Bangalore, India, who runs a Web site called K-Praxis (, has suggested that text mining could help people make sense of voluminous documents that are already on the Web, like the 858-page report on the congressional inquiry into intelligence failures regarding the 9/11 terrorist attacks."

As we develop online communities of practice sharing observations and daily discoveries through such technologies as web logs, text mining tools could prove invaluable in analyzing disparate findings and pointing out possible relationships and new paths of inquiry. Although the two products mentioned in the article, Clear Forest and SPSS offer products in the $75,000 and up range, less expensive products are already in development. Products like PolyAnalyst from Megaputer provide taxonomy-based categorization for approximately $2,100 (education pricing). Netica's Bayesian network tools can be purchased for as little as $285 (education pricing). Perhaps the ultimate solution will be a hybrid of some of these programs.

