| "Top 10 Data Mining Mistakes" |
| Abstract: Data Mining is still as much it is an art as a science, and fancy new tools make it easy to do wrong things with one's data even faster. We'll examine the major "cracks in the crystal ball" through case studies, both simple and complex, of (often personal) errors t - drawn from real-world consulting engagements. Best Practices for Data Mining will be (accidentally) illuminated by their (rarely described) opposites. These common errors range from allowing anachronistic variables into the pool of candidate inputs, to subtly inflating results through early up-sampling. You'll hear cautionary tales of endangered projects and embarrassed teams - but also the keys to avoiding such a fate yourself. |
| 2:00 pm ~3:00 pm, Monday, November 28, 2005 |
| "The Million Book Digital Library Project: Research Problems in Data Mining And Discovery" |
| Abstract: Creating a universal, free to read, digital library containing all the books ever published is technically feasible today. Google, Yahoo and Microsoft have all announced their intention to scan and make available books of interest to public. Unfortunately many of these will be in English and inaccessible to over 80% of the world's population. Even when books in other languages become available online, their content will remain incomprehensible to most people. Natural Language Processing Technology is not yet perfect but promises to provide a way out of this conundrum. In this talk, we will discuss some of the special and unique research problems in data discovery arising in digital libraries and other online content, such as multi-lingual search, translation and summarization. |
| 9:00 am ~10:00 am, Monday, November 28, 2005 |
| "Graphical models for structure extraction and information integration" |
| Abstract:Recent advances in supervised learning over multiple inter-dependent variables have paved the way for accurate and automated methods for information extraction and integration.
We present various graphical models for extraction, starting from traditional chain models for plain text, to segmentation models for exploiting matches with existing entities, and general graph models for extracting from visual 2D layouts as in web pages. Such models are trained either via conditional likelihood maximization or margin maximization leading to constrained convex optimization problems. Inferencing often involves more than a simple message passing algorithm because of the presence of constraints that are not captured in the dependency graph. We present algorithms for such constrained inferencing and optimization tricks for reducing the computation of expensive features, like matches with large external dictionaries. There is much scope for further research in handling diverse unstructured sources, continuous model refinement, efficient training and inferencing, and, probabilistic query answering in the presence of source uncertainties. |
| 9:00 am ~10:00 am, Tuesday, November 29, 2005 |
| "Efficient Indexing Technology for Data Mining of Scientific Data" |
| Abstract: Data mining in scientific applications usually involves searches over a large number of objects in the multidimensional space of their properties, or searches for known patterns. This is in contrast to mining for associations between objects, or discovering new patterns. Examples are searching over billions of objects to find rare objects by expressing numerical range conditions on their properties, or finding flame fronts in large volume, spatio-temporal combustion simulation data by expressing multiple conditions over the data values associated with the cells in the 3D space. A critical issue in supporting such directed searches over large data volumes is the efficiency of the indexing method. This is required in order to facilitate real time exploration of the data. In this talk, we will describe a specialized bitmap indexing method, called FastBit, which has proved especially appropriate for numeric multidimensional data common in scientific applications. We will illustrate the use of this technology with several examples. |
| 9:00 am ~10:00 am, Wednesday, November 30, 2005 |