The SQoUT Project at NYU/Columbia
Unstructured text data is ubiquitous and, not surprisingly, many users and applications rely on textual data for a variety of tasks. The current paradigm for handling text data, popularized by search engines, is essentially a keyword ``lookup'' operation, followed by a sophisticated ranking of the results. There is very limited support for ``structured'' queries, no support for queries that need to combine information from multiple sources, and no support for queries that need to aggregate results from multiple web pages. Furthermore, users have to go through a large number of returned documents to identify and construct the required answer. In the last years, research in information extraction showed how to retrieve structured information from unstructured textual data. Such systems allow users to ask complicated questions over unstructured text and get concrete answers, thus enabling users to spend less time searching for information and more time analyzing and understanding the results. Our work focuses on:
Cost-based Query Optimization for Text-Centric Tasks: Having hundreds and thousands of information extraction systems available, allows the execution of many, complex queries over the web. With billions of available documents, it is crucial to find efficient methods for optimizing the execution of such queries. Research in this component focuses on making information extraction a ``first-class'' citizen of a web-scale database system, allowing the building of a query optimizer that will automatically choose the best execution plan for a given task, returning the desired answer in the fastest way possible.
Quality-aware Query Optimization for Text-Centric Tasks: Extracting structured information from unstructured text is inherently a noisy process, and the returned results do not have perfect ``precision'' and ``recall'' (i.e., they are neither perfect nor complete). Our research focuses on making the quality of the results an integral part of the query optimization process. The goal is to enable users to specify the desired result quality, and the optimizer should choose automatically the appropriate extraction systems), the appropriate configuration for the system, and the appropriate execution plan for the given task.