Probabilistic Ranking of Database Query Results


Why Probabilistic in IR?

In traditional IR systems, matching between each document and query is attempted in a semantically imprecise space of index terms. Probabilities provide a principled foundation for uncertain reasoning.

A system and methods rank results of database queries. An automated approach for ranking database query results is disclosed that leverages data and workload statistics and associations. Ranking functions are based upon the principles of probabilistic models from Information Retrieval that are adapted for structured data. The ranking functions are encoded into an intermediate knowledge representation layer. The system is generic, as the ranking functions can be further customized for different applications. Benefits of the disclosed system and methods include the use of adapted probabilistic information retrieval (PIR) techniques that leverage relational/structured data, such as columns, to provide natural groupings of data values. This permits the inference and use of pair-wise associations between data values across columns, which are usually not possible with text data.

Probabilistic in IR:

  • Classical probabilistic retrieval model
    -> Probability ranking principle, etc.
  • (Naïve) Bayesian Text Categorization
  • Bayesian networks for text retrieval
  • Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR.
    -> Traditionally: neat ideas, but they’ve never won on performance. It may be different now.

Introduction:

In probabilistic information retrieval, the goal is the estimation of the probability of relevance P(R l qk, dm) that a document dm will be judged relevant by a user with request qk. In order to estimate this probability, a large number of probabilistic models have been developed.

Typically, such a model is based on representations of queries and documents (e.g., as sets of terms); in addition to this, probabilistic assumptions about the distribution of elements of these representations within relevant and nonrelevant documents are required.

By collecting relevance feedback data from a few documents, the model then can be applied in order to estimate the probability of relevance for the remaining documents in the collection.

As the name suggests ‘Ranking’ is the process of ordering a set of values (or data items) based on some parameter that is of high relevance to the user of ranking process.

Ranking and returning the most relevant results of user’s query is a popular paradigm in information retrieval.

Ranking and Database:

Not much work has been done in ranking of results of query in  database systems.

We have all seen example of ranking of results in the internet. The most common example is the internet search engines (like Google). A set of WebPages (satisfying the users search criteria) are returned, with most relevant results featuring at the top of the list.

In contrast to the WWW, databases support only a Boolean query model. For example a selection query on a SQL database schema returns all tuples that satisfy the conditions specified in the query. Depending on the conditions specified in the query, two situations may arise:

Empty Answers: when the query is too selective, the answer may be empty.
Many Answers: when the query is not too selective, too many tuples may be there in the answer.

We next consider these two scenarios in detail and look at various mechanism to produce ranked results  in these circumstances.

The Empty Answers Problem:

Empty answers problem is the consequence of a very selective query in database system. In this case it would be desirable to return a ranked list of ‘approximately’ matching tuples without burdening the user to specify any additional conditions. In other words, an automated approach for ranking and returning approximately matching tuples.

Automated Ranking Functions:

Automated ranking of query results is the process of taking a user  query and mapping it to a Top-K query with a ranking function that depends on conditions specified in the user query. A ranking function should be able to work well even for large databases and have minimum side effects on query processing.

Automated Ranking functions for the ‘Empty Answers Problem’ :

  • IDF Similarity
  • QF Similarity
  • QFIDF Similarity

IDF Similarity:

IDF (inverse document frequency) is an adaptation of popular IR technique based on the philosophy that frequently occurring words convey less information about user’s needs than rarely occurring words, and thus should be weighted less.

QF Similarity – leveraging workloads:

There may be instances where relevance of a attribute value may be due to factors other than the frequency of its occurrence. QF similarity is based on this very philosophy. According to QF Similarity, the importance of attribute values is directly related to the frequency of their occurrence in query strings in workload.

QFIDF Similarity:

QF is purely workload based, i.e., it does not use data at all. This may be a disadvantage in situations wherein we have insufficient or unreliable workloads.Add to Del.icio.us
QFIDF Similarity is a remedy in such situations. It combines QF and IDF weights. This way even if a value is never referenced in the workload, it gets a small non-zero QF.

Why Probabilistic in IR?

In traditional IR systems, matching between each document and query is attempted in a semantically imprecise space of index terms. Probabilities provide a principled foundation for uncertain reasoning.

A system and methods rank results of database queries. An automated approach for ranking database query results is disclosed that leverages data and workload statistics and associations. Ranking functions are based upon the principles of probabilistic models from Information Retrieval that are adapted for structured data. The ranking functions are encoded into an intermediate knowledge representation layer. The system is generic, as the ranking functions can be further customized for different applications. Benefits of the disclosed system and methods include the use of adapted probabilistic information retrieval (PIR) techniques that leverage relational/structured data, such as columns, to provide natural groupings of data values. This permits the inference and use of pair-wise associations between data values across columns, which are usually not possible with text data.

Probabilistic in IR:

  • Classical probabilistic retrieval model
    -> Probability ranking principle, etc.
  • (Naïve) Bayesian Text Categorization
  • Bayesian networks for text retrieval
  • Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR.
    -> Traditionally: neat ideas, but they’ve never won on performance. It may be different now.

Introduction:

In probabilistic information retrieval, the goal is the estimation of the probability of relevance P(R l qk, dm) that a document dm will be judged relevant by a user with request qk. In order to estimate this probability, a large number of probabilistic models have been developed.

Typically, such a model is based on representations of queries and documents (e.g., as sets of terms); in addition to this, probabilistic assumptions about the distribution of elements of these representations within relevant and nonrelevant documents are required.

By collecting relevance feedback data from a few documents, the model then can be applied in order to estimate the probability of relevance for the remaining documents in the collection.

As the name suggests ‘Ranking’ is the process of ordering a set of values (or data items) based on some parameter that is of high relevance to the user of ranking process.

Ranking and returning the most relevant results of user’s query is a popular paradigm in information retrieval.

Ranking and Database:

Not much work has been done in ranking of results of query in  database systems.

We have all seen example of ranking of results in the internet. The most common example is the internet search engines (like Google). A set of WebPages (satisfying the users search criteria) are returned, with most relevant results featuring at the top of the list.

In contrast to the WWW, databases support only a Boolean query model. For example a selection query on a SQL database schema returns all tuples that satisfy the conditions specified in the query. Depending on the conditions specified in the query, two situations may arise:

Empty Answers: when the query is too selective, the answer may be empty.
Many Answers: when the query is not too selective, too many tuples may be there in the answer.

We next consider these two scenarios in detail and look at various mechanism to produce ranked results  in these circumstances.

The Empty Answers Problem:

Empty answers problem is the consequence of a very selective query in database system. In this case it would be desirable to return a ranked list of ‘approximately’ matching tuples without burdening the user to specify any additional conditions. In other words, an automated approach for ranking and returning approximately matching tuples.

Automated Ranking Functions:

Automated ranking of query results is the process of taking a user  query and mapping it to a Top-K query with a ranking function that depends on conditions specified in the user query. A ranking function should be able to work well even for large databases and have minimum side effects on query processing.

Automated Ranking functions for the ‘Empty Answers Problem’ :

  • IDF Similarity
  • QF Similarity
  • QFIDF Similarity

IDF Similarity:

IDF (inverse document frequency) is an adaptation of popular IR technique based on the philosophy that frequently occurring words convey less information about user’s needs than rarely occurring words, and thus should be weighted less.

QF Similarity – leveraging workloads:

There may be instances where relevance of a attribute value may be due to factors other than the frequency of its occurrence. QF similarity is based on this very philosophy. According to QF Similarity, the importance of attribute values is directly related to the frequency of their occurrence in query strings in workload.

QFIDF Similarity:

QF is purely workload based, i.e., it does not use data at all. This may be a disadvantage in situations wherein we have insufficient or unreliable workloads.Add to Del.icio.us
QFIDF Similarity is a remedy in such situations. It combines QF and IDF weights. This way even if a value is never referenced in the workload, it gets a small non-zero QF.

In case of many answers problem, the recently discussed ranking functions might fail to perform.
This is because many tuples may tie for the same similarity score. Such a scenario could arise for empty answer problem also.
To break this tie, requires looking beyond the attributes specified in the query, i.e., missing attributes.

Many Answers Problem:

We know by now, that many answers problem in database systems is the consequence of not too selective queries.
Such a query on a database system produces a large number of tuples that satisfy the condition specified in the query.
Let us see how ranking of results in such a scenario is accomplished.

Basic Approach:

Any ranking function for many answers problem has to look beyond the attributes specified in the query, since all or a large number of tuples satisfy the specified conditions.
To determine precisely the unspecified attributes is a challenging task. We show adaptation of Probabilistic Information Retrieval (PIR) ranking methods.

Ranking function for Many Answers Problem:

Ranking function for many answers problem is developed by adaptation of PIR models that best model data dependencies and correlations.
The ranking function of a tuple depends on two factors: (a) a global score, and (b) a conditional score.
These scores can be computed through workload as well as data analysis.

Ranking Function: Adaptation of PIR Models for Structured Data:

The basic philosophy of PIR models is that given a document collection, ‘D’, the set of relevant documents, ‘R’, and the set of irrelevant documents,
R (= D – R), any document ‘t’ in ‘D’ can be ranked by finding out score(t). The score(t) is the probability of ‘t’ belonging to the relevant set, ‘R’

Problem in adapting this approach:

The problem in computing the score(t) using PIR model for the databases is that the relevant set, ‘R’ is unknown at query time.
This approach is well suited to IR domain as ‘R’ is usually determined through user feedback.
User feedback based estimation of ‘R’ might be attempted in databases also but we propose an automated approach.

Architecture of Ranking System:

ir1

Implementation:

Pre-Processing : the pre-processing component is composed of ‘Atomic Probabilities Module’ and ‘Index Module’.

Atomic Probabilities Module – is responsible for computation of several atomic probabilities necessary for the computation of Ranking Function, Score(t).

Index Module – is responsible for pre-computation of ranked lists necessary for improving the efficiency of query processing module.

Intermediate Layer: the atomic probabilities, lists computed by the Index Module are all stored as database tables in the intermediate layer. All the tables in the intermediate layer are indexed on appropriate attributes for fast access during the later stages.

Primary purpose of the intermediate layer is to avoid computing the score from scratch each time a query is received, by storing pre-computed results of all atomic computations

The Index Module:

  • Index module pre-computes the ranked lists of tuples for each possible atomic query.
  • Purpose is to take the run-time load off the query processing component.
  • To assist the query processing component in returning the Top-K tuples, ranked lists of the tuples for all possible “atomic” queries are pre-computed
  • Taking as input, the association rules and the database, Conditional List and Global List are created for each distinct value ‘x’ in the database

Query Processing Component:

List merge algorithm is the key player in the query processing component.

Its function is to take the user query, compute scores for all the tuples that satisfy the condition specified in the query, rank the tuples in a sorted order of the scores and then return the Top-K tuples.

Space Requirements:

To build the conditional and the global lists, space consumed is O(mn) bytes (where ‘m’ is the number of attributes and ‘n’ is the number of tuples of the database table)

There may be applications where space is an expensive resoAdd to Del.icio.usurce.

In such cases, only a subset of the lists may be stored at pre-processing times, but this will at the expense of an increase in query processing time.

Next…..

The ranking function so presented works on single table databases and does not allow presence of NULL values. A very interesting but nevertheless challenging extension to this work would be to develop ranking functions that work on multi-table databases and allow NULL’s as well as non-text data in database columns.

In case of many answers problem, the recently discussed ranking functions might fail to perform.
This is because many tuples may tie for the same similarity score. Such a scenario could arise for empty answer problem also.
To break this tie, requires looking beyond the attributes specified in the query, i.e., missing attributes.

Many Answers Problem:

We know by now, that many answers problem in database systems is the consequence of not too selective queries.
Such a query on a database system produces a large number of tuples that satisfy the condition specified in the query.
Let us see how ranking of results in such a scenario is accomplished.

Basic Approach:

Any ranking function for many answers problem has to look beyond the attributes specified in the query, since all or a large number of tuples satisfy the specified conditions.
To determine precisely the unspecified attributes is a challenging task. We show adaptation of Probabilistic Information Retrieval (PIR) ranking methods.

Ranking function for Many Answers Problem:

Ranking function for many answers problem is developed by adaptation of PIR models that best model data dependencies and correlations.
The ranking function of a tuple depends on two factors: (a) a global score, and (b) a conditional score.
These scores can be computed through workload as well as data analysis.

Ranking Function: Adaptation of PIR Models for Structured Data:

The basic philosophy of PIR models is that given a document collection, ‘D’, the set of relevant documents, ‘R’, and the set of irrelevant documents,
R (= D – R), any document ‘t’ in ‘D’ can be ranked by finding out score(t). The score(t) is the probability of ‘t’ belonging to the relevant set, ‘R’

Problem in adapting this approach:

The problem in computing the score(t) using PIR model for the databases is that the relevant set, ‘R’ is unknown at query time.
This approach is well suited to IR domain as ‘R’ is usually determined through user feedback.
User feedback based estimation of ‘R’ might be attempted in databases also but we propose an automated approach.

Architecture of Ranking System:

ir1

Implementation:

Pre-Processing : the pre-processing component is composed of ‘Atomic Probabilities Module’ and ‘Index Module’.

Atomic Probabilities Module – is responsible for computation of several atomic probabilities necessary for the computation of Ranking Function, Score(t).

Index Module – is responsible for pre-computation of ranked lists necessary for improving the efficiency of query processing module.

Intermediate Layer: the atomic probabilities, lists computed by the Index Module are all stored as database tables in the intermediate layer. All the tables in the intermediate layer are indexed on appropriate attributes for fast access during the later stages.

Primary purpose of the intermediate layer is to avoid computing the score from scratch each time a query is received, by storing pre-computed results of all atomic computations

The Index Module:

  • Index module pre-computes the ranked lists of tuples for each possible atomic query.
  • Purpose is to take the run-time load off the query processing component.
  • To assist the query processing component in returning the Top-K tuples, ranked lists of the tuples for all possible “atomic” queries are pre-computed
  • Taking as input, the association rules and the database, Conditional List and Global List are created for each distinct value ‘x’ in the database

Query Processing Component:

List merge algorithm is the key player in the query processing component.

Its function is to take the user query, compute scores for all the tuples that satisfy the condition specified in the query, rank the tuples in a sorted order of the scores and then return the Top-K tuples.

Space Requirements:

To build the conditional and the global lists, space consumed is O(mn) bytes (where ‘m’ is the number of attributes and ‘n’ is the number of tuples of the database table)

There may be applications where space is an expensive resoAdd to Del.icio.usurce.

In such cases, only a subset of the lists may be stored at pre-processing times, but this will at the expense of an increase in query processing time.

Next…..

The ranking function so presented works on single table databases and does not allow presence of NULL values. A very interesting but nevertheless challenging extension to this work would be to develop ranking functions that work on multi-table databases and allow NULL’s as well as non-text data in database columns.

Add to Technorati Add to Del.icio.us Add to Furl Add to Yahoo My Web 2.0 Add to Reddit Add to Digg Add to Spurl Add to Wists Add to Simpy Add to Newsvine Add to Blinklist Add to Fark Add to Blogmarks Add to GoldenFeed

Advertisements
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: