Archive for October, 2009

Magic Quotes in PHP

Why use Magic Quotes

  • Useful for beginners Magic quotes are implemented in PHP to help code written by beginners from being dangerous. Although SQL Injection is still possible with magic quotes on, the risk is reduced.
  • Convenience For inserting data into a database, magic quotes essentially runs addslashes() on all Get, Post, and Cookie data, and does so automagically.

Why not to use Magic Quotes

  • Portability Assuming it to be on, or off, affects portability. Use get_magic_quotes_gpc() to check for this, and code accordingly.
  • Performance Because not every piece of escaped data is inserted into a database, there is a performance loss for escaping all this data. Simply calling on the escaping functions (like addslashes()) at runtime is more efficient. Although php.ini-dist enables these directives by default, php.ini-recommended disables it. This recommendation is mainly due to performance reasons.
  • Inconvenience Because not all data needs escaping, it’s often annoying to see escaped data where it shouldn’t be. For example, emailing from a form, and seeing a bunch of \’ within the email. To fix, this may require excessive use of stripslashes().

Disabling Magic Quotes

The magic_quotes_gpc directive may only be disabled at the system level, and not at runtime. In otherwords, use of ini_set() is not an option.

Example #1 Disabling magic quotes server side

An example that sets the value of these directives to Off in php.ini. For additional details, read the manual section titled How to change configuration settings.

; Magic quotes
;

; Magic quotes for incoming GET/POST/Cookie data.
magic_quotes_gpc = Off

; Magic quotes for runtime-generated data, e.g. data from SQL, from exec(), etc.
magic_quotes_runtime = Off

; Use Sybase-style magic quotes (escape ' with '' instead of \').
magic_quotes_sybase = Off

If access to the server configuration is unavailable, use of .htaccess is also an option. For example:

php_flag magic_quotes_gpc Off

In the interest of writing portable code (code that works in any environment), like if setting at the server level is not possible, here’s an example to disable magic_quotes_gpc at runtime. This method is inefficient so it’s preferred to instead set the appropriate directives elsewhere.

Example #2 Disabling magic quotes at runtime

<?php
if (get_magic_quotes_gpc()) {
function stripslashes_deep($value)
{
$value = is_array($value) ?
array_map('stripslashes_deep', $value) :
stripslashes($value);

return $value;
}

$_POST = array_map(‘stripslashes_deep’, $_POST);
$_GET = array_map(‘stripslashes_deep’, $_GET);
$_COOKIE = array_map(‘stripslashes_deep’, $_COOKIE);
$_REQUEST = array_map(‘stripslashes_deep’, $_REQUEST);
}
?>

What are Magic Quotes

When on, all (single-quote), (double quote), \ (backslash) and NULL characters are escaped with a backslash automatically. This is identical to what addslashes() does.

There are three magic quote directives:

  • magic_quotes_gpc Affects HTTP Request data (GET, POST, and COOKIE). Cannot be set at runtime, and defaults to on in PHP. See also get_magic_quotes_gpc().
  • magic_quotes_runtime If enabled, most functions that return data from an external source, including databases and text files, will have quotes escaped with a backslash. Can be set at runtime, and defaults to off in PHP. See also set_magic_quotes_runtime() and get_magic_quotes_runtime().
  • magic_quotes_sybase If enabled, a single-quote is escaped with a single-quote instead of a backslash. If on, it completely overrides magic_quotes_gpc. Having both directives enabled means only single quotes are escaped as . Double quotes, backslashes and NULL’s will remain untouched and unescaped. See also ini_get() for retrieving its value.

get_magic_quotes_gpc()

(PHP 4, PHP 5)

get_magic_quotes_gpc — Gets the current configuration setting of magic quotes gpc

Description

int get_magic_quotes_gpc ( void )

Returns the current configuration setting of magic_quotes_gpc

Keep in mind that the setting magic_quotes_gpc will not work at runtime.

For more information about magic_quotes, see this security section.

Return Values

Returns 0 if magic quotes gpc are off, 1 otherwise.

Examples

Example #1 get_magic_quotes_gpc() example

<?php
echo get_magic_quotes_gpc();        // 1
echo $_POST['lastname'];            // O\'reilly
echo addslashes($_POST['lastname']); // O\\'reilly
if (!get_magic_quotes_gpc()) {
$lastname = addslashes($_POST[‘lastname’]);
} else {
$lastname = $_POST[‘lastname’];
}

echo $lastname; // O\’reilly
$sql = “INSERT INTO lastnames (lastname) VALUES (‘$lastname’)”;
?>

Notes

Note: If the directive magic_quotes_sybase is ON it will completely override magic_quotes_gpc. So even when get_magic_quotes_gpc() returns TRUE neither double quotes, backslashes or NUL’s will be escaped. Only single quotes will be escaped. In this case they’ll look like:

get_magic_quotes_runtime

(PHP 4, PHP 5)

get_magic_quotes_runtime — Gets the current active configuration setting of magic_quotes_runtime

Description

int get_magic_quotes_runtime ( void )

Returns the current active configuration setting of magic_quotes_runtime.

Return Values

Returns 0 if magic quotes runtime is off, 1 otherwise.

Bookmark and Share
Advertisements

Array_walk_recursive in PHP

The array_walk_recursive() function runs each array element in a user-made function. The array’s keys and values are parameters in the function. The difference between this function and the array_walk() function is that with this function you can work with deeper arrays (an array inside an array). Returns True or False

In otherwords, Apply a user function recursively to every member of an array .

Syntax:

bool array_walk_recursive ( array &$input , callback $funcname [, mixed $userdata ] )

The array_walk_recursive() applies a user-defined callback function to every element of an array and will recurse into nested arrays. Normally the array_walk_recursive() function takes two arguments: the first one the array being walked over, and the second one the value of the key or index of the array. the third, otional argument is an additional value that you can send to the callback function. the Function returns true on success and false on failure.
If your call back function needs to be working with the actual values of the array, specify the first parameter of the callback as a reference. Then, any changes made to those elements will be made in the original array itself.

Example:

<html>
<body>
<div align=”center”>
<?php
$numbers=array(array(1,2,3,4),
array(4,8,10,12),
array(20,25,30,35),
);
function cube(&$element,$index)
{
print $index;
$element=$element*$element*$element;
}
?>
<table border=’1′><caption><font size=’-1′> The<em> array_walk_recursive()</em> function</font></caption>
<?php
array_walk_recursive($numbers,’cube’);
for($i=0;$i<3;$i++)
{
echo “<tr bgcolor=’999FFF’>”;
for($j=0;$j<4;$j++)
{
echo “<td><b>”.$numbers[$i][$j];
}
echo “</td></tr>”;
}
echo “<table>”;
?>
</div>
</body>
</html>

OUTPUT:

arrayfunction

Explain:
$numbers is a numeric array containing three arrays.
This callback function takes a reference to the array being walked over as its first argument and $key, which is the index of each element of the array. Its function is to walk through the array cubing each of its elements.
The array_walk_recursive() function takes the array as its first argument, and the name of the function, ‘cube’, as a string value for its second argument. The cube() function is the callback function that will be applied toeach element of the two-dimensional array, $numbers.

Bookmark and Share

Upgrading WordPress

Before you get started, make sure you meet the minimum requirements.

If you consider upgrading because of security issues, ensure that the underlying software is kept current as well, for example your PHP version or the version of the MySQL database

Automatic Upgrade

Recent versions of WordPress feature an Automatic Upgrade. You can launch the automatic upgrade by clicking the link in the new version banner (if it’s there) or by going to the Tools -> Upgrade menu. After that it should be straightforward.

Automatic Upgrades do fail sometimes, though, so remember to backup your database first, and deactivate your plugins before starting the upgrade.

Note that your files all need to be owned by the user under which your Apache server executes, or you will receive a dialog box asking for “connection information,” and you will find that no matter what you enter, it won’t work.

If the automatic upgrade doesn’t work for you, don’t panic, just try a manual upgrade.

Three Step Manual Upgrade

These are the short instructions, if you want more check out the extended upgrade instructions. If you experience problems with the Three Step Upgrade, you may want to review the more detailed upgrade instructions

For these instructions, it is assumed that your blog’s URL is http://example.com/wordpress/. Note that during the upgrade process access to your blog may not work for your visitors.

A Warning before you start

If you run into problems Upgrading WordPress with the three Steps described here, you need to revert to your old version first before using the more detailed upgrade instructions (ie. restore the backup made in step 0). Even though you might not run into any errors with this process right away, you might run into problems later down the line. Then it may not be possible to revert far enough back to fix the problem without losing any recent changes.

So If you use Plugins and Themes other than the ones that come with the default WordPress installation, it is advisable to start over with the more detailed upgrade instructions

.

Step 0: Before You Get Started

  • Just in case something goes wrong, make sure you have a backup. WordPress Backups is a comprehensive guide.
  • Make sure the database user name registered to WordPress has permission to create, modify, and delete database tables. If you installed WordPress in the standard way, and nothing has changed since then, you are fine.
  • Deactivate your plugins. A plugin might not be compatible with the new version, so it’s nice to check for new versions of them and deactivate any that may cause problems. You can reactivate plugins one-by-one after the upgrade. This is particularly important when upgrading to WordPress 2.7!

Step 1: Replace WordPress files

  1. Get the latest WordPress. Either download and extract it to your computer or download it directly to the server.
    1. As a reminder, to extract a tar.gz to a folder use this command, replacing (folder name) with the name of your folder: tar -xvzf latest.tar.gz -C ./(folder name)
  2. Delete your old wp-includes and wp-admin directories.
  3. Copy the new WordPress files to your server, overwriting old files in the root, except perhaps the wp-content folder (see “NOTE” below). You may use FTP or shell commands to do so. Note that this means *all* the files, including all the files in the root directory as well. If you use the default or classic theme and have customized it, then you can skip that theme.

NOTE The wp-content folder requires special handling, as do the plugins and themes folders. You should copy over the contents of these folders, not the entire folder. In some cases, copying the entire folder may overwrite all your customizations and added content.

Also take care to preserve the content of the wp-config.php file in the root directory. This file contains current settings for your existing installation, e.g. database sign-in information. Occasionally new versions of WordPress add statements to this file. (E.g. in version 2.5 the SECRET_KEY variable was added, see Extended upgrade instructions). Compare your existing file with the new installation file which is named wp-config-sample.php. Either transfer your settings to the sample-file and rename it to wp-config.php or copy the new statements from the sample file into your current file.

Step 2: Upgrade your installation

Visit your main WordPress admin page at /wp-admin. You may be asked to login again. If a database upgrade is necessary at this point, WordPress will detect it and give you a link to a URL like http://example.com/wordpress/wp-admin/upgrade.php. Follow that link and follow the instructions. This will update your database to be compatible with the latest code. If you fail to do this step, your blog might look funny.

Step 3: Do something nice for yourself

If you have caching enabled, your changes will appear to users more immediately if you clear the cache at this point (and if you don’t, you may get confused when you see the old version number in page footers when you check to see if the upgrade worked).

Your WordPress installation is successfully upgraded. That’s as simple as we can make it without Updating WordPress Using Subversion.

Consider rewarding yourself with a blog post about the upgrade, reading that book or article you’ve been putting off, or simply sitting back for a few moments and let the world pass you by.

Troubleshooting

If anything has gone wrong the first thing to do is go through all the steps in our extended upgrade instructions. That page also has information about some of the most common problems we see.

Bookmark and Share

WordPress Plugin Actions

WordPress actions allow you as a plugin author to be able to hook into the WordPress application and execute a piece of code. An example of an action would be that you want a execute some code after a user has published a post or left a comment.

Some of the actions that I use heavily are:

  • admin_menu: Allows you to set up an admin panel for your plugin.
  • wp_head: Allows you to insert code into the tag of a blog

Actions in Action:

While defining the structure of a WordPress plugin, I left a place holder for some actions. In this example, we are going to set up a piece of code that will run inside the <head>tag of a WordPress blog.
First we need to add a function into our DevloungePluginSeries class.
PHP:
1. function addHeaderCode() {
2.            ?>
3. <!– Devlounge Was Here –>
4.            <?php
5.
6.        }
All the above function does is output an HTML comment. Rather simple, but you could
output just about anything. To call this function, we add an action.
PHP:
1. //Actions and Filters
2. if (isset($dl_pluginSeries)) {
3.     //Actions
4.     add_action(‘wp_head’, array(&$dl_pluginSeries, ‘addHeaderCode’), 1);
5.     //Filters
6. }
From the WordPress Plugin API page, the add_action structure is as follows:
add_action ( ‘hook_name’, ‘your_function_name’, [priority], [accepted_args] );
Since we are calling a function inside of a class, we pass the action an array with a reference to our class variable (dl_pluginSeries) and the function name we wish to call (addHeaderCode). We have given our plugin a priority level of 1, with lower numbers executed first.

Running the Code:

If the Devlounge Plugin Series plugin is activated, the comment of “Devlounge was here” should show up when you go to View->Source in your web browser when looking at your main blog site.

Removing Actions:

If your plugin dynamically adds actions, you can dynamically remove actions as well with the remove_actions function. The structure is as follows:
remove_action(‘action_hook’,’action_function’).

Structure of a WordPress Plugins

One of the more important aspects of developing a WordPress plugin is how you structure it. This post will go over some tips on how to structure your plugin to organize your plugin resources and avoid naming collisions. Each plugin author is different in the way they structure a plugin, so these tips are merely my own personal preference. I’ll first briefly describe how a WordPress plugin works and then go into a plugin’s structure.

How a WordPress Plugin Works:

After placing a WordPress plugin into the “wp-content/plugins/” folder, the plugin should automatically be available to install.

When a plugin is “Activated”, this tells WordPress to load your bit of code on “each” page (including admin pages). This is why if you have many plugins activated, your WordPress installation may be very slow due to the amount of code being included. Since WordPress loads your code automatically when the plugin is activated, you can take advantage of this by tapping into the WordPress Plugin Application Program Interface (API). You can also access the WordPress template tags or create your own.
I suggest reading into the WordPress loop if you plan on making changes to the post content or comments. The WordPress loop is the loop that displays your posts. Some template tags will not work outside of this loop, so it is imperative that you know exactly where your code is executing. You can control this by taking advantage of actions and filters, which will be explained in later posts.

Folder Structure:

All WordPress plugins will be installed in the wp-content/plugins directory. Some plugin authors simply include a PHP file for their plugin, but I recommend always creating a folder to store your plugin.

I typically structure my plugin in this folder structure:

Plugin Folder Name (The name of your plugin with no spaces or special characters)

-> Main plugin php file
-> js folder (for JavaScript files)
-> css folder (for StyleSheet files)
-> php folder (for other PHP includes)

For example purposes, here is a sample structure I have created:

devlounge-plugin-series
-> devlounge-plugin-series.php
-> js
-> css
-> php

Within the devlounge-plugin-series folder, I would include just the main PHP file and put all other files in their respective folders. This structure will assist other plugin authors who look at your code to be able to tell what the main plugin file is and where all the supporting files are located.
WordPress also recommends placing images in their own directory and including a read  me file for your plugin.

Main Plugin File:

When you start a new plugin file, the first seven lines are the lines that describe your plugin.
PHP:
1.  <?php
2.  /*
3.  Plugin Name: Your Plugin Name Here
4.  Plugin URI: Your Plugin URI
5.  Version: Current Plugin Version
6.  Author: Who Are You?
7.  Description: What does your plugin do?

Line 3 allows you to name your plugin. Line 4 allows you to point a user to the web location of your plugin. Line 5 allows you to specify the current version. Line 6 allows you to specify the author of the plugin. Line 7 allows you to describe your plugin.

Shown below is an example of the code filled out:
PHP:
1.  <?php
2.  /*
3.  Plugin Name: Devlounge Plugin Series
4.  Plugin URI: http://www.devlounge.net/
5.  Version: v1.00
6.  Author: <a href=”http://www.ronalfy.com/”>Ronald Huereca</a>
7.  Description: A sample plugin for a <a
href=”http://www.devlounge.net”>Devlounge</a&gt; series.

Set Up a Class Structure:

You don’t have to be incredibly familiar with PHP Classes to develop a WordPress plugin, but it sure helps. A class structure is necessary in order to avoid naming collisions with other WordPress plugins. If someone out there sets up the same function name as yours in a plugin, an error will result and WordPress will be rendered inoperable until that plugin is removed. To avoid naming collisions, it is imperative that all plugins incorporate a PHP class structure. Here is some bare-bones code that will allow you to set up a class structure.

PHP:
1. if (!class_exists(“DevloungePluginSeries”)) {
2.     class DevloungePluginSeries {
3.        function DevloungePluginSeries() { //constructor
4.
5.        }
6.
7.     }
8.
9. } //End Class DevloungePluginSeries

What the above code does is checks for the existence of a class named DevloungePluginSeries. If the class doesn’t exist, the class is created.
Initialize Your Class The next bit of code will initialize (instantiate) your class.
PHP:
1. if (class_exists(“DevloungePluginSeries”)) {
2.     $dl_pluginSeries = new DevloungePluginSeries();
3. }

All the above code checks for is if the class DevloungePluginSeries has been created. If it has, a variable called $dl_pluginSeries is created with an instance of the DevloungePluginSeries class.

Set Up Actions and Filters:

The next bit of code sets up a place holder for WordPress actions and filters (which I will go over in a later post).
PHP:
1. //Actions and Filters
2. if (isset($dl_pluginSeries)) {
3.     //Actions
4.

5.     //Filters
6. }
7.
8. ?>

The above code checks to make sure the $dl_pluginSeries variable is set. If it is (and that’s only if the class exists), then the appropriate actions and filters are set up.

Seven Reasons to Write a WordPress Plugin

While writing the “How to Write a Plugin” series, I thought it would be beneficial to list some reasons why WordPress users would want to write a WordPress plugin in the first place. Listed below are seven reasons why a WordPress user should consider writing a WordPress plugin.

=>You like a plugin’s idea, but don’t like the plugin’s implementation:

Whether discovering WordPress plugins on Weblog Tools Collection, the official WordPress plugins directory, or the WordPress Plugin Database, you will inevitably find a plugin that meets your needs — sort of. You like the idea of the plugin, but not really the approach the plugin author took with it. Why not run with the original idea and create your own separate version?

=>You want to modify existing plugin code:

Sometimes the plugin’s output needs to be tweaked a little bit or some functionality you would like is missing. You can try convincing the plugin author to add your feature, but plugin authors are usually quite busy or they may not like your suggestion. It takes a lot of effort by a plugin author to provide support and field feature and bug requests for a plugin that is free. Sometimes the plugin is no longer supported by anyone. In the event the plugin author is unable to your needs, it will be up to you to take the initiative and modify the existing plugin code. If you do a good enough job and make enough changes, you can re-release the plugin as long as the original plugin was released under a GPL compatible license. Usually one of the first things I do when I install or test a new plugin is to look at the code and see what I can modify, what I can’t modify, and what I can possibly add or take away.

=> You want to extend a plugin:

Sometimes a plugin is good as it is, but you would like to build upon it and release your own version. For example, you may think a plugin would work better using AJAX, or would like to add more hooks so that it is compatible with other plugins. You may want to add an admin panel so you don’t have to dig through the code to change the output. As stated earlier, if a plugin is released as GPL compatible, you are free to release your own version.

=>You want portable theme code:

For those of us who opted to build a custom theme from scratch rather than download one, you may find yourself re-using code snippets all over the place. Wouldn’t it be better just to write your own plugin that combined all the little code snippets so that you could use them as template tags? The beauty of template tags is that you can re-use them over and over for your theme and any future ones you build. And you only have one place to change the code rather than several.

=>You are a theme designer:

I would argue that if you are a template designer for WordPress, the next logical step is to be a plugin author. Writing plugins gives you a more intimate knowledge of how WordPress behaves and allows you to extend the functionality of your released themes.

=>You want to make money:

A good plugin author can usually get paid on the side for custom work. Some plugin authors take donations as well or charge extra for providing support or for consulting. If you are a custom theme designer, you can package your custom plugins in with the theme for an extra charge.

When launching the Reader Appreciation Project, one of the goals I had was to rapidly build incoming links. The best way I knew how was to write some WordPress plugins and promote them. One of my plugins (WP Ajax Edit Comments) turned out to be very popular and has currently generated more than 100 incoming links.


Probabilistic Ranking of Database Query Results

Why Probabilistic in IR?

In traditional IR systems, matching between each document and query is attempted in a semantically imprecise space of index terms. Probabilities provide a principled foundation for uncertain reasoning.

A system and methods rank results of database queries. An automated approach for ranking database query results is disclosed that leverages data and workload statistics and associations. Ranking functions are based upon the principles of probabilistic models from Information Retrieval that are adapted for structured data. The ranking functions are encoded into an intermediate knowledge representation layer. The system is generic, as the ranking functions can be further customized for different applications. Benefits of the disclosed system and methods include the use of adapted probabilistic information retrieval (PIR) techniques that leverage relational/structured data, such as columns, to provide natural groupings of data values. This permits the inference and use of pair-wise associations between data values across columns, which are usually not possible with text data.

Probabilistic in IR:

  • Classical probabilistic retrieval model
    -> Probability ranking principle, etc.
  • (Naïve) Bayesian Text Categorization
  • Bayesian networks for text retrieval
  • Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR.
    -> Traditionally: neat ideas, but they’ve never won on performance. It may be different now.

Introduction:

In probabilistic information retrieval, the goal is the estimation of the probability of relevance P(R l qk, dm) that a document dm will be judged relevant by a user with request qk. In order to estimate this probability, a large number of probabilistic models have been developed.

Typically, such a model is based on representations of queries and documents (e.g., as sets of terms); in addition to this, probabilistic assumptions about the distribution of elements of these representations within relevant and nonrelevant documents are required.

By collecting relevance feedback data from a few documents, the model then can be applied in order to estimate the probability of relevance for the remaining documents in the collection.

As the name suggests ‘Ranking’ is the process of ordering a set of values (or data items) based on some parameter that is of high relevance to the user of ranking process.

Ranking and returning the most relevant results of user’s query is a popular paradigm in information retrieval.

Ranking and Database:

Not much work has been done in ranking of results of query in  database systems.

We have all seen example of ranking of results in the internet. The most common example is the internet search engines (like Google). A set of WebPages (satisfying the users search criteria) are returned, with most relevant results featuring at the top of the list.

In contrast to the WWW, databases support only a Boolean query model. For example a selection query on a SQL database schema returns all tuples that satisfy the conditions specified in the query. Depending on the conditions specified in the query, two situations may arise:

Empty Answers: when the query is too selective, the answer may be empty.
Many Answers: when the query is not too selective, too many tuples may be there in the answer.

We next consider these two scenarios in detail and look at various mechanism to produce ranked results  in these circumstances.

The Empty Answers Problem:

Empty answers problem is the consequence of a very selective query in database system. In this case it would be desirable to return a ranked list of ‘approximately’ matching tuples without burdening the user to specify any additional conditions. In other words, an automated approach for ranking and returning approximately matching tuples.

Automated Ranking Functions:

Automated ranking of query results is the process of taking a user  query and mapping it to a Top-K query with a ranking function that depends on conditions specified in the user query. A ranking function should be able to work well even for large databases and have minimum side effects on query processing.

Automated Ranking functions for the ‘Empty Answers Problem’ :

  • IDF Similarity
  • QF Similarity
  • QFIDF Similarity

IDF Similarity:

IDF (inverse document frequency) is an adaptation of popular IR technique based on the philosophy that frequently occurring words convey less information about user’s needs than rarely occurring words, and thus should be weighted less.

QF Similarity – leveraging workloads:

There may be instances where relevance of a attribute value may be due to factors other than the frequency of its occurrence. QF similarity is based on this very philosophy. According to QF Similarity, the importance of attribute values is directly related to the frequency of their occurrence in query strings in workload.

QFIDF Similarity:

QF is purely workload based, i.e., it does not use data at all. This may be a disadvantage in situations wherein we have insufficient or unreliable workloads.Add to Del.icio.us
QFIDF Similarity is a remedy in such situations. It combines QF and IDF weights. This way even if a value is never referenced in the workload, it gets a small non-zero QF.

Why Probabilistic in IR?

In traditional IR systems, matching between each document and query is attempted in a semantically imprecise space of index terms. Probabilities provide a principled foundation for uncertain reasoning.

A system and methods rank results of database queries. An automated approach for ranking database query results is disclosed that leverages data and workload statistics and associations. Ranking functions are based upon the principles of probabilistic models from Information Retrieval that are adapted for structured data. The ranking functions are encoded into an intermediate knowledge representation layer. The system is generic, as the ranking functions can be further customized for different applications. Benefits of the disclosed system and methods include the use of adapted probabilistic information retrieval (PIR) techniques that leverage relational/structured data, such as columns, to provide natural groupings of data values. This permits the inference and use of pair-wise associations between data values across columns, which are usually not possible with text data.

Probabilistic in IR:

  • Classical probabilistic retrieval model
    -> Probability ranking principle, etc.
  • (Naïve) Bayesian Text Categorization
  • Bayesian networks for text retrieval
  • Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR.
    -> Traditionally: neat ideas, but they’ve never won on performance. It may be different now.

Introduction:

In probabilistic information retrieval, the goal is the estimation of the probability of relevance P(R l qk, dm) that a document dm will be judged relevant by a user with request qk. In order to estimate this probability, a large number of probabilistic models have been developed.

Typically, such a model is based on representations of queries and documents (e.g., as sets of terms); in addition to this, probabilistic assumptions about the distribution of elements of these representations within relevant and nonrelevant documents are required.

By collecting relevance feedback data from a few documents, the model then can be applied in order to estimate the probability of relevance for the remaining documents in the collection.

As the name suggests ‘Ranking’ is the process of ordering a set of values (or data items) based on some parameter that is of high relevance to the user of ranking process.

Ranking and returning the most relevant results of user’s query is a popular paradigm in information retrieval.

Ranking and Database:

Not much work has been done in ranking of results of query in  database systems.

We have all seen example of ranking of results in the internet. The most common example is the internet search engines (like Google). A set of WebPages (satisfying the users search criteria) are returned, with most relevant results featuring at the top of the list.

In contrast to the WWW, databases support only a Boolean query model. For example a selection query on a SQL database schema returns all tuples that satisfy the conditions specified in the query. Depending on the conditions specified in the query, two situations may arise:

Empty Answers: when the query is too selective, the answer may be empty.
Many Answers: when the query is not too selective, too many tuples may be there in the answer.

We next consider these two scenarios in detail and look at various mechanism to produce ranked results  in these circumstances.

The Empty Answers Problem:

Empty answers problem is the consequence of a very selective query in database system. In this case it would be desirable to return a ranked list of ‘approximately’ matching tuples without burdening the user to specify any additional conditions. In other words, an automated approach for ranking and returning approximately matching tuples.

Automated Ranking Functions:

Automated ranking of query results is the process of taking a user  query and mapping it to a Top-K query with a ranking function that depends on conditions specified in the user query. A ranking function should be able to work well even for large databases and have minimum side effects on query processing.

Automated Ranking functions for the ‘Empty Answers Problem’ :

  • IDF Similarity
  • QF Similarity
  • QFIDF Similarity

IDF Similarity:

IDF (inverse document frequency) is an adaptation of popular IR technique based on the philosophy that frequently occurring words convey less information about user’s needs than rarely occurring words, and thus should be weighted less.

QF Similarity – leveraging workloads:

There may be instances where relevance of a attribute value may be due to factors other than the frequency of its occurrence. QF similarity is based on this very philosophy. According to QF Similarity, the importance of attribute values is directly related to the frequency of their occurrence in query strings in workload.

QFIDF Similarity:

QF is purely workload based, i.e., it does not use data at all. This may be a disadvantage in situations wherein we have insufficient or unreliable workloads.Add to Del.icio.us
QFIDF Similarity is a remedy in such situations. It combines QF and IDF weights. This way even if a value is never referenced in the workload, it gets a small non-zero QF.

In case of many answers problem, the recently discussed ranking functions might fail to perform.
This is because many tuples may tie for the same similarity score. Such a scenario could arise for empty answer problem also.
To break this tie, requires looking beyond the attributes specified in the query, i.e., missing attributes.

Many Answers Problem:

We know by now, that many answers problem in database systems is the consequence of not too selective queries.
Such a query on a database system produces a large number of tuples that satisfy the condition specified in the query.
Let us see how ranking of results in such a scenario is accomplished.

Basic Approach:

Any ranking function for many answers problem has to look beyond the attributes specified in the query, since all or a large number of tuples satisfy the specified conditions.
To determine precisely the unspecified attributes is a challenging task. We show adaptation of Probabilistic Information Retrieval (PIR) ranking methods.

Ranking function for Many Answers Problem:

Ranking function for many answers problem is developed by adaptation of PIR models that best model data dependencies and correlations.
The ranking function of a tuple depends on two factors: (a) a global score, and (b) a conditional score.
These scores can be computed through workload as well as data analysis.

Ranking Function: Adaptation of PIR Models for Structured Data:

The basic philosophy of PIR models is that given a document collection, ‘D’, the set of relevant documents, ‘R’, and the set of irrelevant documents,
R (= D – R), any document ‘t’ in ‘D’ can be ranked by finding out score(t). The score(t) is the probability of ‘t’ belonging to the relevant set, ‘R’

Problem in adapting this approach:

The problem in computing the score(t) using PIR model for the databases is that the relevant set, ‘R’ is unknown at query time.
This approach is well suited to IR domain as ‘R’ is usually determined through user feedback.
User feedback based estimation of ‘R’ might be attempted in databases also but we propose an automated approach.

Architecture of Ranking System:

ir1

Implementation:

Pre-Processing : the pre-processing component is composed of ‘Atomic Probabilities Module’ and ‘Index Module’.

Atomic Probabilities Module – is responsible for computation of several atomic probabilities necessary for the computation of Ranking Function, Score(t).

Index Module – is responsible for pre-computation of ranked lists necessary for improving the efficiency of query processing module.

Intermediate Layer: the atomic probabilities, lists computed by the Index Module are all stored as database tables in the intermediate layer. All the tables in the intermediate layer are indexed on appropriate attributes for fast access during the later stages.

Primary purpose of the intermediate layer is to avoid computing the score from scratch each time a query is received, by storing pre-computed results of all atomic computations

The Index Module:

  • Index module pre-computes the ranked lists of tuples for each possible atomic query.
  • Purpose is to take the run-time load off the query processing component.
  • To assist the query processing component in returning the Top-K tuples, ranked lists of the tuples for all possible “atomic” queries are pre-computed
  • Taking as input, the association rules and the database, Conditional List and Global List are created for each distinct value ‘x’ in the database

Query Processing Component:

List merge algorithm is the key player in the query processing component.

Its function is to take the user query, compute scores for all the tuples that satisfy the condition specified in the query, rank the tuples in a sorted order of the scores and then return the Top-K tuples.

Space Requirements:

To build the conditional and the global lists, space consumed is O(mn) bytes (where ‘m’ is the number of attributes and ‘n’ is the number of tuples of the database table)

There may be applications where space is an expensive resoAdd to Del.icio.usurce.

In such cases, only a subset of the lists may be stored at pre-processing times, but this will at the expense of an increase in query processing time.

Next…..

The ranking function so presented works on single table databases and does not allow presence of NULL values. A very interesting but nevertheless challenging extension to this work would be to develop ranking functions that work on multi-table databases and allow NULL’s as well as non-text data in database columns.

In case of many answers problem, the recently discussed ranking functions might fail to perform.
This is because many tuples may tie for the same similarity score. Such a scenario could arise for empty answer problem also.
To break this tie, requires looking beyond the attributes specified in the query, i.e., missing attributes.

Many Answers Problem:

We know by now, that many answers problem in database systems is the consequence of not too selective queries.
Such a query on a database system produces a large number of tuples that satisfy the condition specified in the query.
Let us see how ranking of results in such a scenario is accomplished.

Basic Approach:

Any ranking function for many answers problem has to look beyond the attributes specified in the query, since all or a large number of tuples satisfy the specified conditions.
To determine precisely the unspecified attributes is a challenging task. We show adaptation of Probabilistic Information Retrieval (PIR) ranking methods.

Ranking function for Many Answers Problem:

Ranking function for many answers problem is developed by adaptation of PIR models that best model data dependencies and correlations.
The ranking function of a tuple depends on two factors: (a) a global score, and (b) a conditional score.
These scores can be computed through workload as well as data analysis.

Ranking Function: Adaptation of PIR Models for Structured Data:

The basic philosophy of PIR models is that given a document collection, ‘D’, the set of relevant documents, ‘R’, and the set of irrelevant documents,
R (= D – R), any document ‘t’ in ‘D’ can be ranked by finding out score(t). The score(t) is the probability of ‘t’ belonging to the relevant set, ‘R’

Problem in adapting this approach:

The problem in computing the score(t) using PIR model for the databases is that the relevant set, ‘R’ is unknown at query time.
This approach is well suited to IR domain as ‘R’ is usually determined through user feedback.
User feedback based estimation of ‘R’ might be attempted in databases also but we propose an automated approach.

Architecture of Ranking System:

ir1

Implementation:

Pre-Processing : the pre-processing component is composed of ‘Atomic Probabilities Module’ and ‘Index Module’.

Atomic Probabilities Module – is responsible for computation of several atomic probabilities necessary for the computation of Ranking Function, Score(t).

Index Module – is responsible for pre-computation of ranked lists necessary for improving the efficiency of query processing module.

Intermediate Layer: the atomic probabilities, lists computed by the Index Module are all stored as database tables in the intermediate layer. All the tables in the intermediate layer are indexed on appropriate attributes for fast access during the later stages.

Primary purpose of the intermediate layer is to avoid computing the score from scratch each time a query is received, by storing pre-computed results of all atomic computations

The Index Module:

  • Index module pre-computes the ranked lists of tuples for each possible atomic query.
  • Purpose is to take the run-time load off the query processing component.
  • To assist the query processing component in returning the Top-K tuples, ranked lists of the tuples for all possible “atomic” queries are pre-computed
  • Taking as input, the association rules and the database, Conditional List and Global List are created for each distinct value ‘x’ in the database

Query Processing Component:

List merge algorithm is the key player in the query processing component.

Its function is to take the user query, compute scores for all the tuples that satisfy the condition specified in the query, rank the tuples in a sorted order of the scores and then return the Top-K tuples.

Space Requirements:

To build the conditional and the global lists, space consumed is O(mn) bytes (where ‘m’ is the number of attributes and ‘n’ is the number of tuples of the database table)

There may be applications where space is an expensive resoAdd to Del.icio.usurce.

In such cases, only a subset of the lists may be stored at pre-processing times, but this will at the expense of an increase in query processing time.

Next…..

The ranking function so presented works on single table databases and does not allow presence of NULL values. A very interesting but nevertheless challenging extension to this work would be to develop ranking functions that work on multi-table databases and allow NULL’s as well as non-text data in database columns.

Add to Technorati Add to Del.icio.us Add to Furl Add to Yahoo My Web 2.0 Add to Reddit Add to Digg Add to Spurl Add to Wists Add to Simpy Add to Newsvine Add to Blinklist Add to Fark Add to Blogmarks Add to GoldenFeed