Thursday, December 22, 2005

The importance of contextual metadata... (Part 1)

As the amount of information in the world increases so does the need to be able to better search for information. The two main tools that we have at our aid to improve this are statistical metrics and metadata (ignoring for things such as anchor weighting since I'm assuming a non-web enviornment and constant boosting from certain authors due to the fact I'm assuming that all information is important).

When I say statistical metrics I'm referring to things such as a search for "soy or linseed", now statitically one word will occur more often than the other, so it should probably rated more highly and a document containing both soy and linseed is more important than one containing just one of the search terms. etc...

The next aid to finding interesting results is metadata. Now pre-existing metadata for a document is nice to have but is often incorrect or inaccurate, so we have to judgements on how much weight we give to pre-existing metadata must be made usually on a case by case basis (referring to an inspection of the data to be searched over).

Next we have created metadata. This data helps to define things about the document it's self. For example people or places can be extracted. This allows us to drilldown on pre-existing values in searches. Other contextual information can be gathered from the text of the document, such as identifing a title of a document or a heading and making it more important.
A search for bush a gives us? A plant, a president, and a pro footballer. By recognising people we can limit those documents to a president and a pro footballer, by searching for bush inside the people metadata.