Information retrieval, as the name implies, concerns the retrieving of relevant information from databases. It is basically concerned with facilitating the user’s access to large amounts of (predominantly textual) information.
“Process of searching within a document collection for a particular information need which is called a query”- Langville & Meyer
“Information retrieval deals with the representation, storage, organization of, and access to information item, in order to give the user the possibility to easily access the desired information”- Baeza Yates
Information Retrieval (IR) is the activity of obtaining information from large collections of Information sources in response to a need.
The working of Information Retrieval process is explained below
- The Process of Information Retrieval starts when a user creates any query into the system through some graphical interface provided.
- These user-defined queries are the statements of needed information. for example, queries fork by users in search engines.
- In IR single query does not match to the right data object instead it matches with the several collections of data objects from which the most relevant document is taken into consideration for further evaluation.
- The ranking of relevant documents is done to find out the most related document to the given query.
- This is the key difference between the Database searching and Information Retrieval.
- After the query is sent to the core of the system. This part has the access to the content management module which is directly linked with the back-end i.e. the large collections of data objects.
- Once results are generated by the core system then it is returned to the user by some graphical user interfaces.
- The process repeats and results are modified until the user satisfied for what he is actually looking for
- The Operations on Textual data of documents are illustrated in the figure below ->
Above figure (check the embedded document below) sketches the Processing of textual data typically performed by Information Retrieval engine, by taking a document as input and yielding its index terms.
- The Documents comes from different source combinations such as multiple languages, formattings, character sets; normally, if any document consisting of more than languages. e.g. Consider a Spanish mail which has some part in french language.
- Thus Document parsing deals with the overall document structure. In this phase, it breaks down the document into discrete components. In Preprocessing phase it creates unit documents for example one document representing emails and another as additional specific part.
- In Lexical analysis, tokenizationis the process of breaking a stream into words, phrases, symbols, or other meaningful terms called tokens. These meaningful elements ae further sent to Parts of Speech Tagging.
- Typically, Tokenization occurs at a word level.
Stemming and Lemmatization
- In English grammar, for correct sentence structures, we often use different forms of any word. e.g. go, going, goes etc. Stemming is the process of cutting down the affixes and let the root word be found out. Any word is formed using regular noun + plural affix.
- Lemmatization usually refers to doing these things properly with Vocabulary and Morphological analysis of words. Aiming to remove inflectional endings only.
COMPONENTS OF INFORMATION RETRIEVAL
Figure shows the functional approach of the information retrieval system. There are three major components of the information retrieval system.
- A set of information items
- A set of requests
- A set of mapping mechanisms
APPLICATIONS OF INFORMATION RETRIEVAL
Text Information Retrieval
Perhaps one of the most common and well known application of information retrieval is the retrieval of text documents from the internet. With its recent growth, the internet is fast becoming the main media of communications for business and academic information. Thus it is essential to be able to tap the right document from this vast ocean of information. This is in fact, one of the main pushing force for the development of information retrieval. To date, many relatively successful systems have been developed. Some examples include:
NetOwl is an advanced information retrieval system with automatic indexing and summarization capabilities. The product provides an easy, cost-efficient way for common users to benefit from text analysis aimed at intelligence analysts.
NetOwl makes use of a combination of computational linguistics and Knowledge-based pattern matching methods to analyze natural language to determine the categories of words in the language. By identifying key concepts and relationships, it allows users to quickly find relevant content, eliminate inappropriate materials, and get the information they need. An additional feature is that NetOwl is capable of building an electronic “back of the book” type index on a company’s own web server, which enables users to spot important information or launch a request for information.
EUROSPIDER: The EUROSPIDER system is an Information Retrieval (IR) system which searches very large and complex data collections for relevant information. It is a commercial version of the IR system SPIDER, developed by the Swiss Federal Institute of Technology. EUROSPIDER can be used in various ways:
1. as a standalone IR system
2. as an add-on to a World-Wide Web server which makes data collection accessible through a private or public network
3. added to a commercial database (DB) system to access possibly very dynamic and structured data.
The EUROSPIDER retrieval system provides advanced Information Retrieval (IR) functions such as relevance ranking, feedback searches, linguistic document analysis, and automatic indexing. Document analysis and indexing optionally includes fuzzy term matching to cope with recognition errors of OCR-devices.
Multimedia Information Retrieval
In this era of information overloading, the amount of information available to us is simply so much that it is virtually impossible for us to deal with in an efficient manner. One solution to this problem is to set up databases for multimedia data. Hundreds of television and radio broadcasts would then be covered by a database application which keeps track of the information available. Thus these vast amount of informations could then be managed and captured in an efficient way.
STRATEGIES OF RETRIEVAL PROCESS
- The user has an information need
- The user forms a query
- The user sends the query to a system
- The system returns an answer set
- The user eyes and evaluates the results
- If the user is satisfied, s/he stops
- If the user is not satisfied, s/he modifies the query and returns to step 3
– documents are described explicitly with query words (keywords)
– the result is ad hoc document clusters
A search engine query is a request for information that is made using a search engine. Every time a user puts a string of characters in a search engine and presses “Enter”, a search engine query is made. The string of characters (often one or more words) act as keywords that the search engine uses to algorithmically match results with the query. These results are displayed on the search engine results page (SERP) in order of significance (ranks) (according to the algorithm).
– the user starts from some possibly interesting topic/idea/document and browses documents to find relevant ones
– if no relevant documents are found, the user will move to somewhere else
– the starting point can be found by querying
– assumption: documents on the same topic are organised together
– the user follows hyperlinks towards a known goal (e.g. department of education, amu, aligarh)
– the route is assumed to be known, or it is easily found out while navigating
– the user scans the titles of the answer list, documents, hyperlinks, meta data, etc.
– auxiliary operation: e.g. when scanning, the seeker selects a hyperlink to follow
– the goal is to select for a person or an organisation from a document flow (e.g. today’s news, emails) interesting documents or remove unwanted ones
– a document from a document flow is routed to a person who is interested in the document or to whose field of activities it belongs (e.g. questions by customers are routed to different experts)