Autor
Raúl Ramos Pollán, Álvaro Barreiro
Medio de Publicación
Congreso: EELA2, Bogotá (Colombia)
Año: 25-27 Febrero 2009
Tipo de publicación: Oral
Abstract
Information Retrieval (IR) deals with the representation, storage, organization of and access to information items. It represents the core of the search engines of today and it is behind their popularity and usefulness. Behind this success, lies a well established experimental methodology against large corpuses of data (documents, queries and relevance judgements) through which new IR models and software implementations are accurately tested and evaluated. Recently, distributed IR models and technologies are becoming increasingly important as there is an emerging need to search throughout federated collections of documents, ,such as the ones that might exist in Grid environments, where different resource centers might possess different collections of documents, needed to be searched in a distributed manner upon an information request from a user.
Testing distributed IR models and technologies in a systematic manner requires significant amounts of resources as each time we want to test a new model or software we need to deploy many collections of documents and run thousands of queries against them measuring effectiveness and efficiency. Our goal is to use the Grid itself for such purpose.
In this work we present the set of technologies we developed in order to be able to run large scale distributed IR experiments on Grid infrastructures. These techonologies allow us to easily design, setup and run distributed IR experiments using standard Grid job submission mechanisms.We accomplish this by tightly integrating virtualization and cloud computing techniques within a gLite environment in a model that can be easily generalized to be used by other scientific disciplines. This, of course, also constitutes a significant step forward in making Grid infrastructures easily exploitable by the IR comunity.