The Internet Grabber
The Internet Grabber module takes care of content acquisition from the Internet, producing essences that can contain images, videos or text. This component, due to the heterogeneous nature of the material it deals with, must fulfill some technical and functional requirements:
- It must be scalable, in order to exploit computing and networking resources;
- It must be flexible, in order to conveniently address a large number of target web sites;
- It needs to provide tools to process different kind of downloaded media;
- It must be easily configured.
The InternetGrabber architecture has therefore been designed according to the four main steps constituting the grabbing process:
- The Fetcher is the module that takes care of downloading contents of desired type from websites
- The Filter accepts or discards essences according to given criteria;
- The Comparer compares each artifact against a cache, in order to discard already stored essences ;
- The Transformer performs further processing on artifacts, when needed. Currently it can only handle FLASH animations, converting them into MPEG4 video streams and animated GIFs.
As depicted in the pictures below, the Internet Grabber modules create a processing chain, where essences are passed between modules by means of queues: when an essence is successfully processed by a module, it is enqueued for the next one.
A library is be provided for efficient and secure essence transfer. This library uses the most efficient channel available to transfer content from one source to a destination. If co-located, the copy uses file system API, otherwise an FTP transfer is adopted. The client and server side are configured to use the fastest channel.
In this way the system can be configured in a many fashions: all four modules can reside on the same server, if processing power is sufficient, or can be distributed on different machines, for heavywight tasks.
The system can be convieniently configured via a graphical interface, by which we can both set all anagraph data and runtime process parameters.
