ISODAC: A high performance solution for indexing and searching heterogeneous data

Abstract
Searching for words or sentences within large sets of textual documents can be very challenging unless an index of the data has been created in advance. However, indexing can be very time consuming especially if the text is not readily available and has to be extracted from files stored in different formats. Several solutions, based on the MapReduce paradigm, have been proposed to accelerate the process of index creation. These solutions perform well when data are already distributed across the hosts involved in the elaboration. On the other hand, the cost of distributing data can introduce noticeable overhead. We propose ISODAC, a new approach aimed at improving efficiency without sacrificing reliability. Our solution reduces to the bare minimum the number of I/O operations by using a stream of in-memory operations to extract and index text. We further improve the performance by using GPUs for the most computationally intensive tasks of the indexing procedure. ISODAC indexes heterogeneous documents up to 10.6x faster than other widely adopted solutions, such as Apache Spark. As proof-of-concept, we developed a tool to index forensic disk images that can easily be used by investigators through a web interface.
Anno
2016
Autori IAC
Tipo pubblicazione
Altri Autori
Totaro G.; Bernaschi M.; Carbone G.; Cianfriglia M.; Di Marco A.
Editore
Elsevier North Holland]
Rivista
The Journal of systems and software