Publications Roadrunner

Tutorials

1. Christos Doulkeridis, Processing Joins over Big Data in MapReduce, 12th Hellenic Data Management Symposium (HDMS'14), Athens, Greece, July 2014.

International Journals

1. Dimitris Pertesis, Christos Doulkeridis, Efficient Skyline Query Processing in SpatialHadoop, In Information Systems (to appear), doi:10.1016/j.is.2014.10.003.

Abstract

This paper studies the problem of computing the skyline of a vast-sized spatial dataset in SpatialHadoop, an extension of Hadoop that supports spatial operations efficiently. The problem is particularly interesting due to advent of Big Spatial Data that are generated by modern applications run on mobile devices, and also because of the importance of the skyline operator for decision-making and supporting business intelligence. To this end, we present a scalable and efficient framework for skyline query processing that operates on top of SpatialHadoop, and can be parameterized by individual techniques related to filtering of candidate points as well as merging of local skyline sets. Then, we introduce two novel algorithms that follow the pattern of the framework and boost the performance of skyline query processing. Our algorithms employ specific optimizations based on effective filtering and efficient merging, the combination of which is responsible for improved efficiency. We compare our solution against the state-of-the-art skyline algorithm in SpatialHadoop. The results show that our techniques are more efficient and outperform the competitor significantly, especially in the case of large skyline output size.

2. Orestis Gkorgkas, Akrivi Vlachou, Christos Doulkeridis, Kjetil Nørvåg, Exploratory product search using top-k join queries, In Information Systems (to appear), 2016.

Abstract

Given a relation that contains main products and a set of relations corresponding to accessory products that can be combined with a main product, the Exploratory Top-k Join query retrieves the k best combinations of main and accessory products based on user preferences. As a result, the user is presented with a set of k combinations of distinct main products, where a main product is combined with accessory products only if the combination has a better score than the single main product. We model this problem as a rank-join problem, where each combination is represented by a tuple from the main relation and a set of tuples from (some of) the accessory relations. The nature of the problem is challenging because the inclusion of accessory products is not predefined by the user, but instead all potential combinations (joins) are explored during query processing in order to identify the highest scoring combinations. Existing approaches cannot be directly applied to this problem, as they are designed for joining a predefined set of relations. In this paper, we present algorithms for processing exploratory top-k joins that adopt the pull-bound framework for rank-join processing. We introduce a novel algorithm (XRJN) which employs a more efficient bounding scheme and allows earlier termination of query processing. We also provide theoretical guarantees on the performance of this algorithm, by proving that XRJN is instance-optimal. In addition, we consider a pulling strategy that boosts the performance of query processing even further. Finally, we conduct a detailed experimental study that demonstrates the efficiency of the proposed algorithms in various setups.

3. Nikos Pelekis, Panagiotis Tampakis, Marios Vodas, Christos Doulkeridis, Yannis Theodoridis, On temporal-constrained sub-trajectory cluster analysis, Data Mining and Knowledge Discovery, 1-37. DOI 10.1007/s10618-017-0503-4.

Abstract

On temporal-constrained sub-trajectory cluster analysis

Cluster analysis over Moving Object Databases (MODs) is a challenging research topic that has attracted the attention of the mobility data mining community. In this paper, we study the temporal-constrained sub-trajectory cluster analysis problem, where the aim is to discover clusters of sub-trajectories given an ad-hoc, user-specified temporal constraint within the dataset’s lifetime. The problem is challenging because: (a) the time window is not known in advance, instead it is specified at query time, and (b) the MOD is continuously updated with new trajectories. Existing solutions first filter the trajectory database according to the temporal constraint, and then apply a clustering algorithm from scratch on the filtered data. However, this approach is extremely inefficient, when considering explorative data analysis where multiple clustering tasks need to be performed over different temporal subsets of the database, while the database is updated with new trajectories. To address this problem, we propose an incremental and scalable solution to the problem, which is built upon a novel indexing structure, called Representative Trajectory Tree (ReTraTree). ReTraTree acts as an effective spatio-temporal partitioning technique; partitions in ReTraTree correspond to groupings of sub-trajectories, which are incrementally maintained and assigned to representative (sub-)trajectories. Due to the proposed organization of sub-trajectories, the problem under study can be efficiently solved as simply as executing a query operator on ReTraTree, while insertion of new trajectories is supported. Our extensive experimental study performed on real and synthetic datasets shows that our approach outperforms a state-of-the-art in-DBMS solution supported by PostgreSQL by orders of magnitude.

International Conferences

1. Orestis Gkorgkas, Akrivi Vlachou, Christos Doulkeridis, and Kjetil Nørvåg, Efficient Processing of Exploratory Top-k Joins, In Proceedings of 26th International Conference on Scientific and Statistical Database Management (SSDBM'14), Aalborg, Denmark, June 30 - July 2, 2014.

Abstract

Efficient Processing of Exploratory Top-k Joins, Proceedings of SSDBM'2014, Aalborg, Denmark, June 2014.']In this paper, we address the problem of discovering a ranked set of k distinct main objects combined with additional (accessory) objects that best fit the given preferences. This problem is challenging because it considers object combinations of variable size, where objects are combined only if the combination produces a higher score, and thus becomes more preferable to a user. In this way, users can explore overviews of combinations that are more suited to their preferences than single objects, without the need to explicitly specify which objects should be combined. We model this problem as a rank-join problem where each combination is represented by a set of tuples from different relations and we call the respective query eXploratory Top-k Join query. Existing approaches fall short to tackle this problem because they impose a fixed size of combinations, they do not distinguish on combinations based on the main objects or they do not take into account user preferences. We introduce a more efficient bounding scheme that can be used on an adaptation of the rank-join algorithm, which exploits some key properties of our problem and allows earlier termination of query processing. Our experimental evaluation demonstrates the efficiency of the proposed bounding technique.

2. Orestis Gkorgkas, Akrivi Vlachou, Christos Doulkeridis and Kjetil Nørvåg, Finding the Most Diverse Products using Preference Queries, In Proceedings of 18th International Conference on Extending Database Technology (EDBT'15), Brussels, Belgium, March 23-27, 2015.

Abstract

In this paper, given a product database and a set of customer preferences, we address the problem of discovering a bounded set of r diverse products that attract the interests of different customers. This problem finds numerous ap- plications in electronic marketplaces, e.g., for selecting the products that are placed in the home page of an online shop. Existing approaches to tackle this problem fall short because they ignore customer preferences, and instead rely solely on products’ attributes. We model this problem as a diversity problem, where each product is represented by its reverse top-k result set, and seek r products that maximize their diversity value. Since the problem is NP-hard, we employ a greedy algorithm that takes as input the reverse top-k result sets of all candidate products. To further improve performance, we also design a more efficient approximate algorithm that does not require the computation of all reverse top-k sets. Our experimental evaluation demonstrates the performance of the proposed algorithms and quality of the selected diverse products.

3. Maria Karanasou, Christos Doulkeridis and Maria Halkidi, DsUniPi: An SVM-based approach for Sentiment Analysis of Figurative Language on Twitter, In Proceedings of 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, June 4-5, 2015.

Abstract

The DsUniPi team participated in the SemEval 2015 Task#11: Sentiment Analysis of Figurative Language in Twitter. The proposed approach employs syntactical and morphological features, which indicate sentiment polarity in both figurative and non-figurative tweets. These features were combined with others that indicate presence of figurative language in order to predict a fine-grained sentiment score. The method is supervised and makes use of structured knowledge resources, such as Senti- WordNet sentiment lexicon for assigning sentiment score to words and WordNet for calculating word similarity. We have experimented with different classification algorithms (Naïve Bayes, Decision trees, and SVM), and the best results were achieved by an SVM classifier with linear kernel.

4. Orestis Gkorgkas, Akrivi Vlachou, Christos Doulkeridis and Kjetil Nørvåg, Maximizing Influence of Spatio-Textual Objects Based on Keyword Selection, In Proceedings of 14th International Symposium on Spatial and Temporal Databases (SSTD'15), Seoul, South Korea, August 26-28, 2015.

Abstract

Maximizing Influence of Spatio-Textual Objects Based on Keyword Selection.

In modern applications, spatial objects are often annotated with textual descriptions, and users are offered the opportunity to formulate spatio-textual queries. The result set of such a query consists of spatio-textual objects ranked according to their distance from a desired location and to their textual relevance to the query. In this context, a challenging problem is how to select a set of at most b keywords to enhance the description of the facilities of a spatial object, in order to make the object appear in the top-k results of as many users as possible. In this paper, we formulate this problem, called Best-terms and we show that it is NP-hard. Hence, we present a baseline algorithm that provides an approximate solution to the problem. Then, we introduce a novel algorithm for keyword selection that greatly improves the efficiency of query processing. By means of a thorough experimental evaluation, we demonstrate the performance gains attained by our approach.

5. Christos Doulkeridis, Akrivi Vlachou, Panagiotis Nikitopoulos, Panagiotis Tampakis, Mei Saouk, The RoadRunner Framework for Efficient and Scalable Processing of Big Data, In Proceedings of the 19th Panhellenic Conference on Informatics (PCI'15), Athens, Greece, 1-3 October 2015.

Abstract

The RoadRunner Framework for Efficient and Scalable Processing of Big Data.

In this paper, we present the overall architecture of RoadRunner, a Hadoop-based framework that enhances the efficiency of rank-aware query processing by introducing various optimizations to Hadoop, without changing its internal operation. RoadRunner focuses on a specific class of queries that involve ranking, such as top-k queries and top-k joins, as well as on preference-aware queries, such as skyline queries, which are tightly related. For this class of queries, we identify improvements on various stages of MapReduce processing, which result in improved performance without sacrificing scalability. We describe the RoadRunner framework, along with individual modules and their roles, and we demonstrate the merits of the proposed framework by means of showcase query examples.

6. Maria Karanasou, Anneta Ampla, Christos Doulkeridis, Maria Halkidi, Scalable and Real-time Sentiment Analysis on Twitter Data, In Proceedings of 6th ICDM Workshop on Sentiment Elicitation from Natural Text for Information Retrieval and Extraction (SENTIRE'16), Barcelona, Spain, December 12, 2016.

Abstract

In this paper, we present a system for scalable and real-time sentiment analysis of Twitter data. The proposed system relies on feature extraction from tweets, using both morphological features and semantic information. For the sentiment analysis task, we adopt a supervised learning approach, where we train various classifiers based on the extracted features. Finally, we present the design and implementation of a real-time system architecture in Storm, which contains the feature extraction and classification tasks, and scales well with respect to input data size and data arrival rate. By means of an experimental evaluation, we demonstrate the merits of the proposed system, both in terms of classification accuracy as well as scalability and performance.

7. Mei Saouk, Christos Doulkeridis, Akrivi Vlachou, Kjetil Nørvåg, Efficient Processing of Top-k Joins in MapReduce, In Proceedings of IEEE International Conference on Big Data (IEEE BigData'16), Washington D.C., USA, December 5-8, 2016.

Abstract

Top-k join is an essential tool for data analysis, since it enables selective retrieval of the k best combined results that come from multiple different input datasets. In the context of Big Data, processing top-k joins over huge datasets requires a scalable platform, such as the widely popular MapReduce framework. However, such a solution does not necessarily imply efficient processing, due to inherent limitations related to MapReduce. In particular, these include lack of an early termination mechanism for accessing only subset of input data, as well as an appropriate load balancing mechanism tailored to the top-k join problem. Apart from these issues, a significant research problem is how to determine the subset of the inputs that is guaranteed to produce the correct top-k join result. In this paper, we address these challenges by proposing an algorithm for efficient top-k join processing in MapReduce. Our experimental evaluation clearly demonstrates the efficiency of our approach, which does not compromise its scalability nor any other salient feature of MapReduce processing.

8. Christos Doulkeridis, Akrivi Vlachou, Dimitris Mpestas, Nikos Mamoulis, Parallel and Distributed Processing of Spatial Preference Queries using Keywords, In Proceedings of 20th International Conference on Extending Database Technology (EDBT'17), Venice, Italy, March 21-24, 2017.

Abstract

Advanced queries that combine spatial constraints with textual relevance to retrieve interesting objects have attracted increased attention recently due to the ever-increasing rate of user-generated spatio-textual data. Motivated by this trend, in this paper, we study the novel problem of parallel and distributed processing of spatial preference queries using keywords, where the input data is stored in a distributed way. Given a set of keywords, a set of spatial data objects and a set of spatial feature objects that are additionally annotated with textual descriptions, the spatial preference query using keywords retrieves the top-k spatial data objects ranked according to the textual relevance of feature objects in their vicinity. This query type is processing-intensive, especially for large datasets, since any data objects may belong to the result set while the spatial range defines the score, and the k data objects with the highest score need to be retrieved. Our solution has two notable features: (a) we propose a deliberate re-partitioning mechanism of input data to servers, which allows parallelized processing, thus establishing the foundations for a scalable query processing algorithm, and (b) we boost the query processing performance in each partition by introducing an early termination mechanism that delivers the correct result by only examining few data objects. Capitalizing on this, we implement parallel algorithms that solve the problem in the MapReduce framework. Our experimental study using both real and synthetic data in a cluster of sixteen physical machines demonstrates the efficiency of our solution.

Publications

Tutorials

International Journals

International Conferences

Latest Post

Useful Links

Get in Touch