Relational database management systems (RDBMS) have become the standard for efficient data storage and analytics, yet the algorithms driving modern applications demand increasingly sophisticated query capabilities that go beyond the limits of traditional relational algebra. In this line of research, we aim to tackle expressibility and performance problems that arise at the intersection between database systems and programming languages.
Developers face a multi-dimensional impedance mismatch, balancing the algebraic efficiency of SQL with the intuitive semantics of general-purpose languages, often sacrificing performance or correctness. Many systems support user-defined functions (UDFs) as a way to extend database systems with novel functionality, but this increased expressibility comes at a performance cost. Specialized systems, for example Datalog engines that target recursive queries, can express wider classes of problems but miss out on the billions of dollars of engineering and research invested in commercial relational databases.
Recursive Query Optimization
Performance-critical industrial applications, including large-scale program, network, and distributed system analyses, rely on fixed-point computations. The introduction of recursive common table expressions (CTEs) with the WITH RECURSIVE keyword in SQL:1999 extended SQL’s capabilities to handle fixed-point computations, yet techniques and optimizations for recursive queries still lag behind those for traditional, non-recursive queries and recursive SQL remains difficult to master.
Most relational database query engines are not optimized for the specific needs of recursive queries: during the execution of a recursive query, resulting relations from the current iteration are fed as input into the next iteration of query evaluation. As a result, the input data can change unpredictably and the number of iterations required to execute the query is unknown ahead of time. Existing query optimization techniques assume input data does not change during execution, and so cannot guarantee high-quality join orders after the initial iteration.
Our goal is to introduce innovative techniques that shift recursive query optimization and code generation from compile-time to runtime using principled metaprogramming, enabling dynamic optimization and re-optimization before and after query execution has begun.
Adaptive Recursive Query Optimization
2024. IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, Netherlands, 2024-05-13 – 2024-05-17. p. 368 – 381. DOI : 10.1109/ICDE60146.2024.00035.Heterogeneous Pipeline Optimization
The ever-growing demand for sophisticated applications creates an unprecedented need for complex, domain-specific, and heterogeneous building blocks that surpass the capabilities of traditional data processing engines. Fortunately, data practitioners have a plethora of ready-made building blocks at their disposal, ranging from new statistical libraries to specialized utilities. Multi-system pipelines suffer from overheads that originate from inefficiencies due to the integration of utilities and not due to inefficient implementation of their internal core logic: gluing together multiple small, off-the-shelf utilities introduces costly materializations beyond the responsibility bounds of any single utility. Similarly, exploiting the available hardware parallelism requires orchestrating the utilities to exploit data and task parallelism – often beyond the implemented intra-utility parallelization. As a result, existing approaches create a tradeoff between expressiveness, i.e., using off-the-shelf coarse-grained components to express the required operations, and performance, i.e., having an efficient runtime.
Our goal is to achieve an ecosystem-centric architecture that allows engines to extend their functionality without sacrificing performance by inspecting and optimizing data pipelines. We optimize interoperability and orchestration of data pipelines by i) using both compile and runtime information to detect and remove unnecessary serialization operations, ii) using orchestration contracts to parallelize and batch data pipeline operations.