Designing a Join Sampling Framework for Optimizing Approximate Queries in Open-source Database Systems

An image of the web interface to pgAQP, an approximate query processing extension to the open-source database system, PostgreSQL.

This project aims to design a join sampling framework that enables very fast approximate queries in open-source database systems. 

Project is Not Currently Available

This project has reached full capacity for the current term. Please check back next semester for updates.

Project description

Join sampling is a useful technique to draw random samples from a complex database join query without computing the join results in full. It can be used to provide fast approximation of aggregation over the join results. Existing algorithms are often implemented outside DBMS kernel, and have rigid design that does not consider the sampling and cost trade-offs compared to their full sampling counterpart. In this project, we aim to design a new join sampling framework integrated in open-source database systems, such that we can enable query optimizer to evaluate the cost/accuracy trade-off of algorithms, and potentially enable hybrid algorithms that combine full join computation and join sampling. The students will be introduced to our existing systems based on PostgreSQL, a mostly used open-source database system, and background on random sampling in database systems. After that, the students are expected to design and implement a join sampling framework in iterator model that can express existing join sampling algorithms and explore query optimization strategies within the framework. 

Project outcome

The specific outcomes of this project will be identified by the faculty mentor at the beginning of your collaboration. 

  • Learn database query processing frameworks.
  • Learn random sampling algorithms for database join queries.
  • Get hands-on practice on hacking database system internals.
  • Practice system experiment design and execution.
  • The final project outcome is expected to be a peer-reviewed conference paper or poster. 

Project details

Timing, eligibility and other details
Length of commitment Year-long
Start time Spring
In-person, remote, or hybrid? Hybrid
Level of collaboration Small group project (2-3 students)
Benefits

Stipend

Potential Academic Credit

Who is eligible Sophomores, Juniors, and Seniors that have taken CSE 220, CSE 250, and CSE 331. The student should be proficient with the C programming language and be familiar and comfortable with data structures and algorithms.

Project mentor

Zhuoyue Zhao

Assistant Professor

Computer Science and Engineering

Phone: (716) 645-4735

Email: zzhao35@buffalo.edu

Start the project

  1. Email the project mentor using the contact information above to express your interest and get approval to work on the project. (Here are helpful tips on how to contact a project mentor.)
  2. After you receive approval from the mentor to start this project, click the button to start the digital badge. (Learn more about ELN's digital badge options.) 

Preparation activities

Once you begin the digital badge series, you will have access to all the necessary activities and instructions. Your mentor has indicated they would like you to also complete the specific preparation activities below. After you’re approved to begin the project, your mentor will send the relevant materials. Please reference this when you get to Step 2 of the Preparation Phase. 

The student should be able to complete an onboard training for developing code in public code base: https://github.com/UB-ADBLAB/aqp_demo_public/.

By the time they start they should have successfully set up the development environment locally or on our research server, and be able to build and set up the PostgreSQL server with pgAQP extension, execute an approximate single-table query, use GDB to debug the code. Please contact the project mentor for access to the servers if needed. 

Keywords

computer science, engineering, C programming, database management