Debugging CodeQL Queries: Lessons from Gradio Vulnerability Research
Sylwia Budzynska explores advanced debugging techniques for CodeQL queries, based on her experience addressing a complex deserialization vulnerability in a Python Gradio application.
Debugging CodeQL Queries: Lessons from Gradio Vulnerability Research
Authored by Sylwia Budzynska, this guide expands on practical debugging strategies when CodeQL queries don’t behave as expected. The tutorial is built around a real security vulnerability—unsafe deserialization via Python’s pickle.load
in a Gradio app—offering both the investigative process and technical solutions for similar CodeQL challenges.
The Challenge: Unexpected CodeQL Results
When writing CodeQL queries, you may encounter situations where your query, despite appearing logically correct, returns no results. This blog details how to debug such cases by leveraging CodeQL’s tooling and deeper understanding of its data flow mechanisms.
Example Vulnerability (Python + Gradio)
The vulnerability revolves around insecure deserialization in a Gradio-powered application:
- The user uploads a file using
gr.File
. - The file name is passed to
open()
in Python. - Its contents are loaded via
pickle.load
, a known dangerous operation when processing untrusted files.
Code Simplification:
import pickle
import gradio as gr
def load_config_from_file(config_file):
"""Load settings from a UUID.pkl file."""
try:
with open(config_file.name, 'rb') as f:
settings = pickle.load(f)
return settings
except Exception as e:
return f"Error loading configuration: {str(e)}"
with gr.Blocks(title="Configuration Loader") as demo:
config_file_input = gr.File(label="Load Config File")
load_config_button = gr.Button("Load Existing Config From File", variant="primary")
config_status = gr.Textbox(label="Status")
load_config_button.click(
fn=load_config_from_file,
inputs=[config_file_input],
outputs=[config_status]
)
demo.launch()
CodeQL Query Debugging Workflow
1. Minimal Reproducing Example
Reduce the codebase and create a CodeQL database for just this example. This minimizes noise and helps identify if the bug is in the query logic or the target code.
codeql database create codeql-zth5 --language=python
2. Predicate/Quick Evaluation
Use CodeQL’s quick evaluation feature to test if your predicate logic for sources and sinks is correct. For instance, check if your isSource
and isSink
predicates hit expected AST nodes.
3. Inspecting the AST
The VSCode extension allows you to view the abstract syntax tree, offering insight into how CodeQL parses elements like argument nodes and attribute reads. This is valuable for writing predicates targeting the right nodes.
4. Partial Path Graphs
Partial path graphs (forward/backward) visualize how far taint/data propagates and where it halts. This is essential for discovering missing taint steps—e.g., between an object and its attribute or after a function like open()
.
5. Custom Taint Steps
Often, default CodeQL flow models won’t propagate taint through attributes like .name
or through custom constructs. You can manually add taint steps. For example:
predicate isAdditionalFlowStep(DataFlow::Node nodeFrom, DataFlow::Node nodeTo) {
exists(DataFlow::AttrRead attr |
attr.accesses(nodeFrom, "name") and nodeTo = attr
)
}
You can chain multiple taint steps for nuanced tracking, such as:
.name
attribute fromgr.File
to variable- File name argument to open() result
6. Refinement and Generalization
The blog also describes how to generalize these taint steps to avoid excessive false positives. For security research, it’s common to use “hacky” propagation steps, but production workflows demand more precise modeling—e.g., only propagating through attributes for specific object types.
7. Upstreaming and Maintenance
Finally, if your taint step models are broadly reusable and accurate (as in the shown nameAttrRead
for Gradio files), consider contributing them upstream to help the broader CodeQL community.
Conclusion
Debugging CodeQL queries can be complex, especially when modeling edge cases like deserialization vulnerabilities or complex framework behaviors. By employing careful query structuring, quick evaluation, AST inspection, partial path tracing, and targeted taint steps, you can craft effective, reliable queries.
For further assistance, the GitHub Security Lab’s public Slack and CodeQL discussions forum are recommended support channels.
References:
- CodeQL zero to hero part 5: Debugging queries
- GitHub Security Lab Slack
- CodeQL Community Packs
- Full code and queries on GitHub
This post appeared first on “The GitHub Blog”. Read the entire article here