Databao docs

Get started

Databao is a powerful data agent that helps you query, clean, and visualize your enterprise data securely within your environment.

In this tutorial, you’ll learn how to install the Databao Python SDK, configure sample data sources, and get reliable answers to your questions about a sample dataset. You'll also explore how to work with agents and threads, retrieve results in different formats, and try some of advanced Databao’s features.

Prefer digging into code right away? Follow this tutorial in Colab

Before you begin

This tutorial has been tested in Google Colab. You can reproduce the steps in another Jupyter notebook or Python environment, but it may require adjusting some steps.

  1. If you use an environment other than Colab, make sure the Python version is 3.10, 3.11, 3.12, or 3.13.
  2. If you want to use a cloud OpenAI LLM instead of a local model running in Ollama, get an API key.
  3. If you follow this tutorial on your local machine and want to use a local Ollama model, install Ollama and download a model. We recommend using Qwen3-8B.

Install Databao and other packages

  1. Open a new Colab notebook

    If you don’t want to copy and paste code into a new notebook, you can follow this tutorial in Colab

  2. Install Databao and other packages:

    Notebook cell
    !pip install -q databao duckdb matplotlib pandas seaborn
  3. Import the packages and download the sample dataset

    Notebook cell
    # Import packages
    import duckdb
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from pathlib import Path
    
    # Import Databao
    import databao
    from databao import LLMConfig
    from databao.configs import LLMConfigDirectory
    
    # Download the sample DuckDB database
    !gdown https://drive.google.com/uc?id=16aN4_vFeIASKheAtlo8js3BS1NhrAdDU
    !gdown https://drive.google.com/uc?id=1mUus17rWWYGK3ikbBFcx7PJBdliv9poE
    
    # Connect to the local sample database
    DB_PATH = "web_shop.duckdb"
    conn = duckdb.connect(DB_PATH, read_only=False)
    print(f"Connected to DuckDB database: {DB_PATH}")
    Cell output
    Downloading...
    From: https://drive.google.com/uc?id=16aN4_vFeIASKheAtlo8js3BS1NhrAdDU
    To: /content/web_shop.duckdb
    100% 8.66M/8.66M [00:00<00:00, 291MB/s]
    Downloading...
    From: https://drive.google.com/uc?id=1mUus17rWWYGK3ikbBFcx7PJBdliv9poE
    To: /content/duckdb_schema_overview.md
    100% 3.84k/3.84k [00:00<00:00, 20.5MB/s]
    Connected to DuckDB database: web_shop.duckdb

Configure an LLM

Databao is designed to work with both local and cloud LLMs. You can choose whether you want to use a local Ollama model or an OpenAI cloud model.

The local model runs isolated in the Colab compute. If you choose this option, downloading and setting up the model will take a few minutes.

The responses depend on the available compute resource and may be slower compared the cloud model.

  1. To download and setup ollama with the Qwen3-8B model, add the following code cell:

    Notebook cell
    !gdown https://drive.google.com/uc?id=1W_HuqVsj25lAI88zjE7jxWkj6A59-GFL
    !unzip models_qwen3_8b.zip -d ollama_models
    !curl https://ollama.ai/install.sh | sh
    
    import os
    import subprocess
    
    OLLAMA_HOST = "127.0.0.1:1234"
    os.environ["OLLAMA_HOST"] = OLLAMA_HOST
    os.environ["OLLAMA_CONTEXT_LENGTH"] = "8192"
    os.environ["OLLAMA_MODELS"] = "/content/ollama_models/models"
    subprocess.Popen(["ollama", "serve"])
  2. Add the LLM config:

    Notebook cell
    llm_config = LLMConfigDirectory.QWEN3_8B_OLLAMA

The cloud model requires an API key and will use tokens from your account.

  1. Add the LLM config:

    %env OPENAI_API_KEY=your_api_key
    
    llm_config = LLMConfig(name="gpt-4.1-2025-04-14", temperature=0)

Initialize an agent and add data sources

An agent in Databao acts as the main interface for database connections and context. It can handle multiple threads or conversations, each operating independently on the same data sources.

Notebook cell
agent = databao.new_agent(llm_config=llm_config)

# Register a DuckDB data source and context about the source
agent.add_db(conn, context= Path('/content/duckdb_schema_overview.md'))

# Register an example dataframe
cal = pd.DataFrame({
    "month": pd.date_range("2017-01-01", periods=6, freq="MS").strftime("%Y-%m"),
    "is_qtr_end": [False, False, True, False, False, True]
})

agent.add_df(cal, name="calendar_helper", context="Small helper calendar with monthly rows and a qtr-end flag.")

# Add project-wide context
agent.add_context(
    """
    Project‑wide notes:
    - Monetary values are in EUR unless stated otherwise.
    - Orders with status 'canceled' should be excluded from KPIs unless explicitly requested.
    """
)

print("Registered DBs:", list(agent.dbs.keys()))
print("Registered DFs:", list(agent.dfs.keys()))
Output
Registered DBs: ['db1']
Registered DFs: ['calendar_helper']

Start a thread

Threads act as individual conversations within an agent. While multiple threads share the data sources and context managed by an agent, each thread has its own message history allowing you to ask follow-up questions and build on previous answers within that thread.

By default, threads execute immediately (eager mode) and materialize the result. You will learn how to turn it off and use lazy mode in later steps.

  1. To create a new thread and ask a question that will execute immediately, use the following code:

    Notebook cell
    thread = agent.thread()
    
    thread.ask(
        """
        Return a compact KPI table with:
        - orders_count
        - total_revenue
        - average_order_value
        """
    )
    Output (shortened)
    ======== <THINKING> ========
    
    To answer your request, I will:
    
    1. Calculate the total number of orders (orders_count), excluding orders with status 'canceled'.
    
    ...
    
    [tool_call_output: 'submit_query_id']
    '''
    Query 6-0 submitted successfully. Your response is now visible to the user.
    '''
    
    ======== </THINKING> ========
    
       orders_count  total_revenue  average_order_value
    0           542       664673.8          1226.335424
  2. You can access the result as a dataframe. Because this request has already been executed, the results are cached in the thread.

    Notebook cell
    df_kpis = thread.df()
    df_kpis
    Output
       orders_count  total_revenue  average_order_value
    0           542       664673.8          1226.335424
  3. To review the code Databao generated to produce results, use the .code() method:

    Notebook cell
    print(thread.code())
    Output
    SELECT
    COUNT(DISTINCT o.order_id) AS orders_count,
    SUM(p.payment_value) AS total_revenue,
    SUM(p.payment_value) / COUNT(DISTINCT o.order_id) AS average_order_value
    FROM db1.main.dim_orders o
    JOIN db1.main.fct_order_payments p ON o.order_id = p.order_id
    WHERE o.order_status != 'canceled'

Ask follow-up questions

Threads have memory, so new requests can reference previous answers in the same thread.

  1. In the same thread, ask a new question about the results calculated previously:

    Notebook cell
    thread.ask("Add average delivery days and average review score to the KPI table. Keep it tidy.")
    Output (shortened)
    ======== <THINKING> ========
    
    To add the requested KPIs, I will:
    
    ...
    
    [tool_call_output: 'submit_query_id']
    '''
    Query 15-0 submitted successfully. Your response is now visible to the user.
    '''
    
    ======== </THINKING> ========
    
      orders_count	total_revenue	average_order_value	average_delivery_days	average_review_score
    0          542	   2535254.74         4677.591771              9.983549             3.382784
  2. You can access the results in a dataframe:

    Notebook cell
    df_kpis_extended = thread.df()
    df_kpis_extended
    Output
      orders_count	total_revenue	average_order_value	average_delivery_days	average_review_score
    0          542	   2535254.74         4677.591771              9.983549             3.382784

Try chaining requests in lazy mode

If you know that your flow requires several steps, you can chain several .ask() calls in lazy mode and then materialize the final result using .df(), .plot(), .text(), or .code().

  1. Initialize a new thread and ask several chained question. Because the thread is created in lazy mode, Databao won’t compute it yet.

    Notebook cell
    lazy_thread = agent.thread(lazy=True)
    (
        lazy_thread
            .ask("Compute monthly revenue from june till september of 2025.")
            .ask("Include a 2-month moving average.")
            .ask("Name the columns month, revenue, revenue_ma2.")
    )
    Output
    Unmaterialized Pipe.
  2. Calculate the response

    Notebook cell
    df_trend = lazy_thread.df()
    Output (shortened)
    ======== <THINKING> ========
    
    To answer this, I will:
    
    1. Define "revenue" as the sum of item_price from sales, excluding orders with status 'canceled'.
    
    ...
    
    [df: name=df, showing 4 / 4 rows]
    | month   |   revenue |   revenue_ma2 |
    |:--------|----------:|--------------:|
    | 2025-06 |    166537 |        166537 |
    | 2025-07 |    179467 |        173002 |
    | 2025-08 |    164677 |        172072 |
    | 2025-09 |    162770 |        163723 |
    
    
    
    [tool_call: 'submit_query_id']
    '''
    {"query_id":"20-0","result_description":"Columns are named as requested: 'month' (YYYY-MM), 'revenue' (monthly revenue in EUR), and 'revenue_ma2' (2-month moving average of revenue). The moving average is the average of the current and previous month's revenue. For June, it is just June's revenue, as there is no previous month in the selected range.","visualization_prompt":"Line chart: x-axis month, y-axis revenue and revenue_ma2, two lines, title 'Monthly Revenue and 2-Month Moving Average (June-Sep 2025)'."}
    '''
    
    
    [tool_call_output: 'submit_query_id']
    '''
    Query 20-0 submitted successfully. Your response is now visible to the user.
    '''
    
    
    ======== </THINKING> ========
  3. To visualize calculation results, use the .plot() method.

    Notebook cell
    lazy_thread.plot("Line chart of revenue and revenue_ma2 by month")

    Databao uses Vega-Lite for charts, and you can specify any supported chart type in your prompt

    Line chart of revenue and revenue_ma2 by month generated by the .plot() method in Databao

Ask for explanations

In addition to dataframes and visualizations, Databao can provide text answers about your data which is convenient for summaries.

lazy_thread.ask("Write 3 bullet insights about the revenue trend you just computed. Keep it concise.", stream=False)
print(lazy_thread.text())
- Revenue peaked in July 2025 at €179,467, showing a strong mid-summer sales boost.
- After July, revenue declined steadily through August and September, dropping to €162,770 by September.
- The 2-month moving average smooths out fluctuations, highlighting a gradual downward trend after the July peak.

Visualize dataframes manually

The dataframes returned by .df() are standard pandas dataframes, so you can use them wherever you see fit. For example, you can visualize the result dataframe using seaborn.

if {"month", "revenue"}.issubset(df_trend.columns):
    plt.figure(figsize=(7, 3.5))
    sns.lineplot(data=df_trend, x="month", y="revenue", marker="o")
    plt.xticks(rotation=45, ha="right")
    plt.title("Monthly revenue (from Databao df)")
    plt.tight_layout()
    plt.show()
else:
    print("df_trend columns:", df_trend.columns.tolist(), "— adjust the plotting code to your column names.")

A line chart generated by seaborn from a pandas dataframe

(Optional). Try advanced features

Disable streaming of the model’s thinking process

If you prefer to get shorter responses from Databao, you can turn off the streaming of the LLM's thinking process. This will make the output cleaner and more readable, but it may sometimes seem like the model is taking longer to respond.

There are three ways to disable streaming:

  • To disable streaming for a new thread, specify stream_ask=False when creating one:

    Notebook cell
    no_stream = agent.thread(stream_ask=False)
    no_stream.ask("Compute the top 5 products by revenue in june 2025.")
    df_top5 = no_stream.df()
    df_top5
    Output
                             product_id  product_category_name_english  total_revenue
    0  437e682e8b761b3191da4fca20307e94              Furniture & Decor        8278.09
    1  60690a285907cb1676b2f5f5e544d5af            Musical Instruments        7418.62
    2  6fee3c3366b596ecfae722087b1a58c7                      Computers        5567.23
    3  c658bc04de920254d81ccd80b33078f4                      Computers        5081.73
    4  8efe1caed4e9e8bf475630c46b8108d2              Furniture & Decor        5060.11
  • To disable streaming for an .ask() call in an existing thread, add stream=False in the call:

    Notebook cell
    thread.ask("Now compute total orders by category.", stream=False)
    orders_by_country = thread.df()
    orders_by_country.head()
    Output
              category  total_orders
    0       Cool Stuff           103
    1             Toys            95
    2             Pets            86
    3  Office Supplies            85
    4  Fashion & Shoes            84
  • To make threads lazy by default in a new agent, create the agent as follows:

    Notebook cell
    lazy_agent = databao.new_agent(
        llm_config=llm_config,
        default_lazy_threads=True,
    )
    lazy_agent.thread().ask("This won’t run until you call .df() because lazy is now the default.")

Enable interactive charts

  1. To make charts interactive, add .interactive() when calling .plot().

    This is an experimental feature in Databao, so things might not work perfectly yet.

    Notebook cell
    thread = agent.thread(stream_ask=False)
    thread.ask(
        """
        Return a KPI table with:
        - orders_count
        - total_revenue
        - average_order_value
        per category for july 2025
        """
    )
    thread.plot().interactive()

    Interactive chart in Databao

Adjust row limits

  1. To limit the number of rows in the dataframe output, use row_limit when calling .ask():

    Notebook cell
    small_sample = agent.thread(lazy=False)
    small_sample.ask("Return all raw orders rows", rows_limit=3).df()
    Output
       order_id  customer_id  order_status  order_purchase_timestamp    order_approved_at order_delivered_carrier_date  order_delivered_customer_date   order_estimated_delivery_date
    0  ord_0001    cust_0001       shipped       2025-08-31 11:26:00  2025-08-31 12:47:00          2025-09-04 02:26:00                            NaT             2025-09-12 11:26:00
    1  ord_0002    cust_0002      invoiced       2025-09-08 19:19:00  2025-09-08 21:54:00                          NaT                            NaT             2025-09-14 19:19:00
    2  ord_0003    cust_0003      invoiced       2025-06-10 04:42:00  2025-06-10 05:59:00                          NaT                            NaT             2025-06-29 04:42:00

What’s next

Now that you’ve tried Databao using a sample project, connect it to your data and see how it can help you with your daily tasks.

On this page