Zhu Yezi - Data Scientist & Product Builder

For those who've read my earlier post about my first impressions of LLMs for Text-to-SQL (link to the original post), I wanted to share some updated thoughts after spending a few months working on a related project. This post builds on the questions I raised earlier and explores them in more depth.

Disclaimer: The ideas shared here are my personal reflections and do not represent the specific design or solutions I'm working on at my company. The discussion assumes we aim to create a tool that end users can truly adopt for daily use with a small team working on it.

• • •

Who is Target User?

One of the questions I raised in my last post was about identifying the target users for a Text-to-SQL tool. But after months working, this proves to be the most critical consideration for product design.

If the application is built for data professionals

Having worked in this field for a few years, across roles involving feature stores, data ingestion, and more, I can see what are pain points of day-to-day data work. Based on this experience, I see significant potential in a phased approach to building a Text2SQL application for data professionals.

Phase 1: Laying the Foundation

The success of any Text2SQL tool depends heavily on the availability of a robust data catalog. This includes properly documented table descriptions, column metadata. In a more ideal case, a unified "business metric definition dictionary" (e.g., a clear definition of metrics like GMV) exists.

In some companies I've worked with, such tools exist. However, in many organisations, this foundational layer is missing, which makes it harder to deliver a sustainable Text2SQL product.
Without a reliable data catalog, your GenAI team might end up spending most of their time building one internally, only to realise it's not shared across the company or integrated into other workflows.

If you have this foundation, you can focus on addressing the most time-consuming parts of data work first, such as:

Finding the correct table(s) for a business request.
Identifying the right filters for specific use cases.
Resolving conflicts when multiple data sources contain the same information.
Determining the correct contact point for clarification on business logic.

These tasks often take days or even weeks to solve. A well-integrated Text2SQL tool could drastically reduce this effort.

Suggested Features:

Most low-hanging fruit here:

Integrate the tool into the existing data catalog page as an "input cell" to allow users to:

Describe their needs in natural language.
Use embedding similarity to fetch relevant table schemas and column definitions.
Shortlist potential tables and columns based on the input query.
Provide reference queries or suggest starting points.

Even this minimal functionality can save significant time for DA/DS/DE. It might even reduce the need to hire additional team members for these roles.

Phase 2 and Beyond: Incremental Enhancements

With a solid foundation in place, you can gradually expand the tool's capabilities:

It can either be

Inhouse SQL Generation: Use LLMs to generate SQL queries based on the detailed column descriptions and user inputs.
OR
Code assistants: Integrate existing tools like GCP BigQuery's Text2SQL feature or general-purpose assistants (e.g., Gemini Code Assistant, Cursor) to further streamline query creation.
OR
SQL improvement agent: Explore using LLMs for SQL query optimization to enhance performance.

If your goal is to reduce the workload or lower the skill barrier for DAs, DSs, or DEs, it's not necessary to aim for a fully operational Text2SQL system right away. A step-by-step rollout — starting with foundational features — can already deliver significant value while mitigating risks.

Writing SQL is not inherently the hardest or most time-consuming part of data work; finding the correct inputs is.

All the functions here are quite straightforward, and not hard to implement. By focusing on foundational improvements first, you ensure the product adds value incrementally, building user trust and adoption along the way.

The only issue of this, it's not a fancy product to show in a company.

If the application is built for non-data professionals

When designing a product for non-data professionals, things get a lot trickier — and way more unpredictable. Some companies might want to go all-in on replacing DA/DE/DS roles entirely (maybe even this year!), but let's be honest — pulling that off right now is a huge challenge.

From my own experience, even helping a newbie data scientist use GPT tools effectively isn't easy. Sure, they can get some next-level results, but writing the kind of prompts that a senior DS would? That's a whole different game. It's like when I tried to learn full-stack coding — I spent a month on React and Node.js, and while I could use code assistants to get things done, my lack of deeper knowledge really limited how much I could actually do. The same goes for LLMs: your own knowledge limits how much the tool can help you. Knowing how to ask the right questions is already a skill, and it's not as simple as it seems.

So, honestly, I doubt that an inhouse LLM app (not talking about building foundation models here) can fully replace a DA for a business user right now. It's just not that straightforward.

What Would Replacing a DA Actually Look Like?

If you really want to build something that can replace a DA for business folks, here's what it would need to do:

Write SQL queries that pull exactly the right data for the user's needs.
Create clear, useful plots from that data.
Build dashboards (think Tableau) if needed.
Explain the charts or data in a way that makes sense to non-data people.
Put together a report with all the findings.

It's not just about generating SQL — far from it. You're talking about a tool that understands the context, adapts to different situations, and handles multiple steps in a workflow seamlessly. That's no small task.

And you will find your business stakeholders want to let the whole process as simple as one click.

The Real Problem: Workflow Design

Each of these tasks would need its own properly scoped workflow. Trying to build one giant system that covers every possible scenario would turn into an endless game of patching it for new cases.

If you really want to replace a DA, you'd need to design each workflow to be powerful and flexible. And honestly, each of these workflows — SQL generation, visualization, reporting — could probably be its own standalone product or even the foundation of a startup. That's how big the opportunity is if it's done right.

My takeaway is this: if your goal is to reduce the number of DAs you need or cut down on the time they spend on tasks, that's absolutely doable. But if you're aiming to fully replace a DA, that's a much bigger and more ambitious challenge.

You need to ask yourself: are you just looking to create a nice PoC to show off in a demo, or do you want to build out the entire solution? If it's the latter, it might actually be more practical to integrate services from existing providers rather than trying to develop everything from scratch.

Who is Target User? — LLMs for Text-to-SQL Discussion Series