Eat Study Love

먹고 공부하고 사랑하라

Data Science/Research

PostgreSQL SQL2NL(6)

eatplaylove 2025. 3. 14. 18:21

지난 연구내용 요약

원활히 진행중인 NL2SQL에서 Prompt Engineering을 어떻게 진행하는지 확인하고, 다시 이와 유사하게 SQL2NL Prompt Engineering을 진행해봐야겠다.

 

Prompt Engineering이 제대로 되지 않은 거 같아서 다시 시도를 해보려한다.

 

어떤 것들을 해봐야할까?

 

1. GPT4.0 기준 일반 SQL -> NL 과 SQL -> NL(+ Metadata ) 라도 제대로 비교해놔야겠다. 저번엔 이것을 생짜로 GPT에 복붙했었는데, GPT API를 이용한 code로 좀 더 정밀하게 NL변환작업을 해야겠다.

 

2. 1번에서 코드가 완성된다면, 여기다가 Parsing정보(from PSQL)를 추가해서도 NL변환작업을 진행해본다.

 

Gold SQL Predicted Text Predicted Text ( + DB Schema metadata )
SELECT `Free Meal Count (Ages 5-17)` / `Enrollment (Ages 5-17)` FROM frpm WHERE `Educational Option Type` = 'Continuation School' AND `Free Meal Count (Ages 5-17)` / `Enrollment (Ages 5-17)` IS NOT NULL ORDER BY `Free Meal Count (Ages 5-17)` / `Enrollment (Ages 5-17)` ASC LIMIT 3 The command is asking to retrieve the ratio of the count of free meals for ages 5-17 to the enrollment count for ages 5-17 from the 'frpm' table. However, it only wants to consider rows where the educational option type is 'Continuation School' and where the ratio of free meal count to enrollment count is not null. The results should be ordered in ascending order based on this ratio. Only the first three results that meet these criteria should be returned. The query is requesting the ratio of the count of free meals provided to students aged 5-17 to the total enrollment of students aged 5-17. This is being requested from the 'frpm' table for schools that are classified as 'Continuation School' under the 'Educational Option Type'. The query is specifically asking for cases where this ratio is not null, meaning both the free meal count and enrollment for ages 5-17 are available. The results are then ordered in ascending order, meaning from the lowest to the highest ratio. The query is limited to the first 3 results, so it will only return the three lowest ratios.
Predicted Text ( + Parsing info) Predicted Text ( + Metadata, Parsing info) Gold Text
The query is selecting the ratio of the 'Free Meal Count (Ages 5-17)' to the 'Enrollment (Ages 5-17)' from the 'frpm' data table. It is only selecting records where the 'Educational Option Type' is 'Continuation School' and where the calculated ratio is not null. The results are sorted in ascending order based on this ratio. The query will return at most 3 records. The SQL query is selecting the ratio of the count of free meals provided to students aged 5-17 to the total number of students aged 5-17 enrolled. This is being done from the 'frpm' table where the type of educational option is a 'Continuation School' and where this ratio is not null. The results are then sorted in ascending order by this ratio. The query will return at most 3 records. Please list the lowest three eligible free rates for students aged 5-17 in continuation schools.

 

결과는 좀 참혹하다.

 

뭔가 주절주절 말만 많지 천하의 LLM이라도 Gold-text처럼 좀 깔끔하게 문장을 만들어내진 못한다.

그리고 Query 20개 정도로 test를 해보았을 뿐인데.. API 이용요금이 ㅎㄷㄷ..

실험 한 5분 했나? 근데 거의 4~5천원이 빨렸다.

AI기업들이 대거 폭리를 취할 수 있는 이유를 조금이나마 느끼게 되더라는..ㅠㅠ

 

뭔가 문항들을 분석해보니, DB table schema만 전달했을 때 좀 더 사람에게 친숙한? Language가 반환되는 거 같다. parsing info가 추가되니까 좀 더 'SQL'문에 대한 해석이 Main이 된다.

 

일단 목적은, 사람이 이해하기 쉬워야 하니까 DB-schema data를 추가한 버전이 내 Test 상에선 제일 적합했다. 근데 이게 또 정확도를 사람의 눈으로 책정하다보니까.. 주관적인 의견이 들어가긴 해서 애매하긴하다.

 

LLM을 돌릴 때 Humantic(?)한 문구로 반환을 요청이라도 해야하나.. 고민이 된다.

 

위 예시보단, 아래 예시에서 내가 느낀 부분이 좀 더 잘 보일 것이다.

Gold SQL Predicted Text Predicted Text ( + DB Schema metadata )
SELECT T2.Phone FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode WHERE T1.`Charter Funding Type` = 'Directly funded' AND T1.`Charter School (Y/N)` = 1 AND T2.OpenDate > '2000-01-01' The command is asking to retrieve the phone numbers from the 'schools' table. However, it only wants the phone numbers for the schools that meet certain conditions. These conditions are that the school must be directly funded according to the 'frpm' table, it must be a charter school according to the 'frpm' table, and it must have an open date later than January 1, 2000 according to the 'schools' table. The 'frpm' and 'schools' tables are linked by a common field, 'CDSCode'. The SQL query is asking for the phone numbers of all schools that are directly funded charter schools, which opened after January 1, 2000. The data is being pulled from two tables: the 'frpm' table and the 'schools' table. The tables are linked by a common field, 'CDSCode', which is a unique identifier for each school.
Predicted Text ( + Parsing info) Predicted Text ( + Metadata, Parsing info) Gold Text
The query is selecting the phone numbers from the 'schools' table. It does this by joining the 'frpm' table and the 'schools' table on the common field 'CDSCode'. However, it only selects the records where the 'Charter Funding Type' in the 'frpm' table is 'Directly funded', the 'Charter School (Y/N)' field in the 'frpm' table is 1 (indicating it is a charter school), and the 'OpenDate' in the 'schools' table is later than January 1, 2000. The SQL query is asking for the phone numbers of schools. It is looking in two tables: 'frpm' and 'schools'. It is specifically looking for schools that are directly funded charter schools, which opened after January 1, 2000. The 'frpm' and 'schools' tables are connected by a common field called 'CDSCode'. Please list the phone numbers of the direct charter-funded schools that are opened after 2000/1/1.

 

어떻게 하면 좀 더 Human friendly 한 NL반환이 될 지 고민해봐야겠다.

 

무작정 LLM을 돌리면서 고민하자니 돈이 너무 깨지니까, 고민을 잘 하고 수행하자.