GitHub Search CLI

本文整理目前專案從 baseline、失敗分析、規則式 hardening，到 Part 2 LLM 評估管線的實作與測試結果。

TL;DR

對於題目要求的三個模型皆為 85% 在時限前沒有做到，分別是 openai GPT5-mini 76% 與 MiniMax 66%。
採用架構是使用者輸入 -> LLM 重新編寫成明確需求 -> 程式驗證是否正確 -> 程式輸出
分析報告請詳閱 process\039-blind30-reviewer-evidence-report.md

結論

我使用了 rule-based 與 LLM 的混和，因為規則夠細緻，也不能覆蓋大多數情況。我希望 LLM 能做語意理解，應該能補上 rule-based 的不足，但進入 30 題盲測後發現事情沒這麼簡單，一開始成績非常差，完全就是在猜測，後來經過調整分數才有慢慢上來。

讓 LLM 乖乖說話是難的，尤其牽扯到需要查詢才知道的事情，有良好的 tool 杜絕幻覺是必要的。

評估非常困難，在查詢語句中的輸出是錯、是對、還是語意不清，這影響準確率非常多。到目前為止，我仍然沒有一個結論去斷定說一個是哪個指標好，還在不斷迭代與修正中。前幾次的分數遍低原因是評分程式過於嚴苛，例如大小寫、順序、單複數這些無關查詢結果正確性的都列入評分，可以引入 LLM 作為裁判作為第二備援。

1. 專案目標與範圍

Github 是軟體開發最常用的工具之一，因此我選擇這個做為回家作業，由於 github api 十分龐大所以我將範圍限縮在 search 的部分。目標是做一個 CLI 工具，讓使用者輸入自然語言查詢，系統把它轉成 GitHub Search API 可以理解的結構化查詢，實際呼叫 API，並回傳結果。

Github Search API 分析

Github 的 search API 有三個關鍵分別是 endpoint、qualifier 與 syntax，

Endpoint

我們首先觀察 endpoint，他是查詢的入口，扮演能不能拿到真正要取得甚麼資料的重要角色：

Endpoint Example
Search code:
https://api.github.com/search/code?q=Q

Search commits:
https://api.github.com/search/commits?q=Q

還有其他 endpoint 包括 label、user、issue等。這意味著當使用者傳送了複合式的問題時(多 endpoints)，query必須切成兩次查詢，這增加了實作時的困難，但也是 LLM 可以好好派上用場的時機。

Qualifier

當它 endpoint 成功收到訊息後，它用於過濾這些資訊，每個 endpoint 都有各自可使用的 qualifier，LLM 怎麼知道可以用那些 qualifier 並接上什麼詞去讓 qualifier 運作，是實作中要克服的。

Qualifier Example
in:title	warning in:title matches issues with "warning" in their title.
in:body	error in:title,body matches issues with "error" in their title or body.
in:comments	shipit in:comments matches issues mentioning "shipit" in their comments.

Syntax

Syntax 掌管布林函數與時間怎麼表示等數字、文字間關係表達時使用，在這裡時間的表達。

Syntax Example
>YYYY-MM-DD	cats created:>2016-04-29符合 2016 年 4 月 29 日之後創建的包含「cats」一詞的問題。
YYYY-MM-DD..YYYY-MM-DD	cats push:2016-04-30..2016-07-04匹配 2016 年 4 月底至 7 月期間推送到包含單字「cats」的儲存庫。

2. 問題發現：直接把自然語言丟給 GitHub

最早的 baseline 是 direct_search.py。這個版本不做任何事情，把使用者整句輸入直接當成包成 query 後就送出。

這個設計可以實際呼叫 GitHub API，因此符合「baseline execution」的最低要求。但它很快暴露出自然語言查詢的核心問題：GitHub Search API 不知道哪些詞是指令，哪些詞是查詢內容。

例如 process 紀錄中的測試：

Input:
top python repositories about machine learning

Direct passthrough query:
q=top python repositories about machine learning&sort=stars&order=desc&per_page=5

Observed result:
GitHub only returned 2 repositories.

問題不在 API 不能搜尋，而是 top、repositories、about 這些詞在 GitHub 看來都是 literal search terms。它不會自動理解：

top 應該對應到 sort=stars。
python 應該變成 language:Python。
repositories 是搜尋目標，不應該留在 query text。
about machine learning 代表真正的主題詞是 machine learning。

比較合理的結構化解讀是：

q=machine learning language:Python
sort=stars
order=desc
per_page=5

我嘗試讓進來的輸入先填格子 (slots)，以穩定輸出。需要設計一層 interpretation，把使用者語句拆成 slots，再由程式碼產生 GitHub Search API request，而不是把自然語言整句送進 API。

3. Break It：刻意測試的失敗類型

Part 1 的破壞測試集中在自然語言輸入會讓 baseline 或簡單 parser 失效的情境。README.md 與 process 中列出的代表案例包括：

uv run python direct_search.py "??python ? machine learning ??repo"
uv run python direct_search.py "python repositories created after 2020 about web server"
uv run python direct_search.py "popular repositories"
uv run python direct_search.py "python repos about machine learning and web server"
uv run python direct_search.py "top pythn repos about machien learing"

這些案例對應到幾種問題：

Mixed language / encoding-corrupted input

中文或混合語言提示詞不會被 direct baseline 理解，有些字串還呈現 mojibake。baseline 會把這些 token 原樣送進 GitHub，造成查詢內容被污染。
Date constraint

created after 2020 在自然語言中是時間條件，但 direct baseline 只會把它當 literal words。正確做法應該產生 created:>=2020-01-01。
過度寬泛的查詢

popular repositories 沒有 topic、language、owner、repo 或其他限制。硬查會產生過於廣泛且不具體的結果。後來的 hardened CLI 選擇詢問使用者補充，而不是假裝知道意圖。
多重主題混在一起

python repos about machine learning and web server 同時提出兩個可能主題。直接合併會變成過度限制的 keyword query，也可能不是使用者真正想要的行為。後來的設計選擇要求 clarification。
Typo

pythn、machien、learing 這類拼字錯誤會讓 direct baseline 查不到或查偏。後續 hardening 加入小型 correction dictionary，只修正已知 demo typo，不宣稱能處理所有 typo。

後續 live adversarial check 又補上一個重要問題：日期條件衝突。輸入：

python repositories created after 2024 before 2023 about web server

當時 parser 產生：

web server language:Python created:>=2024-01-01

它忽略了 before 2023，沒有偵測衝突。這被記錄為未解問題：日期解析應該偵測多個或互相矛盾的 date constraints，而不是只吃到第一個條件。

4. Part 1 實作設計：規則式 slot pipeline

Part 1 沒有引入 LLM，而是保留 rule-based parser。後續 refactor 把 CLI 明確整理成下列 pipeline：

user text -> extractor -> normalized slots -> validator -> query builder -> GitHub API

這個設計的核心理由是把「理解使用者語句」與「產生 GitHub Search 語法」分開，避免每次產生不同格式的 API request。

在規則式 hardening 中加入的能力包括：

language aliases：如 python、js、typescript、golang。
topic marker：如 about、for、related to，以及部分中文或混合語言 marker。
ranking words：如 top、popular、most stars 對應 sort=stars&order=desc。
typo correction：修正少數明確 demo typo，例如 pythn、machien、learing。
date constraints：處理 created after 2020、this year、last year、recent 等有限形式。
ambiguity stop：對 empty / overly broad / mixed-topic request 回覆 clarification，而不是呼叫 API。

Part 1 後續也從 repositories 擴展到 issues / PRs / commits：

Repository search endpoint：/search/repositories
Issue / PR search endpoint：/search/issues
Commit search endpoint：/search/commits

新增 slots 包括：

repo
labels
state
item_type
author

這讓 CLI 可以處理像下面的輸入：

open bug issues in microsoft/vscode

產生：

repo:microsoft/vscode is:issue is:open label:bug

以及：

commits fixing typo in github/gitignore after 2024

產生：

fixing typo repo:github/gitignore author-date:>=2024-01-01

以上新增的規則後都可以良好的運行規則有寫道的字詞，但是這些行為不是完整 GitHub Search parser，而是針對發現的高風險做最小可解釋 hardening，會遇到的問題如第 5 點。

5. Part 1 驗證結果與仍不足之處

Part 1 紀錄中通過的檢查包括：

uv run python -m py_compile github_search.py github_repo_search.py main.py direct_search.py
uv run python github_search.py "open bug issues in microsoft/vscode"
uv run python github_search.py "commits fixing typo in github/gitignore after 2024"
uv run python github_search.py "top python repositories about machine learning"

五個 adversarial live checks 的結果是：

Mixed Chinese/English repository query 可接受，轉成 machine learning language:Python，API 回傳符合預期的 Python machine-learning repositories。
popular repositories 被判定過於寬泛，CLI 要求使用者提供 topic、language 或 time constraint，沒有盲目呼叫 API。
python repos about machine learning and web server 被判定為 mixed-topic ambiguity，CLI 要求使用者選一個主題。
typo-heavy query top pythn repos about machien learing 被修正為 machine learning language:Python，並在 CLI 中輸出 typo correction warnings。
日期衝突案例有問題：created after 2024 before 2023 只保留了 after 條件，沒有偵測矛盾。

因此 Part 1 的結論是：規則式 pipeline 已經比 direct passthrough 更可靠，能處理 assignment 要求中的幾個關鍵 failure cases。但它仍有明確限制：

multilingual support 是 marker-based，不是真正的 multilingual understanding。
typo handling 只涵蓋 explicit dictionary。
recent 只是用近 365 天的 pushed:>= 近似，不等於 GitHub Trending。
多主題目前以 clarification 處理，沒有自動拆成多個 searches。
issue / PR / commit support 只涵蓋常見欄位，沒有完整實作 milestone、review status、reactions、merge queue、commit hash、author email 等 GitHub qualifier。
日期條件衝突偵測仍不足。

心得：我發現要維護這些龐大的錯字、邊界條件、多語言的情況是維護不完的，程式碼會變得脆弱且臃腫。

也許引入 LLM 是一個好方法。

6. Part 2：資料集與驗證模型準確性

在開始實作前先定義怎麼測試：讓受測的 LMM 去運行由 30 題經過測試可執行的 "需求-> query" 資料集去執行並比較 query 結果，這 30 題相當於驗證集所以執行結果不會用來調整 prompt 或架構。

Part 2 的目標從「做 CLI」轉成「設計 multi-model eval pipeline」。這裡的重點不是只看 API 回傳結果，而是評估模型是否能把自然語言穩定轉成正確的 structured query intent。在建立資料集與 evaluator 前，專案先補強 GitHub Search 規則依據。github_search_rule 紀錄了 GitHub REST Search 官方文件。

這個步驟的目的不是改 CLI 行為，而是讓後續 dataset、ground truth 與 evaluator 的規則能回到 GitHub REST Search 文件本身，而不是只憑印象寫 qualifier。

專案新增 generate_github_search_dataset.py 來產生 30 題 adversarial dataset。資料集設計參考 local github_search_rule 文件，涵蓋：

repository search
code search
issue / PR search
commit search
user search
discussion / package 等容易踩到 unsupported 或非 REST Search 的範圍
ranges、dates、exclusions、quoted phrases、sorting qualifiers

每筆資料包含 manually labeled ground_truth。ground truth 的期望行為有兩類：

write_query：模型應產生一個或多個 GitHub Search query / URL / slot-level fields。
ask_user：當需求模糊或無法安全轉成 GitHub Search 時，模型應提出詢問。

後來也建立了docs/github_search_candidate_pool.md，作為額外候選題庫。這個 candidate pool 沒有直接覆蓋 official 30-question dataset，因為紀錄中明確保留使用者選擇 final set 的空間。

7. Part 2：LLM extractor pipeline 設計

LLM 版本沒有讓模型直接自由產生最終 URL，而是沿用 slot pipeline：

natural language -> extractor -> slots -> validator -> deterministic query builder

llm_search_pipeline.py 實作：

prompt YAML loading
OpenAI / Fireworks chat completion HTTP callers
model JSON output parsing
extractor output validation
deterministic GitHub query rendering
live GitHub smoke checks
slot scoring helpers

相關 runner / scorer 包括：

run_llm_eval.py：跑多模型 eval。
score_llm_slots.py：在 slot stage 評分，不呼叫 GitHub。
run_direct_query_eval.py：直接產生 raw GitHub query 的控制組。
score_direct_queries.py：評估 direct-query baseline。

選擇 slot-level scoring 的原因是 GitHub 搜尋結果會隨時間變動，只要 slot 正確，deterministic builder 就能產生穩定 query。這也讓錯誤分析可以分清楚是模型理解錯、schema 錯、還是搜尋結果變動。

Part 2 的模型來源包括：

closed-source：openai:gpt5-mini
open-weight / hosted via Fireworks：fireworks:qwen3-8b
open-weight / hosted via Fireworks：fireworks:minimax-m2p5 會選擇的原因是這些模型相對便宜適合日常使用，以及能力彼此相當。

8. 實作過程

過程由 AI 整理在 process\ 資料夾裡，如需導覽請閱讀 process\Workthrough.md

9. 進入 blind 30 題測試

請詳閱 process\039-blind30-reviewer-evidence-report.md

10. 核心發現與設計權衡

為什麼選擇 slot-level pipeline 而非直接 query？

Direct-query baseline 的慘敗（openai:gpt5-mini 2/10）證明讓模型直接寫 GitHub Search query 根本不可靠。slot pipeline 的優勢：

驗證層：validator 在模型寫出最終 URL 前就檢查 slot 完整性
可控渲染：deterministic builder 保證 GitHub Search syntax 正確，避免模型發明 unsupported qualifier
錯誤分析：能清楚看出是 intent 錯還是 schema 錯
穩定評分：GitHub 搜尋結果會變，但 slot 評分不變

為什麼需要 resolver 與 two-stage 架構？

單一 slot extractor 在以下情境失敗：

Profile term ambiguity：Java developer 中的 Java 是 language 還是使用者技能描述？
Label resolution：不同 project 用不同 label 名稱，模型猜測的 label 未必存在，resolver 可以去查 label
Multi-step 查詢：一個需求需要拆成 2–3 個 GitHub Search 呼叫
Repo ownership：microsoft/vscode vs VSCode LLM 不知道 vscode 前面要加什麼作者或機構，所以需要額外工具去查詢

Two-stage + resolver 把這些分開：

Stage 1 理解 intent
Stage 2 抽取 slot
Tools 做「查表」工作

Rule-based hardening vs LLM 的取捨

回顧 Part 1 的結論：rule-based CLI 雖然能處理常見案例，但維護成本高、擴展性差。Part 2 的 LLM approach 更優雅，但明確受限於設計模式。

實際可行的平衡點是：

對「易出錯但可確定」的部分（日期格式、label 名稱、repo 格式），用工具或規則
對「需要語意理解」的部分（intent、profile term、multi-step 判斷），用 LLM
保守地設計 schema，避免模型需要猜測的邊界

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
docs		docs
eval_part2		eval_part2
github_search_rule		github_search_rule
presona		presona
process		process
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
README.md		README.md
controlled_search_compiler.py		controlled_search_compiler.py
direct_search.py		direct_search.py
generate_blind_github_search_dataset.py		generate_blind_github_search_dataset.py
generate_github_search_dataset.py		generate_github_search_dataset.py
github_repo_search.py		github_repo_search.py
github_search.py		github_search.py
live_github_search_check.py		live_github_search_check.py
llm_search_pipeline.py		llm_search_pipeline.py
main.py		main.py
pyproject.toml		pyproject.toml
run_controlled_compiler_eval.py		run_controlled_compiler_eval.py
run_direct_query_eval.py		run_direct_query_eval.py
run_llm_eval.py		run_llm_eval.py
score_adversarial_30.py		score_adversarial_30.py
score_adversarial_30_v2.py		score_adversarial_30_v2.py
score_direct_queries.py		score_direct_queries.py
score_llm_slots.py		score_llm_slots.py
uv.lock		uv.lock
verify_blind_dataset_live.py		verify_blind_dataset_live.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Search CLI

TL;DR

結論

1. 專案目標與範圍

Github Search API 分析

Endpoint

Qualifier

Syntax

2. 問題發現：直接把自然語言丟給 GitHub

3. Break It：刻意測試的失敗類型

4. Part 1 實作設計：規則式 slot pipeline

5. Part 1 驗證結果與仍不足之處

6. Part 2：資料集與驗證模型準確性

7. Part 2：LLM extractor pipeline 設計

8. 實作過程

9. 進入 blind 30 題測試

10. 核心發現與設計權衡

為什麼選擇 slot-level pipeline 而非直接 query？

為什麼需要 resolver 與 two-stage 架構？

Rule-based hardening vs LLM 的取捨

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GitHub Search CLI

TL;DR

結論

1. 專案目標與範圍

Github Search API 分析

Endpoint

Qualifier

Syntax

2. 問題發現：直接把自然語言丟給 GitHub

3. Break It：刻意測試的失敗類型

4. Part 1 實作設計：規則式 slot pipeline

5. Part 1 驗證結果與仍不足之處

6. Part 2：資料集與驗證模型準確性

7. Part 2：LLM extractor pipeline 設計

8. 實作過程

9. 進入 blind 30 題測試

10. 核心發現與設計權衡

為什麼選擇 slot-level pipeline 而非直接 query？

為什麼需要 resolver 與 two-stage 架構？

Rule-based hardening vs LLM 的取捨

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages