[ACL 2025 main] SCAR: Data Selection via Style Consistency Aware Response Ranking for Efficient Instruction Tuning of Large Language Models
Overview
SCAR is an innovative data‑selection method that enhances instruction tuning for large language models. By leveraging style‑consistency‑aware response ranking, SCAR identifies and selects the most beneficial training data for instruction tuning, ultimately improving model performance.
Installation
Ensure you have Python 3.8+ installed, then run:
pip install scar-tool
Requirements
torch >= 2.3
, transformers >= 4.37
, huggingface_hub >= 0.23
, scikit-learn
, tqdm
, nltk
, datasketch
.
These packages install automatically with scar-tool.
Usage
Basic example (Hugging Face Transformers)
import torch
from transformers import AutoTokenizer
from style_ranker.ranker.model import StyleRanker
model_path = "lizhuang144/scar-gte-base"
model = StyleRanker.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
instructions = ["Write a poem about spring",
"Explain quantum computing"]
answers = ["I am sorry. Who are you? Why should I tell you anything about poem",
"Quantum computing is a type of computation..."]
ins = tokenizer(instructions, return_tensors="pt", padding=True,
truncation=True, max_length=512)
ans = tokenizer(answers, return_tensors="pt", padding=True,
truncation=True, max_length=512)
model.eval()
with torch.no_grad():
scores = model(ins.input_ids, ins.attention_mask,
ans.input_ids, ans.attention_mask)
for i, (instr, answ, s) in enumerate(zip(instructions, answers, scores)):
print(f"{i+1}. {instr}\n {answ}\n Score: {s.item():.2f}\n")
Advanced: rank and filter
from style_ranker.rank import rank_and_filter
import torch
model_path = "lizhuang144/scar-gte-base"
device = "cuda" if torch.cuda.is_available() else "cpu"
instructions = ["Write a poem about spring",
"Explain quantum computing",
"Describe the water cycle"]
answers = ["I am sorry. Who are you? Why should I tell you anything about poem",
"Quantum computing is a type of computation...",
"The water cycle, also known as..."]
topk_pairs = rank_and_filter(model_path, instructions, answers, topk=2, device=device)
threshold_pairs = rank_and_filter(model_path, instructions, answers, threshold=-2.0, device=device)
ratio_pairs = rank_and_filter(model_path, instructions, answers, ratio=0.5, device=device)
Tip: SCAR models currently do not support non‑English data or automatic de‑duplication. Exclude non‑English examples and remove duplicates before filtering.
Citation (ACL 2025)
@inproceedings{li-etal-2025-scar,
title = "{SCAR}: Data Selection via Style Consistency-Aware Response Ranking for Efficient Instruction-Tuning of Large Language Models",
author = "Li, Zhuang and
Hua, Yuncheng and
Vu, Thuy-Trang and
Zhan, Haolan and
Qu, Lizhen and
Haffari, Gholamreza",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.625/",
pages = "12756--12790",
ISBN = "979-8-89176-251-0"
}
License
MIT License – © 2024 Zhuang Li
Unlock efficient instruction tuning for your next LLM project with SCAR!