[ACL 2025 main] SCAR: Data Selection via Style Consistency Aware Response Ranking for Efficient Instruction Tuning of Large Language Models

Overview

SCAR is an innovative data‑selection method that enhances instruction tuning for large language models. By leveraging style‑consistency‑aware response ranking, SCAR identifies and selects the most beneficial training data for instruction tuning, ultimately improving model performance.

arXiv PyPI Downloads GitHub stars MIT license

Installation

Ensure you have Python 3.8+ installed, then run:

pip install scar-tool

Requirements

torch >= 2.3, transformers >= 4.37, huggingface_hub >= 0.23, scikit-learn, tqdm, nltk, datasketch. These packages install automatically with scar-tool.

Usage

Basic example (Hugging Face Transformers)

import torch
from transformers import AutoTokenizer
from style_ranker.ranker.model import StyleRanker

model_path = "lizhuang144/scar-gte-base"
model = StyleRanker.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

instructions = ["Write a poem about spring",
                "Explain quantum computing"]
answers = ["I am sorry. Who are you? Why should I tell you anything about poem",
           "Quantum computing is a type of computation..."]

ins = tokenizer(instructions, return_tensors="pt", padding=True,
                truncation=True, max_length=512)
ans = tokenizer(answers, return_tensors="pt", padding=True,
                truncation=True, max_length=512)

model.eval()
with torch.no_grad():
    scores = model(ins.input_ids, ins.attention_mask,
                   ans.input_ids, ans.attention_mask)

for i, (instr, answ, s) in enumerate(zip(instructions, answers, scores)):
    print(f"{i+1}. {instr}\n   {answ}\n   Score: {s.item():.2f}\n")

Advanced: rank and filter

from style_ranker.rank import rank_and_filter
import torch

model_path = "lizhuang144/scar-gte-base"
device = "cuda" if torch.cuda.is_available() else "cpu"

instructions = ["Write a poem about spring",
                "Explain quantum computing",
                "Describe the water cycle"]
answers = ["I am sorry. Who are you? Why should I tell you anything about poem",
           "Quantum computing is a type of computation...",
           "The water cycle, also known as..."]

topk_pairs      = rank_and_filter(model_path, instructions, answers, topk=2, device=device)
threshold_pairs = rank_and_filter(model_path, instructions, answers, threshold=-2.0, device=device)
ratio_pairs     = rank_and_filter(model_path, instructions, answers, ratio=0.5, device=device)
Tip: SCAR models currently do not support non‑English data or automatic de‑duplication. Exclude non‑English examples and remove duplicates before filtering.

Citation (ACL 2025)

@inproceedings{li-etal-2025-scar,
    title = "{SCAR}: Data Selection via Style Consistency-Aware Response Ranking for Efficient Instruction-Tuning of Large Language Models",
    author = "Li, Zhuang  and
      Hua, Yuncheng  and
      Vu, Thuy-Trang  and
      Zhan, Haolan  and
      Qu, Lizhen  and
      Haffari, Gholamreza",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.625/",
    pages = "12756--12790",
    ISBN = "979-8-89176-251-0"
}

License

MIT License – © 2024 Zhuang Li

Unlock efficient instruction tuning for your next LLM project with SCAR!