LLM Prompt Recovery 第24名比赛总结

Overview

很高兴在这次的比赛拿到了第24名，这次比赛收货颇丰，前排大佬有和我一样使用技巧上分的，也有用硬实力训练模型上分的，下面我来简单总结一下这次的比赛。

这次比赛是给予我们两篇文章，其中一篇为原文章，另一篇为大模型改写之后的文章，我们需要做的是尽可能还原出我们提供给大模型改写文章的提示词。比赛的判分也很简单，直接看提示词的相似度。

模型

我们最好的模型是使用了 LUMOS 所微调的模型，LB: 0.66。

另外，我们使用了 TOMOO INUBUSHI 收集的数据集，我对这个数据集进行了一遍简单的清洗之后，训练了一个PhiV2模型，在原本 LUMOS‘s notebook 的基础上，仅对模型生成提示词的提示词作改动，只能达到 LB：0.60。

下面是我们以 LUMOS的 notebook 为 baseline 所统计的一些数据。

model	LB	change the generate prompt	change the mean prompt	change the final prompt
LUMOS's phi2	0.61	False	False	False
LUMOS's phi2	0.63	True	False	False
LUMOS's phi2	0.65	True	True	False
LUMOS's phi2	0.67	True	True	True
Our phi2	0.60	True	False	False
Our phi2	0.62	True	False	False
Our phi2	0.65	True	True	False
Our phi2	0.66	True	True	True

最终我们仅对这两个模型进行一个简单的ensemble来作为最终的提交。

最终得分为：

公榜：0.6730

私榜：0.6697

About trick

接下来介绍我们上分的一些技巧。

结构化提示词

虽然在baseline中已经有了基础的结构化提示词，但是我们通过改进原有的提示词，让文本的输入更有结构性，这样可以增强模型的逻辑推理能力，让模型输出的提示词符合实际情况。举个例子：

原来的提示词：

prompt = f"Instruct: Original Text:{ori_text}\nRewritten Text:{rew_text}\nWrite a prompt that was likely given to the LLM to rewrite original text to rewritten text.\nOutput:"

模型的输出:

Rewrite the text as a catchy song chorus.
Rewrite the text as a dialogue between two characters.
Rewrite the text as a series of instructions for a recipe.
Rewrite the text as a motivational speech.
Rewrite the text as a news headline.
Rewrite the text as a poem stanza.
Rewrite the text as a dialogue between two characters in a movie.
Rewrite the text as a series of instructions for a recipe.
Rewrite the ...

可以看到模型的输出并不像我们人类所能够给予的提示词。而下面这个是我们优化后的提示词：

prompt = f"You are an expert that expressing clearly.\n##Original Text:\n{ori_text}\n##Rewritten Text:{rew_text}\n##Task:\nGive you an original text and a rewritten text, you need to return a summary and a prompt that was likely given to the LLM to expressing the original text to the rewritten text.\n##Your Output:"

模型Output:

Summary: Rewrite the text as a song, with verses and a chorus.
Prompt: Rewrite the text as a song, with verses and a chorus. Use the rewritten text as the lyrics for the song.

改动mean prompt mean prompt是用于模型无法输出prompt或者生成的prompt不理想时，所使用的一个默认的prompt。

我们曾经尝试自己改动 mean prompt，但是始终无法突破0.61这个分数，后来在 LB:0.63的notebook 里面的 mean prompt 和讨论区这些优秀讨论discussion1 , discussion2 , discussion3的提醒下，我使用了里面提到的方法对mean prompt进行改动。

对模型输出进行拼接

我们参考讨论区所说的方法，对模型输出的prompt和一些字符串进行拼接，之后我们简单粗暴地删除了提示词中一些副词（没错，是直接删除，而不是对它们进行改动）和一些我们平时不经常使用的单词，这一步拼接prompt的方式直接让我们分数大涨。

训练Phi2模型

这是关于训练Phi2的部分，首先是数据准备工作，这次比赛的训练数据直接使用了 TOMOO 整理好的数据，基于这些数据进行简单的数据清洗。

数据链接：https://www.kaggle.com/datasets/ibrahim2002/all-in-one-dataset-with-embedding

数据清洗：

去除空值
去除重复值
去除以 “sure” 开头的提示词（这个是google某大模型输出的不合格不符合要求的提示词）
去除前后没改变的文本
过滤掉提示词长度少于5的文本
过滤掉包含特殊符号的提示词

下面是我数据清洗所用的部分代码：

# 由于是在kaggle平台上进行清洗，所以需要安装faiss-gpu和sentence-transformers
!pip install faiss-gpu sentence-transformers

import numpy as np
import pandas as pd 
import os
import tqdm
import gc
import sys
import re
import faiss
import swifter # 加速apply函数的一个方法
import torch
import polars as pl
from pathlib import Path
import torch.nn.functional as F
from tqdm.notebook import tqdm
from sklearn.metrics.pairwise import cosine_similarity

# 这些是数据链接
path1 = "/kaggle/input/all-in-one-dataset-with-embedding/df_with_emb.parquet"
path2 = "/kaggle/input/llmpr-public-10k-unique/public_10k_unique_rewrite_prompt.csv"

# 读取数据，先保存为csv文件，后续好合并处理
data1 = pd.read_parquet(path1)
data1 = data1[["dataset_id", "original_text", "rewrite_prompt", "rewritten_text"]]
data1 = data1.rename(columns={'dataset_id': 'id'})
data1.to_csv("clear_vector.csv", index=False)

data2 = pd.read_csv(path2)
data2.head()

# 确保行列没问题后，合并数据
df = pd.concat([data1, data2], ignore_index=True, axis=0)
df.head()

# 开始清洗数据
print(f'Before romoval: {df.shape}')
# 清洗非空值
df = df[~df.original_text.isnull()].reset_index(drop=True)
df = df[~df.rewritten_text.isnull()].reset_index(drop=True)
df = df[~df.rewrite_prompt.isnull()].reset_index(drop=True)
print(f'Remove NaN: {df.shape}')
# 清洗重复值
df = df.drop_duplicates(['original_text','rewritten_text','rewrite_prompt']).reset_index(drop=True)
print(f'Remove duplicates: {df.shape}')
# 清洗以 sure 开头的提示词
df = df[[s.lower().startswith('sure')==False for s in df['rewritten_text']]].reset_index(drop=True)
print(f'Remove prompts start with sure: {df.shape}')
# 清洗没改变的文本
df = df[df.original_text != df.rewritten_text ].reset_index(drop=True)
print(f'Remove unchanged texts: {df.shape}')
print(f'After romoval: {df.shape}')
# 保存简单清洗后的数据
df.to_csv("llm_prompts_rewrite.csv", index=False)

# 使用polars进行数据清洗，数据量较多，使用polars用gpu处理速度快些
pldf = pl.read_csv("llm_prompts_rewrite.csv")
# 过滤掉提示词长度少于5的文本
pldf = pldf.filter(pl.col("rewrite_prompt").str.split(" ").apply(lambda x: len(x)>=5))
# 过滤掉空值
pldf = pldf.filter(~pl.col("rewrite_prompt").str.contains('^(?:\s*|NULL|null|NaN)$'))
# pldf = pldf.filter(pl.col("original_text").apply(check_string))
pldf.glimpse()

--------------------------------

# 过滤相似的提示词
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
vector = model.encode(pldf["rewrite_prompt"].to_numpy(), batch_size=128, show_progress_bar=True, device="cuda", convert_to_tensor=True)

torch.cuda.empty_cache()

# 相似度阈值
threshold = 0.93
# 每次比较的向量数量
n_neighbors = 1000
# 每次处理的batch
batch_size = 1000
# 相似的向量
similar_vectors = []

# 使用Faiss库创建IndexFlatIP索引
# 'IP'代表内积(Inner Product),
# 这等同于余弦相似度,因为它涉及对归一化向量进行点积运算。
# index = faiss.IndexFlatIP(384)
res = faiss.StandardGpuResources()
flat_config = faiss.GpuIndexFlatConfig()
flat_config.device = 0
index = faiss.GpuIndexFlatIP(res, 1024, flat_config)
index.add(F.normalize(vector).cpu().numpy())
for i in tqdm(range(0, len(vector), batch_size)):
    # 分批处理数据
    batch_data = vector.cpu().numpy()[i:i + batch_size]
    # 根据相似度来保存
    similarities, indices = index.search(batch_data, n_neighbors)
    
    # 提取高于阈值的向量，后面进行删除
    for j in range(similarities.shape[0]):
        close_vectors = indices[j, similarities[j] >= threshold] 
        index_base = i
        close_vectors = close_vectors[close_vectors != index_base + j]  
        similar_vectors.append((index_base + j, close_vectors))

pldf = pldf.with_columns(pl.Series(values=list(range(len(pldf))), name="index"))
pldf = pldf.filter(~pl.col("index").is_in(np.unique(np.concatenate([x for _, x in similar_vectors])).tolist()))

pldf.write_csv("prompts_rewrite.csv")
pldf.head()

训练阶段：

将提示词更改为上方介绍的更为结构化的提示词，控制模型输出
Peft + SFT 训练

参考其他人的phi2训练代码：https://www.kaggle.com/code/mozhiwenmzw/0-61-llmpr-phi2-sft-model-training?scriptVersionId=169400826

Other

后续会出一些前排大佬的方案以及其他比赛的比赛总结。敬请期待～