• 作者:老汪软件技巧
  • 发表时间:2024-08-30 10:02
  • 浏览量:

这个项目是吴恩达博士的一个开源项目,对理解吴恩达博士的agent思路非常有帮助。

项目地址:/andrewyng/t…

项目核心代码都在src\translation_agent的utils.py下。入口方法为translate。

源码解读

def translate(
    source_lang,
    target_lang,
    source_text,
    country,
    max_tokens=MAX_TOKENS_PER_CHUNK,
):
    """Translate the source_text from source_lang to target_lang."""
    # 计算输入文本的token数,这个方法中使用了openai提供的tiktoken库,对输入的文本进行token计算
    num_tokens_in_text = num_tokens_in_string(source_text)
		# ic 使用icecream进行文本打印。关于ic打印相比print打印的区别参考:
    ic(num_tokens_in_text)
		# 判断token是否支持一次翻译。MAX_TOKENS_PER_CHUNK默认参数是1000
    if num_tokens_in_text < max_tokens:
        ic("Translating text as a single chunk")
				# 核心翻译方法
        final_translation = one_chunk_translate_text(
            source_lang, target_lang, source_text, country
        )
        return final_translation
    else:
		    # 大部分场景应该是这个分支
        ic("Translating text as multiple chunks")
				# 计算每个trunk的size
        token_size = calculate_chunk_size(
            token_count=num_tokens_in_text, token_limit=max_tokens
        )
        ic(token_size)
				# 调用langchain的文本分割方法切片
        text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
            model_name="gpt-4",
            chunk_size=token_size,
            chunk_overlap=0,
        )
        source_text_chunks = text_splitter.split_text(source_text)
				# 核心翻译方法
        translation_2_chunks = multichunk_translation(
            source_lang, target_lang, source_text_chunks, country
        )
        return "".join(translation_2_chunks)

多chunks翻译

def multichunk_translation(
    source_lang, target_lang, source_text_chunks, country: str = ""
):
    """
    Improves the translation of multiple text chunks based on the initial translation and reflection.
    Args:
        source_lang (str): The source language of the text chunks.
        target_lang (str): The target language for translation.
        source_text_chunks (List[str]): The list of source text chunks to be translated.
        translation_1_chunks (List[str]): The list of initial translations for each source text chunk.
        reflection_chunks (List[str]): The list of reflections on the initial translations.
        country (str): Country specified for the target language
    Returns:
        List[str]: The list of improved translations for each source text chunk.
    """
		# 扮演语言学家,进行文本翻译。除了翻译之外,不要提供任何解释和文本
    translation_1_chunks = multichunk_initial_translation(
        source_lang, target_lang, source_text_chunks
    )
		# 同样扮演语言学家,进行反思。任务是改进翻译。要求准确、流利、保持风格和术语正确。
    reflection_chunks = multichunk_reflect_on_translation(
        source_lang,
        target_lang,
        source_text_chunks,
        translation_1_chunks,
        country,
    )
		# 同样是扮演语言学家,任务是需要考虑专家建议和建设性的批评。
    translation_2_chunks = multichunk_improve_translation(
        source_lang,
        target_lang,
        source_text_chunks,
        translation_1_chunks,
        reflection_chunks,
    )
    return translation_2_chunks

one_chunk_translate_text原理类似,比这个相对简单。

提示语解读翻译任务

system_message = f"You are an expert linguist, specializing in translation from {source_lang} to {target_lang}."
    translation_prompt = """Your task is to provide a professional translation from {source_lang} to {target_lang} of PART of a text.
The source text is below, delimited by XML tags  and . Translate only the part within the source text
delimited by  and . You can use the rest of the source text as context, but do not translate any
of the other text. Do not output anything other than the translation of the indicated part of the text.

{tagged_text}

To reiterate, you should translate only this part of the text, shown here again between  and :

{chunk_to_translate}

Output only the translation of the portion you are asked to translate, and nothing else.
"""

system_message:扮演语言学家,专门从事语言翻译任务。

translation_prompt :进行专业分段翻译任务。这个比较有意思的是中提供了整个原始要翻译的文本。并且把需要翻译的内容用包裹起来。最后又重申了一下,只翻译TRANSLATE_THIS的内容。

进行反思

system_message = f"You are an expert linguist specializing in translation from {source_lang} to {target_lang}. \
You will be provided with a source text and its translation and your goal is to improve the translation."
reflection_prompt = """Your task is to carefully read a source text and part of a translation of that text from {source_lang} to {target_lang}, and then give constructive criticism and helpful suggestions for improving the translation.
The final style and tone of the translation should match the style of {target_lang} colloquially spoken in {country}.
The source text is below, delimited by XML tags  and , and the part that has been translated
is delimited by  and  within the source text. You can use the rest of the source text
as context for critiquing the translated part.

{tagged_text}

To reiterate, only part of the text is being translated, shown here again between  and :

{chunk_to_translate}

The translation of the indicated part, delimited below by  and , is as follows:

{translation_1_chunk}

When writing suggestions, pay attention to whether there are ways to improve the translation's:\n\
(i) accuracy (by correcting errors of addition, mistranslation, omission, or untranslated text),\n\
(ii) fluency (by applying {target_lang} grammar, spelling and punctuation rules, and ensuring there are no unnecessary repetitions),\n\
(iii) style (by ensuring the translations reflect the style of the source text and take into account any cultural context),\n\
(iv) terminology (by ensuring terminology use is consistent and reflects the source text domain; and by only ensuring you use equivalent idioms {target_lang}).\n\
Write a list of specific, helpful and constructive suggestions for improving the translation.
Each suggestion should address one specific part of the translation.
Output only the suggestions and nothing else."""

_精度吴恩达博士开源的AI智能体项目translation-agent_精度吴恩达博士开源的AI智能体项目translation-agent

system_message:前部分和翻译任务类似,并给出这个任务的目标是改进翻译任务。

reflection_prompt :要求这个任务需要仔细阅读原始输入,并且给出建设性的意见和有帮助的改进建议。并且要求符合特定国家的口语风格。tagged_text部分类似翻译任务的要求,不同的是告诉模型可以使用source text的其它部分作为评判翻译的上下文。最终的数据格式类似于:xxxxx翻译。

TRANSLATE_THIS: 又重申了一下这只是一部分的翻译内容,并且被TRANSLATE_THIS分割。

TRANSLATION: 类似TRANSLATE_THIS

最后要求写一段建议,认真的寻找建议。准确、流利、保持风格和术语正确

改进翻译

system_message = f"You are an expert linguist, specializing in translation editing from {source_lang} to {target_lang}."
    improvement_prompt = """Your task is to carefully read, then improve, a translation from {source_lang} to {target_lang}, taking into
account a set of expert suggestions and constructive criticisms. Below, the source text, initial translation, and expert suggestions are provided.
The source text is below, delimited by XML tags <SOURCE_TEXT> and SOURCE_TEXT>, and the part that has been translated
is delimited by <TRANSLATE_THIS> and TRANSLATE_THIS> within the source text. You can use the rest of the source text
as context, but need to provide a translation only of the part indicated by <TRANSLATE_THIS> and TRANSLATE_THIS>.
<SOURCE_TEXT>
{tagged_text}
SOURCE_TEXT>
To reiterate, only part of the text is being translated, shown here again between <TRANSLATE_THIS> and TRANSLATE_THIS>:
<TRANSLATE_THIS>
{chunk_to_translate}
TRANSLATE_THIS>
The translation of the indicated part, delimited below by <TRANSLATION> and TRANSLATION>, is as follows:
<TRANSLATION>
{translation_1_chunk}
TRANSLATION>
The expert translations of the indicated part, delimited below by <EXPERT_SUGGESTIONS> and EXPERT_SUGGESTIONS>, are as follows:
<EXPERT_SUGGESTIONS>
{reflection_chunk}
EXPERT_SUGGESTIONS>
Taking into account the expert suggestions rewrite the translation to improve it, paying attention
to whether there are ways to improve the translation's
(i) accuracy (by correcting errors of addition, mistranslation, omission, or untranslated text),
(ii) fluency (by applying {target_lang} grammar, spelling and punctuation rules and ensuring there are no unnecessary repetitions), \
(iii) style (by ensuring the translations reflect the style of the source text)
(iv) terminology (inappropriate for context, inconsistent use), or
(v) other errors.
Output only the new translation of the indicated part and nothing else."""

整体和反思比较类似,区别是添加了一个专家意见。只进行指定部分的翻译改进。

总结

博观而约取,厚积而薄发。

吴恩达博士开源的翻译任务项目,可以学习到以下内容:

背后的核心逻辑就是基于LLM作为翻译引擎的心脏,进行翻译、反思以及优化。其中反思是核心。翻译任务执行之后,在把结果扔到LLM中进行反思。再把原始数据、翻译结果、反思结果扔到LLM中进行优化。

提示词的书写。从这个项目中可以看出提示词书写是有规律可循的。

首先system_prompt中指名这个任务是谁,要做什么事。user_prompt中详细描述这个任务要进行的动作,输入的数据形式(必要的时候需要进行重申)。要求。例如这里的要求是需要考虑改进建议重写进行改进。然后进行1234的仔细要求。最后要求只输出指定部分的翻译内容,不输出其它内容。