Self-Improvement of Large Language Models: A Technical Overview and Future Outlook

Enable Large Language Models (LLMs) to improve autonomously by having them evaluate their own outputs. This action pack guides you through setting up an LLM to self-critique its responses, identifying deficiencies, and suggesting improvements, reducing reliance on costly human supervision.

intermediate15 min3 steps

The play

Define Evaluation Criteria
Establish specific metrics (e.g., accuracy, completeness, relevance, coherence, conciseness, safety) that your LLM will use to judge its own output. These criteria will form the basis of its self-critique.
Construct a Self-Critique Prompt
Design a detailed prompt that instructs the LLM to analyze its previous output against the defined criteria, score it, and provide explanations and concrete suggestions for improvement. This prompt acts as the LLM's internal critic.
Execute Self-Evaluation
Pass the original user prompt and the LLM's initial response, along with your self-critique prompt, back to the LLM. It will then generate a detailed evaluation of its own performance and suggest modifications.

Starter code

import openai # Or any other LLM client like HuggingFace Transformers

def self_evaluate(llm_client, original_prompt: str, llm_output: str) -> str:
    """
    Instructs an LLM to critically evaluate its own output.
    """
    evaluation_prompt = f"""
    You are an expert AI assistant tasked with critically assessing the following LLM output for the given prompt.

    Original Prompt:
    {original_prompt}

    LLM Output to Evaluate:
    {llm_output}

    Critique the output based on these criteria:
    1.  **Accuracy**: Is the information factually correct?
    2.  **Completeness**: Does it fully address all aspects of the prompt?
    3.  **Relevance**: Is all information pertinent to the prompt? No tangents.
    4.  **Coherence**: Is it well-structured, logical, and easy to understand?
    5.  **Conciseness**: Is it free of unnecessary verbosity or repetition?

    Provide a score (1-10) for each criterion, then give an overall score, and finally, suggest concrete improvements to make the output better. Focus on actionable advice.

    Critique:
    """
    # Example of calling an LLM (replace with your actual LLM client call)
    # response = llm_client.chat.completions.create(
    #     model="gpt-4-turbo-preview",
    #     messages=[{"role": "user", "content": evaluation_prompt}]
    # )
    # return response.choices[0].message.content
    return evaluation_prompt # For demonstration, returning the constructed prompt

# Example Usage:
# llm_output_to_critique = "The capital of France is Berlin."
# original_question = "What is the capital of France?"
# critique = self_evaluate(None, original_question, llm_output_to_critique)
# print(critique)