This project evaluates the agentic quality of text samples using a multi-agent reasoning pipeline. It leverages Azure OpenAI services and a set of predefined prompts to assess the quality of content on a scale of 1 to 5, providing constructive feedback and scoring.
- Multi-Agent Evaluation: Includes discussion, criticism, and ranking agents to evaluate text samples.
- Customizable Prompts: Prompts for agents can be tailored to specific evaluation needs.
- Scoring and Feedback: Provides a score (1–10) and detailed feedback for each sample.
- Azure Integration: Utilizes Azure OpenAI services for model inference.
- Python: Ensure Python 3.8+ is installed.
- Dependencies: Install required Python packages using:
pip install -r requirements.txt
- Azure Credentials: Set up Azure credentials and environment variables:
AZURE_DEPLOYMENTMODEL_NAMEAZURE_ENDPOINTAPI_TOKEN
- Input File: Prepare a
.jsonlfile containing JSON objects (one per line) that includes all the information needed by the evaluation model.
-
Clone the Repository:
git clone https://github.com/microsoft-mousa/agentAsAJudge cd agentAsAJudge -
Set Up Environment Variables: Create a
.envfile in the project root and add the following:AZURE_DEPLOYMENT=<your-deployment-name> MODEL_NAME=<your-model-name> AZURE_ENDPOINT=<your-endpoint-url> API_TOKEN=<your-api-token>
-
Change System Prompts: To customize the system prompts:
- Create a new directory under
metricsand add your prompt files (e.g.,.mdfiles). - Update the initialization of the
AgentEvalPromptsobject inmain.pywith the paths to your new prompt files:agent_eval_prompts = AgentEvalPrompts( reviewer_prompt="<path-to-your-reviewer-prompt>", critic_prompt="<path-to-your-critic-prompt>", ranker_prompt="<path-to-your-ranker-prompt>" )
- Create a new directory under
-
Run Evaluation: Execute the script:
python main.py <path-to-jsonl-file>
-
Validation:
- If the
.jsonlfile is valid:✅ All lines are valid JSON objects! - If there are issues:
❌ Found issues in the file: - Line X: <error-description>
- If the
-
Evaluation: For each sample, the output includes:
- Score: A numeric value (1–10).
- Feedback: Detailed reasoning for the score.
Example:
🔍 Evaluating Sample 1... 📊 Score: 4 🗣 Review: The content is well-structured and informative. -
Errors: If evaluation fails for a sample:
❌ Error evaluating sample X: <error-description>
Feel free to contribute by improving prompts, adding new metrics, or enhancing the evaluation pipeline.