Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,306 @@
# Create Analyzer with Labeled Training Data

This sample demonstrates how to create a custom analyzer using labeled training data to improve extraction accuracy. Labeled data consists of document samples that have been manually annotated with expected field values, which helps the analyzer learn from examples and better recognize patterns in similar documents.

## Prerequisites

Before running this sample, ensure you have:

1. **Azure Content Understanding Resource**: Set up an Azure AI Content Understanding resource
2. **Model Deployments**: Deploy required models (GPT-4.1 and text-embedding-3-large)
3. **Azure Blob Storage**: An Azure Storage Account with a blob container to store training data
4. **Training Data**: Documents with corresponding `.labels.json` and `.result.json` files

### Setting up Training Data Storage

You have two options for configuring access to your Azure Blob Storage:

#### Option 1: Using Storage Account and Container Name (Recommended)

Set the following environment variables, and the SDK will automatically generate a SAS token using Azure Identity:

```bash
# Storage account and container (SAS token will be auto-generated)
export TRAINING_DATA_STORAGE_ACCOUNT="mmigithubsamplesstorage"
export TRAINING_DATA_CONTAINER_NAME="mmi-github-samples-blob-container"
export TRAINING_DATA_PATH="document_training/" # Optional, defaults to "training_data/"
```

This approach requires you to be authenticated with Azure (e.g., via `az login` or DefaultAzureCredential), and you must have appropriate permissions on the storage account.

#### Option 2: Using Pre-generated SAS URL

Alternatively, you can provide a pre-generated SAS URL:

```bash
# Pre-generated SAS URL
export TRAINING_DATA_SAS_URL="https://mystorageaccount.blob.core.windows.net/mycontainer?sv=2023-01-03&ss=b&srt=co&sp=racwdl&se=..."
export TRAINING_DATA_PATH="document_training/" # Optional
```

**Important**: The SAS token must include the following permissions:
- **Read** (r): To read existing blobs
- **Add** (a): To add new blobs (append operations)
- **Create** (c): **Required** to create new blobs when uploading training files
- **Write** (w): To write to blobs
- **Delete** (d): To delete blobs if needed
- **List** (l): To enumerate blobs in the container

The full permission string should be: `sp=racwdl`

### Training Data Structure

Each training document requires three files:

1. **Original Document**: The source file (PDF, image, etc.)
- Example: `receipt.jpg`

2. **Labels File**: Contains the manually annotated field values
- Example: `receipt.jpg.labels.json`
- Structure:
```json
{
"$schema": "https://schema.ai.azure.com/mmi/2024-12-01-preview/labels.json",
"fieldLabels": {
"MerchantName": {
"type": "string",
"valueString": "Contoso"
},
"TotalPrice": {
"type": "string",
"valueString": "$14.50"
}
}
}
```

3. **OCR Results File**: Contains the OCR analysis results from `prebuilt-documentSearch`
- Example: `receipt.jpg.result.json`

## Usage

```C# Snippet:ContentUnderstandingCreateAnalyzerWithLabels
// Generate a unique analyzer ID
string analyzerId = $"receipt_analyzer_{DateTimeOffset.UtcNow.ToUnixTimeSeconds()}";

// Step 1: Upload training data to Azure Blob Storage
// Get training data configuration from environment
string trainingDataSasUrl = Environment.GetEnvironmentVariable("TRAINING_DATA_SAS_URL")
?? throw new InvalidOperationException("TRAINING_DATA_SAS_URL environment variable is required");
string trainingDataPath = Environment.GetEnvironmentVariable("TRAINING_DATA_PATH") ?? "training_data/";

// Ensure path ends with /
if (!string.IsNullOrEmpty(trainingDataPath) && !trainingDataPath.EndsWith("/"))
{
trainingDataPath += "/";
}

// Upload training documents with labels and OCR results
string trainingDocsFolder = Path.Combine(
Path.GetDirectoryName(System.Reflection.Assembly.GetExecutingAssembly().Location) ?? string.Empty,
"TestData",
"document_training");

if (Directory.Exists(trainingDocsFolder))
{
var containerClient = new BlobContainerClient(new Uri(trainingDataSasUrl));
var files = Directory.GetFiles(trainingDocsFolder);

foreach (var file in files)
{
string fileName = Path.GetFileName(file);

// Upload document, labels.json, and result.json files
if (!fileName.EndsWith(".labels.json") && !fileName.EndsWith(".result.json"))
{
// Upload the main document
string blobPath = trainingDataPath + fileName;
var blobClient = containerClient.GetBlobClient(blobPath);

using (var fileStream = File.OpenRead(file))
{
await blobClient.UploadAsync(fileStream, overwrite: true);
}

// Upload associated labels.json
string labelsFile = file + ".labels.json";
if (File.Exists(labelsFile))
{
string labelsBlobPath = trainingDataPath + fileName + ".labels.json";
var labelsBlobClient = containerClient.GetBlobClient(labelsBlobPath);
using (var labelsStream = File.OpenRead(labelsFile))
{
await labelsBlobClient.UploadAsync(labelsStream, overwrite: true);
}
}

// Upload associated result.json
string resultFile = file + ".result.json";
if (File.Exists(resultFile))
{
string resultBlobPath = trainingDataPath + fileName + ".result.json";
var resultBlobClient = containerClient.GetBlobClient(resultBlobPath);
using (var resultStream = File.OpenRead(resultFile))
{
await resultBlobClient.UploadAsync(resultStream, overwrite: true);
}
}
}
}
Console.WriteLine("Training data uploaded to blob storage successfully.");
}

// Step 2: Define field schema for receipt extraction
// Create the Items array item definition (object with properties)
var itemDefinition = new ContentFieldDefinition
{
Type = ContentFieldType.Object,
Method = GenerationMethod.Extract,
Description = "Individual item details"
};
itemDefinition.Properties.Add("Quantity", new ContentFieldDefinition
{
Type = ContentFieldType.String,
Method = GenerationMethod.Extract,
Description = "Quantity of the item"
});
itemDefinition.Properties.Add("Name", new ContentFieldDefinition
{
Type = ContentFieldType.String,
Method = GenerationMethod.Extract,
Description = "Name of the item"
});
itemDefinition.Properties.Add("Price", new ContentFieldDefinition
{
Type = ContentFieldType.String,
Method = GenerationMethod.Extract,
Description = "Price of the item"
});

var fieldSchema = new ContentFieldSchema(
new Dictionary<string, ContentFieldDefinition>
{
["MerchantName"] = new ContentFieldDefinition
{
Type = ContentFieldType.String,
Method = GenerationMethod.Extract,
Description = "Name of the merchant"
},
["Items"] = new ContentFieldDefinition
{
Type = ContentFieldType.Array,
Method = GenerationMethod.Generate,
Description = "List of items purchased",
ItemDefinition = itemDefinition
},
["TotalPrice"] = new ContentFieldDefinition
{
Type = ContentFieldType.String,
Method = GenerationMethod.Extract,
Description = "Total price on the receipt"
}
})
{
Name = "receipt_schema",
Description = "Schema for receipt extraction with labeled training data"
};

// Step 3: Configure knowledge sources with labeled data
var knowledgeSource = new LabeledDataKnowledgeSource(new Uri(trainingDataSasUrl), string.Empty)
{
Prefix = trainingDataPath
};

// Step 4: Create analyzer configuration
var config = new ContentAnalyzerConfig
{
EnableFormula = false,
EnableLayout = true,
EnableOcr = true,
EstimateFieldSourceAndConfidence = true,
ReturnDetails = true
};

// Step 5: Create the custom analyzer with knowledge sources
var customAnalyzer = new ContentAnalyzer
{
BaseAnalyzerId = "prebuilt-document",
Description = "Receipt analyzer trained with labeled data",
Config = config,
FieldSchema = fieldSchema
};

// Add knowledge source
customAnalyzer.KnowledgeSources.Add(knowledgeSource);

// Add model mappings (required when using knowledge sources)
customAnalyzer.Models.Add("completion", "gpt-4.1");
customAnalyzer.Models.Add("embedding", "text-embedding-3-large");

// Create the analyzer
var operation = await client.CreateAnalyzerAsync(
WaitUntil.Completed,
analyzerId,
customAnalyzer,
allowReplace: true);

ContentAnalyzer result = operation.Value;
Console.WriteLine($"Analyzer '{analyzerId}' created successfully with labeled training data!");
```

## Cleanup

After testing, you can delete the analyzer:

```C# Snippet:ContentUnderstandingDeleteAnalyzerWithLabels
// Clean up: delete the analyzer (for testing purposes only)
// In production, trained analyzers are typically kept and reused
await client.DeleteAnalyzerAsync(analyzerId);
Console.WriteLine($"Analyzer '{analyzerId}' deleted successfully.");
```

## Key Concepts

### What is Labeled Data?

Labeled data consists of document samples that have been manually annotated with expected field values. This training data helps the analyzer:

- **Learn from Examples**: Understand how to extract specific fields from similar documents
- **Improve Accuracy**: Better recognize patterns and variations in document formats
- **Handle Edge Cases**: Learn to handle unusual or complex document layouts

### Knowledge Sources

Knowledge sources provide additional context to improve extraction accuracy. When using labeled data:

- **Kind**: Set to `KnowledgeSourceKind.LabeledData`
- **ContainerUrl**: Azure Blob Storage SAS URL containing training files
- **Prefix**: Folder path within the container where training data is stored

### When to Use Training Data

Use labeled training data when:

- You have domain-specific documents that prebuilt analyzers don't handle well
- You need higher accuracy for specific field extractions
- You have a collection of labeled documents ready for training
- You want to improve extraction for custom document types

### Model Requirements

When using knowledge sources with labeled data:

- **Completion Model**: Required for generating and extracting field values (e.g., "gpt-4.1")
- **Embedding Model**: Required for semantic search and matching (e.g., "text-embedding-3-large")

## Related Samples

- [Sample04_CreateAnalyzer](Sample04_CreateAnalyzer.md) - Create a basic custom analyzer
- [Sample06_GetAnalyzer](Sample06_GetAnalyzer.md) - Retrieve analyzer details
- [Sample07_ListAnalyzers](Sample07_ListAnalyzers.md) - List all analyzers

## Additional Resources

- [Azure AI Content Understanding Documentation](https://learn.microsoft.com/azure/ai-services/content-understanding/)
- [Analyzer Training Concepts](https://learn.microsoft.com/azure/ai-services/content-understanding/concepts/analyzer-training)
- [How to Label Training Data](https://learn.microsoft.com/azure/ai-services/content-understanding/how-to/training-data)
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

using System;

namespace Azure.AI.ContentUnderstanding
{
/// <summary> Labeled data knowledge source. </summary>
public partial class LabeledDataKnowledgeSource
{
/// <summary> Initializes a new instance of <see cref="LabeledDataKnowledgeSource"/> with container URL. FileListPath will be set to empty string to avoid null validation. </summary>
/// <param name="containerUrl"> The URL of the blob container containing labeled data. </param>
/// <exception cref="ArgumentNullException"> <paramref name="containerUrl"/> is null. </exception>
public LabeledDataKnowledgeSource(Uri containerUrl) : this(containerUrl, string.Empty)
{
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
<PackageReference Include="NUnit3TestAdapter" />
<PackageReference Include="Microsoft.NET.Test.Sdk" />
<PackageReference Include="Moq" />
<PackageReference Include="Azure.Storage.Blobs" />
</ItemGroup>

<ItemGroup>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,42 @@ public class ContentUnderstandingClientTestEnvironment : TestEnvironment
/// </summary>
public string? TargetKey => GetRecordedOptionalVariable("TARGET_KEY", options => options.IsSecret());

/// <summary>
/// Gets the training data SAS URL for analyzer training with labeled data (optional).
/// </summary>
/// <remarks>
/// This is the Azure Blob Storage SAS URL that contains training documents with their
/// corresponding .labels.json and .result.json files.
/// </remarks>
public string? TrainingDataSasUrl => GetRecordedOptionalVariable("TRAINING_DATA_SAS_URL", options => options.IsSecret("https://sanitizedstorage.blob.core.windows.net/sanitizedcontainer?sanitizedsas"));

/// <summary>
/// Gets the training data path prefix within the blob container (optional).
/// </summary>
/// <remarks>
/// This is the folder path within the container where training data files are stored.
/// Example: "training_data/" or "labeling-data/"
/// </remarks>
public string? TrainingDataPath => GetRecordedOptionalVariable("TRAINING_DATA_PATH") ?? "training_data/";

/// <summary>
/// Gets the storage account name for training data (optional).
/// </summary>
/// <remarks>
/// Used when TRAINING_DATA_SAS_URL is not provided. The system will generate a SAS URL
/// using this storage account name and the container name.
/// </remarks>
public string? TrainingDataStorageAccount => GetRecordedOptionalVariable("TRAINING_DATA_STORAGE_ACCOUNT", options => options.IsSecret("sanitizedstorage"));

/// <summary>
/// Gets the container name for training data (optional).
/// </summary>
/// <remarks>
/// Used when TRAINING_DATA_SAS_URL is not provided. The system will generate a SAS URL
/// using this container name and the storage account name.
/// </remarks>
public string? TrainingDataContainerName => GetRecordedOptionalVariable("TRAINING_DATA_CONTAINER_NAME") ?? "training-data";

/// <summary>
/// Creates a file path for a test asset file.
/// </summary>
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"$schema":"https://schema.ai.azure.com/mmi/2024-12-01-preview/labels.json","fileId":"","fieldLabels":{"MerchantName":{"type":"string","valueString":"Contoso","source":"D(1,811,150,945,136,946,163,812,178)","kind":"corrected"},"TotalPrice":{"type":"string","valueString":"$14,50","spans":[{"offset":342,"length":6}],"confidence":0.581,"source":"D(1,928,767,1027,771,1027,805,927,801)","kind":"confirmed"},"Items":{"type":"array","kind":"confirmed","valueArray":[{"type":"object","kind":"confirmed","valueObject":{"Quantity":{"type":"string","valueString":"1","spans":[{"offset":127,"length":1}],"confidence":0.966,"source":"D(1,697,462,703,462,702,489,696,489)","kind":"confirmed"},"Name":{"type":"string","valueString":"Cappuccino","spans":[{"offset":129,"length":10}],"confidence":0.956,"source":"D(1,720,463,819,467,819,494,719,490)","kind":"confirmed"},"Price":{"type":"string","valueString":"$2.20","spans":[{"offset":140,"length":5}],"confidence":0.289,"source":"D(1,949,460,997,459,997,485,949,487)","kind":"confirmed"}}},{"type":"object","kind":"confirmed","valueObject":{"Quantity":{"type":"string","valueString":"1","spans":[{"offset":147,"length":1}],"confidence":0.984,"source":"D(1,695,536,700,536,700,561,695,561)","kind":"confirmed"},"Name":{"type":"string","valueString":"BACON & EGGS","spans":[{"offset":149,"length":5},{"offset":155,"length":1},{"offset":157,"length":4}],"confidence":0.961,"source":"D(1,717,536,773,537,772,562,717,561);D(1,779,537,791,537,791,562,779,562);D(1,798,537,842,537,842,562,798,562)","kind":"confirmed"},"Price":{"type":"string","valueString":"$9.5","spans":[{"offset":176,"length":4}],"confidence":0.287,"source":"D(1,957,569,996,569,996,597,957,598)","kind":"confirmed"}}}]}},"metadata":{"displayName":"receipt2.png","createdDateTime":"2024-12-13T02:00:04.775Z","mimeType":"image/png"}}
Loading