Skip to content

Conversation

@shangxinli
Copy link
Contributor

Replace GenericDatum intermediate layer with direct Avro decoder access to improve manifest I/O performance.

Changes:

  • Add avro_direct_decoder_internal.h with DecodeAvroToBuilder API
  • Add avro_direct_decoder.cc implementing direct Avro→Arrow decoding
    • Primitive types: bool, int, long, float, double, string, binary, fixed
    • Temporal types: date, time, timestamp
    • Logical types: uuid, decimal
    • Nested types: struct, list, map
    • Union type handling for nullable fields
  • Modify avro_reader.cc to use DataFileReaderBase with direct decoder
    • Replace DataFileReader with DataFileReaderBase
    • Use decoder.decodeInt(), decodeLong(), etc. directly
    • Remove GenericDatum allocation and extraction overhead
  • Update CMakeLists.txt to include new decoder source

Performance improvement:

  • Before: Avro binary → GenericDatum → Extract → Arrow (3 steps)
  • After: Avro binary → decoder.decodeInt() → Arrow (2 steps)

This matches Java implementation which uses Decoder directly via ValueReader interface, avoiding intermediate object allocation.

All avro_test cases pass.

Issue: #332

@shangxinli shangxinli force-pushed the avro_reader branch 6 times, most recently from eae14b1 to e7b55b2 Compare December 1, 2025 00:52
Replace GenericDatum intermediate layer with direct Avro decoder access
to improve manifest I/O performance.

Changes:
- Add avro_direct_decoder_internal.h with DecodeAvroToBuilder API
- Add avro_direct_decoder.cc implementing direct Avro→Arrow decoding
  - Primitive types: bool, int, long, float, double, string, binary, fixed
  - Temporal types: date, time, timestamp
  - Logical types: uuid, decimal (with validation)
  - Nested types: struct, list, map
  - Union type handling with bounds checking
  - Field skipping with proper multi-block handling for arrays/maps
- Modify avro_reader.cc to use DataFileReaderBase with direct decoder
  - Replace DataFileReader<GenericDatum> with DataFileReaderBase
  - Use decoder.decodeInt(), decodeLong(), etc. directly
  - Remove GenericDatum allocation and extraction overhead
- Update CMakeLists.txt to include new decoder source

Validation added:
- Union branch bounds checking
- Decimal byte width validation (uses schema fixedSize, not calculated)
- Decimal precision sufficiency validation
- Logical type presence validation
- Type mismatch error handling

Documentation:
- Comprehensive API documentation in header
- Schema evolution handling via SchemaProjection explained
- Error handling behavior documented
- Limitations noted (default values not supported)

Performance improvement:
- Before: Avro binary → GenericDatum → Extract → Arrow (3 steps)
- After: Avro binary → decoder.decodeInt() → Arrow (2 steps)

This matches Java implementation which uses Decoder directly via
ValueReader interface, avoiding intermediate object allocation.

All 173 avro_test cases pass.

Issue: apache#332
ICEBERG_RETURN_UNEXPECTED(
AppendDatumToBuilder(reader_->readerSchema().root(), *context_->datum_,
projection_, *read_schema_, context_->builder_.get()));
DecodeAvroToBuilder(reader_->readerSchema().root(), reader_->decoder(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to use a feature flag to enable users to use the old reader just in case there is any bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion!

@shangxinli shangxinli force-pushed the avro_reader branch 2 times, most recently from 2334ea3 to d16b0f1 Compare December 1, 2025 03:17
@wgtmac
Copy link
Member

wgtmac commented Dec 2, 2025

Could you help add some test cases? The original avro reader is simply a wrapper on top of avro-cpp so existing test cases are very minimal. We need better test coverage to be confident. Perhaps we can use the avro writer to create in-memory avro files (using MakeMockFileIO and then use the new reader to read them. It is also nice to add some benchmarks to prove that this indeed improves the reading performance compared to the original implementation.

@shangxinli shangxinli force-pushed the avro_reader branch 2 times, most recently from 043bb08 to eae860e Compare December 3, 2025 15:22
Add extensive test coverage to validate the direct decoder implementation:
- All primitive types (boolean, int, long, float, double, string, binary)
- Temporal types (date, time, timestamp)
- Complex nested structures (nested structs, lists, maps)
- Null handling and optional fields
- Large datasets (1000+ rows)
- Direct decoder vs GenericDatum comparison tests

Add benchmark tool to measure performance improvements:
- Benchmarks with various data patterns (primitives, nested, lists, nulls)
- Compares direct decoder vs GenericDatum performance
- Expected speedup: 1.5x - 2.5x due to eliminated intermediate copies

Add feature flag for direct Avro decoder:
- ReaderProperties::kAvroUseDirectDecoder (default: true)
- Allows fallback to GenericDatum implementation if issues arise
- Dual-path implementation with helper functions to reduce code duplication

Test results:
- 16 comprehensive Avro reader tests (vs 5 before)
- 180 total tests in avro_test suite
- 100% passing rate

This addresses review feedback from wgtmac to provide better test coverage
and prove performance improvements of the direct decoder implementation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants