BlobReadChannel dumps its buffer on any seek, therefore so does CloudStorageReadChannel.
This causes unexpected/pathological behavior, e.g. in hadoop-bam/spark-bam where the read pattern is roughly "read 100 bytes, rewind 99 bytes, repeat 1000x"; basically the same 2MB is fetched over the network at every iteration.
spark-bam uses a CachingChannel abstraction that LRU-caches blocks of the underlying channel, which fixes this.
Just wanted to call out this "gotcha"; might be worth fixing here.
BlobReadChanneldumps its buffer on any seek, therefore so doesCloudStorageReadChannel.This causes unexpected/pathological behavior, e.g. in hadoop-bam/spark-bam where the read pattern is roughly "read 100 bytes, rewind 99 bytes, repeat 1000x"; basically the same 2MB is fetched over the network at every iteration.
spark-bam uses a
CachingChannelabstraction that LRU-caches blocks of the underlying channel, which fixes this.Just wanted to call out this "gotcha"; might be worth fixing here.