Skip to content

AzureBlobDataStorageService::Cache issues #40

@drub0y

Description

@drub0y

I'm noticing a few issues with the implementation of this method that I'd like to discuss...

Using a sync method to download

The AzureBlobDataStorageService::Cache method is itself async, but it's calling the synchronous version of BlockBlobClient::Download here:

It should be changed to use the asynchronous DownloadAsync at bare minimum.

Deserialization implementation is suboptimal

Looking at the implementation overall though, I would suggest a more optimal solution which will reduce code as well as runtime allocations:

            using (var blobJsonStream = new MemoryStream())
            {
                await Blob.DownloadToAsync(blobJsonStream);

                CachedCollection = JsonSerializer.Deserialize<List<TStoredModel>>(new ReadOnlySpan<byte>(blobJsonStream.GetBuffer(), 0, (int)blobJsonStream.Length));
            }

NOTE: this code is based off the WIP branch to make the switch to System.Text.Json, assumes that will be committed and is meant to build on that.

So, first, it allows the BlobDowloadClient to optimize the download and delivery to the actual MemoryStream with the additional bonus of reducing a bunch of code in this code base to get to CopyTo. Second it removes the allocation/copy of the bytes to a byte[] that is caused by the ToArray() call right now. It does this by simply using the existing underlying buffer of the MemoryStream itself and wrapping it in a ReadOnlySpan while being sure to specify that the span ends at the logical Length of the stream, not the totally size of the allocated buffer. Third, it avoids the conversion of the byte[] into a string which not only is another allocation/copy, as it goes through the overhead of the GetString(...) conversion. Since we already know the bytes to be UTF8 bytes, we can simply pass those bytes through raw to the JsonSerializer and let it work right on top of them directly which is all it wants in the first place.

Now, the only behavioral difference to consider here is if the writer of the JSON blob happened to put a byte order mark (BOM) in place this approach would not skip over that vs the GetString approach which would handle that. Thus, I'm making the assumption that these JSON files are stored without a BOM with this code. If they are definitively stored with a BOM we can account for that. If it's possible that they might be stored with or without a BOM then we can still account for that with more checks too. However, since it's the Commit method that seems to be the source of the blob content and we control the writing of that blob content and its implementation will always write without a BOM right now, then I think my change is safe.

This method doesn't actually "cache"?

Finally, the method is called Cache and it's called as a preamble in almost every other method (e.g. Get, GetOne, etc), but it doesn't actually cache the results... it will execute the download logic every single time. This doesn't seem like a caching method as I traditionally think of it. I would think that at least for the lifetime of the instance once Cache() is called it would not download the results again. Is this the intended behavior of this method?

Bigger picture, it's treated as a singleton by the application, so... in effect there is no caching going on here and you're always going to the source to download the latest. If there's multiple writers that might make a lot of sense, but then I guess I'd change the name from "cache" since it's always doing a read-through to the backing store. Or, if we really do want it to still act like a cache, this could be further optimized to remember the ETag from the last read and only perform the download/deserialization if the ETag has changed. This would at least optimize for the many read, few write scenarios.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinginvalidThis doesn't seem right

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions