Enhance GSON Converters, hash data only once, cache ULID generator class#137
Enhance GSON Converters, hash data only once, cache ULID generator class#137Override-6 wants to merge 7 commits intoproject-alvarium:mainfrom
Conversation
|
By the way, a lot of the diff is caused by reformat (i'm an addict to ctrl+alt+l in IJ!), what do you think of opening a pull request to setup the maven formatter plugin ? |
6e5f534 to
bc8e871
Compare
There was a problem hiding this comment.
Hi @Override-6, thanks for the suggestions. I have split the review into two parts, taking on the high-level approach you've suggested which I'll discuss here, and some code-specific comments which I'll write down below.
Hashing in the SDK top-level calls
While I think that the hashing of a piece of data should be the role of the Annotator since it is a field on the Annotation being created, I can't deny that the performance gained from this change is significant enough for us to overlook this design choice. I think we should give this a greenlight.
Additionally, I would have hoped that we can replace the data parameter in the annotator execute method to be the hash parameter, but as you already probably know the PkiAnnotator relies on the data, and is also the only annotator that does so. I think we should keep this in mind as it has been discussed that we want to remove the reliance of that annotator on the incoming data schema... just pinning a thought here.
Base64 Encoding Annotations in Publish Wrappers
I believe it's been discussed that breaking changes should not be entertained at this time. Other Alvarium services expect that the publish wrapper will contain the annotation list as a Base64-encoded value. This change should probably be reverted.
ULID Generation
I think this is on the right track, I am not a fan of static variables though. Do you think the same idea you have here can be approached differently ? I was thinking something along the lines of a ULID provider class, similar to how hashing and signing works. Interested to hear your ideas here.
Annotation Serialization
I think this is a straight up improvement on what was already there, just some implementation comments below.
| import java.util.ArrayList; | ||
| import java.util.HashMap; | ||
| import java.util.List; | ||
|
|
||
| import com.alvarium.annotators.Annotator; | ||
| import com.alvarium.annotators.AnnotatorConfig; | ||
| import com.alvarium.annotators.AnnotatorException; | ||
| import com.alvarium.annotators.AnnotatorFactory; | ||
| import com.alvarium.contracts.Annotation; | ||
| import com.alvarium.contracts.AnnotationList; | ||
| import com.alvarium.contracts.AnnotationType; | ||
| import com.alvarium.hash.HashProviderFactory; | ||
| import com.alvarium.hash.HashTypeException; | ||
| import com.alvarium.streams.StreamException; | ||
| import com.alvarium.streams.StreamProvider; | ||
| import com.alvarium.streams.StreamProviderFactory; | ||
| import com.alvarium.utils.ImmutablePropertyBag; | ||
| import com.alvarium.utils.PropertyBag; | ||
|
|
||
| import org.apache.logging.log4j.Logger; | ||
|
|
||
| import java.util.ArrayList; | ||
| import java.util.HashMap; | ||
| import java.util.List; | ||
|
|
There was a problem hiding this comment.
Please refer to our style guide which contains a section for imports and the correct ordering and grouping. It's a combination of industry styling guides used by Google and Twitter.
I understand these changes are probably due to a formatter, but what maintainers here usually do is create configuration on the editor of choice prior to making changes that reflect what is in that style guide.
There was a problem hiding this comment.
No problem, why not trying to use a maven plugin to help apply the code style policy ?
There was a problem hiding this comment.
I don't see why not, we can look into it.
src/main/java/com/alvarium/serializers/AlvariumPersistence.java
Outdated
Show resolved
Hide resolved
| public class AlvariumPersistence { | ||
|
|
||
|
|
||
| public static final Gson GSON = new GsonBuilder() |
There was a problem hiding this comment.
From what you've seen in your tests, does just instantiating this object each time a serialization is made take a lot away from performance ? I can understand the reasoning behind making this static since it's not really a changing value, but if it doesn't affect performance I don't see why not instantiate a new AlvariumPersistence each time we need to serialize something.
There was a problem hiding this comment.
It doesnt that much, I could create a getGson() but a Gson object is not something we gonna mutate, so why not place it in a static place and access it.
I made the GSON instance private, and now expose three static functions from the classs, so i think placing it in a static field is even more adapted now :
public class PeristenceFunctions {
private static final Gson GSON = ...;
public static String serializeWrapper(PublishWrapper wrapper) {
return GSON.toJson(wrapper);
}
public static String serializeAnnotation(Annotation annotation) {
return GSON.toJson(annotation);
}
public static Annotation deserializeAnnotation(String jsonAnnotation) {
return GSON.fromJson(jsonAnnotation, Annotation.class);
}
}There was a problem hiding this comment.
Well the reason behind my suggestion was that we could just do something like this:
public class JSONSerializer<T> {
private final Gson gson;
public JSONSerializer() {
gson = new GsonBuilder()
.registerTypeAdapter(PublishWrapper.class, new PublishWrapperConverter())
.registerTypeAdapter(ZonedDateTime.class, new ZonedDateTimeConverter())
.registerTypeAdapter(Annotation.class, new AnnotationConverter())
.disableHtmlEscaping()
.create();
}
public String toJson(T serializable) {
return this.gson.toJson(serializable);
}
public T fromJson(String json, java.lang.Class<T> classOfT) { // maybe there is a way to skip the second parameter, I haven't dug deep enough to check
return this.gson.fromJson(json, classOfT);
}
}and to use it we can just do new JSONSerializer<Annotation>().toJson(annotation) or new JSONSerializer<Annotation>().fromJson(json, Annotation.class). No static instances of anything, and I assume no detriment to performance.
Thoughts ?
There was a problem hiding this comment.
I see two issues here :
-
The Gson is configured to be only used by Alvarium (due to the ZonedDateTime adapter that is specific to our use.), i don't see why we want to use a generic method here, as we will only need to support persistance operations for Annotation and PublishWrapper.
I think that constraining the JSONSerializer with more typed method (serializeWrapper,serializeAnnotationetc) helps to encapsulate the Gson, and so we ensure that we only use the serializer for what it is intended for (as we saw that using it for unknown structures leads to serious performances degradation). -
new JSONSerializer().toJson()is semantically the same asJSONSerializer.toJson(), as there is no possible way to mutate the wrapped Gson instance.
A Gson instance is thread safe for both serialization and deserialization.
I really dont see any advantage (in both semantic and runtime) in not using a static constant
There was a problem hiding this comment.
That does make sense. I like the idea of enforcing the use of the serializer to only specific types in order to avoid performance issues, so I am sold on that part. But you do realize this will require a fair bit of coupling between the serializers package and any serializable class... Do you think it is possible to move the serialization logic as a member of the respective class ? i.e., something on the lines of String json = publishWrapper.toJson() while keeping all performance improvements ?
There was a problem hiding this comment.
Sorry i don't understand, do you want to move the functions of the PersistanceFunctions.serializeX to X.toJson() ? If yes then the methods already exists in PublishWrapper and Annotation.
Or do you want to have a GSON instance in PublishWrapper and Annotation and then those gson are specialized to the class they are associated ? If yes then this would be like it is done currently, and this will force you to ensure that the changes you make are set in all Gsons (if you add an adpater then you'll need to not forget to place it in all Gson) this looks like code duplication i think 🤔
There was a problem hiding this comment.
Or do you want to have a GSON instance in PublishWrapper and Annotation and then those gson are specialized to the they are associated ? If yes then would be like it is done currently, and this will force you to ensure that the changes you make are set in all Gsons (if you add an adpater then you'll need to not forget to place it in all Gson) this looks like code duplication i think 🤔
Yes, this is what I meant. I see the implication of code duplication here, and yes it would be a worse approach. I was checking to see if you have any ideas that can improve on the coupling across packages without any downsides, as I have none, but it seems this really is the best approach we have on hand at the moment.
My only gripe at the moment is with the class name still 😄 "Persistence" is confusing in a software development space, which is usually used for long-term persistence such as DBs and storage, and when someone reads PersistenceFunctions.serializeWrapper(wrapper), they might think the output is actually being stored somewhere, somehow. Just a suggestion, how about Serialization.toJson(wrapper) which tells me that only serialization is happening as well as using which format. This makes extensibility to other formats easier as well.
public class Serialization {
private static final Gson GSON = new GsonBuilder()
.registerTypeAdapter(PublishWrapper.class, new PublishWrapperConverter())
.registerTypeAdapter(ZonedDateTime.class, new ZonedDateTimeConverter())
.registerTypeAdapter(Annotation.class, new AnnotationConverter())
.disableHtmlEscaping()
.create();
public static String toJson(PublishWrapper wrapper) {
return GSON.toJson(wrapper);
}
public static String toJson(Annotation annotation) {
return GSON.toJson(annotation);
}
public static Annotation annotationFromJson(String jsonAnnotation) {
return GSON.fromJson(jsonAnnotation, Annotation.class);
}
}
src/main/java/com/alvarium/serializers/AlvariumPersistence.java
Outdated
Show resolved
Hide resolved
src/main/java/com/alvarium/serializers/AnnotationConverter.java
Outdated
Show resolved
Hide resolved
|
Hey @Ali-Amin, Thanks for the review. Hashing in the SDK top-level callsFor me the All those things makes the implementation bother with additional operations it shouldnt, and this is why you had to make the hashing process for each annotation, or to inject the I think the responsibility to create ULID GenerationWhy not simply using FormatI think we could open a PR to set up a code formatter and configure it to apply the code style convention? |
4f31b02 to
9167729
Compare
9167729 to
b5374cb
Compare
Why not removing the In this scala sdk poc, I wanted to use a more typesafe way to inject things at "annotation evaluation time", maybe we could use it here, which could allow the PkiAnnotator to stay in the sdk! |
|
I think that might be a good approach yes, because the data is needed for the annotation's 'satisfaction' logic, and we usually pass info related to that 'satisfaction' logic in the property bag for other annotators. So it makes sense to me to do the same here. |
|
I understand for ULID, i forgot that changing the ID generation algorithm would also be a breaking change ! For simplicity's sake, i'm personally more in favor of keeping the ULID in a static field (I indeed could create a I think i'll open a new PR to setup a formatter
Do you want me to implement this idea here or do i create another PR for this (once this one will be merged) ? I am interested by what you mean by Envelope, do you have any thing i can read to better understand the idea ? Thanks you ! |
I think the advantages of an ID generator outweigh those of a static instantiation of ULID, having a default value for the ID generator config as well might be a good idea to prevent headache for other devs.
Awesome, looking forward to it.
I think this PR is large enough, lets keep any more changes and suggestions into their own single units to keep reviews simple and fast.
I do not have anything on hand unfortunately, I will try to fetch the TSC call recording link we had where we discussed plans for it. |
Ali-Amin
left a comment
There was a problem hiding this comment.
Most of everything else looks good to me. As soon as this is fixed, I'll plug these changes into a working Jenkins pipeline and ensure scores are being calculated without error.
This recording https://wiki.lfedge.org/download/attachments/68584065/Alvarium-TSC-11-Mar-2024.mp4?api=v2 should have the relevant info, the discussion starts around 9:50. |
83c50cd to
2859b54
Compare
2859b54 to
01cd65f
Compare
|
@Ali-Amin @Override-6 Thanks for the collaboration on this. Please have some patience w/r/t the timing of getting this work approved and merged. We have an effort in front of it that is proving difficult and I'd like to get that resolved before considering these changes. There is also some pending vacation time coming up which will add a delay. Just FYI that we're still tracking this PR. |
Hello,
This Pull Request implements the three non-breaking changes ideas as discussed in the report mentioned in #136 (section 4.1, page 6).
What has been improved
JSON Optimisation
The main improvement is by optimising GSON serialization.
In the report, I say that 25% of a data annotation process goes to serializing the annotations to JSON.
Avoid GSON reflexion serializer
The profiling results shows that the main part of this overhead is that Gson has to use the java reflection library to automatically serialize the annotations, as no
JsonSerializeris properly registered in theGSONcontext.Currently, the GSON library seems to be misused :
alvarium-sdk-java/src/main/java/com/alvarium/serializers/AnnotationConverter.java
Lines 28 to 37 in 7e53f04
By doing this, the
Annotationis converted to a json string (usingAnnotation#toJson), which is then parsed back to aJSONElement.The implementation of
Annotation#toJsonis the following :alvarium-sdk-java/src/main/java/com/alvarium/contracts/Annotation.java
Lines 116 to 120 in 7e53f04
No converter is set to properly depict the structure of an annotation, so, in addition to this JSON indirection, Gson will have to use reflection to serialize, which is very unefficient. So I defined a JsonSerializer for the
Annotationtype to avoid this.InstantConverter and ZonedDateTime
I also saw that the InstantConverter's serialization side had a formatting/parsing step to convert
Instantobjects to aZonedDateTime, so I directly usedZonedDateTimeforAnnotationtimestamps.alvarium-sdk-java/src/main/java/com/alvarium/serializers/InstantConverter.java
Lines 36 to 46 in 7e53f04
I dont think using an abstract type here is relevant as the annotation instantiation process should only be managed by the Alvarium SDK. As the serialization is also defined and peformed by the SDK, this
timestampfield is fully internally managed, so i think we can sacrifice the O/C principle here (which is useless as the SDK has no reason to use multiple implementations of Instant) to gain some significant performances improvements.ULID Generation
Each
Annotationobject have a unique identifier.The
ULIDclass is instanciated in the constructor ofAnnotation, which will initialize a newSecureRandom.Initializing a
SecureRandomis expensive as shown by the profiling results, so I moved theULIDobject in a static field of theAnnotationclass, and generate an identifier from it :Hash Data Only Once
The second big bottleneck that can be strongly reduced is that data is hashed by each annotators.
This is caused by a conception issue where the
Annotatorinterface is responsible to instantiateAnnotationobjects, which is a quite complex type, as annotations :As the annotators creates new annotations, they are also responsible to hash the data, but the hash is always the same for each annotations, so what i did is to hash the data ahead of time, by the
Sdkimplementation, and then add it as an input to theAnnotator#annotatemethod.This is a quickfix as I dont want to make this pull request refactor too much things, because I think a whole PR is needed to reconsider the utility of having as-said
Annotators, where what we only want is asatisfaction verdict(look at all the implementations of theAnnotatorclass, 75% of the code is duplicated, only few lines are used to build up an actual verdict).Thanks you.