Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 29 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,34 +27,53 @@ This is the core Java component of the DataSketches library. It contains all of

This component is also a dependency of other components of the library that create adaptors for target systems, such as the [Apache Pig adaptor](https://github.com/apache/datasketches-pig), the [Apache Hive adaptor](https://github.com/apache/datasketches-hive), and others.

Note that we have a parallel core component for C++, Python and GO implementations of many of the same sketch algorithms,
[datasketches-cpp](https://github.com/apache/datasketches-cpp), [datasketches-python](https://github.com/apache/datasketches-python), and
[datasketches-go](https://github.com/apache/datasketches-go).
Note that we have parallel core components for C++, Python and GO implementations of many of the same sketch algorithms:

- [datasketches-cpp](https://github.com/apache/datasketches-cpp),
- [datasketches-python](https://github.com/apache/datasketches-python),
- [datasketches-go](https://github.com/apache/datasketches-go).

Please visit the main [DataSketches website](https://datasketches.apache.org) for more information.

If you are interested in making contributions to this site please see our [Community](https://datasketches.apache.org/docs/Community/) page for how to contact us.

---
## Major Changes with this Release
This release is a major release where we took the opportunity to do some significant refactoring that will constitute incompatible changes from previous releases. Any incompatibility with prior releases is always an inconvenience to users who wish to just upgrade to the latest release and run. However, some of the code in this library was written in 2013 and meanwhile the Java language has evolved enormously since then. We chose to use this major release as the opportunity to modernize some of the code to achieve the following goals:

### Eliminate the dependency on the DataSketches-Memory component.
The DataSketches-Memory component was originally developed in 2014 to address the need for fast access to off-heap memory data structures and used Unsafe and other JVM internals as there were no satisfactory Java language features to do this at the time.

The FFM capabilities introduced into the language in Java 22, are now part of the Java 25 LTS release, which we support. Since the capabilities of FFM are a superset of the original DataSketches-Memory component, it made sense to rewrite the code to eliminate the dependency on DataSketches-Memory and use FFM instead. This impacted code across the entire library.

This provided several advantages to the code base. By removing this dependency on DataSketches-Memory, there are now no runtime dependencies! This should make integrating this library into other Java systems much simpler. Since FFM is tightly integrated into the Java language, it has improved performance, especially with bulk operations.

- As an added note: There are numerous other improvements to the Java language that we could perhaps take advantage of in a rewrite, e.g., Records, text blocks, switch expressions, sealed, var, modules, patterns, etc. However, faced with the risk of accidentally creating bugs due to too many changes at one time, we focused on FFM, which actually improve performance as opposed to just syntactic sugar.

### Align public sketch class names so that the sketch family name is part of the class name.
For example, the Theta sketch was the first sketch written for the library and its base class was called *Sketch*. Obviously, because it was the only sketch! The Tuple sketch evolved soon after and its base class was also called *Sketch*. Oops, bad idea. If a user wanted to use both the Theta and Tuple sketches in the same class one of them had to be fully qualified every time it was referenced. Ugh!

Unfortunately, this habit propagated so some of the other early sketches where we ended up with two different sketches with a *ItemsSketch*, for example. For the more recent additions to the library we started including the sketch family name in all the relevant sketch-like public classes of a sketch family.

In this release we have refactored these older sketches with new names that now include the sketch family name. Yes, this is an incompatible change for user code moving from earlier releases, but this can be usually fixed with search-and-replace tools. This release is not perfect, but hopefully more consistent across all the different sketch families.


## Build & Runtime Dependencies

### Installation Directory Path
**NOTE:** This component accesses resource files for testing. As a result, the directory elements of the full absolute path of the target installation directory must qualify as Java identifiers. In other words, the directory elements must not have any space characters (or non-Java identifier characters) in any of the path elements. This is required by the Oracle Java Specification in order to ensure location-independent access to resources: [See Oracle Location-Independent Access to Resources](https://docs.oracle.com/javase/8/docs/technotes/guides/lang/resources.html)

### OpenJDK Version 24
An OpenJDK-compatible build of Java 24, provided by one of the Open-Source JVM providers, such as Azul Systems, Red Hat, SAP, Eclipse Temurin, etc, is required.
All of the testing of this release has been performed with an Eclipse Temurin build.

This release uses the new Java Foreign Function & Memory (FFM) features that were made part of the Java Language in in Java 22.
### OpenJDK Version 25
At minimum, an OpenJDK-compatible build of Java 25, provided by one of the Open-Source JVM providers, such as *Azul Systems*, *Red Hat*, *SAP*, *Eclipse Temurin*, etc, is required.
All of the testing of this release has been performed with the *Eclipse Temurin* build.

## Compilation and Test using Maven
This DataSketches component is structured as a Maven project and Maven is the recommended tool for compile and test.

#### A Toolchain is required

* You must have a JDK type toolchain defined in location *~/.m2/toolchains.xml* that specifies where to find a locally installed OpenJDK-compatible version 24.
* Your default \$JAVA\_HOME compiler must be OpenJDK compatible, specified in the toolchain, and may be a version greater than 24. Note that if your \$JAVA\_HOME is set to a Java version greater than 24, Maven will automatically use the Java 24 version specified in the toolchain instead. The included pom.xml specifies the necessary JVM flags, so no further action should be required.
* You must have a JDK type toolchain defined in location *~/.m2/toolchains.xml* that specifies where to find a locally installed OpenJDK-compatible version 25.
* Your default \$JAVA\_HOME compiler must be OpenJDK compatible, specified in the toolchain, and may be a version greater than 25. Note that if your \$JAVA\_HOME is set to a Java version greater than 25, Maven will automatically use the Java 25 version specified in the toolchain instead. The included pom.xml specifies the necessary JVM flags, if required, so no further action is needed.
* Note that the paths specified in the toolchain must be fully qualified direct paths to the OpenJDK version locations. Using environment variables will not work.

#### To run normal unit tests:
Expand Down
24 changes: 12 additions & 12 deletions src/main/java/org/apache/datasketches/hll/BaseHllSketch.java
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@

/**
* Although this class is package-private, it provides a single place to define and document
* the common public API for both HllSketch and Union.
* the common public API for both HllSketch and HllUnion.
* @author Lee Rhodes
* @author Kevin Lang
*/
Expand Down Expand Up @@ -115,7 +115,7 @@ public static final int getSerializationVersion(final MemorySegment seg) {
* Gets the current (approximate) Relative Error (RE) asymptotic values given several
* parameters. This is used primarily for testing.
* @param upperBound return the RE for the Upper Bound, otherwise for the Lower Bound.
* @param oooFlag set true if the sketch is the result of a non qualifying union operation.
* @param oooFlag set true if the sketch is the result of a non qualifying HllUnion operation.
* @param lgConfigK the configured value for the sketch.
* @param numStdDev the given number of Standard Deviations. This must be an integer between
* 1 and 3, inclusive.
Expand Down Expand Up @@ -206,8 +206,8 @@ public boolean isEstimationMode() {
* inquire of the sketch if it has, in fact, moved itself.
*
* @param seg the given MemorySegment
* @return true if the given MemorySegment refers to the same underlying resource as this sketch or
* union.
* @return true if the given MemorySegment refers to the same underlying resource as this HllSketch or
* HllUnion.
*/
@Override
public abstract boolean isSameResource(MemorySegment seg);
Expand All @@ -219,17 +219,17 @@ public boolean isEstimationMode() {

/**
* Serializes this sketch as a byte array in compact form. The compact form is smaller in size
* than the updatable form and read-only. It can be used in union operations as follows:
* than the updatable form and read-only. It can be used in HllUnion operations as follows:
* <pre>{@code
* Union union; HllSketch sk, sk2;
* HllUnion union; HllSketch sk, sk2;
* int lgK = 12;
* sk = new HllSketch(lgK, TgtHllType.HLL_4); //can be 4, 6, or 8
* for (int i = 0; i < (2 << lgK); i++) { sk.update(i); }
* byte[] arr = HllSketch.toCompactByteArray();
* //...
* union = Union.heapify(arr); //initializes the union using data from the array.
* union = HllUnion.heapify(arr); //initializes the HllUnion using data from the array.
* //OR, if used in an off-heap environment:
* union = Union.heapify(MemorySegment.ofArray(arr)); //same as above, except from MemorySegment object.
* union = HllUnion.heapify(MemorySegment.ofArray(arr)); //same as above, except from MemorySegment object.
*
* //To recover an updatable heap sketch:
* sk2 = HllSketch.heapify(arr);
Expand All @@ -250,17 +250,17 @@ public boolean isEstimationMode() {
/**
* Serializes this sketch as a byte array in an updatable form. The updatable form is larger than
* the compact form. The use of this form is primarily in environments that support updating
* sketches in off-heap MemorySegment. If the sketch is constructed using HLL_8, sketch updating and
* union updating operations can actually occur in MemorySegment, which can be off-heap:
* sketches in off-heap MemorySegment. If the sketch is constructed using HLL_8, HllSketch updating and
* HllUnion updating operations can actually occur in MemorySegment, which can be off-heap:
* <pre>{@code
* Union union; HllSketch sk;
* HllUnion union; HllSketch sk;
* int lgK = 12;
* sk = new HllSketch(lgK, TgtHllType.HLL_8) //must be 8
* for (int i = 0; i < (2 << lgK); i++) { sk.update(i); }
* byte[] arr = sk.toUpdatableByteArray();
* MemorySegment wseg = MemorySegment.wrap(arr);
* //...
* union = Union.writableWrap(wseg); //no deserialization!
* union = HllUnion.writableWrap(wseg); //no deserialization!
* }</pre>
* @return this sketch as an updatable byte array.
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ void putNibble(final int slotNo, final int nibValue) {
}

@Override
//Would be used by Union, but not used because the gadget is always HLL8 type
//Would be used by HllUnion, but not used because the gadget is always HLL8 type
void updateSlotNoKxQ(final int slotNo, final int newValue) {
throw new SketchesStateException("Improper access.");
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ void putNibble(final int slotNo, final int nibValue) {
}

@Override
//Would be used by Union, but not used because the gadget is always HLL8 type
//Would be used by HllUnion, but not used because the gadget is always HLL8 type
void updateSlotNoKxQ(final int slotNo, final int newValue) {
throw new SketchesStateException("Improper access.");
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ void putNibble(final int slotNo, final int nibValue) {
}

@Override
//Used by Union when source is not HLL8
//Used by HllUnion when source is not HLL8
void updateSlotNoKxQ(final int slotNo, final int newValue) {
final int oldValue = getSlotValue(slotNo);
if (newValue > oldValue) {
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/org/apache/datasketches/hll/Hll4Array.java
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ void putNibble(final int slotNo, final int nibValue) {
}

@Override
//Would be used by Union, but not used because the gadget is always HLL8 type
//Would be used by HllUnion, but not used because the gadget is always HLL8 type
void updateSlotNoKxQ(final int slotNo, final int newValue) {
throw new SketchesStateException("Improper access.");
}
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/org/apache/datasketches/hll/Hll6Array.java
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ void putNibble(final int slotNo, final int nibValue) {
}

@Override
//Would be used by Union, but not used because the gadget is always HLL8 type
//Would be used by HllUnion, but not used because the gadget is always HLL8 type
void updateSlotNoKxQ(final int slotNo, final int newValue) {
throw new SketchesStateException("Improper access.");
}
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/org/apache/datasketches/hll/Hll8Array.java
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ void putNibble(final int slotNo, final int nibValue) {
}

@Override
//Used by Union when source is not HLL8
//Used by HllUnion when source is not HLL8
void updateSlotNoKxQ(final int slotNo, final int newValue) {
final int oldValue = getSlotValue(slotNo);
hllByteArr[slotNo] = (byte) Math.max(newValue, oldValue);
Expand Down
14 changes: 7 additions & 7 deletions src/main/java/org/apache/datasketches/hll/HllSketch.java
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ public static final HllSketch heapify(final MemorySegment srcSeg) {
return heapify(srcSeg, true);
}

//used by union and above
//used by HllUnion and above
static final HllSketch heapify(final MemorySegment srcSeg, final boolean checkRebuild) {
Objects.requireNonNull(srcSeg, "Source MemorySegment must not be null");
checkBounds(0, 8, srcSeg.byteSize()); //need min 8 bytes
Expand All @@ -218,7 +218,7 @@ static final HllSketch heapify(final MemorySegment srcSeg, final boolean checkRe
} else { //Hll_8
heapSketch = new HllSketch(Hll8Array.heapify(srcSeg));
if (checkRebuild) {
Union.checkRebuildCurMinNumKxQ(heapSketch);
HllUnion.checkRebuildCurMinNumKxQ(heapSketch);
}
}
} else if (curMode == CurMode.LIST) {
Expand All @@ -245,7 +245,7 @@ public static final HllSketch writableWrap(final MemorySegment srcWseg) {
return writableWrap(srcWseg, true);
}

//used by union and above
//used by HllUnion and above
static final HllSketch writableWrap( final MemorySegment srcWseg, final boolean checkRebuild) {
Objects.requireNonNull(srcWseg, "Source MemorySegment must not be null");
checkBounds(0, 8, srcWseg.byteSize()); //need min 8 bytes
Expand All @@ -268,8 +268,8 @@ static final HllSketch writableWrap( final MemorySegment srcWseg, final boolean
directSketch = new HllSketch(new DirectHll6Array(lgConfigK, srcWseg));
} else { //Hll_8
directSketch = new HllSketch(new DirectHll8Array(lgConfigK, srcWseg));
if (checkRebuild) { //union only uses HLL_8, we allow non-finalized from a union call.
Union.checkRebuildCurMinNumKxQ(directSketch);
if (checkRebuild) { //HllUnion only uses HLL_8, we allow non-finalized from a HllUnion call.
HllUnion.checkRebuildCurMinNumKxQ(directSketch);
}
}
} else if (curMode == CurMode.LIST) {
Expand Down Expand Up @@ -305,8 +305,8 @@ public static final HllSketch wrap(final MemorySegment srcSeg) { //read only
directSketch = new HllSketch(new DirectHll6Array(lgConfigK, srcSeg, true));
} else { //Hll_8
directSketch = new HllSketch(new DirectHll8Array(lgConfigK, srcSeg, true));
//rebuild if srcSeg came from a union and was not finalized, rather than throw exception.
Union.checkRebuildCurMinNumKxQ(directSketch);
//rebuild if srcSeg came from a HllUnion and was not finalized, rather than throw exception.
HllUnion.checkRebuildCurMinNumKxQ(directSketch);
}
} else if (curMode == CurMode.LIST) {
directSketch =
Expand Down
Loading
Loading