I checked if computing the proposals with a parallel stream would be
beneficial. Turns out the stream uses several threads, but the overall
computation is not faster, because each individual computation is
slower.
The IntLists were no longer sorted since we made the initialization run
in parallel. Therefore a much slower implementation for
intersection/union was used.
When we parallelized the initialization we forgot to
synchronize the docIdToDoc list.
Luckily there is a high probability, that queries return
results, that are obviously wrong.
This reduces memory usage by 1 or 2 MB.
33% of an ArrayList can be free. If the list is 1 million entries long,
then the list wastes 2.6 MB.
The Doc objects in the list are much bigger.
This reduces the size of the old generation by 100MB (300MB down to
200MB). Unfortunately the total JVM size didn't change and is still
512MB.
Doc stores the path as byte array instead of Path.
We are using ANTLR listeners to find out where in the
query the cursor is. Then we generate a list of keys/values
that might fit at that position. With that information we
can generate new queries and sort them by the number
of results they yield.
This is done in preparation for the proposal API.
In order to compute proposals we need to consume the
API of the DataStore, but the code does not need to
be in the DataStore.
Extracting the API allows us to separate these concerns.
LuDB has a few disadvantages.
1. Most notably disk space. H2 wastes a lot of valuable disk space.
For my test data set with 44 million entries it is 14 MB
(sometimes a lot more; depends on H2 internal cleanup). With
data-store it is 15 KB.
Overall I could reduce the disk space from 231 MB to 200 MB (13.4 %
in this example). That is an average of 4.6 bytes per entry.
2. Speed:
a) Liquibase is slow. The first time it takes approx. three seconds
b) Query and insertion. with data-store we can insert entries
up to 1.6 times faster.
Data-store uses a few tricks to save disk space:
1. We encode the tags into the file names.
2. To keep them short we translate the key/value of the tag into
shorter numbers. For example "foo" -> 12 and "bar" to 47. So the
tag "foo"/"bar" would be 12/47.
We then translate this number into a numeral system of base 62
(a-zA-Z0-9), so it can be used for file names and it is shorter.
That way we only have to store the mapping of string to int.
3. We do that in a simple tab separated file.