UniqueStringIntegerPairs stores mappings of integers
0-n to strings and vice versa. Mapping integers to
strings does not need a TreeMap, it can be done with
a List.
Makes insertions 3 times (when using the in-memory
variant that does not write to disk) and 7 times faster
for int to string mapping.
One bottleneck was the blocking queue used to transport entries
from the listener thread to the ingestor thread.
Reduced the bottleneck by batching entries.
Interestingly the batch size of 100 was better than batch size
of 1000 and better than 10.
Compared to FastISODateParser.parse, which returns an
OffsetDateTime object, parseAsEpochMilli returns the
epoch time millis. The performance improvement for
date parsing alone is roughly 100% (8m dates/s to
18m dates/s).
Insertion speed improved from 13-14s for 1.6m entries
to 11.5-12.5s.
Replaced Tags.filenameBytes with a SortedSet<Tag>. Tags are now
stored as longs (variable length encoded) in the PersistenMap.
Tags.filenameBytes was introduced to reduce memory consumption, when
all tags were hold in memory. Tags are now stored in a PersistentMap
and only read when needed.
Moved the VariableByteEncoder into its own project, because it was
needed by pdb-api.
Replaces the use of in-memory data structures with the PersistentMap.
This is the crucial step in reducing memory usage for both persistent
storage and main memory.
- The DiskStorage uses only one file instead of millions.
Also the block size is only 512 byte instead of 4kb, which
helps to reduce the memory usage for short sequences.
- Update primitiveCollections to get the new LongList.range
and LongList.rangeClosed methods.
- BSFile now stores Time&Value sequences and knows how to
encode the time values with delta encoding.
- Doc had to do some magic tricks to save memory. The path
was initialized lazy and stored as byte array. This is no
longer necessary. The patch was replaced by the
rootBlockNumber of the BSFile.
- Had to temporarily disable the 'in' queries.
- The stored values are now processed as stream of LongLists
instead of Entry. The overhead for creating Entries is
gone, so is the memory overhead, because Entry was an
object and had a reference to the tags, which is
unnecessary.
The last improvement of memory usage introduced a performance
regression. The ingestion performance dropped by 50%-80%, because
for every inserted entry the Tags were created inefficient.
Before we could only group by a single field. But it is acutally
very useful to group by multiple fields. For example to see the
graph for a small set of methods grouped by host and project.
Store values in sequences of variable length. Instead of using 8 bytes
per entry we are now using between 2 and 20 bytes. But we are also able
to store every non-negative long value.