The cache must remove/evict writers after a few seconds, but EhCache
only evicts entries when a new entry is added. That is not acceptable
for us, because that would leave lots of files open and we would need
a second mechanism to close them.
Therefore I write a simple wrapper for a ConcurrentHashMap that evicts
entries after timeToLive+5s.
- PdbFiles no longer require dates to be monotonically
increasing. Therefore TagsToFile does not have to ensure
this. => We only have one file per Tags.
- Use EhCache instead of HashMap.
- The DiskStorage uses only one file instead of millions.
Also the block size is only 512 byte instead of 4kb, which
helps to reduce the memory usage for short sequences.
- Update primitiveCollections to get the new LongList.range
and LongList.rangeClosed methods.
- BSFile now stores Time&Value sequences and knows how to
encode the time values with delta encoding.
- Doc had to do some magic tricks to save memory. The path
was initialized lazy and stored as byte array. This is no
longer necessary. The patch was replaced by the
rootBlockNumber of the BSFile.
- Had to temporarily disable the 'in' queries.
- The stored values are now processed as stream of LongLists
instead of Entry. The overhead for creating Entries is
gone, so is the memory overhead, because Entry was an
object and had a reference to the tags, which is
unnecessary.
The old implementation opened a new buffered reader everytime
getPathByOffset was called. This took 1/20th of a second or
longer. For queries that visited thousands of files this could
take a long time.
We are now using a RandomAccessFile, that is opened once. The
average time spend in getPathByOffset is now down to 0.11ms.
The path in Doc is not optional. This reduces memory consumption,
because we only have to store a long (the offset in the listing file).
This assumes, that only a small percentage of Docs is requested.
The last improvement of memory usage introduced a performance
regression. The ingestion performance dropped by 50%-80%, because
for every inserted entry the Tags were created inefficient.
Now we can find writer much faster, because we don't have to execute
a query for documents that match the tags. We can just look up the
documents in the map.
Speedup: 2-4ms -> 0.002-0.01ms
Plotting can take a long time and use a lot of resources.
Multiple plot requests can cause the machine to run OOM.
We are now allowing plots for 500k files again. This is mainly to
prevent unwanted plots of everything.
We used to open all PdbReaders in a search result and then interate over
them. This used a lot of heap space (> 8GB) for 400k files.
Now the PdbReaders are only opened while they are used. Heap usage was
less than 550 while reading more than 400k files.
1. Log the exception in PdbFileIterator with a logger instead
of just printing it to stderr.
2. Increase log level for exceptions when inserting entries.
3. Log exception when creation of entry failed in TcpIngestor.
We are using ANTLR listeners to find out where in the
query the cursor is. Then we generate a list of keys/values
that might fit at that position. With that information we
can generate new queries and sort them by the number
of results they yield.
This is done in preparation for the proposal API.
In order to compute proposals we need to consume the
API of the DataStore, but the code does not need to
be in the DataStore.
Extracting the API allows us to separate these concerns.