The old implementation opened a new buffered reader everytime
getPathByOffset was called. This took 1/20th of a second or
longer. For queries that visited thousands of files this could
take a long time.
We are now using a RandomAccessFile, that is opened once. The
average time spend in getPathByOffset is now down to 0.11ms.
The path in Doc is not optional. This reduces memory consumption,
because we only have to store a long (the offset in the listing file).
This assumes, that only a small percentage of Docs is requested.
The last improvement of memory usage introduced a performance
regression. The ingestion performance dropped by 50%-80%, because
for every inserted entry the Tags were created inefficient.
Now we can find writer much faster, because we don't have to execute
a query for documents that match the tags. We can just look up the
documents in the map.
Speedup: 2-4ms -> 0.002-0.01ms
Plotting can take a long time and use a lot of resources.
Multiple plot requests can cause the machine to run OOM.
We are now allowing plots for 500k files again. This is mainly to
prevent unwanted plots of everything.
We used to open all PdbReaders in a search result and then interate over
them. This used a lot of heap space (> 8GB) for 400k files.
Now the PdbReaders are only opened while they are used. Heap usage was
less than 550 while reading more than 400k files.
1. Log the exception in PdbFileIterator with a logger instead
of just printing it to stderr.
2. Increase log level for exceptions when inserting entries.
3. Log exception when creation of entry failed in TcpIngestor.
We are using ANTLR listeners to find out where in the
query the cursor is. Then we generate a list of keys/values
that might fit at that position. With that information we
can generate new queries and sort them by the number
of results they yield.
This is done in preparation for the proposal API.
In order to compute proposals we need to consume the
API of the DataStore, but the code does not need to
be in the DataStore.
Extracting the API allows us to separate these concerns.
LuDB has a few disadvantages.
1. Most notably disk space. H2 wastes a lot of valuable disk space.
For my test data set with 44 million entries it is 14 MB
(sometimes a lot more; depends on H2 internal cleanup). With
data-store it is 15 KB.
Overall I could reduce the disk space from 231 MB to 200 MB (13.4 %
in this example). That is an average of 4.6 bytes per entry.
2. Speed:
a) Liquibase is slow. The first time it takes approx. three seconds
b) Query and insertion. with data-store we can insert entries
up to 1.6 times faster.
Data-store uses a few tricks to save disk space:
1. We encode the tags into the file names.
2. To keep them short we translate the key/value of the tag into
shorter numbers. For example "foo" -> 12 and "bar" to 47. So the
tag "foo"/"bar" would be 12/47.
We then translate this number into a numeral system of base 62
(a-zA-Z0-9), so it can be used for file names and it is shorter.
That way we only have to store the mapping of string to int.
3. We do that in a simple tab separated file.
Date increments have usually higher values.
I had hoped to reduce the file size by a lot. But in my example data
with 44 million entries (real life data) it only reduced the storage
size by 1.5%.
Also fixed a bug in PdbReader that prevented other values for the
CONTINUATION byte.
Also added a small testing tool that prints the content of a pdb file.
It is not (yet) made available as standalone tool, but during
debugging sessions it is very useful.
Before we could only group by a single field. But it is acutally
very useful to group by multiple fields. For example to see the
graph for a small set of methods grouped by host and project.
If for 10 seconds no new entry is received, then all open
files are flushed and closed.
We do this to make sure, that we do not loose data, when
we kill the process.
There is still a risk of data loss if we kill the process
while entries are received.