To make it easier/possible to write stable unit test the CSV upload
can optionally wait until all entries have been flushed to disk.
This is necessary for tests that ingest data and then read the data.
Recommind is using a pseudo ISO date format for their
log files. It uses a comma instead of a dot for the
second to milli second separator and it does not
add a timezone. Dates without timezone are assumed to be UTC.
Refactoring to prepare the addition of CSV parsing rules. The parsing
rules will contain information about which columns to ingest or ignore.
This will be used to add the ability to upload files via HTTP post in
addition to the TcpIngestor.
Remove the controller method that returned the VueJS index page.
Add resource handlers that redirect to the Angular application.
I added two implementations, but activated only one. At the moment I am
not sure which solution is the better. We keep both so that we can
easily switch if need arises.
In the previous changeset the code that determined
which axis the plots used was implemented as a
side effect of getting the Gnuplot definition of
an axis.
Changed that to an explit update call with simpler
logic.
We do not know which fields exist at compile time.
But it is a great help to have some pre-selected
fields in groupBy.
Solved by adding a configuration option.
When using autocomplete to return field values I
missed, that autocomplete had the feature that cut
values at dots. So instead of returning full field
values only the prefix up to the first dot was
returned.
Fixed by making the cut-at-dot feature optional.
We had a method that returned the values of a field
with respect to a query. That method was inefficient,
because it executed the query, fetched all Docs
and collected the values.
The autocomplete method we introduced a while back
can answer the same question but much more efficiently.
In order to prevent files from getting too big and
make it easier to implement retention policies, we
are splitting all files into chunks. Each chunk
contains the data for a time interval (1 month per
default).
This first changeset introduces the ClusteredPersistentMap
that implements this for PersistentMap. It is used
for a couple (not all) of indices.
Due to a mistake in Tag which added all strings used
by Tag into the String dictionary, the dictionary
did contain all values that were used in queries.
We are now reading the CSV input without transforming
the data into strings. This reduces the amount of bytes
that have to be converted and copied.
We also made Tag smaller. It no longer stores pointers
to strings, instead it stored integers obtained by
compressing the strings (see StringCompressor). This
reduces memory usage and it speeds up hashcode and
equals, which speeds up access to the writer cache.
Performance gain is almost 100%:
- 330k entries/s -> 670k entries/s, top speed measured over a second
- 62s -> 32s, to ingest 16 million entries
One bottleneck was the blocking queue used to transport entries
from the listener thread to the ingestor thread.
Reduced the bottleneck by batching entries.
Interestingly the batch size of 100 was better than batch size
of 1000 and better than 10.
I was finally able to show that there is a tiny but measureable
effect of this buffer. I think it was not visible before,
because the parsing was too slow. But now, that I replaced the
date parser, the ingestion thread is twice as fast as the
insertion thread. Therefore the buffer makes more sense.
Compared to FastISODateParser.parse, which returns an
OffsetDateTime object, parseAsEpochMilli returns the
epoch time millis. The performance improvement for
date parsing alone is roughly 100% (8m dates/s to
18m dates/s).
Insertion speed improved from 13-14s for 1.6m entries
to 11.5-12.5s.