Commit Graph

81 Commits

Author SHA1 Message Date
andi 6eaf4e10fc add maxSize parameter to HotEntryCache 2019-08-24 19:24:20 +02:00
andi feda901f6d remove event types
We only have removal events. The additional complexity
of having a generic interface for many different event
types does not pay off.
2019-08-18 20:30:25 +02:00
andi 4d9ea6d2a8 switch back to my own HotEntryCache implementation
Guava's cache does not evict elements reliably by
time. Configure a cache to have a lifetime of n
seconds, then you cannot expect that an element is
actually evicted after n seconds with Guava.
2019-08-18 20:14:14 +02:00
andi 3252fcf42d improve trace logging
- Add filename for trace logs for read/write operations.
2019-08-18 09:25:49 +02:00
andi 0b3eb97b96 Fix to string for maps with values of type Empty
The MAX_KEY inserted into the tree had a value of one byte. This
triggered an assertion for maps with values of type Empty, because they
expected values to be empty.
Fixed by using an empty array for the value of the MAX_KEY.
2019-08-12 08:35:40 +02:00
andi 9fb1a136c8 cache last used date prefix
The 99.9999% use case is to ingest data
from the same month.
2019-04-22 09:51:44 +02:00
andi 56085061ed do not return anything if the field/value does not exist
The computation of proposals is done by searching for values in a
combined index. If one of the values didn't exist, then the algorithm
returned all values. Fixed by checking that we query only existing
field/values from the combined index.
2019-04-20 19:48:51 +02:00
andi dbe0e02517 rename cluster to partition
We are not clustering the indices, we
are partitioning them.
2019-04-14 10:10:16 +02:00
andi 2a1885a77f cluster the indices 2019-03-31 09:01:55 +02:00
andi 95f2f26966 handle IOExceptions earlier 2019-03-17 11:13:46 +01:00
andi 5d0ceb112e add clustering for DiskStore 2019-03-17 10:53:02 +01:00
andi b5e2d0a217 introduce clustering for query completion indices 2019-03-16 10:19:28 +01:00
andi fb9f8592ac make ClusteredPersistentMap easier to use 2019-02-24 19:20:44 +01:00
andi 59aea1a15f introduce index clustering (part 1)
In order to prevent files from getting too big and
make it easier to implement retention policies, we
are splitting all files into chunks. Each chunk
contains the data for a time interval (1 month per
default).
This first changeset introduces the ClusteredPersistentMap
that implements this for PersistentMap. It is used
for a couple (not all) of indices.
2019-02-24 16:50:57 +01:00
andi 372a073b6d PdbWriter is no longer in the API of DataStore 2019-02-16 16:24:14 +01:00
andi 92a47d9b56 remove TagsToFile
Remove one layer of abstraction by moving the code into the DataStore.
2019-02-16 16:06:46 +01:00
andi 117ef4ea34 use guava's cache as implementation for the HotEntryCache
My own implementation was faster, but was not able to
implement a size limitation.
2019-02-16 10:23:52 +01:00
andi 7b00eede86 refactoring: extract EncoderDecoders from DataStore 2019-02-16 09:16:15 +01:00
andi cbcb7714bb split BSFile into a TimeSeries and a LongStream file
BSFile was used to store two types of data. This makes
the API complex. I split the API into two files with
easier and more clear APIs. Interestingly the API of
BSFile is still rather complex and has to consider both
use cases.
2019-02-10 09:59:16 +01:00
andi 27b83234cc group proposal as if they were hierarchical
We interpret dots ('.') as hierarchy delimiter in.
That way we can reduce the number of proposed values
and show only those for the next level.
2019-02-09 15:21:35 +01:00
andi 493971bcf3 values used in queries were added to the keys.csv
Due to a mistake in Tag which added all strings used
by Tag into the String dictionary, the dictionary
did contain all values that were used in queries.
2019-02-09 08:28:23 +01:00
andi ea5884a5e6 move creation of PdbWriter to the DataStore 2019-02-07 18:06:41 +01:00
andi 99cdf557b3 add metric logger for query completion evaluation 2019-02-06 15:51:41 +00:00
andi 668d73c926 introduced a new custom file format used for backup and ingestion
The new file format reduces repetition, is easy to parse,
easy to generate in any language and is human readable.
2019-02-03 15:44:35 +01:00
andi d4d1685f9f replace stdout with logger 2019-02-02 16:49:21 +01:00
andi 151e9363e1 remove obsolete classes 2019-02-02 16:45:34 +01:00
andi 76e5d441de rewrite query completion
The old implementation searched for all possible values and then
executed each query to see what matches.
The new implementation uses several indices to find only
the matching values.
2019-02-02 15:35:56 +01:00
andi 72e9a9ebe3 prepare more efficient query completion
adding an index that answers the question
given a query "a=b and c=", what are possible values
for c.
2019-01-13 10:22:17 +01:00
andi 5197063ae3 the union of many small lists is expensive
The reason seems to be the number of memory allocations. In order
to create the union of 100 lists we have 99 memory allocations.
The first needs the space for the first two lists, the second the
space for the first three lists, and so on.

We can reduce the number of allocations drastically (in many
cases to one) by leveraging the fact that many of the lists
were already sorted, non-overlapping and increasing, so that
we can simply concatenate them.
2019-01-05 08:52:56 +01:00
andi 4cde10a9f2 read csv using input stream instead of reader
We are now reading the CSV input without transforming
the data into strings. This reduces the amount of bytes
that have to be converted and copied.
We also made Tag smaller. It no longer stores pointers
to strings, instead it stored integers obtained by
compressing the strings (see StringCompressor). This
reduces memory usage and it speeds up hashcode and
equals, which speeds up access to the writer cache.

Performance gain is almost 100%:
- 330k entries/s -> 670k entries/s, top speed measured over a second
- 62s -> 32s, to ingest 16 million entries
2019-01-01 08:31:28 +01:00
andi e537e94d39 HotEntryCache will update Instants only once per second
Calling Instant.now() several hundred thousand times per
second can be expensive. In my measurements >10% of the
time spend when loading new data was spend calling
Instant.now().
Fixed this by storing an Instant as static member and
updating it periodically in a separate thread.
2018-12-21 19:16:55 +01:00
andi 253bbabd19 cleanup
remove debug output
2018-11-25 07:49:23 +00:00
andi 593752470c cleanup 2018-11-25 07:46:58 +01:00
andi f78f69328b add cache for docId to Doc mapping
A Doc does not change once it is created, so it is easy to cache.
Speedup was from 1ms per Doc to 3ms for 444 Docs (0.00675ms/Doc).
2018-11-22 19:51:07 +01:00
andi afd1e36066 fix unsupported operation exception when adding to an unmodifiable set 2018-11-19 19:19:51 +01:00
andi 135ab42cd8 tags are now stored as variable length byte sequences of longs
Replaced Tags.filenameBytes with a SortedSet<Tag>. Tags are now
stored as longs (variable length encoded) in the PersistenMap.
Tags.filenameBytes was introduced to reduce memory consumption, when
all tags were hold in memory. Tags are now stored in a PersistentMap
and only read when needed.

Moved the VariableByteEncoder into its own project, because it was
needed by pdb-api.
2018-11-17 20:03:46 +01:00
andi fce0f6a04d use PersistentMap in DataStore
Replaces the use of in-memory data structures with the PersistentMap.
This is the crucial step in reducing memory usage for both persistent
storage and main memory.
2018-11-17 09:45:35 +01:00
andi bd88c63aff ensure BSFiles use blocks that are aligned to 512 Byte offsets 2018-10-14 09:00:26 +02:00
andi 0539080200 use byte offsets instead of block numbers
We want to allow arbitrary allocations in DiskStorage. The
first step was to change the hard coded block size into a
dynamic one.
2018-10-12 08:10:43 +02:00
andi 979d3269fa remove obsolete classes and methods 2018-10-04 18:46:51 +02:00
andi 8939332004 remove the wrapper class PdbDB
It did not serve any purpose and could be replaced by DataStore.
2018-10-04 18:43:27 +02:00
andi 24fcfd7763 prepare the addition of a date index 2018-09-28 19:07:01 +02:00
andi a2e63cca44 cleanup 2018-09-13 08:11:15 +02:00
andi c6a1291ee6 the pattern must match the property value exactly,
when matching property values to the query. This is important when
you have a property value that is a prefix of another property value,
e.g., AuditService.logEvent and AuditService.logEvents.
2018-09-13 07:55:13 +02:00
andi 61f131571a add CamelCase matching to the query language 2018-09-12 13:42:23 +02:00
andi 86b8f93752 replace 'in' queries with a simpler syntax
field in (val1, val2)
was replaced with
field=val1,val2
or
field=(val1, val2)
2018-09-12 10:10:01 +02:00
andi 1182d76205 replace the FolderStorage with DiskStorage
- The DiskStorage uses only one file instead of millions.
  Also the block size is only 512 byte instead of 4kb, which
  helps to reduce the memory usage for short sequences.
- Update primitiveCollections to get the new LongList.range
  and LongList.rangeClosed methods.
- BSFile now stores Time&Value sequences and knows how to
  encode the time values with delta encoding.
- Doc had to do some magic tricks to save memory. The path
  was initialized lazy and stored as byte array. This is no
  longer necessary. The patch was replaced by the
  rootBlockNumber of the BSFile.
- Had to temporarily disable the 'in' queries.
- The stored values are now processed as stream of LongLists
  instead of Entry. The overhead for creating Entries is
  gone, so is the memory overhead, because Entry was an
  object and had a reference to the tags, which is
  unnecessary.
2018-09-12 09:35:07 +02:00
andi ea5e16fad5 expressions now support in-queries 2018-08-18 10:31:49 +02:00
andi 182d1edd97 add a datetime picker
Unfortunately the datetime picker does not support seconds. But it is
one of the few that support date and time and are flexible enough to
be used with VueJS.
2018-08-04 08:32:04 +00:00
andi b61a34a0e6 use existing RandomAccessFile when updating the listing file
Ingestion speed dropped drastically with the old implementation.
In some situations to 7 entries per second over a 10 second period
(sic!). When using the already opened RandomAccessFile the speed
is back to previous values of 40k-50k entries per second on my 10 year
old machine on an encrypted spinning disk.
2018-05-10 17:41:50 +02:00