Commit Graph

112 Commits

Author SHA1 Message Date
11beda5432 make logging of insertion speed a little nicer 2020-11-24 10:00:53 +01:00
3e77c2a103 various fixes 2020-08-11 16:12:18 +02:00
9a311313ec use US locale to format strings
This is especially important for all strings that are
passed to gnuplot. Because gnuplot uses the US locale
during parsing.
2020-03-12 19:40:20 +01:00
5d8df6888d move Entry and Entries to data-store 2019-12-13 18:15:10 +01:00
550d7ba44e add flag to make CSV upload wait until entries are flushed
To make it easier/possible to write stable unit test the CSV upload
can optionally wait until all entries have been flushed to disk.
This is necessary for tests that ingest data and then read the data.
2019-12-13 18:05:20 +01:00
07ad62ddd9 use Junit5 instead of TestNG
We want to be able to use @SpringBootTest tests that fully initialize
the Spring application. This is much easier done with Junit than TestNG.
Gradle does not support (at least not easily) to run Junit and TestNG
tests. Therefore we switch to Junit with all tests.
The original reason for using TestNG was that Junit didn't support
data providers. But that finally changed in Junit5 with
ParameterizedTest.
2019-12-13 14:33:20 +01:00
85679ca0c8 send CSV file via REST 2019-12-08 18:39:43 +01:00
06b379494f apply new code formatter and save action 2019-11-24 10:20:43 +01:00
2f35978184 fetch available values for gallery via autocomplete method
We had a method that returned the values of a field
with respect to a query. That method was inefficient,
because it executed the query, fetched all Docs
and collected the values.
The autocomplete method we introduced a while back
can answer the same question but much more efficiently.
2019-08-25 18:52:05 +02:00
dfe9579726 use DateTimeRange.max() instead of arbitrary relative range 2019-04-20 20:36:26 +02:00
dbe0e02517 rename cluster to partition
We are not clustering the indices, we
are partitioning them.
2019-04-14 10:10:16 +02:00
5d0ceb112e add clustering for DiskStore 2019-03-17 10:53:02 +01:00
b5e2d0a217 introduce clustering for query completion indices 2019-03-16 10:19:28 +01:00
59aea1a15f introduce index clustering (part 1)
In order to prevent files from getting too big and
make it easier to implement retention policies, we
are splitting all files into chunks. Each chunk
contains the data for a time interval (1 month per
default).
This first changeset introduces the ClusteredPersistentMap
that implements this for PersistentMap. It is used
for a couple (not all) of indices.
2019-02-24 16:50:57 +01:00
372a073b6d PdbWriter is no longer in the API of DataStore 2019-02-16 16:24:14 +01:00
92a47d9b56 remove TagsToFile
Remove one layer of abstraction by moving the code into the DataStore.
2019-02-16 16:06:46 +01:00
117ef4ea34 use guava's cache as implementation for the HotEntryCache
My own implementation was faster, but was not able to
implement a size limitation.
2019-02-16 10:23:52 +01:00
493971bcf3 values used in queries were added to the keys.csv
Due to a mistake in Tag which added all strings used
by Tag into the String dictionary, the dictionary
did contain all values that were used in queries.
2019-02-09 08:28:23 +01:00
ea5884a5e6 move creation of PdbWriter to the DataStore 2019-02-07 18:06:41 +01:00
58bfba23bb reset lastEpochMilli when opening a new export file 2019-02-06 15:52:37 +00:00
668d73c926 introduced a new custom file format used for backup and ingestion
The new file format reduces repetition, is easy to parse,
easy to generate in any language and is human readable.
2019-02-03 15:44:35 +01:00
f2d16b6758 make CacheKey comparable
The CacheKey is used as a key in a HashMap. Lookup can
be faster if the CacheKey is comparable when there are
hash collisions.
In this case I was not able to measure any effect. I am
keeping the comparables nonetheless, because the can
only have a positive effect.
2019-01-01 08:47:48 +01:00
e537e94d39 HotEntryCache will update Instants only once per second
Calling Instant.now() several hundred thousand times per
second can be expensive. In my measurements >10% of the
time spend when loading new data was spend calling
Instant.now().
Fixed this by storing an Instant as static member and
updating it periodically in a separate thread.
2018-12-21 19:16:55 +01:00
d95a71e32e batch entries between TcpIngestor and PerformanceDB
One bottleneck was the blocking queue used to transport entries
from the listener thread to the ingestor thread.
Reduced the bottleneck by batching entries.
Interestingly the batch size of 100 was better than batch size
of 1000 and better than 10.
2018-12-21 13:11:35 +01:00
40f4506e13 use FastISODateParser.parseAsEpochMilli
Compared to FastISODateParser.parse, which returns an
OffsetDateTime object, parseAsEpochMilli returns the
epoch time millis. The performance improvement for
date parsing alone is roughly 100% (8m dates/s to
18m dates/s).
Insertion speed improved from 13-14s for 1.6m entries
to 11.5-12.5s.
2018-12-16 19:24:47 +01:00
f78f69328b add cache for docId to Doc mapping
A Doc does not change once it is created, so it is easy to cache.
Speedup was from 1ms per Doc to 3ms for 444 Docs (0.00675ms/Doc).
2018-11-22 19:51:07 +01:00
eaa234bfa5 rename put to putEntries
The method name put is used too often so that eclipse has a
hard time finding references.
2018-10-11 19:25:01 +02:00
979e001efd TcpIngestor can handle csv files 2018-10-11 18:56:16 +02:00
979d3269fa remove obsolete classes and methods 2018-10-04 18:46:51 +02:00
8939332004 remove the wrapper class PdbDB
It did not serve any purpose and could be replaced by DataStore.
2018-10-04 18:43:27 +02:00
01b93e32ca replace EhCache with a custom implementation
The cache must remove/evict writers after a few seconds, but EhCache
only evicts entries when a new entry is added. That is not acceptable
for us, because that would leave lots of files open and we would need
a second mechanism to close them.
Therefore I write a simple wrapper for a ConcurrentHashMap that evicts
entries after timeToLive+5s.
2018-10-03 20:22:45 +02:00
c9dcc77b53 reuse existing PdbFiles 2018-10-03 16:49:46 +02:00
60578b45ec PdbWriters are now closed by the cache TagsToFile
we do not have to close the files when the input streams are idle.
2018-10-03 16:47:29 +02:00
ad630fc6b2 simplify caching in TagsToFile
- PdbFiles no longer require dates to be monotonically
  increasing. Therefore TagsToFile does not have to ensure
  this. => We only have one file per Tags.
- Use EhCache instead of HashMap.
2018-09-30 10:38:25 +02:00
24fcfd7763 prepare the addition of a date index 2018-09-28 19:07:01 +02:00
84350c4dfb move TimeStampDeltaDecoder to BSFile
Now the encoding and decoding code is in the same class.
2018-09-13 13:08:45 +02:00
a2e63cca44 cleanup 2018-09-13 08:11:15 +02:00
1182d76205 replace the FolderStorage with DiskStorage
- The DiskStorage uses only one file instead of millions.
  Also the block size is only 512 byte instead of 4kb, which
  helps to reduce the memory usage for short sequences.
- Update primitiveCollections to get the new LongList.range
  and LongList.rangeClosed methods.
- BSFile now stores Time&Value sequences and knows how to
  encode the time values with delta encoding.
- Doc had to do some magic tricks to save memory. The path
  was initialized lazy and stored as byte array. This is no
  longer necessary. The patch was replaced by the
  rootBlockNumber of the BSFile.
- Had to temporarily disable the 'in' queries.
- The stored values are now processed as stream of LongLists
  instead of Entry. The overhead for creating Entries is
  gone, so is the memory overhead, because Entry was an
  object and had a reference to the tags, which is
  unnecessary.
2018-09-12 09:35:07 +02:00
911062e26b use RandomAccessFile in FolderStorage.getPathByOffset()
The old implementation opened a new buffered reader everytime
getPathByOffset was called. This took 1/20th of a second or
longer. For queries that visited thousands of files this could
take a long time.
We are now using a RandomAccessFile, that is opened once. The
average time spend in getPathByOffset is now down to 0.11ms.
2018-05-10 10:22:25 +02:00
82b8a8a932 reduce memory footprint by lazily intializing the path in Doc
The path in Doc is not optional. This reduces memory consumption,
because we only have to store a long (the offset in the listing file).
This assumes, that only a small percentage of Docs is requested.
2018-05-06 12:58:10 +02:00
e3102c01d4 use listing.csv instead of iterating through all folders
The hope is, that it is faster to read a single file instead of listing
hundreds of folders.
2018-05-05 10:46:16 +02:00
22c99f8517 fix null pointer exception
filename were generated without '$', but the parsing code expected
the '$'.
2018-03-28 19:34:48 +02:00
81711d551f fix performance regression
The last improvement of memory usage introduced a performance
regression. The ingestion performance dropped by 50%-80%, because
for every inserted entry the Tags were created inefficient.
2018-03-27 19:30:18 +02:00
5343c0d427 reduce memory usage
Reduce memory usage by storing the filename as string instead of
individual tags.
2018-03-19 19:21:57 +01:00
ahr
3387ebc134 use epoch millis instead of creating a date object
We only have to check if one timestamp is newer than another.
We don't have to create an expensive date object to do that.
2018-03-09 08:43:37 +01:00
ahr
7e5b762c0d pre-compute firstByteMaxValue
this operation is executed very often during ingestion
2018-03-09 08:38:58 +01:00
ahr
6b60fd542c add percentile plots 2018-03-03 08:19:26 +01:00
9f45eb24ca add trace logging for creation of new writer 2018-01-21 08:36:40 +01:00
ahr
740cb1cb2d print metrics every 10 seconds, not every 10.001 seconds 2018-01-14 09:52:08 +01:00
ahr
d98c45e8bd add index for tags-to-documents
Now we can find writer much faster, because we don't have to execute
a query for documents that match the tags. We can just look up the 
documents in the map.
Speedup: 2-4ms -> 0.002-0.01ms
2018-01-14 09:51:37 +01:00