Commit Graph

111 Commits

Author SHA1 Message Date
b5e2d0a217 introduce clustering for query completion indices 2019-03-16 10:19:28 +01:00
59aea1a15f introduce index clustering (part 1)
In order to prevent files from getting too big and
make it easier to implement retention policies, we
are splitting all files into chunks. Each chunk
contains the data for a time interval (1 month per
default).
This first changeset introduces the ClusteredPersistentMap
that implements this for PersistentMap. It is used
for a couple (not all) of indices.
2019-02-24 16:50:57 +01:00
372a073b6d PdbWriter is no longer in the API of DataStore 2019-02-16 16:24:14 +01:00
92a47d9b56 remove TagsToFile
Remove one layer of abstraction by moving the code into the DataStore.
2019-02-16 16:06:46 +01:00
117ef4ea34 use guava's cache as implementation for the HotEntryCache
My own implementation was faster, but was not able to
implement a size limitation.
2019-02-16 10:23:52 +01:00
493971bcf3 values used in queries were added to the keys.csv
Due to a mistake in Tag which added all strings used
by Tag into the String dictionary, the dictionary
did contain all values that were used in queries.
2019-02-09 08:28:23 +01:00
ea5884a5e6 move creation of PdbWriter to the DataStore 2019-02-07 18:06:41 +01:00
58bfba23bb reset lastEpochMilli when opening a new export file 2019-02-06 15:52:37 +00:00
668d73c926 introduced a new custom file format used for backup and ingestion
The new file format reduces repetition, is easy to parse,
easy to generate in any language and is human readable.
2019-02-03 15:44:35 +01:00
f2d16b6758 make CacheKey comparable
The CacheKey is used as a key in a HashMap. Lookup can
be faster if the CacheKey is comparable when there are
hash collisions.
In this case I was not able to measure any effect. I am
keeping the comparables nonetheless, because the can
only have a positive effect.
2019-01-01 08:47:48 +01:00
e537e94d39 HotEntryCache will update Instants only once per second
Calling Instant.now() several hundred thousand times per
second can be expensive. In my measurements >10% of the
time spend when loading new data was spend calling
Instant.now().
Fixed this by storing an Instant as static member and
updating it periodically in a separate thread.
2018-12-21 19:16:55 +01:00
d95a71e32e batch entries between TcpIngestor and PerformanceDB
One bottleneck was the blocking queue used to transport entries
from the listener thread to the ingestor thread.
Reduced the bottleneck by batching entries.
Interestingly the batch size of 100 was better than batch size
of 1000 and better than 10.
2018-12-21 13:11:35 +01:00
40f4506e13 use FastISODateParser.parseAsEpochMilli
Compared to FastISODateParser.parse, which returns an
OffsetDateTime object, parseAsEpochMilli returns the
epoch time millis. The performance improvement for
date parsing alone is roughly 100% (8m dates/s to
18m dates/s).
Insertion speed improved from 13-14s for 1.6m entries
to 11.5-12.5s.
2018-12-16 19:24:47 +01:00
f78f69328b add cache for docId to Doc mapping
A Doc does not change once it is created, so it is easy to cache.
Speedup was from 1ms per Doc to 3ms for 444 Docs (0.00675ms/Doc).
2018-11-22 19:51:07 +01:00
cc0157fe0b update java 3rd-party libs 2018-11-20 19:13:59 +01:00
eaa234bfa5 rename put to putEntries
The method name put is used too often so that eclipse has a
hard time finding references.
2018-10-11 19:25:01 +02:00
979e001efd TcpIngestor can handle csv files 2018-10-11 18:56:16 +02:00
979d3269fa remove obsolete classes and methods 2018-10-04 18:46:51 +02:00
8939332004 remove the wrapper class PdbDB
It did not serve any purpose and could be replaced by DataStore.
2018-10-04 18:43:27 +02:00
01b93e32ca replace EhCache with a custom implementation
The cache must remove/evict writers after a few seconds, but EhCache
only evicts entries when a new entry is added. That is not acceptable
for us, because that would leave lots of files open and we would need
a second mechanism to close them.
Therefore I write a simple wrapper for a ConcurrentHashMap that evicts
entries after timeToLive+5s.
2018-10-03 20:22:45 +02:00
c9dcc77b53 reuse existing PdbFiles 2018-10-03 16:49:46 +02:00
60578b45ec PdbWriters are now closed by the cache TagsToFile
we do not have to close the files when the input streams are idle.
2018-10-03 16:47:29 +02:00
ad630fc6b2 simplify caching in TagsToFile
- PdbFiles no longer require dates to be monotonically
  increasing. Therefore TagsToFile does not have to ensure
  this. => We only have one file per Tags.
- Use EhCache instead of HashMap.
2018-09-30 10:38:25 +02:00
f07977c27a update java, gradle and third party libs 2018-09-29 09:08:29 +02:00
24fcfd7763 prepare the addition of a date index 2018-09-28 19:07:01 +02:00
84350c4dfb move TimeStampDeltaDecoder to BSFile
Now the encoding and decoding code is in the same class.
2018-09-13 13:08:45 +02:00
a2e63cca44 cleanup 2018-09-13 08:11:15 +02:00
1182d76205 replace the FolderStorage with DiskStorage
- The DiskStorage uses only one file instead of millions.
  Also the block size is only 512 byte instead of 4kb, which
  helps to reduce the memory usage for short sequences.
- Update primitiveCollections to get the new LongList.range
  and LongList.rangeClosed methods.
- BSFile now stores Time&Value sequences and knows how to
  encode the time values with delta encoding.
- Doc had to do some magic tricks to save memory. The path
  was initialized lazy and stored as byte array. This is no
  longer necessary. The patch was replaced by the
  rootBlockNumber of the BSFile.
- Had to temporarily disable the 'in' queries.
- The stored values are now processed as stream of LongLists
  instead of Entry. The overhead for creating Entries is
  gone, so is the memory overhead, because Entry was an
  object and had a reference to the tags, which is
  unnecessary.
2018-09-12 09:35:07 +02:00
89840cf9e9 update dependencies 2018-07-28 08:50:42 +02:00
daaa0e6907 update dependencies
gradle to 4.8
jackson to 2.9.6
spring-boot to 2.0.3
guava to 25.1-jre
gradle-versions-plugin to 0.19.0
2018-06-17 08:59:48 +02:00
911062e26b use RandomAccessFile in FolderStorage.getPathByOffset()
The old implementation opened a new buffered reader everytime
getPathByOffset was called. This took 1/20th of a second or
longer. For queries that visited thousands of files this could
take a long time.
We are now using a RandomAccessFile, that is opened once. The
average time spend in getPathByOffset is now down to 0.11ms.
2018-05-10 10:22:25 +02:00
82b8a8a932 reduce memory footprint by lazily intializing the path in Doc
The path in Doc is not optional. This reduces memory consumption,
because we only have to store a long (the offset in the listing file).
This assumes, that only a small percentage of Docs is requested.
2018-05-06 12:58:10 +02:00
e3102c01d4 use listing.csv instead of iterating through all folders
The hope is, that it is faster to read a single file instead of listing
hundreds of folders.
2018-05-05 10:46:16 +02:00
b06ccb0d00 update 3rd party libs
spring boot to 2.0.1
guava to 24.1-jre
jackson to 2.9.5
log4j2 to 2.10.0 (same version as pulled by spring boot)
testng to 6.14.3
2018-04-21 20:01:39 +02:00
22c99f8517 fix null pointer exception
filename were generated without '$', but the parsing code expected
the '$'.
2018-03-28 19:34:48 +02:00
81711d551f fix performance regression
The last improvement of memory usage introduced a performance
regression. The ingestion performance dropped by 50%-80%, because
for every inserted entry the Tags were created inefficient.
2018-03-27 19:30:18 +02:00
5343c0d427 reduce memory usage
Reduce memory usage by storing the filename as string instead of
individual tags.
2018-03-19 19:21:57 +01:00
ahr
3387ebc134 use epoch millis instead of creating a date object
We only have to check if one timestamp is newer than another.
We don't have to create an expensive date object to do that.
2018-03-09 08:43:37 +01:00
ahr
7e5b762c0d pre-compute firstByteMaxValue
this operation is executed very often during ingestion
2018-03-09 08:38:58 +01:00
ahr
6b60fd542c add percentile plots 2018-03-03 08:19:26 +01:00
9f45eb24ca add trace logging for creation of new writer 2018-01-21 08:36:40 +01:00
ahr
740cb1cb2d print metrics every 10 seconds, not every 10.001 seconds 2018-01-14 09:52:08 +01:00
ahr
d98c45e8bd add index for tags-to-documents
Now we can find writer much faster, because we don't have to execute
a query for documents that match the tags. We can just look up the 
documents in the map.
Speedup: 2-4ms -> 0.002-0.01ms
2018-01-14 09:51:37 +01:00
ahr
64613ce43c add metric logging for getWriter 2018-01-13 10:32:03 +01:00
ahr
3cc512f73d update third party libs
testng 6.11 -> 6.13.1
jackson-databind 2.9.1 -> 2.9.3
guava 23.0 -> 23.6-jre
2017-12-30 10:06:57 +01:00
ahr
cafaa7343c remove obsolete method 2017-12-16 19:20:38 +01:00
ahr
04b029e1be add trace logging 2017-12-16 19:19:12 +01:00
ahr
d63fabc85d prevent parallel plot requests
Plotting can take a long time and use a lot of resources. 
Multiple plot requests can cause the machine to run OOM.

We are now allowing plots for 500k files again. This is mainly to
prevent unwanted plots of everything.
2017-12-15 17:20:12 +01:00
ahr
8d48726472 remove unnecessary mapping to TagSpecificBaseDir 2017-12-15 16:52:20 +01:00
ahr
8860a048ff remove call of listRecursively on a file
The call was needed in a very early version.
2017-12-10 17:55:16 +01:00