Commit Graph

99 Commits

Author SHA1 Message Date
40f4506e13 use FastISODateParser.parseAsEpochMilli
Compared to FastISODateParser.parse, which returns an
OffsetDateTime object, parseAsEpochMilli returns the
epoch time millis. The performance improvement for
date parsing alone is roughly 100% (8m dates/s to
18m dates/s).
Insertion speed improved from 13-14s for 1.6m entries
to 11.5-12.5s.
2018-12-16 19:24:47 +01:00
f78f69328b add cache for docId to Doc mapping
A Doc does not change once it is created, so it is easy to cache.
Speedup was from 1ms per Doc to 3ms for 444 Docs (0.00675ms/Doc).
2018-11-22 19:51:07 +01:00
cc0157fe0b update java 3rd-party libs 2018-11-20 19:13:59 +01:00
eaa234bfa5 rename put to putEntries
The method name put is used too often so that eclipse has a
hard time finding references.
2018-10-11 19:25:01 +02:00
979e001efd TcpIngestor can handle csv files 2018-10-11 18:56:16 +02:00
979d3269fa remove obsolete classes and methods 2018-10-04 18:46:51 +02:00
8939332004 remove the wrapper class PdbDB
It did not serve any purpose and could be replaced by DataStore.
2018-10-04 18:43:27 +02:00
01b93e32ca replace EhCache with a custom implementation
The cache must remove/evict writers after a few seconds, but EhCache
only evicts entries when a new entry is added. That is not acceptable
for us, because that would leave lots of files open and we would need
a second mechanism to close them.
Therefore I write a simple wrapper for a ConcurrentHashMap that evicts
entries after timeToLive+5s.
2018-10-03 20:22:45 +02:00
c9dcc77b53 reuse existing PdbFiles 2018-10-03 16:49:46 +02:00
60578b45ec PdbWriters are now closed by the cache TagsToFile
we do not have to close the files when the input streams are idle.
2018-10-03 16:47:29 +02:00
ad630fc6b2 simplify caching in TagsToFile
- PdbFiles no longer require dates to be monotonically
  increasing. Therefore TagsToFile does not have to ensure
  this. => We only have one file per Tags.
- Use EhCache instead of HashMap.
2018-09-30 10:38:25 +02:00
f07977c27a update java, gradle and third party libs 2018-09-29 09:08:29 +02:00
24fcfd7763 prepare the addition of a date index 2018-09-28 19:07:01 +02:00
84350c4dfb move TimeStampDeltaDecoder to BSFile
Now the encoding and decoding code is in the same class.
2018-09-13 13:08:45 +02:00
a2e63cca44 cleanup 2018-09-13 08:11:15 +02:00
1182d76205 replace the FolderStorage with DiskStorage
- The DiskStorage uses only one file instead of millions.
  Also the block size is only 512 byte instead of 4kb, which
  helps to reduce the memory usage for short sequences.
- Update primitiveCollections to get the new LongList.range
  and LongList.rangeClosed methods.
- BSFile now stores Time&Value sequences and knows how to
  encode the time values with delta encoding.
- Doc had to do some magic tricks to save memory. The path
  was initialized lazy and stored as byte array. This is no
  longer necessary. The patch was replaced by the
  rootBlockNumber of the BSFile.
- Had to temporarily disable the 'in' queries.
- The stored values are now processed as stream of LongLists
  instead of Entry. The overhead for creating Entries is
  gone, so is the memory overhead, because Entry was an
  object and had a reference to the tags, which is
  unnecessary.
2018-09-12 09:35:07 +02:00
89840cf9e9 update dependencies 2018-07-28 08:50:42 +02:00
daaa0e6907 update dependencies
gradle to 4.8
jackson to 2.9.6
spring-boot to 2.0.3
guava to 25.1-jre
gradle-versions-plugin to 0.19.0
2018-06-17 08:59:48 +02:00
911062e26b use RandomAccessFile in FolderStorage.getPathByOffset()
The old implementation opened a new buffered reader everytime
getPathByOffset was called. This took 1/20th of a second or
longer. For queries that visited thousands of files this could
take a long time.
We are now using a RandomAccessFile, that is opened once. The
average time spend in getPathByOffset is now down to 0.11ms.
2018-05-10 10:22:25 +02:00
82b8a8a932 reduce memory footprint by lazily intializing the path in Doc
The path in Doc is not optional. This reduces memory consumption,
because we only have to store a long (the offset in the listing file).
This assumes, that only a small percentage of Docs is requested.
2018-05-06 12:58:10 +02:00
e3102c01d4 use listing.csv instead of iterating through all folders
The hope is, that it is faster to read a single file instead of listing
hundreds of folders.
2018-05-05 10:46:16 +02:00
b06ccb0d00 update 3rd party libs
spring boot to 2.0.1
guava to 24.1-jre
jackson to 2.9.5
log4j2 to 2.10.0 (same version as pulled by spring boot)
testng to 6.14.3
2018-04-21 20:01:39 +02:00
22c99f8517 fix null pointer exception
filename were generated without '$', but the parsing code expected
the '$'.
2018-03-28 19:34:48 +02:00
81711d551f fix performance regression
The last improvement of memory usage introduced a performance
regression. The ingestion performance dropped by 50%-80%, because
for every inserted entry the Tags were created inefficient.
2018-03-27 19:30:18 +02:00
5343c0d427 reduce memory usage
Reduce memory usage by storing the filename as string instead of
individual tags.
2018-03-19 19:21:57 +01:00
ahr
3387ebc134 use epoch millis instead of creating a date object
We only have to check if one timestamp is newer than another.
We don't have to create an expensive date object to do that.
2018-03-09 08:43:37 +01:00
ahr
7e5b762c0d pre-compute firstByteMaxValue
this operation is executed very often during ingestion
2018-03-09 08:38:58 +01:00
ahr
6b60fd542c add percentile plots 2018-03-03 08:19:26 +01:00
9f45eb24ca add trace logging for creation of new writer 2018-01-21 08:36:40 +01:00
ahr
740cb1cb2d print metrics every 10 seconds, not every 10.001 seconds 2018-01-14 09:52:08 +01:00
ahr
d98c45e8bd add index for tags-to-documents
Now we can find writer much faster, because we don't have to execute
a query for documents that match the tags. We can just look up the 
documents in the map.
Speedup: 2-4ms -> 0.002-0.01ms
2018-01-14 09:51:37 +01:00
ahr
64613ce43c add metric logging for getWriter 2018-01-13 10:32:03 +01:00
ahr
3cc512f73d update third party libs
testng 6.11 -> 6.13.1
jackson-databind 2.9.1 -> 2.9.3
guava 23.0 -> 23.6-jre
2017-12-30 10:06:57 +01:00
ahr
cafaa7343c remove obsolete method 2017-12-16 19:20:38 +01:00
ahr
04b029e1be add trace logging 2017-12-16 19:19:12 +01:00
ahr
d63fabc85d prevent parallel plot requests
Plotting can take a long time and use a lot of resources. 
Multiple plot requests can cause the machine to run OOM.

We are now allowing plots for 500k files again. This is mainly to
prevent unwanted plots of everything.
2017-12-15 17:20:12 +01:00
ahr
8d48726472 remove unnecessary mapping to TagSpecificBaseDir 2017-12-15 16:52:20 +01:00
ahr
8860a048ff remove call of listRecursively on a file
The call was needed in a very early version.
2017-12-10 17:55:16 +01:00
ahr
3ee6336125 log time of query execution 2017-12-10 17:52:32 +01:00
ahr
06d25e7ceb do not allow search results with more than 100k docs
a) they take a long time to compute
b) danger of OOM
c) they should drill down
2017-12-10 09:19:28 +01:00
cc49a8cf2a open PdbReaders only when reading
We used to open all PdbReaders in a search result and then interate over
them. This used a lot of heap space (> 8GB) for 400k files.
Now the PdbReaders are only opened while they are used. Heap usage was
less than 550 while reading more than 400k files.
2017-11-18 10:12:22 +01:00
ahr
64db4c48a2 add plots for percentiles 2017-11-06 16:57:22 +01:00
a7cd918fc6 skip empty files 2017-09-24 17:12:17 +02:00
347f1fdc74 update 3rd-party libraries 2017-09-23 18:24:51 +02:00
38873300c8 print last inserted entry during ingestion 2017-09-23 10:55:03 +02:00
dc716f8ac4 log more information in a more predictable manner when inserting entries 2017-04-19 19:32:23 +02:00
a99f6a276e fix missing/wrong logging
1. Log the exception in PdbFileIterator with a logger instead
   of just printing it to stderr.
2. Increase log level for exceptions when inserting entries.
3. Log exception when creation of entry failed in TcpIngestor.
2017-04-17 18:27:25 +02:00
c58e7baf69 make sure there is an exception if the file is corrupt 2017-04-17 17:52:11 +02:00
bcb2e6ca83 add query completion
We are using ANTLR listeners to find out where in the
query the cursor is. Then we generate a list of keys/values
that might fit at that position. With that information we
can generate new queries and sort them by the number
of results they yield.
2017-04-17 16:25:14 +02:00
f6a9fc2394 propose for an empty query 2017-04-16 10:39:17 +02:00