Commit Graph

120 Commits

Author SHA1 Message Date
ahr 740cb1cb2d print metrics every 10 seconds, not every 10.001 seconds 2018-01-14 09:52:08 +01:00
ahr d98c45e8bd add index for tags-to-documents
Now we can find writer much faster, because we don't have to execute
a query for documents that match the tags. We can just look up the 
documents in the map.
Speedup: 2-4ms -> 0.002-0.01ms
2018-01-14 09:51:37 +01:00
ahr 64613ce43c add metric logging for getWriter 2018-01-13 10:32:03 +01:00
ahr 3cc512f73d update third party libs
testng 6.11 -> 6.13.1
jackson-databind 2.9.1 -> 2.9.3
guava 23.0 -> 23.6-jre
2017-12-30 10:06:57 +01:00
ahr cafaa7343c remove obsolete method 2017-12-16 19:20:38 +01:00
ahr 04b029e1be add trace logging 2017-12-16 19:19:12 +01:00
ahr d63fabc85d prevent parallel plot requests
Plotting can take a long time and use a lot of resources. 
Multiple plot requests can cause the machine to run OOM.

We are now allowing plots for 500k files again. This is mainly to
prevent unwanted plots of everything.
2017-12-15 17:20:12 +01:00
ahr 8d48726472 remove unnecessary mapping to TagSpecificBaseDir 2017-12-15 16:52:20 +01:00
ahr 8860a048ff remove call of listRecursively on a file
The call was needed in a very early version.
2017-12-10 17:55:16 +01:00
ahr 3ee6336125 log time of query execution 2017-12-10 17:52:32 +01:00
ahr 06d25e7ceb do not allow search results with more than 100k docs
a) they take a long time to compute
b) danger of OOM
c) they should drill down
2017-12-10 09:19:28 +01:00
andi cc49a8cf2a open PdbReaders only when reading
We used to open all PdbReaders in a search result and then interate over
them. This used a lot of heap space (> 8GB) for 400k files.
Now the PdbReaders are only opened while they are used. Heap usage was
less than 550 while reading more than 400k files.
2017-11-18 10:12:22 +01:00
ahr 64db4c48a2 add plots for percentiles 2017-11-06 16:57:22 +01:00
andi a7cd918fc6 skip empty files 2017-09-24 17:12:17 +02:00
andi 347f1fdc74 update 3rd-party libraries 2017-09-23 18:24:51 +02:00
andi 38873300c8 print last inserted entry during ingestion 2017-09-23 10:55:03 +02:00
andi dc716f8ac4 log more information in a more predictable manner when inserting entries 2017-04-19 19:32:23 +02:00
andi a99f6a276e fix missing/wrong logging
1. Log the exception in PdbFileIterator with a logger instead
   of just printing it to stderr.
2. Increase log level for exceptions when inserting entries.
3. Log exception when creation of entry failed in TcpIngestor.
2017-04-17 18:27:25 +02:00
andi c58e7baf69 make sure there is an exception if the file is corrupt 2017-04-17 17:52:11 +02:00
andi bcb2e6ca83 add query completion
We are using ANTLR listeners to find out where in the
query the cursor is. Then we generate a list of keys/values
that might fit at that position. With that information we
can generate new queries and sort them by the number
of results they yield.
2017-04-17 16:25:14 +02:00
andi f6a9fc2394 propose for an empty query 2017-04-16 10:39:17 +02:00
andi 44f30aafee add a new facade in front of DataStore
This is done in preparation for the proposal API.
In order to compute proposals we need to consume the
API of the DataStore, but the code does not need to
be in the DataStore. 
Extracting the API allows us to separate these concerns.
2017-04-16 10:11:46 +02:00
andi 43d6eba7b7 skip entries if we cannot search for the pdb file
Happened when the project was 'http:'.
2017-04-16 09:49:21 +02:00
andi ac1ee20046 replace ludb with data-store
LuDB has a few disadvantages. 
  1. Most notably disk space. H2 wastes a lot of valuable disk space.
     For my test data set with 44 million entries it is 14 MB 
     (sometimes a lot more; depends on H2 internal cleanup). With 
     data-store it is 15 KB.
     Overall I could reduce the disk space from 231 MB to 200 MB (13.4 %
     in this example). That is an average of 4.6 bytes per entry.
  2. Speed:
     a) Liquibase is slow. The first time it takes approx. three seconds
     b) Query and insertion. with data-store we can insert entries 
        up to 1.6 times faster.

Data-store uses a few tricks to save disk space:
  1. We encode the tags into the file names.
  2. To keep them short we translate the key/value of the tag into 
     shorter numbers. For example "foo" -> 12 and "bar" to 47. So the
     tag "foo"/"bar" would be 12/47. 
     We then translate this number into a numeral system of base 62
     (a-zA-Z0-9), so it can be used for file names and it is shorter.
     That way we only have to store the mapping of string to int.
  3. We do that in a simple tab separated file.
2017-04-16 09:07:28 +02:00
andi f22be73b42 switch the byte prefix of DATE_INCREMENT and MEASUREMENT
Date increments have usually higher values. 
I had hoped to reduce the file size by a lot. But in my example data
with 44 million entries (real life data) it only reduced the storage 
size by 1.5%.
Also fixed a bug in PdbReader that prevented other values for the 
CONTINUATION byte.
Also added a small testing tool that prints the content of a pdb file.
It is not (yet) made available as standalone tool, but during
debugging sessions it is very useful.
2017-04-13 20:19:29 +02:00
andi 58f8606cd3 use special logger for insertion metrics
This allows us to enable/disable metric logging without having to log 
other stuff.
2017-04-13 20:12:00 +02:00
andi 8baf05962f group by multiple fields
Before we could only group by a single field. But it is acutally
very useful to group by multiple fields. For example to see the
graph for a small set of methods grouped by host and project.
2017-04-12 19:16:19 +02:00
andi b8b4a6d760 remove deprecated constructor and getter 2017-04-10 20:15:22 +02:00
andi ac8ad8d30f close open files when no new entries are received
If for 10 seconds no new entry is received, then all open 
files are flushed and closed.
We do this to make sure, that we do not loose data, when
we kill the process.
There is still a risk of data loss if we kill the process
while entries are received.
2017-04-10 20:13:10 +02:00
andi d72d6df0f4 update third-party libraries 2017-04-08 08:18:39 +02:00
andi 2d78a70883 duration for inserts was wrong
The bug was, that we computed the difference between millis and nanos.
Also log duration for flushes.
2017-04-02 11:15:24 +02:00
andi ee00ecb4b5 remove obsolete class 2017-03-20 19:02:01 +01:00
andi 9ab5d76d93 better exception logging 2017-03-19 09:08:41 +01:00
andi a01c8b3907 fix flaky test and improve error handling
just ignore invalid entries
2017-03-18 10:14:41 +01:00
andi 513c256352 update third party libraries 2017-03-17 16:23:21 +01:00
andi 3456177291 add date range filter 2017-03-17 11:17:57 +01:00
andi 5aee6f5e4d use label '<none>' to for values that have not value for groupBy field 2017-02-12 18:56:37 +01:00
andi 562dadb692 group plots by field 2017-02-12 09:59:14 +01:00
andi b238849d65 use text input for filtering, again 2017-02-12 09:32:46 +01:00
andi 0c9195011a use log4j in pdb-ui 2017-02-05 11:20:00 +01:00
andi 3722ba02b1 add slf4j via log4j 2 logging 2017-02-05 09:53:25 +01:00
andi 175a866c90 update third-party libraries 2017-02-05 08:54:49 +01:00
andi 4f77515bbd test for keywords db performance 2017-01-07 09:10:42 +01:00
andi c283568757 group plots by a single field 2016-12-30 18:45:01 +01:00
andi 62437f384f minor unimportant changes 2016-12-30 13:16:30 +01:00
andi 58bb64c80a save 12ms in when checking if cached writer can be used 2016-12-29 19:33:45 +01:00
andi f520f18e13 leverage the cached pdbwriters
this increased performance from 500 entries per second to 4000.
2016-12-29 19:24:16 +01:00
andi de241ceb6d finalize refactoring 2016-12-29 18:27:15 +01:00
andi 68ac1dd631 reuse pdb writers 2016-12-28 08:39:20 +01:00
andi db0b3d6d24 new file format
Store values in sequences of variable length. Instead of using 8 bytes
per entry we are now using between 2 and 20 bytes. But we are also able
to store every non-negative long value.
2016-12-27 10:24:56 +01:00