Commit Graph

27 Commits

Author SHA1 Message Date
60f1a79816 add settings to file upload
This makes it possible to define properties for
the uploaded CSV files. Currently we can define the
separator and which columns are to be ignored.
2019-12-08 20:20:13 +01:00
85679ca0c8 send CSV file via REST 2019-12-08 18:39:43 +01:00
30504672bc extend CsvReaderSettings by list of columns that will not be indexed 2019-12-01 09:21:07 +01:00
08b1be5334 extract CSV reading code to new file
Refactoring to prepare the addition of CSV parsing rules. The parsing
rules will contain information about which columns to ingest or ignore.
This will be used to add the ability to upload files via HTTP post in
addition to the TcpIngestor.
2019-11-30 17:58:01 +01:00
06b379494f apply new code formatter and save action 2019-11-24 10:20:43 +01:00
10a7710940 fix CSV parser corrupts duration if duration is last element in line 2019-11-14 18:40:14 +01:00
b7c4fe4c1f move scatter plot creation into an AggregateHandler 2019-10-20 08:11:09 +02:00
2cb81e5acd make it possible to ignore columns using the csv ingestor 2019-07-04 09:51:33 +02:00
dbe0e02517 rename cluster to partition
We are not clustering the indices, we
are partitioning them.
2019-04-14 10:10:16 +02:00
59aea1a15f introduce index clustering (part 1)
In order to prevent files from getting too big and
make it easier to implement retention policies, we
are splitting all files into chunks. Each chunk
contains the data for a time interval (1 month per
default).
This first changeset introduces the ClusteredPersistentMap
that implements this for PersistentMap. It is used
for a couple (not all) of indices.
2019-02-24 16:50:57 +01:00
668d73c926 introduced a new custom file format used for backup and ingestion
The new file format reduces repetition, is easy to parse,
easy to generate in any language and is human readable.
2019-02-03 15:44:35 +01:00
4cde10a9f2 read csv using input stream instead of reader
We are now reading the CSV input without transforming
the data into strings. This reduces the amount of bytes
that have to be converted and copied.
We also made Tag smaller. It no longer stores pointers
to strings, instead it stored integers obtained by
compressing the strings (see StringCompressor). This
reduces memory usage and it speeds up hashcode and
equals, which speeds up access to the writer cache.

Performance gain is almost 100%:
- 330k entries/s -> 670k entries/s, top speed measured over a second
- 62s -> 32s, to ingest 16 million entries
2019-01-01 08:31:28 +01:00
40f4506e13 use FastISODateParser.parseAsEpochMilli
Compared to FastISODateParser.parse, which returns an
OffsetDateTime object, parseAsEpochMilli returns the
epoch time millis. The performance improvement for
date parsing alone is roughly 100% (8m dates/s to
18m dates/s).
Insertion speed improved from 13-14s for 1.6m entries
to 11.5-12.5s.
2018-12-16 19:24:47 +01:00
23f800a441 add date parsing method that returns epochMillis instead of date object 2018-12-16 15:38:26 +01:00
218ea9ed68 use custom date parser
A specialized date parser that can only handle ISO-8601 like dates
(2011-12-03T10:15:30.123Z or 2011-12-03T10:15:30+01:00) but does this
roughly 10 times faster than DateTimeFormatter and 5 times
faster than the FastDateParser of commons-lang3.
2018-11-19 19:23:57 +01:00
979e001efd TcpIngestor can handle csv files 2018-10-11 18:56:16 +02:00
6d4e3da672 add test for sending entries with negative values to the ingestor 2018-10-07 09:08:25 +02:00
ad630fc6b2 simplify caching in TagsToFile
- PdbFiles no longer require dates to be monotonically
  increasing. Therefore TagsToFile does not have to ensure
  this. => We only have one file per Tags.
- Use EhCache instead of HashMap.
2018-09-30 10:38:25 +02:00
1182d76205 replace the FolderStorage with DiskStorage
- The DiskStorage uses only one file instead of millions.
  Also the block size is only 512 byte instead of 4kb, which
  helps to reduce the memory usage for short sequences.
- Update primitiveCollections to get the new LongList.range
  and LongList.rangeClosed methods.
- BSFile now stores Time&Value sequences and knows how to
  encode the time values with delta encoding.
- Doc had to do some magic tricks to save memory. The path
  was initialized lazy and stored as byte array. This is no
  longer necessary. The patch was replaced by the
  rootBlockNumber of the BSFile.
- Had to temporarily disable the 'in' queries.
- The stored values are now processed as stream of LongLists
  instead of Entry. The overhead for creating Entries is
  gone, so is the memory overhead, because Entry was an
  object and had a reference to the tags, which is
  unnecessary.
2018-09-12 09:35:07 +02:00
bda2de672e improvements
- split the 'sortby' select field into two fields
- sort by average
- legend shows plotted and total values in the date range
- removed InlineDataSeries, because it was not used anymore
2018-05-01 17:32:25 +02:00
5343c0d427 reduce memory usage
Reduce memory usage by storing the filename as string instead of
individual tags.
2018-03-19 19:21:57 +01:00
ahr
5a9aae70af handle corrupt json
Entries must be separated by a newline. This allows
us to handle corrupt json entries, because we know
that entries only start at a line beginning.
2018-03-03 09:58:50 +01:00
ac1ee20046 replace ludb with data-store
LuDB has a few disadvantages. 
  1. Most notably disk space. H2 wastes a lot of valuable disk space.
     For my test data set with 44 million entries it is 14 MB 
     (sometimes a lot more; depends on H2 internal cleanup). With 
     data-store it is 15 KB.
     Overall I could reduce the disk space from 231 MB to 200 MB (13.4 %
     in this example). That is an average of 4.6 bytes per entry.
  2. Speed:
     a) Liquibase is slow. The first time it takes approx. three seconds
     b) Query and insertion. with data-store we can insert entries 
        up to 1.6 times faster.

Data-store uses a few tricks to save disk space:
  1. We encode the tags into the file names.
  2. To keep them short we translate the key/value of the tag into 
     shorter numbers. For example "foo" -> 12 and "bar" to 47. So the
     tag "foo"/"bar" would be 12/47. 
     We then translate this number into a numeral system of base 62
     (a-zA-Z0-9), so it can be used for file names and it is shorter.
     That way we only have to store the mapping of string to int.
  3. We do that in a simple tab separated file.
2017-04-16 09:07:28 +02:00
72436e9c8c extract utility method
a method to send json over tcp can be used by several tests
2017-04-08 08:21:50 +02:00
ed17d84da4 remove obsolete and disabled test 2017-04-08 08:18:17 +02:00
aadc9cbd21 move TcpIngestor to pdb-ui
and start it in the web application.
Also use the spring way of handling property files.
2017-03-19 08:00:18 +01:00
d1e39513f3 create web application 2016-12-21 17:48:36 +01:00