influxdb CLI import failed inserts when using a huge file - influxdb

I am currently working on NASDAQ data parsing and insertion into the influx database. I have taken care of all the data insertion rules (escaping special characters and organizing the according to the format : <measurement>[,<tag-key>=<tag-value>...] <field-key>=<field-value>[,<field2-key>=<field2-value>...] [unix-nano-timestamp]).
Below is a sample of my data:
apatel17#*****:~/output$ head S051018-v50-U.csv
U,StockLoc=6445,OrigOrderRef=22159,NewOrderRef=46667 TrackingNum=0,Shares=200,Price=73.7000 1525942800343419608
U,StockLoc=6445,OrigOrderRef=20491,NewOrderRef=46671 TrackingNum=0,Shares=200,Price=73.7800 1525942800344047668
U,StockLoc=952,OrigOrderRef=65253,NewOrderRef=75009 TrackingNum=0,Shares=400,Price=45.8200 1525942800792553625
U,StockLoc=7092,OrigOrderRef=51344,NewOrderRef=80292 TrackingNum=0,Shares=100,Price=38.2500 1525942803130310652
U,StockLoc=7092,OrigOrderRef=80292,NewOrderRef=80300 TrackingNum=0,Shares=100,Price=38.1600 1525942803130395217
U,StockLoc=7092,OrigOrderRef=82000,NewOrderRef=82004 TrackingNum=0,Shares=300,Price=37.1900 1525942803232492698
I have also created the database: NASDAQData inside influx.
The problem I am facing is this:
The file has approximately 13 million rows (12,861,906 to be exact). I am trying to insert this data using the CLI import command as below:
influx -import -path=S051118-v50-U.csv -precision=ns -database=NASDAQData
I usually get upto 5,000,000 lines before I start getting the error for insertion. I have run this code multiple times and sometimes I get the error at 3,000,000 lines as well. To figure out this error, I am running the same code on a part of the file. I divide the data into 500,000 lines each and the code successfully ran for all the smaller files. (all 26 files of 500,000 rows)
Has this happened to somebody else or does somebody know a fix for this problem wherein a huge file shows errors during data insert but if broken down and worked with smaller data size, the import works perfectly.
Any help is appreciated. Thanks

As recommended by influx documentation, it may be necessary to split your data file into several smaller ones as the http request used for issuing your writes can timeout after 5 seconds.
If your data file has more than 5,000 points, it may be necessary to
split that file into several files in order to write your data in
batches to InfluxDB. We recommend writing points in batches of 5,000
to 10,000 points. Smaller batches, and more HTTP requests, will result
in sub-optimal performance. By default, the HTTP request times out
after five seconds. InfluxDB will still attempt to write the points
after that time out but there will be no confirmation that they were
successfully written.
Alternatively you can set a limit on how much points to write per second using the pps option. This should relief some stress off your influxdb.


Talend- Memory issues. Working with big files

Before admins start to eating me alive, I would like to say to my defense that I cannot comment in the original publications, because I do not have the power, therefore, I have to ask about this again.
I have issues running a job in talend (Open Studio for BIG DATA!). I have an archive of 3 gb. I do not consider that this is too much since I have a computer that has 32 GB in RAM.
While trying to run my job, first I got an error related to heap memory issue, then it changed for a garbage collector error, and now It doesn't even give me an error. (just do nothing and then stops)
I found this SOLUTIONS and:
a) Talend performance
#Kailash commented that parallel is only on the condition that I have to be subscribed to one of the Talend Platform solutions. My comment/question: So there is no other similar option to parallelize a job with a 3Gb archive size?
b) Talend 10 GB input and lookup out of memory error
#54l3d mentioned that its an option to split the lookup file into manageable chunks (may be 500M), then perform the join in many stages for each chunk. My comment/cry for help/question: how can I do that, I do not know how to split the look up, can someone explain this to me a little bit more graphical
c) How to push a big file data in talend?
just to mention that I also went through the "c" but I don't have any comment about it.
The job I am performing (thanks to #iMezouar) looks like this:
1) I have an inputFile MySQLInput coming from a DB in MySQL (3GB)
2) I used the tFirstRows to make it easier for the process (not working)
3) I used the tSplitRow to transform the data form many simmilar columns to only one column.
4) MySQLOutput
enter image description here
Thanks again for reading me and double thanks for answering.
From what I understand, your query returns a lot of data (3GB), and that is causing an error in your job. I suggest the following :
1. Filter data on the database side : replace tSampleRow by a WHERE clause in your tMysqlInput component in order to retrieve fewer rows in Talend.
2. MySQL jdbc driver by default retrieves all data into memory, so you need to use the stream option in tMysqlInput's advanced settings in order to stream rows.

Reading large gzip JSON files from Google Cloud Storage via Dataflow into BigQuery

I am trying to read about 90 gzipped JSON logfiles from Google Cloud Storage (GCS), each about 2GB large (10 GB uncompressed), parse them, and write them into a date-partitioned table to BigQuery (BQ) via Google Cloud Dataflow (GCDF).
Each file holds 7 days of data, the whole date range is about 2 years (730 days and counting). My current pipeline looks like this:
p.apply("Read logfile", TextIO.Read.from(bucket))
.apply("Repartition", Repartition.of())
.apply("Parse JSON", ParDo.of(new JacksonDeserializer()))
.apply("Extract and attach timestamp", ParDo.of(new ExtractTimestamps()))
.apply("Format output to TableRow", ParDo.of(new TableRowConverter()))
.apply("Window into partitions", Window.into(new TablePartWindowFun()))
.apply("Write to BigQuery", BigQueryIO.Write
.to(new DayPartitionFunc("someproject:somedataset", tableName))
The Repartition is something I've built in while trying to make the pipeline reshuffle after decompressing, I have tried running the pipeline with and without it. Parsing JSON works via a Jackon ObjectMapper and corresponding classes as suggested here. The TablePartWindowFun is taken from here, it is used to assign a partition to each entry in the PCollection.
The pipeline works for smaller files and not too many, but breaks for my real data set. I've selected large enough machine types and tried setting a maximum number of workers, as well as using autoscaling up to 100 of n1-highmem-16 machines. I've tried streaming and batch mode and disSizeGb values from 250 up to 1200 GB per worker.
The possible solutions I can think of at the moment are:
Uncompress all files on GCS, and so enabling the dynamic work splitting between workers, as it is not possible to leverage GCS's gzip transcoding
Building "many" parallel pipelines in a loop, with each pipeline processsing only a subset of the 90 files.
Option 2 seems to me like programming "around" a framework, is there another solution?
With Repartition after Reading the gzip JSON files in batch mode with 100 workers max (of type n1-highmem-4), the pipeline runs for about an hour with 12 workers and finishes the Reading as well as the first stage of Repartition. Then it scales up to 100 workers and processes the repartitioned PCollection. After it is done the graph looks like this:
Interestingly, when reaching this stage, first it's processing up to 1.5 million element/s, then the progress goes down to 0. The size of OutputCollection of the GroupByKey step in the picture first goes up and then down from about 300 million to 0 (there are about 1.8 billion elements in total). Like it is discarding something. Also, the ExpandIterable and ParDo(Streaming Write) run-time in the end is 0. The picture shows it slightly before running "backwards".
In the logs of the workers I see some exception thrown while executing request messages that are coming from the logger, but I can't find more info in Stackdriver.
Without Repartition after Reading the pipeline fails using n1-highmem-2 instances with out of memory errors at exactly the same step (everything after GroupByKey) - using bigger instance types leads to exceptions like
CANCELLED: Received RST_STREAM with error code 8 dataflow-...-harness-5l3s
talking to frontendpipeline-..-harness-pc98:12346
Thanks to Dan from the Google Cloud Dataflow Team and the example he provided here, I was able to solve the issue. The only changes I made:
Looping over the days in 175 = (25 weeks) large chunks, running one pipeline after the other, to not overwhelm the system. In the loop make sure the last files of the previous iteration are re-processed and the startDate is moved forward at the same speed as the underlying data (175 days). As WriteDisposition.WRITE_TRUNCATE is used, incomplete days at the end of the chunks are overwritten with correct complete data this way.
Using the Repartition/Reshuffle transform mentioned above, after reading the gzipped files, to speed up the process and allow smoother autoscaling
Using DateTime instead of Instant types, as my data is not in UTC
UPDATE (Apache Beam 2.0):
With the release of Apache Beam 2.0 the solution became much easier. Sharding BigQuery output tables is now supported out of the box.
It may be worthwhile trying to allocate more resources to your pipeline by setting --numWorkers with a higher value when you run your pipeline. This is one of the possible solutions discussed in the “Troubleshooting Your Pipeline” online document, at the "Common Errors and Courses of Action" sub-chapter.

Compressing/encoding SQL commands for faster network transmission

I am developing an iOS app which stores its data in an sqlite3 database. Every insert, update or delete operation is logged locally and then pushed up to iCloud, and other devices running the app can download these transaction logs and execute the SQL commands within them to keep all devices running the app synchronised. This is working extremely well.
I am now looking into optimising the process and it occurs to me that logging the whole SQL command results in a lot of redundant data being pushed to and pulled from the cloud, which will ultimately result in longer sync times and increased data usage.
The SQL queries are very predictable (there is only one format each of insert, update and delete used in the app) so I am considering using an encoding/decoding routine which will compress the SQL command for storage in the transaction log, and then decompress it from the log for execution.
The string compression methods I have found don't seem to do too well with SQL queries, so I've devised my own:
Single byte to identify the SQL command type
Table and column names indexed in arrays in the app, and the names are encoded using their index position in the array
String of tab separated digits to represent groups of columns, and tab separated values (e.g. in a VALUES() clause)
Encoded check column and value (for the WHERE clause in an update or delete command)
Using this format I have compressed one example query of 186 bytes down to just 78 bytes. This has clear advantages for speed of data transmission and amount of data usage.
The disadvantage I foresee is that it will require more processing on the client end to encode and decode the commands. I am wondering whether anybody has done anything similar and has any advice to offer.
To make is clearer what I am asking: in general is it better to minimise the amount of data being synced and increase the burden on the client to interprete those data, or is it preferable to just sync the data as-is and leave the client to use it as-is?
I'm posting an answer to my own question because I have some information which may provide advice to others who are looking in to the same thing.
I spent some time yesterday writing SQL query compression and decompression functions in Objective C. These functions use the method detailed in my original question and in so doing reduce all non-data parts of the queries (SQL command [insert/update/delete], table names and column names) to a single digit to represent each, and remove all of the remaining SQL syntax (key words like "FROM", spaces, commas, brackets...).
I have done some testing by creating five records both with and without the query compression enabled, and here are the results for the file sizes:
Full SQL queries (logged unmodified to the transactions file)
Zip compressed: 747 bytes
Uncompressed: 1,515 bytes
Compressed SQL queries (compressed using my custom format to the transactions file)
Zip compressed: 673 bytes
Uncompressed: 785 bytes
As you can see, the greatest benefit was to using both types of compression. Encoding the queries achieved ~50% compression. Encoding and then zipping the queries achieved about 10% compression compared to zipping the unencoded queries.
The question I really need to ask myself now is whether it's worth the additional overhead to encode and then zip the queries, and unzip and decode them after downloading the transaction logs from other devices. In this example I only saved 74 bytes overall by encoding and then compressing the transaction log. With five queries this averages at 14.8 bytes saved per query (compared to zipping alone). This is just 14.4kb per 1000 records, which does not seem a lot.

Import of large dataset in Neo4j taking really long (>12 hours) with Neo4j import tool

I have a large dataset (about 1B nodes and a few billion relationships) that I am trying to import into Neo4j. I am using the Neo4j import tool. The nodes finished importing in an hour, however since then, the importer is stuck in a node index preparation phase (unless I am reading the output below incorrectly) for over 12 hours now.
Available memory:
Free machine memory: 184.49 GB
Max heap memory : 26.52 GB
[>:23.39 MB/s---|PROPERTIE|NODE:|LAB|*v:37.18 MB/s---------------------------------------------] 1B
Done in 1h 7m 18s 54ms
Prepare node index
[*SORT:11.52 GB--------------------------------------------------------------------------------]881M
My question is how can I speed this up? I am thinking the following:
1. Split up the import command for nodes and relationships and do the nodes import.
2. Create indexes on the nodes
3. Do merge/match to get rid of dupes
4. Do rels import.
Will this help? Is there something else I should try? Is the heap size too large (I think not, but would like an opinion)?
I also tried importing exactly half that data on the same machine and it gets stuck again in that phase at roughly the same amount of time (proportionally). So I have mostly eliminated disk space and memory as an issue.
I have also checked my headers (since I noticed that other people ran into this problem when they had incorrect headers) and they seem correct to me. Any suggestions on what I else should be looking at?
Ok so now it is getting kind of ridiculous. I reduced my data size down to just one large file (about 3G). It only contains nodes of a single kind and only has ids. So the data looks something like this
and the header (in a separate file) looks like this
And my import still gets stuck in the sort phase. I am pretty sure I am doing something wrong here. But I really have no clue what. Here is my command line to invoke this
/var/lib/neo4j/bin/neo4j-import --into data/db/graph.db --id-type string --delimiter "|" \
--bad-tolerance 1000000000 --skip-duplicate-nodes true --stacktrace true --ignore-empty-strings true \
--nodes:Author "data/author/author_header_label.csv,data/author/author_half_label.csv.gz"
Most of the options for bad-tolerance and skip-duplicate-nodes are there to see if I can make it get through the import somehow at least once.
I think I found the issue. I was using some of the tips here where it says I can re-use the same csv file with different headers -- once for nodes and once for relationships. I underestimated the 1-n (ness) of the data I was using, causing a lot of duplicates on the ID. That stage was basically almost all spent on trying to sort and then dedupe. Re-working my queries to extract the data split into nodes and rels files, fixed that problem. Thanks for looking into this!
So basically, ideally always having separate files for each type of node and rel will give fastest results (at least in my tests).
Have a look at the batch importer I wrote for a stress test:
I used both neo4j index and in memory map between two commit. It is really fast and works for both version of neo4j.
Ignore the tests and get the batch importer.

importing and processing data from a CSV File in Delphi

I had an pre-interview task, which I have completed and the solution works, however I was marked down and did not get an interview due to having used a TADODataset. I basically imported a CSV file which populated the dataset, the data had to be processed in a specific way, so I used Filtering and Sorting of the dataset to make sure that the data was ordered in the way I wanted it and then I did the logic processing in a while loop. The feedback that was received said that this was bad as it would be very slow for large files.
My main question here is if using an in memory dataset is slow for processing large files, what would have been better way to access the information from the csv file. Should I have used String Lists or something like that?
It really depends on how "big" and the available resources(in this case RAM) for the task.
"The feedback that was received said that this was bad as it would be very slow for large files."
CSV files are usually used for moving data around(in most cases that I've encountered files are ~1MB+ up to ~10MB, but that's not to say that others would not dump more data in CSV format) without worrying too much(if at all) about import/export since it is extremely simplistic.
Suppose you have a 80MB CSV file, now that's a file you want to process in chunks, otherwise(depending on your processing) you can eat hundreds of MB of RAM, in this case what I would do is:
while dataToProcess do begin
// step1
read <X> lines from file, where <X> is the max number of lines
you read in one go, if there are less lines(i.e. you're down to 50 lines and X is 100)
to process, then you read those
// step2
process information
// step3
generate output, database inserts, etc.
In the above case, you're not loading 80MB of data into RAM, but only a few hundred KB, and the rest you use for processing, i.e. linked lists, dynamic insert queries(batch insert), etc.
"...however I was marked down and did not get an interview due to having used a TADODataset."
I'm not surprised, they were probably looking to see if you're capable of creating algorithm(s) and provide simple solutions on the spot, but without using "ready-made" solutions.
They were probably thinking of seeing you use dynamic arrays and creating one(or more) sorting algorithm(s).
"Should I have used String Lists or something like that?"
The response might have been the same, again, I think they wanted to see how you "work".
The interviewer was quite right.
The correct, scalable and fastest solution on any medium file upwards is to use an 'external sort'.
An 'External Sort' is a 2 stage process, the first stage being to split each file into manageable and sorted smaller files. The second stage is to merge these files back into a single sorted file which can then be processed line by line.
It is extremely efficient on any CSV file with over say 200,000 lines. The amount of memory the process runs in can be controlled and thus dangers of running out of memory can be eliminated.
I have implemented many such sort processes and in Delphi would recommend a combination of TStringList, TList and TQueue classes.
Good Luck
