Elasticsearch data binary ran out of memory - memory

Im trying to upload a 800GB file to elasticsearch but i keep getting a memory error that tells me the data binary is out of memory. I have 64GB of RAM on my system and 3TB of storage
curl -XPOST 'http://localhost:9200/carrier/doc/1/_bulk' --data-binary #carrier.json
Im wondering if there is a setting in the config file to increase to amount of memory so i can upload to his file
thanks

800GB is a quite a lot to send in one shot, ES has to put all the content into memory in order to process it, so that's probably too big for the amount of memory you have.
One way around this is to split your file into several and send each one after another. You can achieve it with a small shell script like the one below.
#!/bin/sh
# split the main file into files containing 10,000 lines max
split -l 10000 -a 10 carrier.json /tmp/carrier_bulk
# send each split file
BULK_FILES=/tmp/carrier_bulk*
for f in $BULK_FILES; do
curl -s -XPOST http://localhost:9200/_bulk --data-binary #$f
done
UPDATE
If you want to interpret the ES response you can do so easily by piping the response to a small python one-liner like this:
curl -s -XPOST $ES_HOST/_bulk --data-binary #$f | python -c 'import json,sys;obj=json.load(sys.stdin);print " <- Took %s ms with errors: %s" % (obj["took"], obj["errors"])';

Related

How to download a huge console output from Jenkins

My job executes ansible-playbook in debug-mode (ansible-playbook -vvv) which generates a lot of output.
After the job finished, its very difficult to search using browser because its very slow and stuck.
I tried to download it with curl/wget, but the file is incomplete (i guess only about 10% was downloaded)
curl http://j:8080/job/my-job/5/consoleText -O
wget http://j:8080/job/my-job/5/consoleText
curl returns with error:
curl: (18) transfer closed with outstanding read data remaining
Try the following.
curl -u "admin":"admin" "http://localhost:8080/job/my-job/5/logText/progressiveText?start=0"
If that doesn't work. I think your best option is to get the log from the server itself. The log can be found at ${JENKINS_HOME}/jobs/${JOB_NAME}/builds/${BUILD_NUMBER}/log.

Google Dataflow creates only one worker for large .bz2 file

I am trying to process the Wikidata json dump using Cloud Dataflow.
I have downloaded the file from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 and hosted it into a GS bucket. It's a large (50G) .bz2 file containing a list of json dicts (one per line).
I understand that apache_beam.io.ReadFromText can handle .bz2 (I tested that on toy datasets) and that .bz2 is splittable. Therefore I was hoping that multiple workers would be created that would work in parallel on different blocks of that unique file (I'm not totally clear if/how blocks would res.
Ultimately I want to do some analytics on each line (each json dict) but as a test for ingestion I am just using the project's wordcount.py:
python -m apache_beam.examples.wordcount \
--input gs://MYBUCKET/wikidata/latest-all.json.bz2 \
--output gs://MYBUCKET/wikidata/output/entities-all.json \
--runner DataflowRunner \
--project MYPROJECT \
--temp_location gs://MYBUCKET/tmp/
At startup, autoscaling quickly increases the number of workers 1->6 but only one worker does any work and then autoscaling scales back 6->1 after a couple minutes (jobid: 2018-10-11_00_45_54-9419516948329946918)
If I disable autoscaling and set explicitly the number of workers, then all but one remain idle.
Can parallelism be achieved on this sort of input? Many thanks for any help.
Other than Hadoop, Apache Beam has not yet implemented bzip2 splitting: https://issues.apache.org/jira/browse/BEAM-683

curl post data and file contemporary

I have tried:
curl -v --http1.0 --data "mac=00:00:00" -F "userfile=#/tmp/02-02-02-02-02-22" http://url_address/getfile.php
but it fails with the following message:
Warning: You can only select one HTTP request!
How can I send a mix of data and file by curl? Is it possible or not?
Thank you
Read up on how -F actually works! You can add any number of data parts and file parts in a multipart formpost that -F makes. -d however makes a "standard" clean post and you cannot mix -d with -F.
You need to first figure out which kind of post you want, then you pick either -d or -F depending on your answer.

curl needs to send '\r\n' - need transformation of a working solution

I need a transformation of the following working curl command:
curl --data-binary #"data.txt" http://www.example.com/request.asp
The data.txt includes this:
foo=bar
parameter1=4711
parameter2=4712
The key is I need to send the linebreaks and they are \r\n. Its working with the file because it has the right encoding but how do I manage to get this curl command run without the file? So a 1-liner sending the parameters with the correct \r\n on end of each.
All my tests with different URL encoding, etc. didn't work. I never got the same result like with the file.
I need this information because I have serious trouble to get this post run on my Ruby on Rails App using net/http.
Thanks!
One way to solve it is to generate the binary stream with something on the fly, like the printf command, and have curl read the data from stdin:
printf 'foo=bar\r\nparameter1=4711\r\nparameter2=4712' | curl --data-binary #- http://example.com

how to tar last few lines

Is it possible to create a tar of only the last few lines of a file?
Something like this does not seem to be working.
tail abc.xml | tar -zcf bac.tar.gz
I am trying to keep the compressed file size as small as possible. I do also want to transfer it over the network as fast as possible.
You can have standard input as the source for tar but what do you want to do? There is no need to create a tar archive for just a single file. You can pipe directly to gzip:
tail abc.xml | gzip - > bac.gz
bac.gz will then contain the last 10 lines of your file (compressed).
But I suspect that your question does not reflect what you want to achieve: you really want to send the last part of the XML file as a compressed gzip file?

Resources