I need to load ~29 million nodes from a CSV file (with USING PERIODIC COMMIT) but I'm getting "Unknown error" after the first ~75k nodes are loaded. I've tried changing the commit size (250, 500, and 1000), increasing the java heap (-Xmx4096m), and using memory mapping, but nothing changes (except the number of nodes that get loaded - with commit size 500 I get "Unkown error" after 75,499 nodes and with commit size 250 I get "Unkown error" after 75,749 nodes).
I'm doing it in the browser, using Neoj4 2.1.7 on a remote machine with 10GB of RAM and Windows Server 2012. Here's my code:
USING PERIODIC COMMIT 1000
LOAD CSV FROM "file:/C:/Users/thiago.marzagao/Desktop/CSVs/cnpj.csv" AS node
CREATE (:PessoaJuridica {id: node[0], razaoSocial: node[1], nomeFantasia: node[2], CNAE: node[3], porte: node[4], dataAbertura: node[5], situacao: node[6], dataSituacao: node[7], endereco: node[8], CEP: node[9], municipio: node[10], UF: node[11], tel: node[12], email: node[13]})
The really bad part is that the nioneo_logical.log files have some weird encoding that no text editor can't figure out. All I see is eÿÿÿÿ414141, ÿÿÿÿÿÿÿÿ, etc. The messages file, in turn, ends with hundreds of garbage collection warnings, like these:
2015-02-05 17:16:54.596+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 304ms.
2015-02-05 17:16:55.033+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 238ms.
2015-02-05 17:16:55.471+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 231ms.
I've found somewhat related questions but not exactly what I'm looking for.
What am I missing?
The browser is the worst choice to run such an import, also because of http timeouts.
Enough RAM helps as well as a fast disk.
Try to use bin/Neo4jShell.bat which connects to the running server. And best make sure the CSV file is locally available.
those nioneo.*log files are logical logs (write ahead logs for transactions)
the log files your looking for are data/log/*.log and data/graph.db/messages.log
Something else that you can please do, is to open the Browser-Inspector, go to the Network/Requests tab and re-run the query, so that you can get the raw HTTP-response, we just discussed that and will try to dump it directly to the JS console in the future.
Related
We are trying to use Dask to clean up some data as part of an ETL process.
The original file is over 3GB csv .
When we run the code on a subset (1GB) the code runs successfully (with a few user warning regarding our cleaning procedures such as:
ddf[id1] = ddf[id1].str.extract(´(\d+)´)
repeater = re.compile(r´((\d)\2{5,}´)
mask_repeater = ddf[id1].str.contrains(repeater, regex=True)
ddf = ddf[~mask_repeater]
On the 3GB file the process nearly completes (there is only one task left - drop-duplicates-agg) and then restarts from the middle (that is what I can see from the bokeh status website). we also see the warning which is the same as when the script starts to run.
RuntimeWarning: Couldn't detect a suitable IP address for reaching '8.8.8.8', defaulting to '127.0.0.1'...
I´m running on a offline single windows64bit workstation with 24 cores .
Any suggestions?
I am running 7 worker processes on a single machine with 4 cores. I may have made a poor choice with this loop while waiting for the result of map_async:
while not result.ready():
time.sleep(10)
for out in result.stdout:
print out
rec_file_list = result.get()
result.stdout keeps growing with all the printed output from the 7 processes running, and it caused the console that initiated the map to hang. The activity monitor on my MacBook Pro shows the 7 processes are still running, and the terminal running the Controller is still active. What are my options here? Is there any way to acquire the result once the processes have completed?
I found an answer:
Remote introspection of ASyncResult objects is possible from another client as long as a 'database backend' has been enabled by the controller with:
ipcontroller --dictb # or --mongodb or --sqlitedb
Then, it is possible to create a new client instance and retrieve the results with:
client.get_result(task_id)
where the task_ids can be retrieved with:
client.hub_history()
Also, a simple way to avoid the buffer overflow I encountered is to periodically print just the last few lines from each engine's stdout history, and to flush the buffer like:
from IPython.display import clear_output
import sys
while not result.ready():
clear_output()
for stdout in result.stdout:
if stdout:
lines = stdout.split('\n')
for line in lines[-4:-1]:
if line:
print line
sys.stdout.flush()
time.sleep(30)
In our unit testing suite, we create and destroy a large number of SQLite databases that use the path of ":memory:". Occasionally, and only when running on the iOS simulator, the creation of those databases fails with the rather enigmatic message:
Database ":memory:": unable to open database file
99% of the time, these requests succeed. (Subsequent tests within the same test run typically succeed after this failure occurs.) But when you're using this in an automated build-acceptance test, you want 100%.
We've instrumented for memory consumption (it's within normal limits) and disk-space availability (20GB+ available).
Any ideas?
UPDATE: Captured this happening with extra logging per Richard's suggestion below. Here's the log output:
SQLITE ERROR: (28) attempt to open "/Users/xxx/Library/Developer/CoreSimulator/Devices/CF762060-7D23-4C79-A466-7F20AB6233E7/data/Containers/Data/Application/582E1ED0-81E0-4CC7-A6F6-DBEBC101BBE8/tmp/etilqs_1ghbf1MSTa8ilSj" as
SQLITE ERROR: (14) cannot open file at line 30595 of [f66f7a17b7]
SQLITE ERROR: (14) os_unix.c:30595: (17) open(/Users/xxx/Library/Developer/CoreSimulator/Devices/CF762060-7D23-4C79-A466-7F20AB6233E7/data/Containers/Data/Application/582E1ED0-81E0-4CC7-A6F6-DBEBC101BBE8/tmp/etilqs_1ghbf1MST
I've noticed that even a :memory: database will files on disk if you create a temporary table. The temporary files for unix system are built by a Prng, so there is a non-zero chance of name collision if lots and lots of temporary files are created simultaneously. Or, if the disk is full, the create could fail. Or if for some reason the unix temp directory is not accessible either because it's been deleted or permissions on it are invalid.
For example, I've turned on several loggers in sqlite3 command line by adding these command line arguments to llvc-gcc: -DSQLITE_DEBUG_OS_TRACE=1 -DSQLITE_TEST=1 -DSQLITE_DEBUG=1 then I observed a temp file being created from the command line using this SQL:
$ ./sqlite3
SQLite version 3.8.8.2 2015-01-30 14:30:45
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> create temporary table t( x );
OPENX 3 /var/folders/nf/l1cw8sn1707b73zy5nqycrpw0000gn/T//etilqs_fvwR6KbMm518S4w 01002
OPEN 3
WRITE 3 512 0 0
OPENX 4 /var/folders/nf/l1cw8sn1707b73zy5nqycrpw0000gn/T//etilqs_OJJJ1lrTtQIFnUO 05402
OPEN 4
WRITE 4 1024 0 0
WRITE 4 1024 1024 0
WRITE 3 28 0 0
sqlite>
No ideas. But perhaps if you turned on the error and warning log it will give some clues.
I am monitoring Java heap usage on all managed servers in the Weblogic 10.3 domain using WLST. I have written a Jython script to achieve this. This script first logs into the admin server in the domain. Following is the code snippet that fetches the heap statistics for each managed server:
def getServerJavaHeap():
domainRuntime()
servers=domainRuntimeService.getServerRuntimes()
for server in servers:
free = int(server.getJVMRuntime().getHeapFreeCurrent())/(1024*1024)
freePct = int(server.getJVMRuntime().getHeapFreePercent())
current = int(server.getJVMRuntime().getHeapSizeCurrent())/(1024*1024)
max = int(server.getJVMRuntime().getHeapSizeMax())/(1024*1024)
print 'Domain Name #', cmo.getName()
print 'Server Name #', server.getName()
print 'Current Heap Size #', current
print 'Current Heap Free #', free
print 'Maximum Heap Size #', max
print 'Percentage Heap Free #', freePct
The heap statistics that the above code fetches is different from what the Weblogic admin console shows. For instance for managed server123
The above code gives the heap size usage as 1.25GB while the admin console shows heap usages as 3GB
I am wondering why is there a discrepancy in what admin console shows and the output of the above code. I am trying to determine if I am looking in the right place and invoking the right method calls (listed here in the docs) to get the heap statistics on each managed server.
I am sure the time when the script ran also is a factor. Was wondering how frequently the admin console refreshes these tables.
I can't see anything wrong with your approach tbh. The admin console page won't update automatically unless you click on the auto-refresh icon (the two arrows forming a circle) in the top left of the table. By default the refresh interval is 10 seconds but this can be set from the 'Preferences' page - the link is on the banner of every page.
I tried on both an Admin server and a managed server and as long as I ran the code close to a refresh, the numbers tied up. I can only assume a garbage collect run between the time the console displayed the data and your script ran.
I wrote a admin script that tails a heroku log and every n seconds, it summarizes averages and notifies me if i cross a certain threshold (yes I know and love new relic -- but I want to do custom stuff).
Here is the entire script.
I have never been a master of IO and threads, I wonder if I am making a silly mistake. I have a couple of daemon threads that have while(true){} which could be the culprit. For example:
# read new lines
f = File.open(file, "r")
f.seek(0, IO::SEEK_END)
while true do
select([f])
line = f.gets
parse_heroku_line(line)
end
I use one daemon to watch for new lines of a log, and the other to periodically summarize.
Does someone see a way to make it less processor-intensive?
This probably runs hot because you never really block while reading from the temporary file. IO::select is a thin layer over POSIX select(2). It looks like you're trying to block until the file is ready for reading, but select(2) considers EOF to be ready ("a file descriptor is also ready on end-of-file"), so you always return right away from select then call gets which returns nil at EOF.
You can get a truer EOF reading and nice blocking behavior by avoiding the thread which writes to the temp file and instead using IO::popen to fork the %x[heroku logs --ps router --tail --app pipewave-cedar] log tailer, connected to a ruby IO object on which you can loop over gets, exiting when gets returns nil (indicating the log tailer finished). gets on the pipe from the tailer will block when there's nothing to read and your script will only run as hot as it takes to do your line parsing and reporting.
EDIT: I'm not set up to actually try your code, but you should be able to replace the log tailer thread and your temp file read loop with this code to get the behavior described above:
IO.popen( %w{ heroku logs --ps router --tail --app my-heroku-app } ) do |logf|
while line = logf.gets
parse_heroku_line(line) if line =~ /^/
end
end
I also notice your reporting thread does not do anything to synchronize access to #total_lines, #total_errors, etc. So, you have some minor race conditions where you can get inconsistent values from the instance vars that parse_heroku_line method updates.
select is about whether a read would block. f is just a plain old file, so you when get to the end reads don't block, they just return nil instantly. As a result select returns instantly rather than waiting for something to be appending to the file as I assume you're expecting. Because of this you're sitting in a tight busy loop, so high cpu is to be expected.
If you are at eof (you could either check f.eof? or whether gets returns nil), then you could either start sleeping (perhaps with some sort of back off) or use something like listen to be notified of filesystem changes