websocket memory management - memory

I am new to websockets and am pulling some data via the Coinbase websocket, upon every message I append to a pandas df in memory (I know this is bad practice, just trying to get a working version). Every minute I upload this df to TimescaleDB and clear out old data from the df.
I am noticing though that on occasion I am failing to upload the df and as a result I am not clearing the df of old values, and eventually the df consumes all the memory.
Is this a feature of websockets? Or is something off with my scheduler?
This is my scheduler for reference -
while True
nowtime = datetime.now()
floornow = pd.Timestamp(nowtime).floor(freq='1S')
candlefloor = floornow.floor(freq=f'{candle_len}T')
if (floornow == candlefloor):
try:
upload_timescale()
if (candlefloor != 'last datetime'):
timetowait = candle_len*60-(datetime.now() - candlefloor).total_seconds()-0.05
time.sleep(timetowait)
except:
raise ValueError('bug in uploading to timescaledb')
else:
tts = candle_len*60-(nowtime - candlefloor).total_seconds()
if (tts > 2):
time.sleep(tts)
What is the best way to handle a use case where I need to store websocket data intermittently to process, before cleaning it out?
Thanks!

Related

Redis - monitoring maximum memory before inserts fail?

While this Q/A does not address the actual issue of: How to detect with client (eg redis-py) that redis is running out of memory constraint not by machine but by the maxmem configuration? Before inserts fail which command to use in the programm to detect about to be full?
My first guess is: info and check if used_memory_peak < maxmem setting. Is this correct?
(Besides, for out of machine memory, since defrag, use which setting, none of the returned INFO fields help here)
Well should i just try an insert and see if fail (but that would be after the fact then.)
Trail and error, good enough tested by running
while true; do redis-cli lpush mm longstringhere; done; results on maxmem - used_memory < 0.1MB with insert failures:
(error) OOM command not allowed when used memory > 'maxmemory'.
So i have set i poll it via redis-py client and once the diff goes <1mb threshold throw up, sry raise Error of course. Make sure the user_memory memory addon of your longest command is < threshold too of course otherwise you run into it on insert.
I try to figure how to calc the ~percentage of used mem so i get notification way earlier eg 90% of maxmem, therefore this solution is fine.
Info dump:
# Memory
used_memory:3126272
used_memory_human:2.98M
used_memory_rss:5292032
used_memory_rss_human:5.05M
used_memory_peak:4914296
used_memory_peak_human:4.69M
used_memory_peak_perc:63.62%
used_memory_overhead:696654...
Furthermore maxmem is not a hardcap, when running it further by eg adding members to existing set.
used_memory:3162584
used_memory_human:3.02M
code to get percent 0-100
rmem_info = pipe.info(section='memory')
{'redis_mem_percent': math.ceil(rmem_info['used_memory'] / rmem_info['maxmemory'] *100)}

Override dask scheduler to concurrently load data on multiple workers

I want to run graphs/futures on my distributed cluster which all have a 'load data' root task and then a bunch of training tasks that run on that data. A simplified version would look like this:
from dask.distributed import Client
client = Client(scheduler_ip)
load_data_future = client.submit(load_data_func, 'path/to/data/')
train_task_futures = [client.submit(train_func, load_data_future, params)
for params in train_param_set]
Running this as above the scheduler gets one worker to read the file, then it spills that data to disk to share it with the other workers. However, loading the data is usually reading from a large HDF5 file, which can be done concurrently, so I was wondering if there was a way to force all workers to read this file concurrently (they all compute the root task) instead of having them wait for one worker to finish then slowly transferring the data from that worker.
I know there is the client.run() method which I can use to get all workers to read the file concurrently, but how would you then get the data you've read to feed into the downstream tasks?
I cannot use the dask data primitives to concurrently read HDF5 files because I need things like multi-indexes and grouping on multiple columns.
Revisited this question and found a relatively simple solution, though it uses internal API methods and involves a blocking call to client.run(). Using the same variables as in the question:
from distributed import get_worker
client_id = client.id
def load_dataset():
worker = get_worker()
data = {'load_dataset-0': load_data_func('path/to/data')}
info = worker.update_data(data=data, report=False)
worker.scheduler.update_data(who_has={key: [worker.address] for key in data},
nbytes=info['nbytes'], client=client_id)
client.run(load_dataset)
Now if you run client.has_what() you should see that each worker holds the key load_dataset-0. To use this in downstream computations you can simply create a future for the key:
from distributed import Future
load_data_future = Future('load_dataset-0', client=client)
and this can be used with client.compute() or dask.delayed as usual. Indeed the final line from the example in the question would work fine:
train_task_futures = [client.submit(train_func, load_data_future, params)
for params in train_param_set]
Bear in mind that it uses internal API methods Worker.update_data and Scheduler.update_data and works fine as of distributed.__version__ == 1.21.6 but could be subject to change in future releases.
As of today (distributed.__version__ == 1.20.2) what you ask for is not possible. The closest thing would be to compute once and then replicate the data explicitly
future = client.submit(load, path)
wait(future)
client.replicate(future)
You may want to raise this as a feature request at https://github.com/dask/distributed/issues/new

Checking if a Variable exists without using v.get() in Dask Distributed Client

When using a Dask Distributed Client, I can easily get a variable like this:
v = Variable('myvar')
v.get(timeout=5)
But if I do this on a none existing key:
v = Variable('doesntexist')
v.get(timeout=5)
Gives me, after 5 seconds:
distributed.utils - ERROR -
tornado.gen.TimeoutError
Is it possible to check beforehand if a variable exists without relying on timeouts?
Regards,
Niklas

Using Neo4j with React JS

Can we use graph database neo4j with react js? If not so is there any alternate option for including graph database in react JS?
Easily, all you need is neo4j-driver: https://www.npmjs.com/package/neo4j-driver
Here is the most simplistic usage:
neo4j.js
//import { v1 as neo4j } from 'neo4j-driver'
const neo4j = require('neo4j-driver').v1
const driver = neo4j.driver('bolt://localhost', neo4j.auth.basic('username', 'password'))
const session = driver.session()
session
.run(`
MATCH (n:Node)
RETURN n AS someName
`)
.then((results) => {
results.records.forEach((record) => console.log(record.get('someName')))
session.close()
driver.close()
})
It is best practice to close the session always after you get the data. It is inexpensive and lightweight.
It is best practice to only close the driver session once your program is done (like Mongo DB). You will see extreme errors if you close the driver at a bad time, which is incredibly important to note if you are beginner. You will see errors like 'connection to server closed', etc. In async code, for example, if you run a query and close the driver before the results are parsed, you will have a bad time.
You can see in my example that I close the driver after, but only to illustrate proper cleanup. If you run this code in a standalone JS file to test, you will see node.js hangs after the query and you need to press CTRL + C to exit. Adding driver.close() fixes that. Normally, the driver is not closed until the program exits/crashes, which is never in a Backend API, and not until the user logs out in the Frontend.
Knowing this now, you are off to a great start.
Remember, session.close() immediately every time, and be careful with the driver.close().
You could put this code in a React component or action creator easily and render the data.
You will find it no different than hooking up and working with Axios.
You can run statements in a transaction also, which is beneficial for writelocking affected nodes. You should research that thoroughly first, but transaction flow is like this:
const session = driver.session()
const tx = session.beginTransaction()
tx
.run(query)
.then(// same as normal)
.catch(// errors)
// the difference is you can chain multiple transactions:
const tx1 = await tx.run().then()
// use results
const tx2 = await tx.run().then()
// then, once you are ready to commit the changes:
if (results.good !== true) {
tx.rollback()
session.close()
throw error
}
await tx.commit()
session.close()
const finalResults = { tx1, tx2 }
return finalResults
// in my experience, you have to await tx.commit
// in async/await syntax conditions, otherwise it may not commit properly
// that operation is not instant
tl;dr;
Yes, you can!
You are mixing two different technologies together. Neo4j is graph database and React.js is framework for front-end.
You can connect to Neo4j from JavaScript - http://neo4j.com/developer/javascript/
Interesting topic. I am using the driver in a React App and recently experienced some issues. I am closing the session every time a lifecycle hook completes like in your example. When there where more intensive queries I would see a timeout error. Going back to my setup decided to experiment by closing the driver in some more expensive queries and it looks like (still need more testing) the crashes are gone.
If you are deploying a real-world application I would urge you to think about Authentication and Authorization when using a DB-React setup only as you would have to store username/password of the neo4j server in the client. I am looking into options of having the Neo4J server issuing a token and receiving it for Authorization but the best practice is for sure to have a Node.js server in the middle with something like Passport to handle Authentication.
So, all in all, maybe the best scenario is to only use the driver in Node and have the browser always communicating with the Node server using axios...

Reading from TCPSocket is slow in Ruby / Rails

I have this simple piece of code that writes to a socket, then reads the response from the server. The server is very fast (responds within 5ms every time). However, while writing to the socket is quick -- reading the response from the socket is always MUCH slower. Any clues?
module UriTester
module UriInfo
class << self
def send_receive(socket, xml)
# socket = TCPSocket.open("service.server.com","2316")
begin
start = Time.now
socket.print(xml) # Send request
puts "just printed the xml into socket #{Time.now - start}"
rescue Errno::ECONNRESET
puts "looks like there is an issue!!!"
socket = TCPSocket.open("service.server.com","2316")
socket.print(xml) # Send request
end
response=""
while (line =socket.recv(1024))
response += line
break unless line.grep(/<\/bcap>/).empty?
end
puts "SEND_RECEIVE COMPLETED. IN #{Time.now - start}"
# socket.close
response
end
end
end
end
Thanks!
Writing to the socket will always be way faster than reading in this case because the write is local to the machine and the read has to wait for a response to come over the network.
In more detail, when the call to write / send returns all the system is telling you is that N bytes have been successfully copied to the sockets kernel space buffer. It does not mean that the data has actually been sent across the network yet. In fact, the data can sit in the socket buffer for quite a long time ( assuming you're using TCP ). This is because of something called the Nagle algorithm which is intended to make efficient use of network bandwidth. This unseen Nagle delay adds to the round trip time and how long till you get a response. The server might also delay it's response for the same reason adding even more to the response time.
So when you time the write and it returns quickly, that doesn't actually mean anything useful.
As mentioned earlier the read from the socket will be way longer since when you time the read you are actually timing the round trip travel time plus server response time, which is always going to be way slower than the time it takes to copy data from a user space program to a kernel buffer.
What is it that you are actually trying to measure and why?

Resources