Real Time computation - firebase-realtime-database

I have a algorithm written in python and mysql which takes inputs in csv file and some properties and then run for 20-25 mins to produce output.
I want to make it realtime such that if input csv is upload or and properties is changed the output is changed without need to run algorithm
Note Data on which algorithm runs can be very large.
Need help in making realtime computing algorithm
i am trying to change mysql to nosql DB but it still takes time to run and is not realtime

You should try using one of the streaming services for event driven update. Instead of CSV, write the data to stream like Kafka or Kinesis. Write your consumer that reads incoming invents and updates the data without running the full algorithm. You might be able to use Apache Flink for aggregation against Kafka as well.

Related

How to download influxdb2.0 data in csv format?

I am not familiar with influxdb command line especially influxDB2.0. so I choose to use InfluxDB 8086 port frontend. but I found if want to download .csv by frontend, too many data leads to browser collapse, which at last leads to download failure.
I have read influxdb2.0 documentation and found no answer. Whether I must use command line or what command line should I use? Thanks a lot in advance
I have the same issues using Flux in a browser session.
If you need a lot of data use the Influx API and catch the result in a file. See the 'example query request' using curl on the referenced page. I find this to be very quick and haven't had it overload from returning large data sets.
(If the amount of data is enormous you can also ask Influx to gzip it before download, but of course this may load up the machine it's running on.)

Best practices for Kafka streams

We have a predict service written in python to provide the Machine Learning service, you send it a set of data, and it will give the Anomaly Detection or Predict and so on.
I want to use Kafka streams to process the real-time data.
There are two ways to select:
Kafka streams jobs only complete the ETL function: load data, and do simple transform and save data to Elastic Search. And then start a timer periodically load data from ES and call predict service to compute and save result back to ES.
Kafka streams jobs do all the thing besides the ETL, when Kafka streams jobs complete the ETL and then send the data to predict service, and save the compute result to Kafka, and a consumer will forward the result from Kafka to ES.
I think the second way is more real-time, but I don't know it's a good idea to do so much predict tasks in streaming jobs.
Is there any common patterns or advice for such application?
Yes, I'd opt for the second option as well.
What you can do is to use Kafka as the data pipeline between your ML-Training module and your Prediction module. These modules could be very well implemented in Kafka Streams.
Take a look on the diagram below:

Apache Hive and record updates

I have streaming data coming into my consumer app that I ultimately want to show up in Hive/Impala. One way would be to use Hive based APIs to insert the updates in batches to the Hive Table.
The alternate approach is to write the data directly into HDFS as a avro/parquet file and let hive detect the new data and suck it in.
I tried both approaches in my dev environment and the 'only' drawback I noticed was high latency writing to hive and/or failure conditions I need to account for in my code.
Is there an architectural design pattern/best practices to follow?

Apache Samza local storage - OrientDB / Neo4J graph instead of KV store

Apache Samza uses RocksDB as the storage engine for local storage. This allows for stateful stream processing and here's a very good overview.
My use case:
I have multiple streams of events that I wish to process taken from a system such as Apache Kafka.
These events create state - the state I wish to track is based on previous messages received.
I wish to generate new stream events based on the calculated state.
The input stream events are highly connected and a graph such as OrientDB / Neo4J is the ideal medium for querying the data to create the new stream events.
My question:
Is it possible to use a non-KV store as the local storage for Samza? Has anyone ever done this with OrientDB / Neo4J and is anyone aware of an example?
I've been evaluating Samza and I'm by no means an expert, but I'd recommend you to read the official documentation, and even read through the source code—other than the fact that it's in Scala, it's remarkably approachable.
In this particular case, toward the bottom of the documentation's page on State Management you have this:
Other storage engines
Samza’s fault-tolerance mechanism (sending a local store’s writes to a replicated changelog) is completely decoupled from the storage engine’s data structures and query APIs. While a key-value storage engine is good for general-purpose processing, you can easily add your own storage engines for other types of queries by implementing the StorageEngine interface. Samza’s model is especially amenable to embedded storage engines, which run as a library in the same process as the stream task.
Some ideas for other storage engines that could be useful: a persistent heap (for running top-N queries), approximate algorithms such as bloom filters and hyperloglog, or full-text indexes such as Lucene. (Patches accepted!)
I actually read through the code for the default StorageEngine implementation about two weeks ago to gain a better sense of how it works. I definitely don't know enough to say much intelligently about it, but I can point you at it:
https://github.com/apache/samza/tree/master/samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv
https://github.com/apache/samza/tree/master/samza-kv/src/main/scala/org/apache/samza/storage/kv
The major implementation concerns seem to be:
Logging all changes to a topic so that the store's state can be restored if a task fails.
Restoring the store's state in a performant manner
Batching writes and caching frequent reads in order to save on trips to the raw store.
Reporting metrics about the use of the store.
Do the input stream events define one global graph, or multiple graphs for each matching Kafka/Samza partition? That is important as Samza state is local not global.
If it's one global graph, you can update/query a separate graph system from the Samza task process method. Titan on Cassandra would one such graph system.
If it's multiple separate graphs, you can use the current RocksDB KV store to mimic graph database operations. Titan on Cassandra does just that - uses Cassandra KV store to store and query the graph. Graphs are stored either via matrix (set [i,j] to 1 if connected) or edge list. For each node, use it as the key and store its set of neighbors as the value.

InfluXDB Raspberry: send data periodically to logging host

I would like to use InfluXDB wit my Raspberry/Openhab home automatisation.
I am just worried about db size/performance.
So my plan would be: log only 1 month on Pi, let it be cleaned automatically.
Clean I understand is easy with retention. (Automatically clear old data )
For long time anaylsis I want to collect all the data on a server.
Now question: How can I export the data on PI before retention into flatfile and afterwards import that data in a seperate InfluXDB on different server?
(Or even better: is there a way to do this in a sort of cluster mode?)
thanks a lot,
Chris
I use InfluxDB on a pi for sensor logs. I log now 4 records every 5 seconds for more than 3 months and performances on my pi are really good. I don't have now the file size but was not more than 10Mb
You can use InfluxDB in cluster mode but not sure it will answer yor question for data cleaning.
To exprt data, you can use InfluxDB API to get all series in the data base, then all data and flush that in a json file. You can use the API to load that file in another DB

Resources