Jena Query Optimization - jena

I am pretty new to sparql and apache Jena so please forgive my naiveness.
I loaded wikidata dump (705G) using TDB2 loader and executed some query examples from Wikidata Query Service.
Most of the queries take longer time in Jena compare to Wikidata Query Service.
My machine is configured with 750G of RAM and 80 CPUs.
My questions are:
Why Wikidata service is faster then Jena?
How can I improve query performance without rewriting query? maybe some indexing techniques? Or specific server configurations?
I looked up all stackoverflow questions with [Jena] tag and didn't find anything about it. If you can provide tutorials or topics except official Jena website it would be great.

You can try to use the next generation TDB2 (instead of TDB1).
tdb2.tdbloader --loc /path/to/tdb2/ /path/to/some.ttl
Also, building a TDB2 like that does not generate statistics by default. You have to manually do it. First cd to the TDB2 you created (following the example above that is /path/to/tdb2) and run (in bash):
tdb2.tdbstats --loc=`pwd` > /tmp/stats.opt
mv /tmp/stats.opt > /path/to/tdb2/Data-0001/
The statistics "guide the optimizer in choosing one execution plan over another" which could help you achieve better query performance.
https://jena.apache.org/documentation/tdb/optimizer.html#running-tdbstats

Related

Which Timeseries Database would be more suitable for Sensor Datas - InfluxDb or Prometheus

I have searched for it in many blogs, but it seems all the blogs present a biased view. I myself am having a little bias towards Prometheus now, However, i did not find any good article which explains a use case of Prometheus for sensor data.
In my case, we manufacture IoT devices and we have a lot of data coming in. Till now we have been using MongoDB for everything, but now I want to switch to a time-series database, but I am really confused, whether I can choose Prometheus or not.
I am comfortable writing my own metric converter which can convert my sensor data into Prometheus metrics format (If something doesn't exist already)
Don't feel bd, lots of folks start out trying MongoDB for IoT applications because Mongo claims it's great for IoT. Only problem is, it's terrible for IoT. :-)
What you need is a true Time Series Database (TSDB). If you want to be able to query your data with SQL, try out QuestDB. It's the fastest open source TSDB out there and it's small.
I think i found it. Its Victoria Metrics. Haven't seen something as amazing as VM. First thing, it supports both Prometheus and Influx DB Write protocol(not just these, it supports some other time series database protocols also) and supports query language similar to prometheus. It has Vm Agent whose multiple instances we can run easily. It has cluster support and performance-wise, nothing like it.

Neo4j to grafana

I want to present release data complexity which is associated with each node like at epic, userstory etc in grafana in form of charts but grafana do not support neo4j database.Is there any way Directly or indirectly to present neo4j database in grafana?
I'm having the same issues and found this question among others. From my research I cannot agree with this answer completely, so I felt I should point some things out, here.
Just to clarify: a graph database may seem structurally different from a relational or time series database, but it is possible to build Cypher queries that basically return graph data as tables with proper columns as it would be with any other supported data source. Therefore this sentence of the above mentioned answer:
So what you want to do is just not possible.
is not absolutely true, I'd say.
The actual problem is, there is no datasource plugin for Neo4j available at the moment. You would need to implement one on your own, which will be a lot of work (as far as I can see), but I suspect it to be possible. For me at least, this will be too much work to do, so I won't use any approach to read data directly from Neo4j into Grafana.
As a (possibly dirty) workaround (in my case), a service will regularly copy relevant portions of the Neo4j graph into a relational database (or a time series database, if the data model is sufficiently simple for that), which Grafana is aware of (see datasource plugins), so I can query it from there. This is basically the replication idea also given in the above mentioned answer. In this case you obviously end up with at least 2 different database systems and an additional service, which is not so insanely great, but at the moment it seems to be the quickest way to resolve the problem with the missing datasource plugin. Maybe this is applicable in your case, too.
Using neo4j's graphite metrics you can actually configure data to be sent to grafana, and from there build whichever dashboards you like.
Up until recently, graphite/grafana wasn't supported, but it is now (in the recent 3.4 series releases), along with prometheus and other options.
Update July 2021
There is a new plugin called Node Graph Panel (currently in beta) that can visualise graph structures in Grafana. A prerequisite for displaying your graph is to make sure that you have an API that exposes two data frames, one for nodes and one for edges, and that you set frame.meta.preferredVisualisationType = 'nodeGraph' on both data frames. See the Data API specification for more information.
So, one option would be to setup an API around your Neo4j instance that returns the nodes and edges according to the specifications above. Note that I haven't tried it myself (yet), but it seems like a viable solution to get Neo4j data into Grafana.
Grafana support those databases, but not Neo4j : Graphite, InfluxDB, OpenTSDB, Prometheus, Elasticsearch, CloudWatch
So what you want to do is just not possible.
You can replicate your Neo4j data inside of those database, but the datamodel is really different ... (timeseries vs graph).
If you just want to have some charts, you can use Apache Zeppeline for that.

Neo4j performance in server mode

I am learning neo4j. I am accessing neo4j via REST api(s) supported by the server mode. CRUD operations are implemented using neo4jOperations. For experimentation , I have benchmarked its read operations but I have found that methods : 'query' and 'queryForObjects' are taking huge execution time, although I am querying via a field which is indexed. Traversals are not complex.
I have : around 500K+ nodes, 900K+ relationships.
neo4j version : 3.0.8.
Is there any solution to improve the performance of query on neo4j in server mode?
Without looking at your actual queries and model it is hard to say why the performance would not be up to your expectations. Try to run the queries through the Neo4j browser and either EXPLAIN or PROFILE them, that may give you a hint of where the issue is.
Having said that, you really should move to version 3.2.1 and access the server over the bolt:/ protocol. That by itself should already significantly improve things.
Regards,
Tom

Time-series database with rethinkdb

I'm currently evaluating rethinkdb to use it with a time-series database for my charting needs.
I'm looking for existing tutorials and snippets if they even exist for rethinkdb. (I already know that mongodb, redis, tempodb and cassandra has similar resources).
You have some ressources on RethinkDB website:
General introduction to the query language -- http://www.rethinkdb.com/docs/introduction-to-reql/
Cookbook (probably what you are looking for) -- http://www.rethinkdb.com/docs/cookbook/javascript/
API docs (there are some useful examples there) -- http://www.rethinkdb.com/api/javascript/
There are some ffull apps too -- http://www.rethinkdb.com/docs/examples/
I am not aware of specific snippet for time series though.
But you just have to create an index on your time serie data, and use orderBy({index: "time"}) (something like that).

How to compute custom metrics using Elasticsearch + Kibana?

It seems like a pretty easy question, but for some reason I still can't understand how to solve the same. I have an elastic search cluster which is using twitter river to download tweets. I would like to implement a sentiment analysis module which takes each tweet and computes a score (+ve/-ve) etc. I would like the score to be computed for each of the existing tweets as well as for new tweets and then visualize using Kibana.
However, I am not sure where should I place the call to this sentiment analysis module in the elastic search pipeline.
I have considered the option of modifying twitter river plugin but that will not work retrospectively.
Essentially, I need to answer two questions :-
1) how to call python/java code while indexing a document so that I can modify the json accordingly.
2) how to use the same code to modify all the existing documents in ES.
If you don't want an external application to do the analysis before indexing the documents in Elasticsearch, the best way I guess is to write a plugin that does it. You can write a plugin that implements a custom analyzer that does the sentiment analysis. Then in the mapping define the fields you want to run your analyzer on.
See examples of analysis plugins -
https://github.com/barminator/elasticsearch-analysis-annotation
https://github.com/yakaz/elasticsearch-analysis-combo/
To run the analysis on all existing documents you will need to reindex them after defining the correct mapping.

Resources