Elasticsearch / Kibana: Application-side joins - join

is it possible with Kibana (preferably the shining new version 4 beta) to perform application-side joins?
I know that ES / Kibana is not built to replace relational- databases and it is normally a better idea to denormalize my data. In this use-case however, this is not the best approach since index-size is exploding and performance is dropping:
I'm indexing billions of documents containing session information of network flows like this: source ip, source port, destination ip, destination port, timestamp.
Now I also want to collect additional information for each ip address, such as geolocation, asn, reverse dns etc. Adding this information to every single session document makes the whole database unmanageable: There are millions of documents with the same ip addresses and the redundancy of adding the same additional information to all these documents leads to a massive bloat and an unresponsive user-experience even on a cluster with hundreds of gigabytes of ram.
Instead I would like to create a separate index containing only unique ip addresses and the metadata that I have collected to each one of them.
The question is: How can I still analyze my data using kibana? For each document returned by the query, kibana should perform a lookup in the ip-index and "virtually enrich" each ip address with this information. Something like adding virtual fields so the structure would look like this (on the fly):
source ip, source port, source country, source asn, source fqdn
I'm aware that this would come at the cost of multiple queries.

I don't think that there is such thing, but maybe that you could play around with the filters :
You create nice and simple data visualisations that filter on
different types and display only one simple data.
You put these different visualizations in a dashboard in order to display all the data associated with a type of join.
You use the filters as your join key and use the full dashboard,
composed of different panels, to get insights of specific join keys
(ips in your case, or sessions)
You need to create 1 dashboard for every type of join that you want to make.
Note that you will need to harmonize the names and mappings of the fields in your different documents!
Keep us updated, that's an interesting problematic, I would like to now how it turns out with so many documents.

Related

Per-partition GroupByKey in Beam

Beam's GroupByKey groups records by key across all partitions and outputs a single iterable per-key-per-window. This "brings associated data together into one location"
Is there a way I can groups records by key locally, so that I still get a single iterable per-key-per-window as its output, but only over the local records in the partition instead of a global group-by-key over all locations?
If I understand your question correctly, you don't want to transfer a data over network if a part of it (partition) was processed on the same machine and then can be grouped locally.
Normally, Beam doesn't provide you details where and how your code will be running since it may vary depending on runner/engine/resource manager. Though, if you can fetch some uniq information about your worker (like hostname, ip or mac address) then you can use it as a part of your key and group all related data by this. Quite likely that in this case these data partitions won't be moved to other machines since all needed input data is already sitting on the same machine and can be processed locally. Though, afaik, there is no 100% guarantee about that.

Where does raw geoip data come from?

This question is a general version of a more specific question asked here. However, those answers were unusable.
Question: What is the raw source for geoIP data?
Many websites will tell me where my IP is, but they all appear to be using databases from fewer than 5 companies (most are using a database from MaxMind). These companies offer limited free versions of their databases, but I'm trying to determine what they're using for their source data?
I've tried using Linux/Unix commands such as ping, traceroute, dig, whois, etc., but they don't provide predictably accurate information.
Preamble: I believe this is actually a very valid question for SO website as understanding how such things work is important to understanding how such datasets can be used in software. However the answer to this question is rather complex and full of historical remarks.
First - it is worth mentioning that there is NO unified raw geoip data. Such thing just does not exist. Second - the data for this comes from multiple resources and often is not reliable and/or outdated.
To understand how that comes to be one need to know how Internet came into existence and spread around the world. Short summary is below:
IANA is a global [non-profit] organization which manages assignment of IP blocks to regional organizations: https://www.iana.org/numbers This happens upon request and regional organization requests specified block size
Regional organizations may assign those IP blocks to either ISP directly or to country level sub-organizations (who would assign that to ISP then).
ISP assigns IP addresses to local branches etc.
From above you can easily see that:
There is no single body which is responsible for IP block assignment to this or that location
Decisions how to (and whether to) release information about which IP belongs to which location are not taken uniformly and instead each organizations decides how to (and whether do it at all) release that information
All of above creates a whole lot of mess. It takes a lot of dedication and long time to obtain, aggregate and sort this data. And this is why most up-to-date and detailed geoip datasets are commercial commodity.
Whoever takes on a challenge of building their own dataset should be able to obtain this information directly from end users (ISPs), because higher level organizations do not know to which location each IP address will be assigned. Higher level organizations only distribute IP blocks among applicants (and keep some reserve for faster processing) and it is a lowest level organizations who decide which location gets which IP address and they are not obligated to release this information publicly.
UPD:
To start building your own dataset you can begin with this list of blocks and how they are assigned

Can I have some keyspaces replicated to some nodes?

I am trying to build multiple API for which I want to store the data with Cassandra. I am designing it as if I would have multiple hosts but, the hosts I envisioned would be of two types: trusted and non-trusted.
Because of that I have certain data which I don't want to end up replicated on a group of the hosts but the rest of the data to be replicated everywhere.
I considered simply making a node for public data and one for protected data but that would require the trusted hosts to run two nodes and it would also complicate the way the API interacts with the data.
I am building it in a docker container also, I expect that there will be frequent node creation/destruction both trusted and not trusted.
I want to know if it is possible to use keyspaces in order to achieve my required replication strategy.
You could have two Datacenters one having your public data and the other the private data. You can configure keyspace replication to only replicate that data to one (or both) DCs. See the docs on replication for NetworkTopologyStrategy
However there are security concerns here since all the nodes need to be able to reach one another via the gossip protocol and also your client applications might need to contact both DCs for different reads and writes.
I would suggest you look into configuring security perhaps SSL for starters and then perhaps internal authentication. Note Kerberos is also supported but this might be too complex for what you need at least now.
You may also consider taking a look at the firewall docs to see what ports are used between nodes and from clients so you know which ones to lock down.
Finally as the above poster mentions, the destruction / creation of nodes too often is not good practice. Cassandra is designed to be able to grow / shrink your cluster while running, but it can be a costly operation as it involves not only streaming data from / to the node being removed / added but also other nodes shuffling around token ranges to rebalance.
You can run nodes in docker containers, however note you need to take care not to do things like several containers all accessing the same physical resources. Cassandra is quite sensitive to io latency for example, several containers sharing the same physical disk might render performance problems.
In short: no you can't.
All nodes in a cassandra cluster from a complete ring where your data will be distributed with your selected partitioner.
You can have multiple keyspaces and authentication and authorziation within cassandra and split your trusted and untrusted data into different keyspaces. Or you an go with two clusters for splitting your data.
From my experience you also should not try to create and destroy cassandra nodes as your usual daily business. Adding and removing nodes is costly and needs to be monitored as your cluster needs to maintain repliaction and so on. So it might be good to split cassandra clusters from your api nodes.

What is Mnesia replication strategy?

What strategy does Mnesia use to define which nodes will store replicas of particular table?
Can I force Mnesia to use specific number of replicas for each table? Can this number be changed dynamically?
Are there any sources (besides the source code) with detailed (not just overview) description of Mnesia internal algorithms?
Manual. You're responsible for specifying what is replicated where.
Yes, as above, manually. This can be changed dynamically.
I'm afraid (though may be wrong) that none besides the source code.
In terms of documenation the whole Erlang distribution is hardly the leader
in the software world.
Mnesia does not automatically manage the number of replicas of a given table.
You are responsible for specifying each node that will store a table replica (hence their number). A replica may be then:
stored in memory,
stored on disk,
stored both in memory and on disk,
not stored on that node - in this case the table will be accessible but data will be fetched on demand from some other node(s).
It's possible to reconfigure the replication strategy when the system is running, though to do it dynamically (based on a node-down event for example) you would have to come up with the solution yourself.
The Mnesia system events could be used to discover a situation when a node goes down; given you know what tables were stored on that node you could check the number of their online replicas based on the nodes which were still online and then perform a replication if needed.
I'm not aware of any application/library which already manages this kind of stuff and it seems like a quite an advanced (from my point of view, at least) endeavor to make one.
However, Riak is a database which manages data distribution among it's nodes transparently from the user and is configurable with respect to the options you mentioned. That may be the way to go for you.

How do you build a torrent file indexer?

I am curious about the technology behind a search engine like torrentz.com. From what I could observe, it doesn't host any torrent files, but rather connects you to other servers that do.
you search for keywords, it brings up a list of potential titles matching your search.
then you pick one of these and it provides you with another list of potential servers hosting the corresponding torrent file.
What I'm interested in particularly is the strategy behind gathering and indexing all that content:
How do they collect then aggregate the data?
Is it a submission base service, where each of these servers submits its content for indexing?
Is it a crawling algorithm? If so how do you even start crawling a site like piratebay.org?
Do they have access to these other servers' databases?
My knowledge and understanding of the bittorrent protocol is not very elaborate, but the documentation that I found online pointed me more toward the processes involved in building a tracker service, which isn't exactly what I'm interested in. Any insight and recommended reading material is appreciated.
For beginning start indexing their rss feeds and gather data from it. The next step would be indexing of portal's (like Mininova, tpb, etc) pages but watch out for the fact that you can be banned (ip based) for doing so, since that would provoke huge amount of data requested from their servers (i don't think that they be too happy about that)..
That said i doubt that they have access to other server's databases, but rather it's crawling +rss.
Another thing that you can use is that when somebody make a query of an item which you don't have in qyour database, you make the query on the main bt portal's, cache the result in your db, and then display results. Then if another user make the same query (which is pretty common scenario) you can show him cached data + new data from rss.

Resources