Apache Hive and record updates - apache-hive

I have streaming data coming into my consumer app that I ultimately want to show up in Hive/Impala. One way would be to use Hive based APIs to insert the updates in batches to the Hive Table.
The alternate approach is to write the data directly into HDFS as a avro/parquet file and let hive detect the new data and suck it in.
I tried both approaches in my dev environment and the 'only' drawback I noticed was high latency writing to hive and/or failure conditions I need to account for in my code.
Is there an architectural design pattern/best practices to follow?

Related

Apache Ignite vs Informix Warehouse Accelerator vs Infinispan?

What's difference between Apache Ignite and IWA (Informix Warehouse Accelerator) and Infinispan ?
I have an application that accept large volume of data and process many transaction in per second. Not only
response time is very important for us, but also Data integrity is very important for us, Which in-memory databases best solutions for me ? , I'm confused
to select them. also i use JEE and application server is jboss.
We are looking for best in-memory database solution to processing data in real time?
Update:
I use relational database , i am looking for in-memory database to select , insert , update from that for decrease response time, Also Data integrity is very important and very important persist data on disk
Apache Ignite and Infinispan are both data grids / memory-centric databases with similar feature lists, with the biggest difference that Apache Ignite has SQL support for querying data.
Informix Warehouse Accelerator seems to be a narrow use case product, so it's hard to say if it's useful for your use case or not.
Otherwise, there's too few information in your question about the specifics of your project to say if either of those are a good fit, or even none of them.

gpfdist vs gpload greenplum

I am setting greenplum for the first time. I am following the documentation. I want to setup connection from sql to greenplum database. Currently figuring out what's the best way to achieve this. I came across gpfdist and gpload.
How are the two different? Since both use external tables, both work on slaved nodes and are used for parallel loading. So Is there any advantage of using one over other?
Answering to your question for " I want to setup connection from sql to greenplum database"...
It's ambiguous for which SQL database you are referring to.
Also, there is no direct connectivity drivers available to connect non-greenplum database to greenplum database.
However if you want to migrate data from Oracle to Greenplum, then you can use Informatica's fastclone tool.
To answer your second part of question regarding gpfdist and gpload. GPFDIST is a file distributed process which runs on host system and it serves file parallely to many segments. While initialising external table to read/ write from file, you will need to specify which process will serve the file, In your case it will be GPFDIST. There are other processes too like FTP, GPHDFS, HTTP.
GPLOAD is a wrapper utility which makes your work easier by automatically creating gpfdist processes and external tables.
Also be aware that GPLOAD can only create readable external tables.
gpfdist n gpload or same. In gpfdist you do it manually while in gpload you can automate the activities via maiking entries in config(yaml file) file.
GPLOAD is a wrapper around GPFDIST. so when you load data via gpload it will internally use gpfdist only.
If you want to load/ migrate data from any other RDBMS to Greenplum and you are using any ETL or migration tool, it will use normal copy command and while loading/migrating if you enable gpload(now a days in the latest version of most of the ETL tool and migration tool support gpload feature when you migrate/load data to Greenplum) it will load data in parallel fashion via using gpfdist internally.

How to do Neo4j Cache-based Sharding?

I've been reading Neo4j's Operational Manual on Cache Sharding, and posts all over the web, however I can hardly find any detailed example on how to configure HAProxy for cache sharding(yes the one on Operation Manual is rather brief) on a real-world graph, which may contain multiple node labels.
Has anyone ever done this before? Would be lovely if you could share your experience.
Moreover, I'm a bit confused on the mechanism of the way to shard the graph using HAProxy. How do sub-graphs get cached on certain slaves, merely by providing rules in HAProxy? It surprised me to learn that cache sharding isn't handled by Neo4j.
The goal is to send queries hitting the same region of your graph always to the same instance. This of course means that the request data indicates the region. What to use as "region indicator" is heavily depending on the structure and shape of your graph.
In a lot of cases of customer facing applications people successfully used the current user id and set it as additional http header which is then evaluated by haproxy.

Apache Samza local storage - OrientDB / Neo4J graph instead of KV store

Apache Samza uses RocksDB as the storage engine for local storage. This allows for stateful stream processing and here's a very good overview.
My use case:
I have multiple streams of events that I wish to process taken from a system such as Apache Kafka.
These events create state - the state I wish to track is based on previous messages received.
I wish to generate new stream events based on the calculated state.
The input stream events are highly connected and a graph such as OrientDB / Neo4J is the ideal medium for querying the data to create the new stream events.
My question:
Is it possible to use a non-KV store as the local storage for Samza? Has anyone ever done this with OrientDB / Neo4J and is anyone aware of an example?
I've been evaluating Samza and I'm by no means an expert, but I'd recommend you to read the official documentation, and even read through the source code—other than the fact that it's in Scala, it's remarkably approachable.
In this particular case, toward the bottom of the documentation's page on State Management you have this:
Other storage engines
Samza’s fault-tolerance mechanism (sending a local store’s writes to a replicated changelog) is completely decoupled from the storage engine’s data structures and query APIs. While a key-value storage engine is good for general-purpose processing, you can easily add your own storage engines for other types of queries by implementing the StorageEngine interface. Samza’s model is especially amenable to embedded storage engines, which run as a library in the same process as the stream task.
Some ideas for other storage engines that could be useful: a persistent heap (for running top-N queries), approximate algorithms such as bloom filters and hyperloglog, or full-text indexes such as Lucene. (Patches accepted!)
I actually read through the code for the default StorageEngine implementation about two weeks ago to gain a better sense of how it works. I definitely don't know enough to say much intelligently about it, but I can point you at it:
https://github.com/apache/samza/tree/master/samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv
https://github.com/apache/samza/tree/master/samza-kv/src/main/scala/org/apache/samza/storage/kv
The major implementation concerns seem to be:
Logging all changes to a topic so that the store's state can be restored if a task fails.
Restoring the store's state in a performant manner
Batching writes and caching frequent reads in order to save on trips to the raw store.
Reporting metrics about the use of the store.
Do the input stream events define one global graph, or multiple graphs for each matching Kafka/Samza partition? That is important as Samza state is local not global.
If it's one global graph, you can update/query a separate graph system from the Samza task process method. Titan on Cassandra would one such graph system.
If it's multiple separate graphs, you can use the current RocksDB KV store to mimic graph database operations. Titan on Cassandra does just that - uses Cassandra KV store to store and query the graph. Graphs are stored either via matrix (set [i,j] to 1 if connected) or edge list. For each node, use it as the key and store its set of neighbors as the value.

How do you build a torrent file indexer?

I am curious about the technology behind a search engine like torrentz.com. From what I could observe, it doesn't host any torrent files, but rather connects you to other servers that do.
you search for keywords, it brings up a list of potential titles matching your search.
then you pick one of these and it provides you with another list of potential servers hosting the corresponding torrent file.
What I'm interested in particularly is the strategy behind gathering and indexing all that content:
How do they collect then aggregate the data?
Is it a submission base service, where each of these servers submits its content for indexing?
Is it a crawling algorithm? If so how do you even start crawling a site like piratebay.org?
Do they have access to these other servers' databases?
My knowledge and understanding of the bittorrent protocol is not very elaborate, but the documentation that I found online pointed me more toward the processes involved in building a tracker service, which isn't exactly what I'm interested in. Any insight and recommended reading material is appreciated.
For beginning start indexing their rss feeds and gather data from it. The next step would be indexing of portal's (like Mininova, tpb, etc) pages but watch out for the fact that you can be banned (ip based) for doing so, since that would provoke huge amount of data requested from their servers (i don't think that they be too happy about that)..
That said i doubt that they have access to other server's databases, but rather it's crawling +rss.
Another thing that you can use is that when somebody make a query of an item which you don't have in qyour database, you make the query on the main bt portal's, cache the result in your db, and then display results. Then if another user make the same query (which is pretty common scenario) you can show him cached data + new data from rss.

Resources