Memory Configurations For Cloudera Impala - memory

I'm using Impala, and I know impala does its processing in memory. I've searched for a list of Impala configuration options, but I haven't found any thorough documentation on this, particularly with regard to memory/heap. Does Impala have such settings? Or does it rely on the hdfs/datanode heap space? I know you can cap impala memory usage with -mem_limit, but I'm trying to better understand how this is done.

As of the Impala 1.4.0 release, included in CDH 5.1.0, Impala uses both memory and disk during query processing. To learn more about how to control Impala's use of memory, I recommend reading through the Cloudera documentation on Impala, especially:
Using YARN Resource Management with Impala
Using HDFS Caching with Impala
Modifying Impala Startup Options
You'll find more information on how to configure many aspects of Impala's memory use, including integration with HDFS caching and Hadoop YARN (via Llama). For more on HDFS caching, see Andrew Wang and Colin McCabe's presentation from Hadoop Summit 2014. For more on Llama, see Henry Robinson's presentation from Hadoop World NYC 2013.

Related

Add Neo4j to Gremlin Server - how to?

I have downloaded Gremlin Server with an intention of being able to use Gremlin to traverse a Neo4j DB.
Now, speaking of the latter, it has to be somehow added to the Gremlin Server installation, but I have difficulty finding any up-to-date guidance on how to do that. There are a few posts here on SO describing various kinds of problems people run into, but no definitive solution, much less one for the current versions of both Tinkerpop and Neo4j.
Would appreciate specific links, tips etc.
Thanks!
There is a "TIP" describing Gremlin Server configuration in the TinkerPop reference documentation found here. Basically, you -install Neo4j dependencies:
bin/gremlin-server.sh install org.apache.tinkerpop neo4j-gremlin 3.3.4
then you edit your Gremlin Server YAML configuration file to connect to your database. Gremlin Server contains a sample file to get you started and is found the /conf directory of the installation. Of critical note is this entry:
graphs: {
graph: conf/neo4j-empty.properties}
It specifies the Neo4j configuration to use and the sample one that ships with Gremlin Server looks like this:
gremlin.graph=org.apache.tinkerpop.gremlin.neo4j.structure.Neo4jGraph
gremlin.neo4j.directory=/tmp/neo4j
gremlin.neo4j.conf.dbms.auto_index.nodes.enabled=true
gremlin.neo4j.conf.dbms.auto_index.relationships.enabled=true
As you can see, the configuration basically just passes through Neo4j specific configuration to Neo4j itself. Only the first two lines are TinkerPop options. In this case, it sets up Neo4j for embedded mode, meaning Neo4j runs within the Gremlin Server JVM. You can make Gremlin Server part of a Neo4j HA cluster with instructions found in the reference documentation here.
Note that you asked for "current" versions of both TinkerPop and Neo4j. While these instructions are current for TinkerPop, I'm afraid that the Neo4j version TinkerPop supports is well behind their latest release. It would be nice if someone had time to issue a pull request for that.

Web driver brower heap analysis

We are running our integration tests using selenium web driver(chrome/ie/firefox).
Is there are any options to analyze browser memory usage (heap analysis) from web driver tests.
(or)
How can I integrate this with my integration tests.
Is there options to save browser heap snapshot while running the web driver test.
Please suggest.
Not sure if this helped, but I just ran my tests locally and looked at the system usage on my computer.
Its not exact, but it gave me a ball park 512M in use per thread.
Basically, I added up the Chrome jobs and the chromedriver job and I just used the heaviest Java job.
If you have found a better way please share :)

Cassandra and Analytics on single node

With DataStax Enterprise, is it possible to set up a cassandra cluster that can do cassandra "realtime" and analytics on a single machine? Obviously, this is not for production, but for tiny little proof of concepts / logical experiments, I'd rather fire up a single linux vm, rather than 2 or 3. Would this be possible with a tarball install, if not through apt-get?
Yes. On the latest versions of dse 3.1.x, 3.2.x, and 4.0.x it should be possible to turn both the Solr and Hadoop features on, on the same node for development purposes.

what is the suggested integration pattern for running Mahout jobs using the data stored in Cassandra?

I use Cassandra from DataStax enterprise version (3.1.4). I would like Mahout to access the data stored in Cassandra instead of requiring a HDFS file.
How can a Mahout job access data stored in a Cassandra CQL table ? Not able to run a mahout job that depends on DataStax CQL JDBC driver. It complains that the driver as well related CQL classes are not in the classpath. This error is seen despite adding CQL driver jar files in the Mahout classpath. We found that Hector APIs are bundled with the Mahout jars but not CQL java driver. Can CQL APIs be used with Mahout?
Have you checked out CQLStorage loader for Pig?
You can grab a CF and map/reduce on it e.g. https://github.com/apache/cassandra/blob/trunk/examples/pig/test/test_cql_storage.pig?source=cc and use the org.apache.mahout.pig.LogisticRegression UDF for Pig with Mahout.
There are also DSE commands for Mahout http://www.datastax.com/docs/datastax_enterprise3.1/solutions/mahout#mahout-example

Neo4j supports Windows but the recommended filesystem is ext4

The Neo4j manual shows that windows is supported, but the minimum filesystem is Ext4, what are the compromises for NTFS.
Neo4J is written on java and uses JDK abstraction of file system. So developers can recommend you some operation system, but theoretically it will work on FAT or even on proprietary OS of your cooler (if it is run with Java control).
Just provide your own measurement of performance at possible target OS and select best one.

Resources