Team,
Curious to know if anyone succeeded in executing query for Twitter Cloudera Example?
I added mentioned SerDe Jar in Beewax file resources as Jar, still I am getting the error for any query.
Query:
SELECT
t.retweeted_screen_name,
sum(retweets) AS total_retweets,
count(*) AS tweet_count
FROM (SELECT
retweeted_status.user.screen_name as retweeted_screen_name,
retweeted_status.text,
max(retweet_count) as retweets
FROM tweets
GROUP BY retweeted_status.user.screen_name,
retweeted_status.text) t
GROUP BY t.retweeted_screen_name
ORDER BY total_retweets DESC
LIMIT 10;
Your query has the following error(s):
Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
INFO : Number of reduce tasks not specified. Estimated from input data size: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : number of splits:1
INFO : Submitting tokens for job: job_1432914212475_0002
INFO : The url to track the job: http://quickstart.cloudera:8088/proxy/application_1432914212475_0002/
INFO : Starting Job = job_1432914212475_0002, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1432914212475_0002/
INFO : Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1432914212475_0002
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO : 2015-05-29 10:20:59,400 Stage-1 map = 0%, reduce = 0%
INFO : 2015-05-29 10:21:35,687 Stage-1 map = 100%, reduce = 100%
ERROR : Ended Job = job_1432914212475_0002 with errors
Resolved!
Don't use prebuilt SerDe Jar to download. It may be outdated.
Compile yourself!
Related
I am working on extracting graph embeddings with training GraphSage algorihm. I am working on a large graph consisting of (82,339,589) nodes and (219,521,164) edges. When I checked with ":queries" command the query is listed as running. Algorithm started in 6 days ago. When I look the logs with "docker logs xxx" the last logs listed as
2021-12-01 12:03:16.267+0000 INFO Relationship Store Scan (RelationshipScanCursorBasedScanner): Imported 352,492,468 records and 0 properties from 16247 MiB (17,036,668,320 bytes); took 59.057 s, 5,968,663.57 Relationships/s, 275 MiB/s (288,477,487 bytes/s) (per thread: 1,492,165.89 Relationships/s, 68 MiB/s (72,119,371 bytes/s))
2021-12-01 12:03:16.269+0000 INFO [neo4j.BoltWorker-3 [bolt] [/10.0.0.6:56143] ] LOADING
INFO [neo4j.BoltWorker-3 [bolt] [/10.0.0.6:56143] ] LOADING Actual
memory usage of the loaded graph: 8602 MiB
INFO [neo4j.BoltWorker-3 [bolt] [/10.0.0.6:64076] ] GraphSageTrain ::
Start
There is a way to see detailed logs about training process. Is it normal for taking 6 days for graphs with shared sizes ?
It is normal for GraphSAGE to take a long time compared to FastRP or Node2Vec. Starting in GDS 1.7, you can use
CALL gds.beta.listProgress(jobId: String)
YIELD
jobId,
taskName,
progress,
progressBar,
status,
timeStarted,
elapsedTime
If you call without passing in a jobId, it will return a list of all running jobs. If you call with a jobId, it will give you details about a running job.
This query will summarize the details for job 03d90ed8-feba-4959-8cd2-cbd691d1da6c.
CALL gds.beta.listProgress("03d90ed8-feba-4959-8cd2-cbd691d1da6c")
YIELD taskName, status
RETURN taskName, status, count(*)
Here's the documentation for progress logging. The system monitoring procedures might also be helpful to you.
I have a use case where I want to create a bunch of GroupedFlux by a PartitionKey and within each group delay elements by 100 milliseconds. However, I want multiple groups to start at the same time. So if there are 3 groups, I expect 3 messages emitted every 100 millisecond. However, with the following code I see only 1 message every 100 milliseconds.
This is the code that I was expecting to work.
final Flux<GroupedFlux<String, TData>> groupedFlux =
flux.groupBy(Event::getPartitionKey);
groupedFlux.subscribe(g -> g.delayElements(Duration.ofMillis(100))
.flatMap(this::doWork)
.doOnError(throwable -> log.error("error: ", throwable))
.onErrorResume(e -> Mono.empty())
.subscribe());
This is the log.
21:24:29.318 parallel-5] : GroupByKey : 2
21:24:29.424 parallel-6] : GroupByKey : 3
21:24:29.529 parallel-7] : GroupByKey : 1
21:24:29.634 parallel-8] : GroupByKey : 2
21:24:29.739 parallel-9] : GroupByKey : 3
21:24:29.844 parallel-10] : GroupByKey : 1
21:24:29.953 parallel-11] : GroupByKey : 2
21:24:30.059 parallel-12] : GroupByKey : 3
21:24:30.167 parallel-1] : GroupByKey : 1
(See almost 100 ms difference between each log statement. 1s column is the timestamp.
Upon more analysis I found out that it was working fine. My tests had incorrect data for PartitionKey which was resulting in a single GroupedFlux.
Answering my own question in case someone ever doubts that the delayElements works differently on a groupedFlux. It does not.
I'm doing some scalability testing with hyperledger Iroha using docker containers. Therefore I increase the number of nodes within the network step by step, write some transactions into the ledger an determine the average latency for transaction processing.
The problem is that around 30 nodes the consensus seems to stop working properly i.e. it takes seconds until a transaction is committed.
I have already tried to vary some configuration parameters like vote delay but this does not change irohas behavior.
This my configuration for the iroha nodes:
{
"block_store_path" : "/tmp/block_store/",
"torii_port" : 50051,
"internal_port" : 10001,
"pg_opt" : "host=some-postgres port=5432 user=postgres password=password1234",
"max_proposal_size" : 50,
"proposal_delay" : 5000,
"vote_delay" : 5000,
"mst_enable" : false,
"mst_expiration_time" : 1440,
"max_rounds_delay": 50,
"stale_stream_max_rounds": 2
}
Which sometimes leads to transaction processing times of around 10 seconds: https://gist.github.com/dltuser12/913e036efd735b2996d387b1423096c9 (Iroha Log file for a corresponding example)
In the ibexpert tool, when executing a query, it shows the query performance info like this:
------ Performance info ------
Prepare time = 0ms
Execute time = 0ms
Avg fetch time = 0,00 ms
Current memory = 14*717*320
Max memory = 16*060*920
Memory buffers = 3*000
Reads from disk to cache = 69
Writes from cache to disk = 0
Fetches from cache = 572
I would like to get this information in my own tool, does anyone know how to get this information using Firedac en FDQuery on a Firebird 2.5 database in Delphi?
Within my current research I'm trying to find out, how big the impact of ad-hoc sentiment on daily stock returns is.
Calculations functioned quite well and results also are plausible.
The calculations until now with quantmod package and yahoo financial data look like below:
getSymbols(c("^CDAXX",Symbols) , env = myenviron, src = "yahoo",
from = as.Date("2007-01-02"), to = as.Date("2016-12-30")
Returns <- eapply(myenviron, function(s) ROC(Ad(s), type="discrete"))
ReturnsDF <- as.data.table(do.call(merge.xts, Returns))
# adjust column names
colnames(ReturnsDF) <- gsub(".Adjusted","",colnames(ReturnsDF))
ReturnsDF <- as.data.table(ReturnsDF)
However, to make it more robust towards noisy influence of pennystock data I wonder, how its possible to exclude stocks that once in the time period go below a certain value x, let's say 1€.
I guess, the best thing would be to exclude them before calculating the returns and merge the xts object results or even better, before downloading them with the getSymbols command.
Has anybody an idea how this could work best? Thanks in advance.
Try this:
build a price frame of the Adj. closing prices of your symbols
(I use the PF function of the quantmod add-on package qmao which has lots of other useful functions for this type of analysis. (install.packages("qmao", repos="http://R-Forge.R-project.org”))
check by column if any price is below your minimum trigger price
select only columns which have no closings below the trigger price
To stay more flexible I would suggest to take a sub period - let’s say no price below 5 during the last 21 trading days.The toy example below may illustrate my point.
I use AAPL, FB and MSFT as the symbol universe.
> symbols <- c('AAPL','MSFT','FB')
> getSymbols(symbols, from='2018-02-01')
[1] "AAPL" "MSFT" "FB"
> prices <- PF(symbols, silent = TRUE)
> prices
AAPL MSFT FB
2018-02-01 167.0987 93.81929 193.09
2018-02-02 159.8483 91.35088 190.28
2018-02-05 155.8546 87.58855 181.26
2018-02-06 162.3680 90.90299 185.31
2018-02-07 158.8922 89.19102 180.18
2018-02-08 154.5200 84.61253 171.58
2018-02-09 156.4100 87.76771 176.11
2018-02-12 162.7100 88.71327 176.41
2018-02-13 164.3400 89.41000 173.15
2018-02-14 167.3700 90.81000 179.52
2018-02-15 172.9900 92.66000 179.96
2018-02-16 172.4300 92.00000 177.36
2018-02-20 171.8500 92.72000 176.01
2018-02-21 171.0700 91.49000 177.91
2018-02-22 172.5000 91.73000 178.99
2018-02-23 175.5000 94.06000 183.29
2018-02-26 178.9700 95.42000 184.93
2018-02-27 178.3900 94.20000 181.46
2018-02-28 178.1200 93.77000 178.32
2018-03-01 175.0000 92.85000 175.94
2018-03-02 176.2100 93.05000 176.62
Let’s assume you would like any instrument which traded below 175.40 during the last 6 trading days to be excluded from your analysis :-) .
As you can see that shall exclude AAPL and FB.
apply and the base function any applied(!) to a 6-day subset of prices will give us exactly what we want. Showing the last 3 days of prices excluding the instruments which did not meet our condition:
> tail(prices[,apply(tail(prices),2, function(x) any(x < 175.4)) == FALSE],3)
FB
2018-02-28 178.32
2018-03-01 175.94
2018-03-02 176.62