Reusing Clang's metadata - clang

I want to analyze a big codebase, that takes minutes to compile. Is there a way to save all "metadata" generated by clang, so that I don't have to wait every time I run the analysis?

Related

Get execution time of every part of a profile plan

Is there a way to measure the time it takes to perform each part of a Neo4j execution plan?
I can see the total execution time and total db hits. Also db hits and estimated rows for each part of the execution plan but not the time it takes to perform it. For example, the time it takes to perform a 'Filter' or 'Expand(All)' operation.
Nop you can't.
But you have the number of dbhits on each boxes, so you are already aware of the resources consumption of each part.
Why do you want to know the time of each part ?
Update answer after comment
A dbhit is an abstract unit of work for the database. So more dbhit you've got on a box, more work needs to be done on it, and so it takes more time.
On the other side, an execution depends a lot of the state of your computer. Do you have a lot of processes that are using the CPU, memory, network, hard drive ... ?
So to compare time executions is bad habit, you should compare the dbhits.
DBHits are always related to the time execution of a query, but the opposite is not necessary true.

VoltDB Stored Procedures history

I have many stored procedures running in Volt and it seems like 1 of them is causing spikes in CPU every now and then but I don't know which one.
Is there somewhere I can see the history of all the stored procedures that ran so that I could pinpoint the problematic one based on the time it occurred?
I have tried turning the Command Logging on but it's a binary file so I have no way of reading it.
My next option is to log from inside the stored procedures but I prefer to keep this option as a last resort because it will require some extra developing/deploying and it won't be relevant for internal procedures.
Is there any way to log/somehow see when stored procedures ran?
There isn't a log of every transaction in VoltDB that a user can review. The command log is not meant to be readable and only includes writes. However, there are some tools you can use to identify poorly performing or long-running procedures.
You can call "exec #Statistics PROCEDUREPROFILE 0;" to get a summary of all the procedures that have been executed, including the number of invocations and the average execution time in nanoseconds. If one particular procedure is the problem, it may stick out.
You can also grep the volt.log file for the phrase "is taking a long time", which is a message printed when a procedure or SQL statement takes longer than 1 second to execute.
Also, there is a script in the tools subdirectory called watch_performance.py, which can be used to monitor the performance. It is similar to calling "exec #Statistics PROCEDUREPROFILE 0;" at regular intervals, except there are some columns gathered from additional #Statistics selectors, and the output is formatted for readability. "./watch_performance.py -h" will output help and usage information. For example, you might run this during a performance load to get a picture of the workload. Or, you might run it over a longer period of time, perhaps at less granular intervals, to see the fluctuations in the workload over time.
Disclosure: I work for VoltDB

Dataflow batch vs streaming: window size larger than batch size

Let's say we have log data with timestamps that can either be streamed into BigQuery or stored as files in Google Storage, but not streamed directly to the unbounded collection source types that Dataflow supports.
We want to analyse this data based on timestamp, either relatively or absolutely, e.g. "how many hits in the last 1 hour?" and "how many hits between 3pm and 4pm on 5th Feb 2018"?
Having read the documentation on windows and triggers, it's not clear how we would divide our incoming data into batches in a way that is supported by Dataflow if we want to have a large window - potentially we want to aggregate over the last day, 30 days, 3 months, etc.
For example, if our batched source is a BigQuery query, run every 5 mins, for the last 5 mins worth of data, will Dataflow keep the windows open between job runs, even though the data is arriving in 5 min chunks?
Similarly, if the log files are rotated every 5 mins, and we start Dataflow as a new file is saved to the bucket, the same question applies - is the job stopped and started, and all knowledge of previous jobs discarded, or does the large window (e.g. up to a month) remain open for new events?
How do we change/modify this pipeline without disturbing the existing state?
Apologies if these are basic questions, even a link to some docs would be appreciated.
It sounds like you want arbitrary interactive aggregation queries on your data. Beam / Dataflow are not a good fit for this per se, however one of the most common use cases of Dataflow is to ingest data into BigQuery (e.g. from GCS files or from Pubsub), which is a very good fit for that.
A few more comments on your question:
it's not clear how we would divide our incoming data into batches
Windowing in Beam is simply a way to specify the aggregation scope in the time dimension. E.g. if you're using sliding windows of size 15 minutes every 5 minutes, then a record whose event-time timestamp is 14:03 counts towards aggregations in three windows: 13:50..14:05, 13:55..14:10, 14:00..14:15.
So: same way as you don't need to divide your incoming data into "keys" when grouping by a key (the data processing framework performs the group-by-key for you), you don't divide it into windows either (the framework performs group-by-window implicitly as part of every aggregating operation).
will Dataflow keep the windows open between job runs
I'm hoping this is addressed by the previous point, but to clarify more: No. Stopping a Dataflow job discards all of its state. However, you can "update" a job with new code (e.g. if you've fixed a bug or added an extra processing step) - in that case state is not discarded, but I think that's not what you're asking.
if the log files are rotated every 5 mins, and we start Dataflow as a new file is saved
It sounds like you want to ingest data continuously. The way to do that is to write a single continuously running streaming pipeline that ingests the data continuously, rather than to start a new pipeline every time new data arrives. In the case of files arriving into a bucket, you can use TextIO.read().watchForNewFiles() if you're reading text files, or its various analogues if you're reading some other kind of files (most general is FileIO.matchAll().continuously()).

Initial state for a dataflow job

I'm trying to figure out how we "seed" the window state for some of our streaming dataflow jobs. Scenario is we have a stream of forum messages, we want to emit a running count of messages for each topic for all time, so we have a streaming dataflow job with a global window and triggers to emit each time a record for a topic comes in. All good so far. But prior to the stream source, we have a large file which we'd like to process to get our historical counts, also, because topics live forever, we need the historical count to inform the outputs from the stream source, so we kind've need the same logic to run over the file, then start running over the stream source when the file is exhausted, while keeping the window state.
Current ideas:
Write a custom unbounded source that does just that. Reads over the file until it's exhausted and then starts reading from the stream. Not much fun because writing custom sources is not much fun.
Run the logic in batch mode over the file, and as the last step emit the state to a stream sink somehow, then have a streaming version of the logic start up that reads from both the state stream and the data stream, and somehow combines the two. This seems to make some sense, but not sure how to make sure that the streaming job reads everything from the state source, to initialise, before reading from the data stream.
Pipe the historical data into a stream, write a job that reads from both the streams. Same problems as the second solution, not sure how to make sure one stream is "consumed" first.
EDIT: Latest option, and what we're going with, is to write the calculation job such that it doesn't matter at all what order the events arrive in, so we'll just push the archive to the pub/sub topic and it will all work. That works in this case, but obviously it affects the downstream consumer (need to either support updates or retractions) so I'd be interested to know what other solutions people have for seeding their window states.
You can do what you suggested in bullet point 2 --- run two pipelines (in the same main), with the first that populates a pubsub topic from the large file. This is similar to what the StreamingWordExtract example does.

Is MobileServiceSQLiteStore.DefineTable<T> necessary on every run and if so why?

I'm trying to improve app launch performance for subsequent logins (every login after the first) with my mobile app and after putting some stop watch diagnostics I can see that defining my 8 tables with MobileServiceSQLiteStore.DefineTable<T> takes on average 2.5 seconds. Every time.
On an iPhone 4 running iOS 7 the loading time would be less than a second if it weren't for having to define these tables every time. I would expect them to only need to be defined the first run of the app when the SQLite database is setup. I've tried removing the definitions on subsequent logins and try to just get the sync tables but it fails with "Table is not defined".
So, it seems this is the intended behavior. Can you explain why they need to be defined each time and/or if there is any workaround for this? It could be negligible considering my phone is pretty old now.. but it still is something I would like to remove if possible.
Yes, it is required to be called every time because SDK uses it to know how to deserialize data if you read it via untyped interface i.e. IMobileServiceSyncTable instead of IMobileServiceSyncTable<T>.
As of now there is no work around to avoid calling it each time. I'm surprised however that it is taking 2.5 seconds for you because DefineTable does not do any database operations. It merely inspects the members on your type/JObject and maintains an in memory dictionary for later re-use.
I would recommend you to download and compile the SDK and debug your way through to figure out where the time is actually spent.

Resources