CDC Source Issue in SSIS 2012 - ssis-2012

I'm using CDC Source component in one of my packages and was getting CDC Timeout error. After searching for a solution, I changed the property Command Timeout from 30 to 0 (Zero).
Now, to retrieve 300K records it's taking more than 30 minutes. How can I reduce that time?

Previously we were using CDC Source component with Net CDC processing mode, then changed to All.Records are retrieved faster than expected.

Related

Execution time of neo4j cypher query

I'm trying to find the execution time of GDS algorithms using the community edition of Neo4j. Is there any way to find it rather than query logging? Since this facility is specific to the enterprise edition.
Update:
I did as suggested. Why the result is 0 for the computeMillis and preProcessingMillis?
Update 2:
The following table indicates the time in ms required for running the Yen algorithm to retrieve one path for each topology. However, the time does not dependent on the graph size. Why? is it normal to have such results?
When you are executing the mutate or the write mode of the algorithm, you can YIELD the computeMillis property, which can tell you the execution time of the algorithm. Note that some algorithms like PageRank have more properties available to be YIELD-ed
preProcessingMillis - Milliseconds for preprocessing the graph.
computeMillis - Milliseconds for running the algorithm.
postProcessingMillis - Milliseconds for computing the
centralityDistribution.
writeMillis - Milliseconds for writing result data back.

Dataflow abnormality in time to complete the job and the total CPU hours with reshuflle via random key

I have created a dataflow which takes input from datastore and performs transform to convert it to BigQuery TableRow. I am attaching timestamp with each element in a transform. Then window of one day is applied to the PCollection. The windowed output is written to a partition in BigQuery table using Apache Beam's BigQueryIO.
Before writing to BigQuery, it uses reshuffle via random key as an intermediate step to avoid fusion.
The pipeline behaviour is :
1. For 2.8 million entities in the input:
Total vCPU time- 5.148 vCPU hr
Time to complete job- 53 min 9 sec
Current workers- 27
Target workers- 27
Job ID: 2018-04-04_04_20_34-1951473901769814139
2. For 7 million entites in the input:
Total vCPU time- 247.772 vCPU hr
Time to complete the job- 3 hr 45 min
Current workers- 69
Target workers- 1000
Job ID: 2018-04-02_21_59_47-8636729278179820259
I couldn't understand why it takes so much time to finish the job and CPU hours for the second case.
The dataflow pipeline at a high level is :
// Read from datastore
PCollection<Entity> entities =
pipeline.apply("ReadFromDatastore",
DatastoreIO.v1().read().withProjectId(options.getProject())
.withQuery(query).withNamespace(options.getNamespace()));
// Apply processing to convert it to BigQuery TableRow
PCollection<TableRow> tableRow =
entities.apply("ConvertToTableRow", ParDo.of(new ProcessEntityFn()));
// Apply timestamp to TableRow element, and then apply windowing of one day on that
PCollection<TableRow> tableRowWindowTemp =
tableRow.apply("tableAddTimestamp", ParDo.of(new ApplyTimestampFn())).apply(
"tableApplyWindow",
Window.<TableRow> into(CalendarWindows.days(1).withTimeZone(
DateTimeZone.forID(options.getTimeZone()))));
//Apply reshuffle with random key for avoiding fusion
PCollection<TableRow> ismTableRowWindow =
tableRowWindow.apply("ReshuffleViaRandomKey",
Reshuffle.<TableRow> viaRandomKey());
// Write windowed output to BigQuery partitions
tableRowWindow.apply(
"WriteTableToBQ",
BigQueryIO
.writeTableRows()
.withSchema(BigqueryHelper.getSchema())
.to(TableRefPartition.perDay(options.getProject(),
options.getBigQueryDataset(), options.getTableName()))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
I saw you posted a similar question here, and now you have added to your code the step:
//Apply reshuffle with random key for avoiding fusion
...
As someone already told you in the other question:
"The OOM might be symptomatic of a hot key"
So in this case it looks like it's still happening something similar(you have further information about Hot Key problems here:
If this is the case and there is some worker stuck, then amounts of entities vs. time to complete the job doesn't have to follow any linearity. And the vCPU consumption should be more a matter of optimizing the code to avoid the hot key issue.

Reading large gzip JSON files from Google Cloud Storage via Dataflow into BigQuery

I am trying to read about 90 gzipped JSON logfiles from Google Cloud Storage (GCS), each about 2GB large (10 GB uncompressed), parse them, and write them into a date-partitioned table to BigQuery (BQ) via Google Cloud Dataflow (GCDF).
Each file holds 7 days of data, the whole date range is about 2 years (730 days and counting). My current pipeline looks like this:
p.apply("Read logfile", TextIO.Read.from(bucket))
.apply("Repartition", Repartition.of())
.apply("Parse JSON", ParDo.of(new JacksonDeserializer()))
.apply("Extract and attach timestamp", ParDo.of(new ExtractTimestamps()))
.apply("Format output to TableRow", ParDo.of(new TableRowConverter()))
.apply("Window into partitions", Window.into(new TablePartWindowFun()))
.apply("Write to BigQuery", BigQueryIO.Write
.to(new DayPartitionFunc("someproject:somedataset", tableName))
.withSchema(TableRowConverter.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
The Repartition is something I've built in while trying to make the pipeline reshuffle after decompressing, I have tried running the pipeline with and without it. Parsing JSON works via a Jackon ObjectMapper and corresponding classes as suggested here. The TablePartWindowFun is taken from here, it is used to assign a partition to each entry in the PCollection.
The pipeline works for smaller files and not too many, but breaks for my real data set. I've selected large enough machine types and tried setting a maximum number of workers, as well as using autoscaling up to 100 of n1-highmem-16 machines. I've tried streaming and batch mode and disSizeGb values from 250 up to 1200 GB per worker.
The possible solutions I can think of at the moment are:
Uncompress all files on GCS, and so enabling the dynamic work splitting between workers, as it is not possible to leverage GCS's gzip transcoding
Building "many" parallel pipelines in a loop, with each pipeline processsing only a subset of the 90 files.
Option 2 seems to me like programming "around" a framework, is there another solution?
Addendum:
With Repartition after Reading the gzip JSON files in batch mode with 100 workers max (of type n1-highmem-4), the pipeline runs for about an hour with 12 workers and finishes the Reading as well as the first stage of Repartition. Then it scales up to 100 workers and processes the repartitioned PCollection. After it is done the graph looks like this:
Interestingly, when reaching this stage, first it's processing up to 1.5 million element/s, then the progress goes down to 0. The size of OutputCollection of the GroupByKey step in the picture first goes up and then down from about 300 million to 0 (there are about 1.8 billion elements in total). Like it is discarding something. Also, the ExpandIterable and ParDo(Streaming Write) run-time in the end is 0. The picture shows it slightly before running "backwards".
In the logs of the workers I see some exception thrown while executing request messages that are coming from the com.google.api.client.http.HttpTransport logger, but I can't find more info in Stackdriver.
Without Repartition after Reading the pipeline fails using n1-highmem-2 instances with out of memory errors at exactly the same step (everything after GroupByKey) - using bigger instance types leads to exceptions like
java.util.concurrent.ExecutionException: java.io.IOException:
CANCELLED: Received RST_STREAM with error code 8 dataflow-...-harness-5l3s
talking to frontendpipeline-..-harness-pc98:12346
Thanks to Dan from the Google Cloud Dataflow Team and the example he provided here, I was able to solve the issue. The only changes I made:
Looping over the days in 175 = (25 weeks) large chunks, running one pipeline after the other, to not overwhelm the system. In the loop make sure the last files of the previous iteration are re-processed and the startDate is moved forward at the same speed as the underlying data (175 days). As WriteDisposition.WRITE_TRUNCATE is used, incomplete days at the end of the chunks are overwritten with correct complete data this way.
Using the Repartition/Reshuffle transform mentioned above, after reading the gzipped files, to speed up the process and allow smoother autoscaling
Using DateTime instead of Instant types, as my data is not in UTC
UPDATE (Apache Beam 2.0):
With the release of Apache Beam 2.0 the solution became much easier. Sharding BigQuery output tables is now supported out of the box.
It may be worthwhile trying to allocate more resources to your pipeline by setting --numWorkers with a higher value when you run your pipeline. This is one of the possible solutions discussed in the “Troubleshooting Your Pipeline” online document, at the "Common Errors and Courses of Action" sub-chapter.

No results when running query against data written to InfluxDB

I have InfluxDB 0.13 and I'm sending data via the HTTP API. I'm getting status code 204 back, assuming that means OK. I can see the series if I query with "SHOW SERIES", I see the measurement and tags. But I cannot query on any of the data, it just says no results (query: SELECT * FROM "sql-query").
This is the raw data sent to Influx from Fiddler. Any idea what's wrong?
sql-query,Environment=QA,Service=XTAM_Lag SubscriberName="TXXOff",LagMinutes=141278i 1472628420980000000
sql-query,Environment=QA,Service=XTAM_Lag SubscriberName="TXXTIMEDEPOT",LagMinutes=248i 1472628420980000000
sql-query,Environment=QA,Service=XTAM_Lag SubscriberName="TXXOffMirror",LagMinutes=0i 1472628420980000000
sql-query,Environment=QA,Service=XTAM_Lag SubscriberName="TXXOffMirrorQA",LagMinutes=527i 1472628420980000000
sql-query,Environment=QA,Service=XTAM_Lag SubscriberName="TXXOff",LagMinutes=141279i 1472628480390000128
sql-query,Environment=QA,Service=XTAM_Lag SubscriberName="TXXTIMEDEPOT",LagMinutes=249i 1472628480390000128
sql-query,Environment=QA,Service=XTAM_Lag SubscriberName="TXXOffMirror",LagMinutes=0i 1472628480390000128
sql-query,Environment=QA,Service=XTAM_Lag SubscriberName="TXXOffMirrorQA",LagMinutes=528i 1472628480390000128
By default, all InfluxDB queries with no time constraint will use the current time in UTC on the InfluxDB server as an implicit upper time bound. Essentially, the query SELECT * FROM "sql-query" is interpreted as SELECT * FROM "sql-query" WHERE time < now().
The current UTC time on the server running InfluxDB can be different from the current time on the server generating metrics. This difference can be due to either a bad clock or, more likely, the use of a time zone other than UTC.
If there is an offset, new data will sometimes be written with timestamps in the relative future. Due to the implicit upper time bound on queries explained above, those points will then be excluded from a basic query.
To confirm whether this is the issue, try running a query with the upper time bound set a few days in the future.
SELECT * FROM "sql-query" WHERE time < now() + 1w
The query above will return all points in the sql-query measurement, plus any points written written with a relative time up to a week in the future.

BIDS SSRS Report query timeout issue while using Stored Procedure with timeout settings set appropriately

I've ran into a Timeout issue while executing a stored procedure for a SSRS Report I've created in Business Intelligence Development studio (BIDS). My stored procedure is pretty large and on average takes nearly 4 minutes to execute in SQL Server Management Studio. So i've accomidated for this by increasing the "Time out (in seconds)" to 600 seconds (10 mins). I've also increased the query timeout in the Tools->Options->Business Intelligence Designers-->Query Timeout AND Connection Timeout to 600 seconds as well.
Lastly, I've since created two other reports that use stored procedures with no problems. (they are a lot smaller and take roughly 30 seconds to execute). For my dataset properties, I always use Query type: "Text", and call the stored procedure with the EXEC command.
Any ideas as to why my stored procedure of interest is still timing out?
Below is the error message I receive after clicking "Refresh Fields":
"Could not create a list of fields for the query. Verify that you can connect to the data source and that your query syntax is correct."
Details
"Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
The statement has been terminated."
Thank You for your time.
Check the Add Key="DatabaseQueryTimeout" Value="120" value in your rsreportserver.config file. You may need to increase it there also.
More info on that file:
http://msdn.microsoft.com/en-us/library/ms157273.aspx
Also, in addition to what the first commenter on your post stated, in my experience if you are rendering to PDF, those can time out also. Your large dataset is returned w/i a reasonable amount of time, however the rendering of the PDF can take forever. Try rendering to Excel. The BIDs results will render rather quickly, but exporting the results are what can cause an issue.

Resources