Why does 'doit list' run for hours on a large dataset? - doit

I have a large number of files, 200k+, with a sub-task defined for each file. When I run anything, even 'doit list', it takes 2-3 hours before doit responds.
I'd like to understand what it is doing meanwhile.
I have ran 'doit list' and 'doit --check_file_uptodate timestamp list' and each ran for 2-3 hours.
I tried the later run as I assumed the time is spent in calculating md5 hashes for all these files. It resulted in an error after 2.5 hours: ERROR: Invalid parameter: "list". Must be a command, task, or a target.
I have also tried adding the timestamp directive in the dodo.py like shown below, but had similar results.
DOIT_CONFIG = {'check_file_uptodate': 'timestamp'}

Related

Time series- Not periodic, despite having included frequency

This is actually part of my thesis research, where I have to run a time series analysis on pollution and economic growth of a single country.
I have data of over 144 years of the two variables with each value representing a single year. I imported, set the values as numeric and attached the dataset through the console and ran:
ts_gdp= (data=`GDP per capita, start=1871,end=2014,frequency=1, names=gdp)
I get to see all the values for the first variable and then follow up with the stl() but I get this error. Any clues why this shows up, although I have set the frequency=1, which is the number of observations for the unit of time, in this case a year? Thank you in advance!
Error in stl(GDP, s.window = "periodic") :
series is not periodic or has less than two periods

Netlogo Behaviorspace How to save data not per tick but based on reporter

I have a netlogo model, for which a run takes about 15 minutes, but goes through a lot of ticks. This is because per tick, not much happens. I want to do quite a few runs in an experiment in behaviorspace. The output (only table output) will be all the output and input variables per tick. However, not all this data is relevant: it's only relevant once a day (day is variable, a run lasts 1095 days).
The result is that the model gets so slow running experiments via behaviorspace. Not only would it be nicer to have output data with just 1095 rows, it perhaps also causes the experiment to slow down tremendously.
How to fix this?
It is possible to write your own output file in a BehaviorSpace experiment. Program your code to create and open an output file that contains only the results you want.
The problem is to keep BehaviorSpace from trying to open the same output file from different model runs running on different processors, which causes a runtime error. I have tried two solutions.
Tell BehaviorSpace to only use one processor for the experiment. Then you can use the same output file for all model runs. If you want the output lines to include which model run it's on, use the primitive behaviorspace-run-number.
Have each model run create its own output file with a unique name. Open the file using something like:
file-open (word "Output-for-run-" behaviorspace-run-number ".csv")
so the output files will be named Output-for-run-1.csv etc.
(If you are not familiar with it, the CSV extension is very useful for writing output files. You can put everything you want to output on a big list, and then when the model finishes write the list into a CSV file with:
csv:to-file (word "Output-for-run-" behaviorspace-run-number ".csv") the-big-list
)

stored procedure execution gives different results in SSIS Vs. SSMS

I execute a stored procedure which takes in 7 parameters and returns an integer code. When I run the exec statement, it works perfectly and gives a valid result. But when I run the same using SSIS, I get the exact opposite result of what I am expecting.
I ran the trace on Profiler and saw what was being passed. The values have the correct sequence and correct values. I am not sure what is going on with SSIS execution part.
I verified the data types too and they look good to me. One thing that I noticed on the profiler was that even the integer valued columns were being passed as varchars with single quotes around them. Does that make any difference? Below is what I got from Profiler
exec sp_executesql N'EXEC [dbo].[ProcName] #P1,#P2,#P3,#P4,#P5,#P6,#P7',N'#P1 varchar(6),#P2 nvarchar(9),#P3 datetime2(0),#P4 varchar(1),#P5 varchar(1),#P6 varchar(4),#P7 nvarchar(5)','743290',N'000000034','2018-07-25 00:00:00','2','2','1002',N'Swift'
Thanks,
RV
Ok. So this was a mistake on my part. The query that was running on SSMS was running on the primary instance of the cluster where as the SSIS package points to the readonly instance of the cluster.
I was under the impression that the data is the same on these 2 server, but that was my mistake. I started pointing the SSMS version to the readonly instance and it gave the exact same result as I was expecting. Hence closing this out.

Airflow list dag times out exactly after 30 seconds

I have a dynamic airflow dag(backfill_dag) that basically reads admin variable(Json) and builds it self. Backfill_dag is used for backfilling/history loading, so for example if I wants to history load dag x,y, n z in some order(x n y run in parallel, z depends on x) then I will mention this in a particular json format and put it in admin variable of backfill_dag.
Backfill_dag now:
parses the Json,
renders the tasks of the dags x,y, and z, and
builds itself dynamically with x and y in parallel and z depends on x
Issue:
It works good as long as Backfill_dag can list_dags in 30 seconds.
Since Backfill_dag is bit complex here, it takes more than 30 seconds to list(airflow list_dags -sd Backfill_dag.py), hence it times out and the dag breaks.
Tried:
I tried to set a parameter, dagbag_import_timeout = 100, in airflow.cfg file of the scheduler, but that did not help.
I fixed my code.
Fix:
I had some aws s3 cp command in my dag that were running durring compilation hence my list_dags command was taking more than 30 seconds, i removed them(or had then in a BashOperator task), now my code compiles(list_dags) in couple of seconds.
Besides fixing your code you can also increase the core.dagbag_import_timeout which has per default 30 seconds. For me it helped increasing it to 150.
core.dagbag_import_timeout
default 30 seconds
The number of seconds before importing a Python file times out.
You can use this option to free up resources by increasing the time it takes before the Scheduler times out while importing a Python file to extract the DAG objects. This option is processed as part of the Scheduler "loop," and must contain a value lower than the value specified in core.dag_file_processor_timeout.
core.dag_file_processor_timeout
default: 50 seconds
The number of seconds before the DagFileProcessor times out processing a DAG file.
You can use this option to free up resources by increasing the time it takes before the DagFileProcessor times out. We recommend increasing this value if you're seeing timeouts in your DAG processing logs that result in no viable DAGs being loaded.
You can try change other airflow configs like:
AIRFLOW__WEBSERVER__WEB_SERVER_WORKER_TIMEOUT
AIRFLOW__CORE__DEFAULT_TASK_EXECUTION_TIMEOUT
also as mentioned above:
AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT
AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT

BIDS SSRS Report query timeout issue while using Stored Procedure with timeout settings set appropriately

I've ran into a Timeout issue while executing a stored procedure for a SSRS Report I've created in Business Intelligence Development studio (BIDS). My stored procedure is pretty large and on average takes nearly 4 minutes to execute in SQL Server Management Studio. So i've accomidated for this by increasing the "Time out (in seconds)" to 600 seconds (10 mins). I've also increased the query timeout in the Tools->Options->Business Intelligence Designers-->Query Timeout AND Connection Timeout to 600 seconds as well.
Lastly, I've since created two other reports that use stored procedures with no problems. (they are a lot smaller and take roughly 30 seconds to execute). For my dataset properties, I always use Query type: "Text", and call the stored procedure with the EXEC command.
Any ideas as to why my stored procedure of interest is still timing out?
Below is the error message I receive after clicking "Refresh Fields":
"Could not create a list of fields for the query. Verify that you can connect to the data source and that your query syntax is correct."
Details
"Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
The statement has been terminated."
Thank You for your time.
Check the Add Key="DatabaseQueryTimeout" Value="120" value in your rsreportserver.config file. You may need to increase it there also.
More info on that file:
http://msdn.microsoft.com/en-us/library/ms157273.aspx
Also, in addition to what the first commenter on your post stated, in my experience if you are rendering to PDF, those can time out also. Your large dataset is returned w/i a reasonable amount of time, however the rendering of the PDF can take forever. Try rendering to Excel. The BIDs results will render rather quickly, but exporting the results are what can cause an issue.

Resources