Snowflake task for historical data load, time limit - task

I'm trying to do a full historical load, including transformations, from one table in Snowflake to another. I expect it it to take over an hour.
I would like to schedule this load and have it run overnight so that I don't have to stay connected to the network or risk a connection issue.
I tried to use a scheduled task to do this.
Problem: There is a 60 minute limit. You can make it shorter (STATEMENT_TIMEOUT_IN_SECONDS = 60), but you can’t make it longer it appears. https://docs.snowflake.net/manuals/user-guide/tasks-ts.html#task-timed-out-or-exceeded-the-schedule-window
Anyone else experience this and have a suggestion?

Another approach is to scale up your WAREHOUSE, which makes sense if your TASK is parallelizable.

This is a soft-limit on the execution of tasks per each account. You should be able to contact Snowflake support to get this extended.

Yes you should raise a case to the snowflake support and request for the task limit to be increased. Please check the time it usually takes for the task to complete and request that time to be set . The limit will be set at the account level.Also the development is working on a future feature where the timeout can be set at the task level.

Related

Suggestion for trigger that sends email if threshold is broken

This is quite a broad question but ill try and summarise it as best I can.
I have an MVC front end which displays/allows processing of records which are classed as outstanding. I also have a scheduled console app which runs nightly and attempts to resolve each of these records using some logic I wrote.
I have a new requirement, which is to have an email sent every time the total number of outstanding records exceeds a certain amount, this amount needs to be configurable.
The table will contain every record with a flag to say if they have been resolved or not, so I will need to count the outstanding's then fire an email to notify if the threshold is broken.
I initially thought about adding a SQL Server trigger on insert however I soon realised that if no more records were added for a few days but the total number stayed above the threshold because nobody resolved them, then no further email would be sent.
I need the email to send every day on a schedule independently of insert/update.
So now I'm thinking possibly a SQL Server job, or an SSIS package or even a service which runs, but I'm aware this threshold number needs to be configurable.
So what would be the quickest simplest solution to my requirements, I'm open to any suggestion as long as it ticks all the boxes.
Given that the OP already has a console app running on a schedule, the most logical choice would be to simply add this check to the console app along with the email sending logic. It will be much easier to send emails that way, anyways, especially if you employ something like Postal, which will let you use MVC-style views to create your emails.
An SQL Server scheduled job seems to me to be the simplest way to go.
you can add a table to your database that will hold the threshold number and read it's value from there.
In many cases a GeneralParams table is a good thing to have anyway.
The other option you mentioned (windows service) is also configurable in many ways: you can use a GeneralParams table, or the App.Config file of the service (but you will have to restart it every time you change the app.config), or even a simple text file. anything goes. the downside is that it's outside of your sql server, but the upside is that it is probably easier to send emails from.

Background job taking twice the time that the same operation within rails

In my Rails application, I have a long calculation requiring a lot of database access.
To make it short, my calculation took 25 seconds.
When implementing the same calculation within a background job (a big single worker), the same calculation take twice the same time (ie 50 seconds). I have try several technics to put the job in a background process put none add an impact on my performances => using DelayJob / Sidekiq / doing the process within my rails but in a thread created for the work, but all have the same impact on my performances *2.
This performance difference only exist in rails 'production' environment. It looks like there is an optimisation done by rails that is not done in my background job.
My technical environment is the following =>
I am using ruby 2.0 / rails 4
I am using unicorn (but I have same problem without it).
The job is using Rails.cache to store some partial computation.
I am using postgresql
Does anybody has an clue where this impact might come from ?
I'm assuming you're comparing the background job speed to the speed of running the operation during a web request? If so, you're likely benefiting from Rails's QueryCache, which caches db queries during a web request. Try disabling it like described here:
Disabling Rails SQL query caching globally
If that causes the web request version of the job to take as long as the background job, you've found your culprit. You can then enable the query cache on your background job to speed it up (if it makes sense for your application).
Background job is not something that need to used for speed-up things. It's main meaning is to 'fire and forget' and remove 25 seconds of calculating synchronously and adding some more of calculating asynchronously. So you can give user response that she's request is processing and return with calculation later.
You may take speed gain from background job by splitting big task on some small and running them at same time. In your case I think it's something impossible to use, because of dependency of operations in yours calculation.
So if you want to speed you calculation, you need to look into denormalization of your data structure, storing some calculated values for your big calculation on moment when source data for this calculation updated. So you will calculate less on user request for results and more on data storage. And it's good place for use background job. So you finish your update of data, create background task for update caches. And if user request for calculation comes before this task is finished you will still need to wait for cache fill-up.
Update: I think I am still need to answer your main question. So basically this additional time on background task processing is comes from implementation. Because of 'fire and forget' approach no one need that background task scheduler will consume big amount of processor time just monitoring for new jobs. I am not sure completely but think that if your calculation will be two times more complex, time gain will be same 25 seconds.
My guess is that the extra time is coming from the need for your background worker to load rails and all of your application. My clue is that you said the difference was greatest with Rails in production mode. In production mode, subsequent calls to the app make use of the app and class cache.
How to check this hypotheses:
Change your background job to do the following:
print a log message before you initiate the worker
start the worker
run your calculation. As part of your calculation startup, print a log message
print another log message
run your calculation again
print another log message
Then compare the two times for running your calculation.
Of course, you'll also gain some extra time benefits from database caching, code might remain resident in memory, etc. But if the second run is much much faster, then the fact that the second run didn't restart Rails is more significant.
Also, the time between the log message from steps 1 and 3 will also help you understand the start up times.
Fixes
Why wait?
Most important: why do you need the results faster? Eg, tell your user that the result will be emailed to them after it is calculated. Or let your user see that the calculation is proceeding in the background, and later, show them the result.
The key for any long running calculation is to do it in the background and encourage the user to not wait for the result. They should be able to do something else until they get the result.
Start the calculation automatically As soon as the user logs in, or after they do something interesting, start the calculation. That way, when (and if) the user asks for the calculation, the answer will either be already done or will soon be done.
Cache the result and bust the cache as needed Similar to the above, start the calculation periodically and automatically. If the user changes some data, then restart the calculation by busting the cache. There are also ways to halt any on-going calculation if data is changed during the calculation.
Pre-calculate part of the calculation Why are you taking 25 seconds or more for a dbms calculation? Could be that you should change the calculation. Investigate adding indexes, summary tables, de-normalizing, splitting the calculation into smaller steps that can be pre-calculated, etc.

opening and closing streaming clients for specific durations

I'd like to infrequently open a Twitter streaming connection with TweetStream and listen for new statuses for about an hour.
How should I go about opening the connection, keeping it open for an hour, and then closing it gracefully?
Normally for background processes I would use Resque or Sidekiq, but from my understanding those are for completing tasks as quickly as possible, not chilling and keeping a connection open.
I thought about using a global variable like $twitter_client but that wouldn't horizontally scale.
I also thought about building a second application that runs on one box to handle this functionality, but that seems excessive if it can be integrated into the main app somehow.
To clarify, I have no trouble starting a process, capturing tweets, and using them appropriately. I'm just not sure what I should be starting. A new app? A daemon of some sort?
I've never encountered a problem like this, and am completely lost. Any direction would be much appreciated!
Although not a direct fix, this is what I would look at:
Time
You're working with time, so I'd look at what time-centric processes could be used to induce the connection for an hour
Specifically, I'd look at running a some sort of job on the server, which you could fire at specific times (programmatically if required), to open & close the connection. I only have experience with resque, but as you say, it's probably not up to the job. If I find any better solutions, I'll certainly update the answer
Storage
Once you've connected to TweetStream, you'll want to look at how you can capture the tweets for that time period. It seems a waste to create a data table just for the job, so I'd be inclined to use something like Redis to store the tweets that you need
This can then be used to output the tweets you need, allowing you to simulate storing / capturing them, but then delete them after the hour-window has passed
Delivery
I don't know what context you're using this feature in, so I'll just give you as generic process idea as possible
To display the tweets, I'd personally create some sort of record in the DB to show the time you're pinging TweetStream that day (if it changes; if it's constant, just set a constant in an initializer), and then just include some logic to try and get the tweets from Redis. If you're able to collect them, show them as you wish, else don't print anything
Hope that gives you a broader spectrum of ideas?

Is there a maximum number of connections allowed on Oracle9i DB?

This is my very first question and I hope it's well explained and so I can find an answer.
I work at the website project for a delivery company that has all the data in an Oracle9i server.
Most of the web user just want to know when they're going to get their package but I'm sure there's also robots that query that info several times a day to update their systems.
I'm working on a code to stop those robots (asking for a captcha after the 3rd query in 15min, for example) because we have some web services they can use to query all the data in bulk.
Now, my problem is that peak hours 12.00-14.00 the database starts to answer very slowly.
Here is some data I've parsed from the web application. I don't have logs at this level for the web services, but there was also a lot of queries there.
It shows the timestamp when I request a connection from the datasource, the Integer.toHexString(connection.hashCode()), the name of the datasource, the timestamp when I close the connection and the difference between both timestamps.
Most of the time the queries end in less than a second but yesterday I had this strange delay for more than 2minutes.
Is there some kind of maximun number of connections allowed on the database so when it surpass that limit the database queues my query for sometime before trying again?
Thanks in advance.
Is there some kind of maximun number of connections allowed on the databas
Yes.
SESSIONS is one of the basic initialization parameters and
specifies the maximum number of sessions that can be created in the
system. Because every login requires a session, this parameter
effectively determines the maximum number of concurrent users in the
system.
The default value is derived from the PROCESSES parameter (1.5 times this plus 22); therefore if you didn't change the PROCESSES parameter (default 100) the maximum number of sessions to your database will be 172.
You can determine the value by querying V$PARAMETER:
SQL> select value
2 from v$parameter
3 where name = 'sessions';
VALUE
--------------------------------
480
so when it surpass that limit the database queues my query for sometime before trying again?
No.
When you attempt to exceed the value of the SESSIONS parameter the exception ORA-00018: maximum number of sessions exceeded will be raised.
Something may well be queuing your query but it will be within your own code and not specified by Oracle.
It sounds as though you should find out more information. If not at the maximum number of sessions then you need to capture the query that's taking a long time and profile it; this would, I think, be the more likely scenario. If you're at the maximum number of sessions then you need to look at your (companies) code to determine what's happening.
You haven't really explained anything about your application but it sounds as though you're opening a session (or more) per user. You might want to reconsider whether this is the correct approach.
Thanks for the edit vape.
I've also found the real problem.
I had the method that asks for a connection to the datasource synchronized and it caused locks while requesting connections at peak hours. I've had it removed and everything is working fine.

Dealing with API rate limits?

I've an app that's set up to make scheduled calls to a number of APIs once a day. This works very nicely but i'm aware that some of the APIs i'm calling (Twitter for example) have a rate limit. As the number of calls i'm making is set to continually grow, can anyone recommend a way to throttle my calls so I can send in bursts of x per hour/minute etc?
I've found the Glutton Ratelimit gem, is anyone using this and is it any good? Are there others I should be looking at?
If you're using some kind of background worker to perform your API calls, you could reschedule the task to be reperformed in the next time slot, when the rate limits have been reset.
class TwitterWorker
include Sidekiq::Worker
def perform(status_id)
status = Twitter.status(status_id)
# ...
rescue Twitter::Error::TooManyRequests
# Reschedule the query to be performed in the next time slot
TwitterWorker.perform_in(15.minutes, status_id)
end
end
No scientific solution though, there's e.g. the risk that a query might be rescheduled each time if you try to perform much more API calls in a day than the rate limit allows for. But until then, something easy might do the trick!
Another solution is to buy proxies which allow you to send request with different IP addresses
Use standard http lib http://ruby-doc.org/stdlib-2.0/libdoc/net/http/rdoc/Net/HTTP.html#method-c-Proxy
I am not sure that you will not be blocked but maybe it is worth to try. Randomly choosen IP should increase your limits
Unless you're making concurrent requests there's not much to it.
Figure out how much delay you need per request
Check the time before the request, subtract from the time after the
request and sleep the rest.
With concurrent requests you can be more accurate, I once blogged about that here
I know this is an old question, but wanted to mention something in case it helps others with the same question.
If the work can be queued to jobs using resque, then you could use the gem I've just released which pauses a queue when you hit a rate_limit - and unpauses it some time later.
https://github.com/pavoni/resque-rate_limited_queue

Resources