Compute Engine based on Docker image crashes after 30 minutes - docker

I've a custom Docker image based on 7.4-Apache that is being used for f1 instance type of compute engine. I successfully deployed and my website is reachable but after around 30 minutes or less the health check times out and then the container crashes.
I tried to see if there are any logs to investigate if this is an application issue or something else.
I wanted to ask are there any logs which I can see what's going on, if not how can I add logs?

Since I don't know your exact configuration I can point you to some documentation at this point.
First have a look at Cloud Audit Logs documentation - it describes how to view logs and find what you need. You will find more here about viewing audit logs.
Try looking for related logs in logs viewer.
Also have a look at how to contruct query to extract information you're looking for.
If you provide more details about you congiguration I might be able to give a more precise answer.

Related

Custom REST end point is not called at all from doccano auto labelling

I have filed an issue in the official doccano repo. Here. However I am posting here also in the hope of getting some idea on what I am doing wrong.
I have two EC2 instances both running Ubuntu 20
In one of them I have set up doccano and uploaded some data
I annotated a bit of that data and then trained a custom model using Hgging Face.
In the second EC2 instance I have uploaded the trained model and created a FastAPI based API to serve the result.
I want to set up auto labeling (it is a Sequence Labeling project).
I follow the steps in the official document and also take help from here.
Everything goes right, including at the second step when I am testing the api connectivity doccano could successfully connect and fetch the data.
Once all is done I go to one of the documents and try to do the auto labeling. And Surprise
NOTHING HAPPENS.
There is no log in the model server showing that no request has ever reached there!
Both the doccano and the model server and running via Docker inside the EC2 instances.
What am I doing wrong?
Please help.
Warm regards
Ok I found the reason. Thanks to a reply in the Github issue. Here I am going to quote the reply as it is.
The website is not behaving intuitively from the UX perspective. When
you turn on Auto Labeling, just try going to the next example (arrow
on the top right of the page) and your Auto Labeling API should be
called. Then go back and it will be called on the first example. Also,
it's called only on examples that are not marked as "labeled".
So anyone having difficulty with the same problem hopefully you will get some help from here.

cloud run logs show scanned <tens of gigs>

I have a simple service running that doesn't log at all. The logs view shows currently 31.1gb of logs and is growing fast. What's going on?
This number represents the size of all logs for all services across your project. The Cloud Run logging page is scanning all logs and filtering for logs from the Cloud Run resource.

Tasks will not run in Spring Cloud Data Flow (Docker/K8S)

Last week I installed the Docker/Kubernetes based version of Spring Cloud Data Flow
Although there were not overt errors, things are not working correctly.
I am able to create streams and tasks in the web UI and Spring Cloud Data Flow Shell but nothing runs.
I am most interested in Tasks.
When I create them, they all show with a Task Status of UNKNOWN.
Unfortunately, no matter how many times I launch them, the status always remains UNKNOWN.
I'm able to delete them but what magic must I use to make them run?
There's nothing apparent from the description as to what has failed. Perhaps if you can update it with more details, it'd be useful.
From a troubleshooting standpoint, when deploying streams or if the launch of Tasks' fails for any reason, they will be logged in the SCDF-server/Skipper-server logs. You'd have to tail the logs of the respective pod to learn more about the failures.
Also, it'd be useful to verify the output of kubectl describe pod/<POD_NAME> to see what's causing the stream/task pods not to start successfully. They're usually listed towards the end of this command-output.
The usual suspects are due to pods' health-check failures and/or the stream/task application docker images aren't resolvable at runtime. You'll see the reasons in the logs, of course.
This was a misconfiguration on my end.
I'm able to run as expected now.

Error creating the GCE VMs or starting Dataflow

I'm getting the following error in the recent jobs I'm trying to submit:
2015-01-07T15:51:56.404Z: (893c24e7fd2fd6de): Workflow failed.
Causes: (893c24e7fd2fd601):
There was a problem creating the GCE VMs or starting Dataflow on the VMs so no data was processed. Possible causes:
1. A failure in user code on in the worker.
2. A failure in the Dataflow code.
Next Steps:
1. Check the GCE serial console for possible errors in the logs.
2. Look for similar issues on http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
There are no other errors.
What does this error mean?
Sorry for the trouble.
The Dataflow starts up VM instances and then launches an agent on those VMs. Those agents then do the heavy lifting of executing your code (e.g. ParDo's, reading and writing) your Data.
The error indicates the job failed because no agents were requesting work. As a result, the service marked the job as a failure because it wasn't making any progress and never would since there weren't any agents to process your data.
So we need to figure out where in the agent startup process things failed.
The first thing to check is whether the VMs actually started. When you run your job do you see any VMs created in your project? It might take a minute or two for the VMs to startup but they should appear shortly after the runner prints out the message "Starting worker pool setup". The VMs should be named something like
<PREFIX-OF-JOB-NAME>-<TIMESTAMP>-<random hexadecimal number>-<instance number>
Only a prefix of the job name is used to ensure we don't exceed GCE name limits.
If the VMs startup the next thing to do is to inspect the worker logs to look for errors indicating problems in launching the agent.
The easiest way to access the logs is using the UI. Go to the Google Cloud Console and then select the Dataflow option in the left hand frame. You should see a list of your jobs. You can click on the job in question. This should show you a graph of your job. On the right side you should see a button "view logs". Please click that. You should then see a UI for navigating the logs and you can look for errors.
The second option is to look for the logs on GCS. The location to look for is:
gs://PATH TO YOUR STAGING DIRECTORY/logs/JOB-ID/VM-ID/LOG-FILE
You might see multiple log files. The one we are most interested in is the one that starts with "start_java_worker". If that log file doesn't exist then the worker didn't make enough progress to actually upload the file; or else there might have been a permission problem uploading the log file.
In that case the best thing to do is to try to ssh into one of the VMs before it gets torn down. You should have about 15 minutes before the job fails and the VMs are deleted.
Once you login to the VM you can find all the logs in
/var/log/dataflow/...
The log we care most about at this point is:
/var/log/dataflow/taskrunner/harness/start_java_worker-SOME ID.log
If there is a problem starting the code that runs on the VM that log should tell us. That log and the other logs should also tell us if there is a permission problem that prevents the code running on the worker from being able to access Dataflow.
Please take a look and let us know if you find anything.
Apart from Jeremy Lewi's great answer, I would like to add that I've seen this error appear when you don't enable the proper Google APIs in the Developers Console, as mentioned here, which leads to a permission issue, like Jeremy said.

Agresso payment creation via acrbatchinput

We're attempting to generate payments in an Agresso 5.5 system. The mechanism we've been told to use is to write new payment data into table acrbatchinput where it will be picked up and processed by a regular job running in agrbibat.dll. We have code that worked on a previous version of Agresso but following the upgrade our payments get rejected by the agrbibat job. Sometimes it generates useful messages in the log, sometimes it doesn't, and working through failures without good information is becoming a bit of a slog.
Is there some documentation we're missing? In particular it would be useful to have a full list of validation rules the job is using so we can implement these ourselves rather than trying to infer them from the log. I can't find any - there's not a lot for acrbatchinput on Google. Does this list or some other documentation exist? Is agribibat something easily decompilable, e.g. .NET?
Thanks. The test system we have is running against Oracle on Solaris with the Agresso jobs hosted on Windows. We have limited access to the Oracle and Agresso systems because (I think!) the same Oracle server is hosting the live payment system, but I could probably talk finance into giving us agrbibat.dll if that might help. We're unlikely to get enough access to their servers to debug it in place.
It turns out that our problem is partly because the new test system we've been given access to wasn't set up correctly, so we might be able to progress this without extra information - we're waiting on the financial team here for input.
However we're still interested in acrbatchinput or agrbibat documentation or information. You've missed the bounty I set but ticks, votes and gratitude still available.
I know this is an ancient old question, but here's my response anyway for anyone else that finds it.
The only documentation is the usual Agresso help files from within the desktop client. Meaningful information is only gleaned through trial and error, however!
The required fields differs depending on whether a given record is a GL, AP/AR or tax transaction. (That much is, at least, explained in the help).
In addition to using the log file, it's often helpful to look at GL07's report output for errors.

Resources