Visual Studio Test Run: Run Coded UI automated tests from test plans in the Test hub - tfs

Build Definition
I am using the same machine as Build and Release configuration. I have created build successfully. Using build I am able to execute Coded UI scripts in Visual Studio Test Task in build and they are working fine My configuration is for build definition is mentioned below
Release Destination
after successful build definition and execution of test scripts, My next plan is to Run Automated Tests from Test plans in the Test Hub. I have associated my test scripts with Test cases also. Please have a look at the image of my release definition where I have selected Test Run using the Test run
Notification I receive after failed execution of automated test from test plan in test hub is
Deployment of release Release-11 Rejected in Deploy Test Scripts.
Log
2018-02-21T14:24:20.8978238Z AgentName: EVSRV017-DEVSRV017-4
2018-02-21T14:24:20.8978238Z AgentId: 29
2018-02-21T14:24:20.9038250Z ServiceUrl: https://mytfsserver/tfs/DefaultCollection/
2018-02-21T14:24:20.9038250Z TestPlatformVersion:
2018-02-21T14:24:20.9038250Z EnvironmentUri: dta://env/Calculator/_apis/release/16/20/1
2018-02-21T14:24:20.9038250Z QueryForTaskIntervalInMilliseconds: 3000
2018-02-21T14:24:20.9038250Z MaxQueryForTaskIntervalInMilliseconds: 10000
2018-02-21T14:24:20.9048252Z QueueNotFoundDelayTimeInMilliseconds: 3000
2018-02-21T14:24:20.9058254Z MaxQueueNotFoundDelayTimeInMilliseconds: 50000
2018-02-21T14:24:20.9058254Z RetryCountWhileConnectingToTfs: 3
2018-02-21T14:24:20.9058254Z ===========================================
2018-02-21T14:24:21.3909224Z Initializing the Test Execution Engine
Warning
2018-02-21T14:25:02.1240674Z ##[warning]Failure attempting to call the restapis. Exception: System.AggregateException: One or more errors occurred. ---> System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: The underlying connection was closed: An unexpected error occurred on a send. ---> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
2018-02-21T14:25:02.1240674Z at System.Net.Sockets.Socket.EndReceive(IAsyncResult asyncResult)
2018-02-21T14:25:02.1240674Z at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)
2018-02-21T14:25:02.1240674Z --- End of inner exception stack trace ---
ERROR:
2018-02-22T10:10:42.0007605Z ##[error]System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: The underlying connection was closed: An unexpected error occurred on a send. ---> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
I added debug variable and it seems that when creating a setting file, it is something like system.io exception.
Enabled DEBUG LOG
2018-02-22T20:17:53.8151287Z Initializing the Test Execution Engine
2018-02-22T20:17:53.8161287Z ##[debug]Creating test settings. test settings name : 44de4d5b-f134-4ba2-b0de-ebd8d30b4d22
2018-02-22T20:18:35.3911287Z ##[warning]Failure attempting to call the restapis. Exception: System.AggregateException: One or more errors occurred. ---> System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: The underlying connection was closed: An unexpected error occurred on a send. ---> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
2018-02-22T20:18:35.3931287Z ##[debug]Processed: ##vso[task.logissue type=warning;]Failure attempting to call the restapis. Exception: System.AggregateException: One or more errors occurred. ---> System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: The underlying connection was closed: An unexpected error occurred on a send. ---> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
TFS Agent Log
[INFO VstsAgentWebProxy] No proxy setting found.
[INFO ConfigurationStore] IsServiceConfigured: False
[INFO ConfigurationManager] Is service configured: False
Worker Log
[2018-02-23 19:32:17Z INFO VstsAgentWebProxy] No proxy setting found.
[2018-02-23 19:32:52Z INFO JobServerQueue] Try to append 1 batches web console lines, success rate: 1/1.
[2018-02-23 19:32:52Z INFO JobServerQueue] Try to append 1 batches web console lines, success rate: 1/1.
[2018-02-23 19:32:53Z INFO JobServerQueue] Try to append 1 batches web console lines, success rate: 1/1.
[2018-02-23 19:33:34Z INFO JobServerQueue] Catch exception during update timeline records, try to update these timeline records next time.
[2018-02-23 19:33:34Z INFO ProcessInvoker] Finished process with exit code 0, and elapsed time 00:00:49.0055812.
[2018-02-23 19:33:34Z INFO StepsRunner] Step result: Failed
[2018-02-23 19:33:34Z INFO StepsRunner] Update job result with current step result 'Failed'.
[2018-02-23 19:33:34Z INFO StepsRunner] Current state: job state = 'Failed'
[2018-02-23 19:33:34Z INFO JobRunner] Job result after all job steps finish: Failed
[2018-02-23 19:33:34Z INFO JobRunner] Run all post-job steps.
[2018-02-23 19:33:34Z INFO JobRunner] Job result after all post-job steps finish: Failed
[2018-02-23 19:33:34Z INFO JobRunner] Completing the job execution context.
[2018-02-23 19:33:34Z INFO JobServerQueue] Try to append 2 batches web console lines, success rate: 2/2.
[2018-02-23 19:33:34Z INFO JobRunner] Shutting down the job server queue.
[2018-02-23 19:33:34Z ERR JobServerQueue] Microsoft.VisualStudio.Services.Common.VssServiceException: String or binary data would be truncated.
at Microsoft.VisualStudio.Services.WebApi.VssHttpClientBase.HandleResponse(HttpResponseMessage response)
at Microsoft.VisualStudio.Services.WebApi.VssHttpClientBase.<SendAsync>d__48.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.ConfiguredTaskAwaitable`1.ConfiguredTaskAwaiter.GetResult()
at Microsoft.VisualStudio.Services.WebApi.VssHttpClientBase.<SendAsync>d__45`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.ConfiguredTaskAwaitable`1.ConfiguredTaskAwaiter.GetResult()
at Microsoft.VisualStudio.Services.WebApi.VssHttpClientBase.<SendAsync>d__27`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.ConfiguredTaskAwaitable`1.ConfiguredTaskAwaiter.GetResult()
at Microsoft.VisualStudio.Services.WebApi.VssHttpClientBase.<SendAsync>d__26`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.VisualStudio.Services.Agent.JobServerQueue.<ProcessTimelinesUpdateQueueAsync>d__32.MoveNext()
[2018-02-23 19:33:34Z INFO JobServerQueue] Fire signal to shutdown all queues.
[2018-02-23 19:33:35Z INFO JobServerQueue] All queue process task stopped.
[2018-02-23 19:33:35Z INFO JobServerQueue] Try to append 1 batches web console lines, success rate: 1/1.
[2018-02-23 19:33:35Z INFO JobServerQueue] Web console line queue drained.
[2018-02-23 19:33:35Z INFO JobServerQueue] Try to upload 2 log files or attachments, success rate: 2/2.
[2018-02-23 19:33:35Z INFO JobServerQueue] File upload queue drained.
[2018-02-23 19:33:35Z INFO JobServerQueue] Timeline update queue drained.
[2018-02-23 19:33:35Z INFO JobServerQueue] All queue process tasks have been stopped, and all queues are drained.
[2018-02-23 19:33:35Z INFO JobRunner] Raising job completed event.
[2018-02-23 19:33:35Z INFO Worker] Job completed.
I will be thankful to you if anyone can identify where I am missing something or what I need to fix this issue so that I can execute automated tests from test plans in the test hub.
Regards

Based on the error message "Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host." and your clarification. It should be related to the known issue on win server 2008 R2. Please refer to below article for details:
Team Foundation Server: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.
However the bug has been fixed by the Windows team and they have
released a QFE for it. You can find the QFE here. You will
need to install it on all of your ATs.
So, just try to install the hotfix and restart the computer after you apply this hotfix, then try it again.
You can also try to use the initial workarounds that list in the blog:
Open the IIS Manager
In the Connections pane, make sure the name of your AT is selected.
In the middle pane (titled “ Home”), make sure you are
in the “Features View” (bottom) and scroll down to the Management
section.
Double-click the “Configuration Editor” icon.
The middle pane should now have the title “Configuration Editor”.
In the Section pull down near the top, expand the
system.applicationHost and select “webLimits”.
You should now see a bunch of property value pairs, one of which is
named “minBytesPerSecond”. Its value is most like 240. You will
want to lower this value for the workaround.
Besides, another possibility is that it's caused by the Proxy server, just try to bypass the proxy server, then check it again.

Related

Integration test with testcontainers (Redis) Failing to run

I'm building integration test infrastructure for my micro-service (Java Spring).
Having an issue with testcontainers, as I try to create base class for tests that will start redis on container and will function as redis for the tested service.
the abstract test looks like this:
#SpringBootTest(classes = Application.class, webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
#ActiveProfiles(Constants.SPRING_PROFILE_DEVELOPMENT)
public class AbstractRedisContainerTest {
#Rule
public GenericContainer redis = new GenericContainer("redis:3.0.6")
.withExposedPorts(6379);
And my log show the follow fail msg:
2020-03-04 12:28:55.545 ERROR [myService,,,] 27709 --- [main] o.t.d.DockerClientProviderStrategy:
Could not find a valid Docker environment. Please check configuration. Attempted configurations were:
2020-03-04 12:28:55.546 ERROR [myService,,,] 27709 --- [main] o.t.d.DockerClientProviderStrategy:
EnvironmentAndSystemPropertyClientProviderStrategy:
failed with exception InvalidConfigurationException (ping failed)
2020-03-04 12:28:55.546 ERROR [myService,,,] 27709 --- [main] o.t.d.DockerClientProviderStrategy:EnvironmentAndSystemPropertyClientProviderStrategy:
failed with exception InvalidConfigurationException (ping failed)
2020-03-04 12:28:55.546 ERROR [myService,,,] 27709 --- [main] o.t.d.DockerClientProviderStrategy:UnixSocketClientProviderStrategy:
failed with exception InvalidConfigurationException (ping failed). Root cause LastErrorException ([13])
2020-03-04 12:28:55.546 ERROR [myService,,,] 27709 --- [main] o.t.d.DockerClientProviderStrategy: ProxiedUnixSocketClientProviderStrategy:
failed with exception InvalidConfigurationException (ping failed). Root cause TimeoutException (null)
2020-03-04 12:28:55.546 ERROR [myService,,,] 27709 --- [main] o.t.d.DockerClientProviderStrategy: As no valid configuration was found, execution cannot continue
org.testcontainers.containers.ContainerLaunchException: Container startup failed
where this is the error for the failing line in the code:
Caused by: org.testcontainers.containers.ContainerFetchException: Can't get Docker image:
RemoteDockerImage(imageNameFuture=java.util.concurrent.CompletableFuture#3186b07d[Completed normally],
imagePullPolicy=DefaultPullPolicy(), dockerClient=LazyDockerClient.INSTANCE)
Any idea how to configure my env or what missing here?
important to state that I have succeed to locally run docker images (the one stated here as well) and I have functioning docker environment from cli. I run this test from idea.
not sure if that related, but I run it as user (not as root).

Why is "java.nio.channels.ClosedByInterruptExceptio" called when caling multiple groupBy with pyspark?

I am running a pyspark job (python 3.5, spark 2.1, java8) in yarn-client mode from an edge node with spark2-submit. The job succed, the result dataframe is written on HDFS and seems correct (we didn't find yet any error with the data in such dataframe).
The issue is that I see a lot (6'000) ERROR messages and I would like to understand what is wrong and if this impact or not the final dataframe.
All ERROR messages looks like this one:
18/06/01 14:08:36 INFO codegen.CodeGenerator: Code generated in 45.712788 ms
18/06/01 14:08:37 INFO executor.Executor: Finished task 33.0 in stage 34.0 (TID 2312). 4600 bytes result sent to driver
18/06/01 14:08:37 INFO executor.Executor: Finished task 117.0 in stage 34.0 (TID 2316). 3801 bytes result sent to driver
18/06/01 14:08:40 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 2512
18/06/01 14:08:40 INFO executor.Executor: Running task 190.1 in stage 34.0 (TID 2512)
18/06/01 14:08:40 INFO storage.ShuffleBlockFetcherIterator: Getting 28 non-empty blocks out of 193 blocks
18/06/01 14:08:40 INFO storage.ShuffleBlockFetcherIterator: Started 5 remote fetches in 1 ms
18/06/01 14:08:40 INFO executor.Executor: Executor is trying to kill task 190.1 in stage 34.0 (TID 2512)
18/06/01 14:08:40 ERROR storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /...../yarn/nm/usercache/../appcache/application_xxxx/blockmgr-xxxx/temp_shuffle_xxxxx
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.truncate(FileChannelImpl.java:372)
at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:238)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The ERROR start after quite some feture engineering (select, groupby ..) and I see the ERROR when adding these lines:
df = (df.groupby('x','y')
.agg(func.sum('x').alias('x_sum'))
.groupby('y')
.agg(func.mean('y').alias('py_sum_avg')))
So I guess the of the data shuffle is triggered by groupBy.
I first thought it was an issue with memory so I added much more memory and overhead memory for the driver and executor without a real success (this is what you can find in some other thread). In the code I have other groupBy and it seems it is causing some issue at this stage.
I also see that it could be related to too many files open or if the disk is full but the ERROR messages is a bit different in these 2 cases.
I am quite new in pysaprk so I am looking to advice to debug such issue.
How can I find what is the reason why is called java.nio.channels.ClosedByInterruptException ? I guess this is the reason that trigger ERROR storage.DiskBlockObjectWriter. Is this correct ? Is it trigger by Executor: Executor is trying to kill task 190 If this is a standard process to have some tasks killed why is this triggering ERRORs ? Can I get some hint by looking at the Sprak UI (I see that some task were killed).Can I get more info from the traceback ?
How can fixed these issues ? Any suggestion how to proceed to debug such things ? I am not sure how to proceed to debug this issue and where to look at (memory, issue in the pysaprk code, issue with the setup of the cluster or of my spark params)
I am working on an Hadoop Data Lake with Cloudera CDH 5.8.
There is an issue with using spark.speculation in Spark 2.1 which I am using.
The related upstream bug is SPARK-19293. The exception stack trace in my situation is slightly different than the one in SPARK-19293. Putting
--conf spark.speculation=false
and the ERROR are gone in my test

Build agent is unable to download available workspaces

We recently upgraded TFS from TFS 2015 to TFS 2018. We changed build agent in our infrastructure to new agent version 2.122.1
When developers commit their changesets or planed build is executed, sometimes build is processed as expected, but sometimes we receive an strange error upon retrieving sources from TFS repository.
In build log they are logged as:
2018-01-03T15:01:25.6074314Z Querying workspace information.
2018-01-03T15:01:26.5136788Z ##[error]There is an error in XML document (1, 1).
If I open agent detailed log, I got an following information:
[2018-01-03 15:01:25Z INFO ProcessInvoker] Starting process:
[2018-01-03 15:01:25Z INFO ProcessInvoker] File name: 'tf'
[2018-01-03 15:01:25Z INFO ProcessInvoker] Arguments: 'vc workspaces
/format:xml /collection:http://servername:8080/tfs/ProjectCollection/
/loginType:OAuth /login:.,******** /noprompt'
[2018-01-03 15:01:25Z INFO ProcessInvoker] Working directory:
'C:\Agent2017\_work\10\s'
[2018-01-03 15:01:25Z INFO ProcessInvoker] Require exit code zero:
'True'
[2018-01-03 15:01:25Z INFO ProcessInvoker] Encoding web name:
windows-1252 ; code page: '1252'
[2018-01-03 15:01:25Z INFO ProcessInvoker] Force kill process on
cancellation: 'False'
[2018-01-03 15:01:25Z INFO ProcessInvoker] Process started with
process id 3524, waiting for process exit.
[2018-01-03 15:01:25Z INFO JobServerQueue] Try to append 1 batches web
console lines, success rate: 1/1.
[2018-01-03 15:01:25Z INFO JobServerQueue] Try to upload 1 log files
or attachments, success rate: 1/1.
[2018-01-03 15:01:26Z INFO ProcessInvoker] Finished process with exit
code 0, and elapsed time 00:00:00.5240505.
[2018-01-03 15:01:26Z ERR StepsRunner] Caught exception from step:
System.InvalidOperationException: There is an error in XML document
(1, 1). ---> System.Xml.XmlException: Data at the root level is
invalid. Line 1, position 1. at
System.Xml.XmlTextReaderImpl.Throw(Exception e) at
System.Xml.XmlTextReaderImpl.ParseRootLevelWhitespace() at
System.Xml.XmlTextReaderImpl.ParseDocumentContent() at
System.Xml.XmlTextReaderImpl.Read() at
System.Xml.XmlReader.MoveToContent() at
Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializationReaderTFWorkspaces.Read5_Workspaces()
--- End of inner exception stack trace --- at System.Xml.Serialization.XmlSerializer.Deserialize(XmlReader
xmlReader, String encodingStyle, Object events) at
System.Xml.Serialization.XmlSerializer.Deserialize(TextReader
textReader) at
Microsoft.VisualStudio.Services.Agent.Worker.Build.TFCommandManager.d__31.MoveNext()
--- End of stack trace from previous location where exception was thrown --- at
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
task) at
Microsoft.VisualStudio.Services.Agent.Worker.Build.TfsVCSourceProvider.d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown --- at
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
task) at
Microsoft.VisualStudio.Services.Agent.Worker.Build.BuildJobExtension.d__17.MoveNext()
--- End of stack trace from previous location where exception was thrown --- at
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
task) at
Microsoft.VisualStudio.Services.Agent.Worker.JobExtensionRunner.d__20.MoveNext()
--- End of stack trace from previous location where exception was thrown --- at
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
task) at
Microsoft.VisualStudio.Services.Agent.Worker.StepsRunner.d__1.MoveNext()
[2018-01-03 15:01:26Z INFO StepsRunner] Step result: Failed
It seems that agent is trying to get list of available workspaces, but it fails. If I try to execute command under appropriate permissions at computer, where agent is running, I get proper list of workspaces.
Try to queue build with system.debug = true, and you'll get more information in the log. Check the log to get detailed error message.
Remove all the workspaces created by the build process like the following name: ws_1_2 in VS/Team Explorer, and set the clean option in the Get Source task to true, then try to run the build again.
Finally I found an answer to my problem here.
Problem was caused by cached agent information and when I manually executed vc workspaces /format:xml /collection:http://servername:8080/tfs/ProjectCollection/ command at build machine, sometimes I got workspace list, but sometimes I got warning plus workspace list (result is not expected to be same after each call dependent on cache usage). Due to warning like
Local path C:\Agent2017_work\10\s is mapped both in workspace ws_11_09;ProjectCollection Build Service on server http://rwstfs:8080/tfs/ProjectCollection/ and workspace ws_09_10;ProjectCollection Build Service on server http://rwstfs:8080/tfs/ProjectCollection/. Removing workspace ws_09_10;ProjectCollection Build Service on server http://rwstfs:8080/tfs/ProjectCollection/ from the cache. Please remove conflicting mappings.
agent is not able to parse xml document, because it is preceded by non-sgml part.
So I went to
%LocalAppData%\Microsoft\Team Foundation\7.0\Cache
directory, deleted cache content and problem was gone.
EDIT:
Alternative way could be delete cached workspaces using command
tf workspaces /remove:*

sos berlin scheduler -- job chain - how to trigger other job after job timeout

I'm using sos berlin scheduler (version linux-x64 1.10.5).
Normally when job in job_chain timeout, scheduler will kill the job process and send a email.
So, based on this, I want to trigger other job.
But, I have tried two ways all doesn't work.
Way 1:
Add a function “spooler_task_after()” in the job.
I guess the failure is because this job will create a process on linux system, while job timeout scheduler kill this job process, also kill the function “spooler_task_after()”
Code:
<job timeout="00:00:09">
<script language="shell"><![CDATA[
echo aa
sleep 10s
echo bb
]]></script>
<monitor name="exit_code" ordering="0">
<script language="java:javascript"><![CDATA[
function spooler_task_after(){
var exitCode = spooler_task.exit_code;
spooler_log.info ("Exit Code is: " + exitCode);
/*
call other job
*/
result = true;
return result;
}
]]></script>
</monitor>
<run_time/>
</job>
Result:
2017-07-27 21:22:21.251+0800 [info]
2017-07-27 21:22:21.251+0800 [info] Task sample_errorhandling/job1:23026 - Protocol starts in /httx/opt/sos-scheduler/ldw-scheduler-test1/logs/task.sample_errorhandling,job1.log
2017-07-27 21:22:21.250+0800 [info] SCHEDULER-842 Task is going to process Order sample_errorhandling/job_chain3:12, state=aaa, on JobScheduler 'http://xxxx:4444', Order's Process_class
2017-07-27 21:22:21.268+0800 [info] SCHEDULER-726 Task runs on this JobScheduler 'http://jt-host-kvm-72:4444'
2017-07-27 21:22:21.268+0800 [info] SCHEDULER-918 state=starting (at=never)
2017-07-27 21:22:22.466+0800 [info] SCHEDULER-987 Starting process: '/bin/sh' '-c' '"/tmp/admin/sos.gBdCm8"'
2017-07-27 21:22:23.520+0800 [info] [stdout] aa
2017-07-27 21:22:30.326+0800 [ERROR] SCHEDULER-272 Terminating task after reaching deadline <job timeout="9">
2017-07-27 21:22:30.359+0800 [ERROR] SCHEDULER-202 Connection to task has been lost, state=running_remote_process: Z-REMOTE-101 Separate process: pid=0: Connection lost / zschimmer::com::object_server::Connection::pop_operation
2017-07-27 21:22:30.359+0800 [ERROR] SCHEDULER-202 Connection to task has been lost, state=release: Z-REMOTE-122 Separate process pid=0: Caller has killed process
2017-07-27 21:22:30.384+0800 [ERROR] SCHEDULER-280 Process terminated with exit code 1 (0x63)
2017-07-27 21:22:30.384+0800 [WARN] SCHEDULER-845 Task ended without processing the order. The order remains in job's order queue in the same state
2017-07-27 21:22:30.384+0800 [info] SCHEDULER-843 Task has ended processing of Order sample_errorhandling/job_chain3:12, state=aaa, on JobScheduler 'http:/xxxx:4444'
Way 2:
Add return code on job chain node
This way works on job execute successfully or with error. But failed when job was killed with timeout.
Code in job chain:
<job_chain >
<job_chain_node state="aaa" job="job1" next_state="success" error_state="error">
<on_return_codes >
<on_return_code return_code="1">
<add_order xmlns="https://jobscheduler-plugins.sos-berlin.com/NodeOrderPlugin" job_chain="/error_handling/sendmail"/>
</on_return_code>
</on_return_codes>
</job_chain_node>
<job_chain_node state="success"/>
<job_chain_node state="error"/>
</job_chain>
You can use the error_state= attribute.
When JobScheduler kills the task because of a timeout this is handled as an error situation.
Please note the the next_state of the errorHandling state is error to indicate in JOC that this was an error and that the errorHandling state have its own error_state to indicate if the errorHandler itself fails.
<job_chain>
<job_chain_node state="100" job="job1" next_state="200" error_state="errorHandling"/>
<job_chain_node state="200" job="job2" next_state="success" error_state="errorHandling"/>
<job_chain_node state="errorHandling" job="errorHandlerJob" next_state="error" error_state="errorInErrorHandling"/>
<job_chain_node state="success"/>
<job_chain_node state="errorInErrorHandling"/>
<job_chain_node state="error"/>
</job_chain>

Flume agent throws java.net.ConnectException: Connection refused

I have been using flume for a while now, I have got agent and collector running on same machine.
Configuration
agent: exec("/usr/bin/tail -n +0 -F /path/to/file") | agentE2ESink("hostname", 35855)
collector: collectorSource(35855) | collector(10000) { collectorSink("/hdfs/path/to/sink","name") }
Facing issues in the agent node:
2012-06-04 19:13:33,625 [naive file wal consumer-27] INFO debug.InsistentOpenDecorator: open attempt 0 failed, backoff (1000ms): Failed to open thrift event sink to hostname:35855 : java.net.ConnectException: Connection refused
2012-06-04 19:13:34,625 [logicalNode hostname-19] ERROR connector.DirectDriver: Expected ACTIVE but timed out in state OPENING
2012-06-04 19:13:34,632 [naive file wal consumer-27] INFO debug.InsistentOpenDecorator: open attempt 1 failed, backoff (2000ms): Failed to open thrift event sink to hostname:35855 : java.net.ConnectException: Connection refused
2012-06-04 19:13:36,635 [naive file wal consumer-27] INFO debug.InsistentOpenDecorator: open attempt 2 failed, backoff (4000ms): Failed to open thrift event sink to hostname:35855 : java.net.ConnectException: Connection refused
and then empty ACKs will be sent continuously
2012-06-04 19:19:56,960 [Roll-TriggerThread-0] INFO endtoend.AckListener$Empty: Empty Ack Listener began 20120604-191956958+0530.881565921235084.00000026
2012-06-04 19:20:07,043 [Roll-TriggerThread-0] INFO hdfs.SeqfileEventSink: closed /tmp/flume-user1/agent/hostname/writing/20120604-191956958+0530.881565921235084.00000026
I dont understand why the connection is refused. Are there any system level changes that needs to be done ?
Note: the collector is listening to the port but agent is unable to send data through the 35855 port.
Can anyone help me with this problem.
Thanks
If you are running both the agent and the collector on the same box, you should be using localhost as the address.
agentE2ESink("localhost", 35855)

Resources