Chronos framework connect and disconnect from Mesos - docker

this is my first question, I hope to do all correctly.
I have 3 dockers on differents hosts with zookeeper, mesos, and chronos.
Mesos slaves are correctly subscribed to the master. Chronos tasks are syncronized with each host.
The problem is: chronos framework is connecting and disconnecting:
0915 12:12:11.132375 49 master.cpp:2231] Received SUBSCRIBE call for framework 'chronos-2.4.0' at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b#10.xxx.xxx.xxx:61740
I0915 12:12:11.132647 49 master.cpp:2302] Subscribing framework chronos-2.4.0 with checkpointing enabled and capabilities [ ]
I0915 12:12:11.133229 49 master.cpp:2312] Framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b#10.xxx.xxx.xxx:61740 already subscribed, resending acknowledgement
W0915 12:12:11.133322 49 master.hpp:1764] Master attempted to send message to disconnected framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b#10.xxx.xxx.xxx:61740
E0915 12:12:11.133745 55 process.cpp:1958] Failed to shutdown socket with fd 41: Transport endpoint is not connected
I0915 12:12:25.648849 52 master.cpp:2231] Received SUBSCRIBE call for framework 'chronos-2.4.0' at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b#10.xxx.xxx.xxx:61740
I0915 12:12:25.649029 52 master.cpp:2302] Subscribing framework chronos-2.4.0 with checkpointing enabled and capabilities [ ]
I0915 12:12:25.649060 52 master.cpp:2312] Framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b#10.xxx.xxx.xxx:61740 already subscribed, resending acknowledgement
W0915 12:12:25.649116 52 master.hpp:1764] Master attempted to send message to disconnected framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b#10.xxx.xxx.xxx:61740
E0915 12:12:25.649433 55 process.cpp:1958] Failed to shutdown socket with fd 41: Transport endpoint is not connected
I0915 12:13:15.146510 50 master.cpp:2231] Received SUBSCRIBE call for framework 'chronos-2.4.0' at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b#10.xxx.xxx.xxx:61740
I0915 12:13:15.146759 50 master.cpp:2302] Subscribing framework chronos-2.4.0 with checkpointing enabled and capabilities [ ]
I0915 12:13:15.146848 50 master.cpp:2312] Framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b#10.xxx.xxx.xxx:61740 already subscribed, resending acknowledgement
W0915 12:13:15.146939 50 master.hpp:1764] Master attempted to send message to disconnected framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b#10.xxx.xxx.xxx:61740
E0915 12:13:15.147408 55 process.cpp:1958] Failed to shutdown socket with fd 41: Transport endpoint is not connected
I0915 12:14:04.957185 51 master.cpp:2231] Received SUBSCRIBE call for framework 'chronos-2.4.0' at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b#10.xxx.xxx.xxx:61740
I0915 12:14:04.957341 51 master.cpp:2302] Subscribing framework chronos-2.4.0 with checkpointing enabled and capabilities [ ]
I0915 12:14:04.957363 51 master.cpp:2312] Framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b#10.xxx.xxx.xxx:61740 already subscribed, resending acknowledgement
W0915 12:14:04.957392 51 master.hpp:1764] Master attempted to send message to disconnected framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b#10.xxx.xxx.xxx:61740
E0915 12:14:04.957844 55 process.cpp:1958] Failed to shutdown socket with fd 41: Transport endpoint is not connected
In this case, mesos-master and chronos framework are in the same docker, but I suspect that can not connect to Chronos at port 61740 (That is a ephemeral port)
netstat capture:
tcpdump capture:
root#HOSTNAME:/# tcpdump -i eth0 port 61740 -v
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
12:30:41.013731 IP (tos 0x0, ttl 64, id 12013, offset 0, flags [DF], proto TCP (6), length 60)
172.xxx.xxx.xxx.29468 > HOSTNAME.61740: Flags [S], cksum 0xb989 (incorrect -> 0xa894), seq 1155265525, win 14600, options [mss 1460,sackOK,TS val 852942104 ecr 0,nop,wscale 6], len gth 0
12:30:41.013780 IP (tos 0x0, ttl 64, id 49727, offset 0, flags [DF], proto TCP (6), length 40)
HOSTNAME.61740 > 172.xxx.xxx.xxx.29468: Flags [R.], cksum 0x595a (correct), seq 0, ack 1155265526, win 0, length 0
12:31:18.129849 IP (tos 0x0, ttl 64, id 64040, offset 0, flags [DF], proto TCP (6), length 60)
172.xxx.xxx.xxx.30564 > HOSTNAME.61740: Flags [S], cksum 0xb989 (incorrect -> 0x97fb), seq 535270461, win 14600, options [mss 1460,sackOK,TS val 852979221 ecr 0,nop,wscale 6], leng th 0
12:31:18.129892 IP (tos 0x0, ttl 64, id 6441, offset 0, flags [DF], proto TCP (6), length 40)
HOSTNAME.61740 > 172.xxx.xxx.xxx.30564: Flags [R.], cksum 0xd9be (correct), seq 0, ack 535270462, win 0, length 0
12:31:36.451417 IP (tos 0x0, ttl 64, id 21303, offset 0, flags [DF], proto TCP (6), length 60)
172.xxx.xxx.xxx.31103 > HOSTNAME.61740: Flags [S], cksum 0xb989 (incorrect -> 0x10c7), seq 186377873, win 14600, options [mss 1460,sackOK,TS val 852997542 ecr 0,nop,wscale 6], leng th 0
12:31:36.451470 IP (tos 0x0, ttl 64, id 13169, offset 0, flags [DF], proto TCP (6), length 40)
HOSTNAME.61740 > 172.xxx.xxx.xxx.31103: Flags [R.], cksum 0x9a1b (correct), seq 0, ack 186377874, win 0, length 0
12:31:41.619076 IP (tos 0x0, ttl 64, id 41997, offset 0, flags [DF], proto TCP (6), length 60)
172.xxx.xxx.xxx.31252 > HOSTNAME.61740: Flags [S], cksum 0xb989 (incorrect -> 0xfe18), seq 2176478683, win 14600, options [mss 1460,sackOK,TS val 853002710 ecr 0,nop,wscale 6], length 0
12:31:41.619119 IP (tos 0x0, ttl 64, id 13179, offset 0, flags [DF], proto TCP (6), length 40)
HOSTNAME.61740 > 172.xxx.xxx.xxx.31252: Flags [R.], cksum 0x9b9d (correct), seq 0, ack 2176478684, win 0, length 0
The IP 172.xxx.xxx.xxx is the container IP, but I actually run mesos-master like this:
mesos-master --log_dir=/var/log/mesos/master/ --work_dir=/var/log/mesos/work/ --quorum=2 --cluster=XXXX --zk=file:///etc/mesos/zk --advertise_ip=10.XXX.XXX.XXX --hostname=HOSTNAME
Any idea or suggestion will be appreciated.
Thanks.

At tcpdump catpure we can see the incorrect checksum. It seem like a bug in kernel version (3.10). This fixes in 3.14+, but i cant check because we cant updating in this enviroment.
https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip-data-to-mesos-kubernetes-docker-containers-4986f88f7a19#.w6eui9yc9

Related

Job Fails with odd message

I have a job that is failing at the very start of the message:
"#*" and "#N" are reserved sharding specs. Filepattern must not contain any of them.
I have altered the destination location to be something other than the default (an email address) which would include the # symbol but I can still see it is using temporary destinations within that path that I am unable to edit.
Did anyone experience this issue before? I've got a file which is only 65k rows long, I can preview all of the complete data in Data Prep but when I run the job it fails which is super tedious and ~3hrs of cleaning down the drain if this won't run. (I appreciate it's not designed for this, but Excel was being a mare so it seemed like a good solution!)
Edit - Adding Logs:
2018-03-10 (13:47:34) Value "PTableLoadTransformGCS/Shuffle/GroupByKey/Session" materialized.
2018-03-10 (13:47:34) Executing operation PTableLoadTransformGCS/SumQuoteAndDelimiterCounts/GroupByKey/Read+PTableLoadTran...
2018-03-10 (13:47:38) Executing operation PTableLoadTransformGCS/Shuffle/GroupByKey/Close
2018-03-10 (13:47:38) Executing operation PTableStoreTransformGCS/WriteFiles/GroupUnwritten/Create
2018-03-10 (13:47:39) Value "PTableStoreTransformGCS/WriteFiles/GroupUnwritten/Session" materialized.
2018-03-10 (13:47:39) Executing operation PTableLoadTransformGCS/Shuffle/GroupByKey/Read+PTableLoadTransformGCS/Shuffle/Gr...
2018-03-10 (13:47:39) Executing failure step failure49
2018-03-10 (13:47:39) Workflow failed. Causes: (c759db2a23a80ea): "#*" and "#N" are reserved sharding specs. Filepattern m...
(c759db2a23a8c5b): Workflow failed. Causes: (c759db2a23a80ea): "#*" and "#N" are reserved sharding specs. Filepattern must not contain any of them.
2018-03-10 (13:47:39) Cleaning up.
2018-03-10 (13:47:39) Starting worker pool teardown.
2018-03-10 (13:47:39) Stopping worker pool...
And StackDriver warnings or higher:
W ACPI: RSDP 0x00000000000F23A0 000014 (v00 Google)
W ACPI: RSDT 0x00000000BFFF3430 000038 (v01 Google GOOGRSDT 00000001 GOOG 00000001)
W ACPI: FACP 0x00000000BFFFCF60 0000F4 (v02 Google GOOGFACP 00000001 GOOG 00000001)
W ACPI: DSDT 0x00000000BFFF3470 0017B2 (v01 Google GOOGDSDT 00000001 GOOG 00000001)
W ACPI: FACS 0x00000000BFFFCF00 000040
W ACPI: FACS 0x00000000BFFFCF00 000040
W ACPI: SSDT 0x00000000BFFF65F0 00690D (v01 Google GOOGSSDT 00000001 GOOG 00000001)
W ACPI: APIC 0x00000000BFFF5D10 00006E (v01 Google GOOGAPIC 00000001 GOOG 00000001)
W ACPI: WAET 0x00000000BFFF5CE0 000028 (v01 Google GOOGWAET 00000001 GOOG 00000001)
W ACPI: SRAT 0x00000000BFFF4C30 0000B8 (v01 Google GOOGSRAT 00000001 GOOG 00000001)
W ACPI: 2 ACPI AML tables successfully acquired and loaded
W ACPI: Executed 2 blocks of module-level executable AML code
W acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
W ACPI: Enabled 16 GPEs in block 00 to 0F
W ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
W ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 10
W i8042: Warning: Keylock active
W GPT:Primary header thinks Alt. header is not at the end of the disk.
W GPT:41943039 != 524287999
W GPT:Alternate GPT header not at the end of the disk.
W GPT:41943039 != 524287999
W GPT: Use GNU Parted to correct GPT errors.
W device-mapper: verity: Argument 0: 'payload=PARTUUID=245B0EEC-6404-8744-AAF2-E8C6BF78D7B2'
W device-mapper: verity: Argument 1: 'hashtree=PARTUUID=245B0EEC-6404-8744-AAF2-E8C6BF78D7B2'
W device-mapper: verity: Argument 2: 'hashstart=2539520'
W device-mapper: verity: Argument 3: 'alg=sha1'
W device-mapper: verity: Argument 4: 'root_hexdigest=244007b512ddbf69792d485fdcbc3440531f1264'
W device-mapper: verity: Argument 5: 'salt=5bacc0df39d2a60191e9b221ffc962c55e251ead18cf1472bf8d3ed84383765b'
E EXT4-fs (dm-0): couldn't mount as ext3 due to feature incompatibilities
W [/usr/lib/tmpfiles.d/var.conf:12] Duplicate line for path "/var/run", ignoring.
W Could not stat /dev/pstore: No such file or directory
W Kernel does not support crash dumping
W Could not load the device policy file.
W [CLOUDINIT] cc_write_files.py[WARNING]: Undecodable permissions None, assuming 420
W [CLOUDINIT] cc_write_files.py[WARNING]: Undecodable permissions None, assuming 420
W [CLOUDINIT] cc_write_files.py[WARNING]: Undecodable permissions None, assuming 420
W [CLOUDINIT] cc_write_files.py[WARNING]: Undecodable permissions None, assuming 420
W [WARNING:persistent_integer.cc(75)] cannot open /var/lib/metrics/version.cycle for reading: No such file or directory
W No API client: no api servers specified
W Unable to update cni config: No networks found in /etc/cni/net.d
W unable to connect to Rkt api service: rkt: cannot tcp Dial rkt api service: dial tcp 127.0.0.1:15441: getsockopt: connection refused
W No api server defined - no events will be sent to API server.
W Hairpin mode set to "promiscuous-bridge" but kubenet is not enabled, falling back to "hairpin-veth"
W Unable to update cni config: No networks found in /etc/cni/net.d
E Image garbage collection failed once. Stats initialization may not have completed yet: unable to find data for container /
W No api server defined - no node status update will be sent.
E Failed to check if disk space is available for the runtime: failed to get fs info for "runtime": unable to find data for container /
E Failed to check if disk space is available on the root partition: failed to get fs info for "root": unable to find data for container /
E [ContainerManager]: Fail to get rootfs information unable to find data for container /
W Registration of the rkt container factory failed: unable to communicate with Rkt api service: rkt: cannot tcp Dial rkt api service: dial tcp 127.0.0.1:15441: getsockopt: connection refused
E Could not find capacity information for resource storage.kubernetes.io/scratch
W eviction manager: no observation found for eviction signal allocatableNodeFs.available
W Profiling Agent not found. Profiles will not be available from this worker.
E debconf: delaying package configuration, since apt-utils is not installed
W [WARNING:metrics_daemon.cc(598)] cannot read /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
E % Total % Received % Xferd Average Speed Time Time Time Current
E Dload Upload Total Spent Left Speed
E
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 3698 100 3698 0 0 64248 0 --:--:-- --:--:-- --:--:-- 64877

tcpreplay, different packet number with pcap file

I have a pcap file, captured with Wireshark,with 2690 packets
I use tcpreplay to replay it on an interface, but the problem is that the number of Attempted packets in tcpreplay is different(less) with number of packets showing in wireshark
#tcpreplay -i eth1 test.pcap
sending out eth1
processing file: /tmp/reloadtest4.pcap
Actual: 1826 packets (1634597 bytes) sent in 58.06 seconds.
Rated: 28153.6 bps, 0.21 Mbps, 31.45 pps
Statistics for network device: eth1
Attempted packets: 1826
Successful packets: 1826
Failed packets: 0
Retried packets (ENOBUFS): 0
Retried packets (EAGAIN): 0
It may be a bug in Tcpreplay, or it may be an issue with your capture. I suggest logging the issue here. Include the output of tcpreplay -V. Also rename the packet capture with a suffix of .txt and you will be able to upload it to the issue.

How to determine what is blocking a connection to website?

I need to connect, let's say a script, from an IP to a website and I can't, because:
wget: domain.com…
--2014-11-10 11:20:53-- domain.com…
recognized: domain.com... <IP>
Connecting with: domain.com|<IP>... error: Connection Timeout.
Connect again
I asked the admin of the server I host the domain on to give me ping and tracert to that IP and here it is:
pat#daz:/# ping 188.127.246.50
PING 188.127.246.50 (188.127.246.50) 56(84) bytes of data.
^C
--- 188.127.246.50 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1009ms
pat#daz:/# traceroute 188.127.246.50
traceroute to 188.127.246.50 (188.127.246.50), 30 hops max, 40 byte packets
1 79.133.193.65 (79.133.193.65) 1.348 ms 0.994 ms 0.796 ms
2 ae1-JunM10.etop.pl (79.133.201.129) 72.054 ms 56.309 ms 52.532 ms
3 frm-k90-cr1.rascom.ru (80.81.192.166) 20.489 ms 20.356 ms 20.388 ms
4 * * *
5 oversun.w-ix.net (193.106.112.195) 63.164 ms 59.526 ms 124.898 ms
6 * * *
7 * *^C
The website's .htaccess is clean, robots.txt too, hosting's admin says there's nothing from their side (firewall, blacklist etc) that could be blocking that IP from connecting.
What could it be then? How can I check it?

How do I pump traffic using tcpreplay at 100 MBps, 500 MBps and 1Gbps speeds?

I used the -R and -K option but it doesnt seem to be working as I captured the pumped traffic using tcpdump and the number of packets that I see there dont seem to match the number of packets that I expect in the time frame.
First of all make sure you are using the latest version, available here. You will want to use the -K and --mbps (or -M) options, for example:
# tcpreplay -i eth7 -K --mbps 1000 smallFlows.pcap
File Cache is enabled
Actual: 14261 packets (9216531 bytes) sent in 0.073761 seconds.
Rated: 124951275.0 Bps, 999.61 Mbps, 193340.65 pps
Flows: 1209 flows, 16390.77 fps, 14243 flow packets, 18 non-flow
Statistics for network device: eth7
Attempted packets: 14261
Successful packets: 14261
Failed packets: 0
Truncated packets: 0
Retried packets (ENOBUFS): 0
Retried packets (EAGAIN): 0
When you attempt to move to higher speeds (e.g. 10GigE) you may need to generate a bigger block of data by using the --loop option. Also with Tcpreplay version 4.0 there are the more advanced --netmap and --unique-ip options which on a properly set up system, will achieve near wire rate and very high flows/sec. More information available at Tcpreplay How To. Here is an example:
# tcpreplay -i eth7 -K --mbps 9500 --loop 100 --netmap --unique-ip smallFlows.pcap
Switching network driver for eth7 to netmap bypass mode... done!
File Cache is enabled
Actual: 1426100 packets (921653100 bytes) sent in 0.776133 seconds.
Rated: 1187493767.1 Bps, 9499.95 Mbps, 1837442.80 pps
Flows: 120900 flows, 155772.27 fps, 1424300 flow packets, 1800 non-flow
Statistics for network device: eth7
Attempted packets: 1426100
Successful packets: 1426100
Failed packets: 0
Truncated packets: 0
Retried packets (ENOBUFS): 0
Retried packets (EAGAIN): 0
Switching network driver for eth7 to normal mode... done!

How do I get siege to use the cookies for benchmark and load testing

I've setup siege (v2.70) to test a webapp that has a login, and then I have a list of about 180 urls to stress test the speed of the app. Problem is that the cookie is being ignored when returned by the login url.
Here is a tcpdump of the network traffic between siege and the app (box.example.org has been substituted for the actual url, that is not the problem)
You can see the Set-Cookie in the first response, and then fail to see the cookie in in the next GET request. I found this:
How do I use the --header option to send cookies with Siege?
but the cookie I need to send is dependent on the login, I can't just hard code it. The FAQ says it supports them, but tcpdump argues otherwise:
20:25:48.003094 IP (tos 0x0, ttl 64, id 31699, offset 0, flags [DF], proto TCP (6), length 223)
192.168.20.34.48923 > 192.168.20.81.80: Flags [P.], seq 2734994720:2734994891, ack 3331849910, win 913, options [nop,nop,TS val 2229194446 ecr 50245070], length 171
E...{.#.#......"...Q...P... ...............
........GET /saml/testlogon/ HTTP/1.1^M
Host: box.example.org^M
Accept: */*^M
Accept-Encoding: gzip^M
User-Agent: JoeDog/1.00 [en] (X11; I; Siege 2.70)^M
Connection: close^M
^M
20:25:48.406571 IP (tos 0x0, ttl 64, id 21369, offset 0, flags [DF], proto TCP (6), length 427)
192.168.20.81.80 > 192.168.20.34.48923: Flags [P.], seq 1:376, ack 171, win 972, options [nop,nop,TS val 50245160 ecr 2229194446], length 375
E...Sy#.#.<....Q...".P...............*.....
...(....HTTP/1.1 302 FOUND^M
Date: Wed, 21 Nov 2012 02:33:32 GMT^M
Server: Apache/2.2.22 (Ubuntu)^M
Vary: Cookie,Accept-Encoding^M
Set-Cookie: sessionid=233511e6001797ec77f7f3a08683ce97; httponly; Path=/^M
Location: http://box.example.org/viewer/start/^M
Content-Encoding: gzip^M
Content-Length: 20^M
Connection: close^M
Content-Type: text/html; charset=utf-8^M
^M
....................
20:25:49.410684 IP (tos 0x0, ttl 64, id 3041, offset 0, flags [DF], proto TCP (6), length 221)
192.168.20.34.48924 > 192.168.20.81.80: Flags [P.], seq 1168563186:1168563355, ack 1414846755, win 913, options [nop,nop,TS val 2229194798 ecr 50245422], length 169
E.....#.#..v..."...Q...PE...TT.#...........
........GET /viewer/start/ HTTP/1.1^M
Host: box.example.org^M
Accept: */*^M
Accept-Encoding: gzip^M
User-Agent: JoeDog/1.00 [en] (X11; I; Siege 2.70)^M
Connection: close^M
^M
20:25:49.419109 IP (tos 0x0, ttl 64, id 30222, offset 0, flags [DF], proto TCP (6), length 375)
192.168.20.81.80 > 192.168.20.34.48924: Flags [P.], seq 1:324, ack 169, win 972, options [nop,nop,TS val 50245424 ecr 2229194798], length 323
E..wv.#.#......Q...".P..TT.#E........*.....
...0....HTTP/1.1 302 FOUND^M
Date: Wed, 21 Nov 2012 02:33:34 GMT^M
Server: Apache/2.2.22 (Ubuntu)^M
Vary: Cookie,Accept-Encoding^M
Location: http://box.example.org/saml/testlogon/?next=/viewer/start/^M
Content-Encoding: gzip^M
Content-Length: 20^M
Connection: close^M
Content-Type: text/html; charset=utf-8^M
^M
Turns out the problem was in siege itself. I fixed the code and pushed it to this repo
https://github.com/mark0978/siege
siege did not expect the httponly; (has no =) and if the expires time is not present, it got the expiration wrong too. The last one I could have fixed by setting an expiration time in my app, but instead I just fixed the calculation.

Resources