Solace Message Broker Message Spool Active Disk Partion - solace

What is meant by Active Disk Partition Usage: 100.00% when checking message-spool config? if this a concern how would I go about fixing it?
I have zero message spooled into Solace, but getting the above status, I also had issues provisioning new queues.
Is there a way to clear all system logs, i.e. command, event, debug and system? I understand that there is an archive policy for that, but I would like to have a clean state for my logs.
A full trace of message-spool is:
Config Status: Enabled (Primary)
Maximum Spool Usage: 1500 MB
Using Internal Disk: Yes
Operational Status: AD-Active
Datapath Status: Up
Synchronization Status: Synced
Spool-Sync Status: Synced
Last Failure Reason: N/A
Last Failure Time: N/A
Max Message Count: 240M
Message Count Utilization: 0.00%
Transaction Resource Utilization: 0.00%
Delivered Unacked Msgs Utilization: 0.00%
Spool Files Utilization: 0.00%
Active Disk Partition Usage: 100.00%
Mate Disk Partition Usage: -%
Next Message Id: 222789873
Defragmentation Status: Idle
Number of delete in-progress: 0
Current Persistent Store Usage (MB) 0.0000 0.0000 0.0000
Number of Messages Currently Spooled 0 0 0
I am using System Software. SolOS-TR Version 7.2.2.34.

When configuring a software message broker disk space needs to be allocated to the device for spooling of messages. The Active Disk Partition Usage statistic refers to the amount of this allocated space that has currently been utilized.
If disk space is not partitioned for the spool, the disk space will be shared with the entire system. In this event, you would observe a high active disk partition with no messages spooled. To resolve the “Active Disk Partition Usage: 100.00%” you are experiencing, spool space should be provisioned to the broker. You can read more about Storage Configuration on Solace Brokers here (https://docs.solace.com/Configuring-and-Managing/Configuring-Storage-Machine-Cloud.htm#).
Toi address your system logs question, clearing the logs is not an aspect that is supported by the router. Information stored to the system logs is timestamped and kept for troubleshooting purposes, as time progresses new information is appended to these files.

Related

How to get rid of warnings with MEM and SSD tiers

I have two tiers: MEM+SSD. The MEM layer is almost always at 90% full and sometimes the SSD tier is also full.
Now this (kind of) message is sometimes spamming my log:
2022-06-14 07:11:43,607 WARN TieredBlockStore - Target tier: BlockStoreLocation{TierAlias=MEM, DirIndex=0, MediumType=MEM} has no available space to store 67108864 bytes for session: -4254416005596851101
2022-06-14 07:11:43,607 WARN BlockTransferExecutor - Transfer-order: BlockTransferInfo{TransferType=SWAP, SrcBlockId=36401609441282, DstBlockId=36240078405636, SrcLocation=BlockStoreLocation{TierAlias=MEM, DirIndex=0, MediumType=MEM}, DstLocation=BlockStoreLocation{TierAlias=SSD, DirIndex=0, MediumType=SSD}} failed. alluxio.exception.WorkerOutOfSpaceException: Failed to find space in BlockStoreLocation{TierAlias=MEM, DirIndex=0, MediumType=MEM} to move blockId 36240078405636
2022-06-14 07:11:43,607 WARN AlignTask - Insufficient space for worker swap space, swap restore task called.
Is my setup flawed? What can I do to get rid of these warnings?
looks like alluxio worker is trying to move/swap some blocks but there is no enough space to finish the operation. I guess it might be caused by both the ssd and mem tiers are full. Have you tried this property? alluxio.worker.tieredstore.free.ahead.bytes This can help us determine whether the swap failed due to insufficient storage space.

What is the best way to performance test an SQS consumer to find the max TPS that one host can handle?

I have a SQS consumer running in EventConsumerService that needs to handle up to 3K TPS successfully, sometimes upwards of 20K TPS (or 1.2 million messages per minute). For each message processed, I make a REST call to DataService's TCP VIP. I'm trying to perform a load test to find the max TPS that one host can handle in EventConsumerService without overstraining:
Request volume on dependencies, DynamoDB storage, etc
CPU utilization in both EventConsumerService and DataService
Network connections per host
IO stats due to overlogging
DLQ size must be minimal, currently I am seeing my DLQ growing to 500K messages due to 500 Service Unavailable exceptions thrown from DataService, so something must be wrong.
Approximate age of oldest message. I do not want a message sitting in the queue for over X minutes.
Fatals and latency of the REST call to DataService
Active threads
This is how I am performing the performance test:
I set up both my consumer and the other service on one host, the reason being I want to understand the load on both services per host.
I use a TPS generator to fill the SQS queue with a million messages
The EventConsumerService service is already running in production. Once messages started filling the SQS queue, I immediately could see requests being sent to DataService.
Here are the parameters I am tuning to find messagesPolledPerSecond:
messagesPolledPerSecond = (numberOfHosts * numberOfPollers * messageFetchSize) * (1000/(sleepTimeBetweenPollsPerMs+receiveMessageTimePerMs))
messagesInSurge / messagesPolledPerSecond = ageOfOldestMessageSLA
ageOfOldestMessage + settingsUpdatedLatency < latencySLA
The variables for SqsConsumer which I kept constant are:
numberOfHosts = 1
ReceiveMessageTimePerMs = 60 ms? It's out of my control
Max thread pool size: 300
Other factors are all game:
Number of pollers (default 1), I set to 150
Sleep time between polls (default 100 ms), I set to 0 ms
Sleep time when no messages (default 1000 ms), ???
message fetch size (default 1), I set to 10
However, with the above parameters, I am seeing a high amount of messages being sent to the DLQ due to server errors, so clearly I have set values to be too high. This testing methodology seems highly inefficient, and I am unable to find the optimal TPS that does not cause such a tremendous number of messages to be sent to the DLQ, and does not cause such a high approximate age of the oldest message.
Any guidance is appreciated in how best I should test. It'd be very helpful if we can set up a time to chat. PM me directly

nodetool info and java memory mismatch

On a 6 node cassandra cluster, heap size is configured as 31g. When I run nodetool info, I see below
Nodetool info -
[root#ip-10-216-86-94 ~]# nodetool info
ID : 88esdsd01-5233-4b56-a240-ea051ced2928
Gossip active : true
Thrift active : false
Native Transport active: true
Load : 53.31 GiB
Generation No : 1549564460
Uptime (seconds) : 734
Heap Memory (MB) : 828.45 / 31744.00
Off Heap Memory (MB) : 277.25
Data Center : us-east
Rack : 1a
Exceptions : 0
Key Cache : entries 8491, size 1.12 MiB, capacity 100 MiB, 35299 hits, 44315 requests, 0.797 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 5414, size 1.22 MiB, capacity 50 MiB, 5387 hits, 10801 requests, 0.499 recent hit rate, 7200 save period in seconds
Chunk Cache : entries 6164, size 249.5 MiB, capacity 480 MiB, 34840 misses, 177139 requests, 0.803 recent hit rate, 121.979 microseconds miss latency
Percent Repaired : 0.0%
Token : (invoke with -T/--tokens to see all 8 tokens)
Heap memory used and allocated maps to what I see on jconsole. But for non-heap memory, on jconsole it shows 188mb whereas from info command it shows 277mb, why is there a mismatch?
Non-Heap Memory in JConsole and Off Heap Memory shown by nodetool are completely different things.
Non-Heap Memory in JConsole is the sum of JVM non-heap memory pools. JVM exports this information through MemoryPoolMXBean. As of JDK 8, these pools include:
Metaspace
Compressed Class Space
Code Cache
So, Non-Heap pools show how much memory JVM uses for class metadata and compiled code.
Nodetool gets Off Heap Memory stats from Cassandra's Column Family Metrics. This is the the total size of Bloom filters, Index Summaries and Compression Metadata for all open tables.
See nodetool tablestats for detailed breakdown of this statistics.

Jenkins web ui is totally unresponsive

My jenkins instance has been running for over two years without issue but yesterday quit responding to http requests. No errors, just clocks and clocks.
I've restarted the service, then restarted the entire server.
There's been a lot of mention of a thread dump. I attempted to get that but I'm not sure that this is that.
Heap
PSYoungGen total 663552K, used 244203K [0x00000000d6700000, 0x0000000100000000, 0x0000000100000000)
eden space 646144K, 36% used [0x00000000d6700000,0x00000000e4df5f70,0x00000000fde00000)
from space 17408K, 44% used [0x00000000fef00000,0x00000000ff685060,0x0000000100000000)
to space 17408K, 0% used [0x00000000fde00000,0x00000000fde00000,0x00000000fef00000)
ParOldGen total 194048K, used 85627K [0x0000000083400000, 0x000000008f180000, 0x00000000d6700000)
object space 194048K, 44% used [0x0000000083400000,0x000000008879ee10,0x000000008f180000)
Metaspace used 96605K, capacity 104986K, committed 105108K, reserved 1138688K
class space used 12782K, capacity 14961K, committed 14996K, reserved 1048576K
Ubuntu 16.04.5 LTS
I prefer looking in the jenkins log file. There you can see error and then fix them.

Flume doesn't recover after memory transaction capacity is exceeded

I'm creating a proof-of-concept of a Flume agent that'll buffer events and stops consuming events from the source when the sink is unavailable. Only when the sink is available again, the buffered events should be processed and then the source restarts consumption.
For this I've created a simple agent, which reads from a SpoolDir and writes to a file. To simulate that the sink service is down, I change file permissions so Flume can't write to it. Then I start Flume some events are buffered in the memory channel and it stops consuming events when the channel capacity is full, as expected. As soon as the file becomes writeable, the sink is able to process the events and Flume recovers. However, that only works when the transaction capacity is not exceeded. As soon as the transaction capacity is exceeded, Flume never recovers and keeps writing the following error:
2015-10-02 14:52:51,940 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR -
org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:160)] Unable to
deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Failed to process transaction
at org.apache.flume.sink.RollingFileSink.process(RollingFileSink.java:218)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flume.ChannelException: Take list for MemoryTransaction,
capacity 4 full, consider committing more frequently, increasing capacity, or
increasing thread count
at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doTake(MemoryChannel.java:96)
at org.apache.flume.channel.BasicTransactionSemantics.take(BasicTransactionSemantics.java:113)
at org.apache.flume.channel.BasicChannelSemantics.take(BasicChannelSemantics.java:95)
at org.apache.flume.sink.RollingFileSink.process(RollingFileSink.java:191)
... 3 more
As soon as the number of events buffered in memory exceed the transaction capacity (4) this error occurs. I don't understand why, because the batchSize of the fileout is 1, so it should take out the events one by one.
This is the config I'm using:
agent.sources = spool-src
agent.channels = mem-channel
agent.sinks = fileout
agent.sources.spool-src.channels = mem-channel
agent.sources.spool-src.type = spooldir
agent.sources.spool-src.spoolDir = /tmp/flume-spool
agent.sources.spool-src.batchSize = 1
agent.channels.mem-channel.type = memory
agent.channels.mem-channel.capacity = 10
agent.channels.mem-channel.transactionCapacity = 4
agent.sinks.fileout.channel = mem-channel
agent.sinks.fileout.type = file_roll
agent.sinks.fileout.sink.directory = /tmp/flume-output
agent.sinks.fileout.sink.rollInterval = 0
agent.sinks.fileout.batchSize = 1
I've tested this config with different values for the channel capacity & transaction capacity (e.g., 3 and 3), but haven't found a situation where the channel capacity is full and Flume is able to recover.
On the flume mailing list someone told me it was probably this bug that affected my proof of concept. The bug entails that the batch size is 100, even tho it's specified differently in the config. I re-ran the test with the source & sink batchSizes set to 100, the memory channel transactionCapacity set to 100 and its capacity to 300. With those values, the proof of concept works exactly as expected.

Resources