Problems with same query in mongoid rails db but different parameters

Problems with same query in mongoid rails db but different parameters - ruby-on-rails

I’m using mongoid gem ’mongoid’, ’~> 7.2.4’ (mongoDB 3.6) with rails (5) and I have a database with customer collections and bills with this relation:
class Bill
...
belongs_to :customer, index: true
...
end
class Customer
....
has_many :bills
...
end
then in a pry console I test with two clients:
[55] pry(main)> c_slow.class
=> Customer
[58] pry(main)> c_slow.bills.count
MONGODB | pro-app-mongodb-05:27017 req:1030 conn:1:1 | db_api_production.aggregate | STARTED | {"aggregate"=>"bills", "pipeline"=>[{"$match"=>{"deleted_at"=>nil, "customer_id"=>BSON::ObjectId('60c76b9e21225c002044f6c5')}}, {"$group"=>{"_id"=>1, "n"=>{"$sum"=>1}}}], "cursor"=>{}, "$db"=>"db_api_production", "$clusterTime"=>{"clusterTime"=>#...
MONGODB | pro-app-mongodb-05:27017 req:1030 | db_api_production.aggregate | SUCCEEDED | 0.008s
=> 523
[59] pry(main)> c_fast.bills.count
MONGODB | pro-app-mongodb-05:27017 req:1031 conn:1:1 | db_api_production.aggregate | STARTED | {"aggregate"=>"bills", "pipeline"=>[{"$match"=>{"deleted_at"=>nil, "customer_id"=>BSON::ObjectId('571636f44a506256d6000003')}}, {"$group"=>{"_id"=>1, "n"=>{"$sum"=>1}}}], "cursor"=>{}, "$db"=>"db_api_production", "$clusterTime"=>{"clusterTime"=>#...
MONGODB | pro-app-mongodb-05:27017 req:1031 | db_api_production.aggregate | SUCCEEDED | 0.135s
=> 35913
until this moment it seems correct but when I execute this query:
[60] pry(main)> c_slow.bills.excludes(_id: BSON::ObjectId('62753df4a54d7584e56ea829')).order(reference: :desc).limit(1000).pluck(:reference, :_id)
MONGODB | pro-app-mongodb-05:27017 req:1083 conn:1:1 | db_api_production.find | STARTED | {"find"=>"bills", "filter"=>{"deleted_at"=>nil, "customer_id"=>BSON::ObjectId('60c76b9e21225c002044f6c5'), "_id"=>{"$ne"=>BSON::ObjectId('62753df4a54d7584e56ea829')}}, "limit"=>1000, "sort"=>{"reference"=>-1}, "projection"=>{"reference"=>1, "_id"=>1...
MONGODB | pro-app-mongodb-05:27017 req:1083 | db_api_production.find | SUCCEEDED | 10.075s
MONGODB | pro-app-mongodb-05:27017 req:1087 conn:1:1 | db_api_production.getMore | STARTED | {"getMore"=>#<BSON::Int64:0x0000558bcd7ba5f8 #value=165481790189>, "collection"=>"bills", "$db"=>"db_api_production", "$clusterTime"=>{"clusterTime"=>#<BSON::Timestamp:0x0000558bcd7a4b90 #seconds=1652511506, #increment=1>, "signature"=>{"hash"=><...
MONGODB | pro-app-mongodb-05:27017 req:1087 | db_api_production.getMore | SUCCEEDED | 1.181s
[61] pry(main)> c_fast.bills.excludes(_id: BSON::ObjectId('62753df4a54d7584e56ea829')).order(reference: :desc).limit(1000).pluck(:reference, :_id)
MONGODB | pro-app-mongodb-05:27017 req:1091 conn:1:1 | db_api_production.find | STARTED | {"find"=>"bills", "filter"=>{"deleted_at"=>nil, "customer_id"=>BSON::ObjectId('571636f44a506256d6000003'), "_id"=>{"$ne"=>BSON::ObjectId('62753df4a54d7584e56ea829')}}, "limit"=>1000, "sort"=>{"reference"=>-1}, "projection"=>{"reference"=>1, "_id"=>1...
MONGODB | pro-app-mongodb-05:27017 req:1091 | db_api_production.find | SUCCEEDED | 0.004s
MONGODB | pro-app-mongodb-05:27017 req:1092 conn:1:1 | db_api_production.getMore | STARTED | {"getMore"=>#<BSON::Int64:0x0000558bcd89c4d0 #value=166614148534>, "collection"=>"bills", "$db"=>"db_api_production", "$clusterTime"=>{"clusterTime"=>#<BSON::Timestamp:0x0000558bcd88eab0 #seconds=1652511516, #increment=1>, "signature"=>{"hash"=><...
MONGODB | pro-app-mongodb-05:27017 req:1092 | db_api_production.getMore | SUCCEEDED | 0.013s
The slow customer is taking 10 seconds and the fast one is taking 0.004s in the same query. and the slow customer has less than 600 documents and the fast client more than 35000. it has no sense for me.
We did on the bills collection a Reindex, we take the query over all customers and it seems too work at the beginnign but in thre second query it went slow again but the same customers are always slow than the fastest one
[1] pry(main)> Customer.all.collect do |c|
[1] pry(main)* starting = Process.clock_gettime(Process::CLOCK_MONOTONIC)
[1] pry(main)* c.bills.excludes(_id: BSON::ObjectId('62753df4a54d7584e56ea829')).order(reference: :desc).limit(1000).pluck(:reference_string, :id);nil
[1] pry(main)* ending = Process.clock_gettime(Process::CLOCK_MONOTONIC)
[1] pry(main)* [c.acronym, ending - starting]
[1] pry(main)* end
I cannot apply explain on pluck query. I reviewd the index and it worked correctly placed in the collection
but doing explain it is slow on the same query
MONGODB | pro-app-mongodb-05:27017 req:1440 | dbapiproduction.explain | SUCCEEDED | 10.841s
MONGODB | pro-app-mongodb-05:27017 req:2005 | dbapiproduction.explain | SUCCEEDED | 0.006s
obviously time, but also docsExamined
the query is the same, changing obyously de ids:
[23] pry(main)> h_slow["queryPlanner"]["parsedQuery"]
=> {"$and"=>
[{"customer_id"=>{"$eq"=>BSON::ObjectId('60c76b9e21225c002044f6c5')}},
{"deleted_at"=>{"$eq"=>nil}},
{"$nor"=>[{"_id"=>{"$eq"=>BSON::ObjectId('62753df4a54d7584e56ea829')}}]}]}
[24] pry(main)> h_fast["queryPlanner"]["parsedQuery"]
=> {"$and"=>
[{"customer_id"=>{"$eq"=>BSON::ObjectId('571636f44a506256d6000003')}},
{"deleted_at"=>{"$eq"=>nil}},
{"$nor"=>[{"_id"=>{"$eq"=>BSON::ObjectId('62753df4a54d7584e56ea829')}}]}]}
"inputStage": {
"advanced": 1000,
"direction": "backward",
"dupsDropped": 0,
"dupsTested": 0,
"executionTimeMillisEstimate": 0,
"indexBounds": {
"reference": [
"[MaxKey, MinKey]"
]
},
"indexName": "reference_1",
"indexVersion": 2,
"invalidates": 0,
"isEOF": 0,
"isMultiKey": false,
"isPartial": false,
"isSparse": false,
"isUnique": false,
"keyPattern": {
"reference": 1
},
"keysExamined": 1000,
"multiKeyPaths": {
"reference": []
},
"nReturned": 1000,
"needTime": 0,
"needYield": 0,
"restoreState": 10,
"saveState": 10,
"seeks": 1,
"seenInvalidated": 0,
"stage": "IXSCAN",
"works": 1000
},
"invalidates": 0,
"isEOF": 0,
"nReturned": 1000,
"needTime": 0,
"needYield": 0,
"restoreState": 10,
"saveState": 10,
"stage": "FETCH",
"works": 1000
},
"invalidates": 0,
"isEOF": 1,
"limitAmount": 1000,
"nReturned": 1000,
"needTime": 0,
"needYield": 0,
"restoreState": 10,
"saveState": 10,
"stage": "LIMIT",
"works": 1001
},
"executionSuccess": true,
"executionTimeMillis": 7,
"nReturned": 1000,
"totalDocsExamined": 1000,
"totalKeysExamined": 1000
}
"inputStage": {
"advanced": 604411,
"direction": "backward",
"dupsDropped": 0,
"dupsTested": 0,
"executionTimeMillisEstimate": 320,
"indexBounds": {
"reference": [
"[MaxKey, MinKey]"
]
},
"indexName": "reference_1",
"indexVersion": 2,
"invalidates": 0,
"isEOF": 1,
"isMultiKey": false,
"isPartial": false,
"isSparse": false,
"isUnique": false,
"keyPattern": {
"reference": 1
},
"keysExamined": 604411,
"multiKeyPaths": {
"reference": []
},
"nReturned": 604411,
"needTime": 0,
"needYield": 0,
"restoreState": 6138,
"saveState": 6138,
"seeks": 1,
"seenInvalidated": 0,
"stage": "IXSCAN",
"works": 604412
},
"invalidates": 0,
"isEOF": 1,
"nReturned": 523,
"needTime": 603888,
"needYield": 0,
"restoreState": 6138,
"saveState": 6138,
"stage": "FETCH",
"works": 604412
},
"invalidates": 0,
"isEOF": 1,
"limitAmount": 1000,
"nReturned": 523,
"needTime": 603888,
"needYield": 0,
"restoreState": 6138,
"saveState": 6138,
"stage": "LIMIT",
"works": 604412
},
"executionSuccess": true,
"executionTimeMillis": 9472,
"nReturned": 523,
"totalDocsExamined": 604411,
"totalKeysExamined": 604411
}
Why happen this differences, and what I can do to correct this collection

Related

How to perform a Linear Regression by group in PySpark?

The goal is to perform linear regression for each user in a scalable way in PySpark. Features: x1 and x2. Output: y
Regression equation (zero intercept): y = m(x1) + n(x2)
Example:
pdf = pd.DataFrame(
{
"user": [1, 1, 1, 2, 2, 2],
"x1": [1, 2, 3, 1, 2, 3],
"x2": [2, 3, 4, 5, 6, 7],
"y": [2, 4, 6, 3, 6, 9],
}
)
df = sc.createDataFrame(pdf)
df.show()
Data looks like:
+----+---+---+---+
|user| x1| x2| y|
+----+---+---+---+
| 1| 1| 2| 2|
| 1| 2| 3| 4|
| 1| 3| 4| 6|
| 2| 1| 5| 3|
| 2| 2| 6| 6|
| 2| 3| 7| 9|
+----+---+---+---+

I have used PandasUDF, that works for my use case. Ben Webber explains it in his post
from his post we can take the following approach: (BDR:6.4;Spark:2.4.5)
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import statsmodels.api as sm
import pandas as pd
pdf = pd.DataFrame(
{
"user": [1, 1, 1, 2, 2, 2],
"x1": [1, 2, 3, 1, 2, 3],
"x2": [2, 3, 4, 5, 6, 7],
"y": [2, 4, 6, 3, 6, 9],
}
)
df = spark.createDataFrame(pdf)
schema = StructType([StructField('user', DoubleType(), True),
StructField('r_squared', DoubleType(), True)])
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def train_LR(input_pd):
usr = input_pd.iloc[0]['user']
# Implement lineal regression, as per your needs
model = sm.OLS(input_pd['y'], input_pd[['x1','x2']]).fit()
R_sq = model.rsquared
return pd.DataFrame({'user': usr, 'r_squared': R_sq }, index=[0])
results = df.groupby('user').apply(train_LR)
display(results)
Note that the UDF is getting a pandas dataframe with only the data for the groupby

GridDB Bad Connection Exception

My application runs a sampling query periodically and it was working fine without any issue for around 30 hours. Then suddenly it gave the following error.
Java client is unable to access the Database and get the following exception.
Caused by: com.toshiba.mwcloud.gs.common.GSConnectionException: [145028:JC_BAD_CONNECTION] Failed to update by notification (address=/239.0.0.1:31999, reason=Receive timed out)
at com.toshiba.mwcloud.gs.subnet.NodeResolver.updateMasterInfo(NodeResolver.java:815)
at com.toshiba.mwcloud.gs.subnet.NodeResolver.prepareConnectionAndClusterInfo(NodeResolver.java:522)
at com.toshiba.mwcloud.gs.subnet.NodeResolver.getPartitionCount(NodeResolver.java:205)
at com.toshiba.mwcloud.gs.subnet.GridStoreChannel$5.execute(GridStoreChannel.java:2106)
at com.toshiba.mwcloud.gs.subnet.GridStoreChannel.executeStatement(GridStoreChannel.java:1675)
... 38 more
Caused by: java.net.SocketTimeoutException: Receive timed out
Why is this happening? What is the cause.
This is the output of gs_stat -u admin/admin
{
"checkpoint": {
"archiveLog": 0,
"backupOperation": 0,
"duplicateLog": 0,
"endTime": 1580053987745,
"mode": "NORMAL_CHECKPOINT",
"normalCheckpointOperation": 1470,
"pendingPartition": 0,
"periodicCheckpoint": "ACTIVE",
"requestedCheckpointOperation": 0,
"startTime": 1580053987741
},
"cluster": {
"activeCount": 0,
"autoGoal": "ACTIVE",
"clusterName": "defaultCluster",
"clusterRevisionId": "4e9be62e-7911-48a4-8d93-19af09be7a15",
"clusterRevisionNo": 17651,
"clusterStatus": "SUB_CLUSTER",
"designatedCount": 1,
"loadBalancer": "ACTIVE",
"nodeList": [
{
"address": "10.128.0.2",
"port": 10040
}
],
"nodeStatus": "ABNORMAL",
"notificationMode": "MULTICAST",
"partitionStatus": "INITIAL",
"startupTime": "2020-01-25T15:20:31.377Z",
"syncCount": 0
},
"currentTime": "2020-01-26T17:20:39Z",
"performance": {
"backupCount": 0,
"batchFree": 0,
"bufferHashCollisionCount": 0,
"checkpointFileAllocateSize": 5443584,
"checkpointFileFlushCount": 0,
"checkpointFileFlushTime": 0,
"checkpointFileSize": 5439488,
"checkpointFileUsageRate": 0.927710843373494,
"checkpointMemory": 196608,
"checkpointMemoryLimit": 1073741824,
"checkpointWriteSize": 270139392,
"checkpointWriteTime": 214,
"currentCheckpointWriteBufferSize": 0,
"currentTime": 1580059239771,
"expirationDetail": {
"autoExpire": false,
"erasableExpiredTime": "1970-01-01T00:00:00.000Z",
"latestExpirationCheckTime": "1970-01-01T00:00:00.000Z"
},
"logFileFlushCount": 8832,
"logFileFlushTime": 38224,
"numBackground": 0,
"numConnection": 2,
"numNoExpireTxn": 0,
"numSession": 0,
"numTxn": 0,
"ownerCount": 128,
"peakProcessMemory": 86626304,
"processMemory": 86626304,
"recoveryReadSize": 262144,
"recoveryReadTime": 0,
"recoveryReadUncompressTime": 0,
"storeCompressionMode": "NO_BLOCK_COMPRESSION",
"storeDetail": {
"batchFreeMapData": {
"storeMemory": 0,
"storeUse": 0,
"swapRead": 0,
"swapWrite": 0
},
"batchFreeRowData": {
"storeMemory": 0,
"storeUse": 0,
"swapRead": 0,
"swapWrite": 0
},
"mapData": {
"storeMemory": 131072,
"storeUse": 131072,
"swapRead": 0,
"swapWrite": 0
},
"metaData": {
"storeMemory": 131072,
"storeUse": 131072,
"swapRead": 0,
"swapWrite": 0
},
"rowData": {
"storeMemory": 4784128,
"storeUse": 4784128,
"swapRead": 0,
"swapWrite": 0
}
},
"storeMemory": 5046272,
"storeMemoryLimit": 1073741824,
"storeTotalUse": 5046272,
"swapRead": 0,
"swapReadSize": 0,
"swapReadTime": 0,
"swapReadUncompressTime": 0,
"swapWrite": 0,
"swapWriteCompressTime": 0,
"swapWriteSize": 0,
"swapWriteTime": 0,
"syncReadSize": 0,
"syncReadTime": 0,
"syncReadUncompressTime": 0,
"totalBackupLsn": 0,
"totalLockConflictCount": 0,
"totalOtherLsn": 0,
"totalOwnerLsn": 110220,
"totalReadOperation": 4733,
"totalRowRead": 2325894,
"totalRowWrite": 55108,
"totalWriteOperation": 55108,
"txnDetail": {
"totalBackgroundOperation": 0
}
},
"recovery": {
"progressRate": 1
},
"version": "4.3.0-36424 CE"
}

It seems that it is being dealt separately at GitHub.
Please refer to this post:
https://github.com/griddb/griddb_nosql/issues/235

How to calculate cpu utilization of container in docker using http api

I know that the CPU utilization of the container can be obtained by docker stats:
#docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
05076af468cd mystifying_kepler 0.02% 10.5MiB / 5.712GiB 0.18% 656B / 0B 0B / 0B 1
And I want to get this data through the HTTP api:http api.
Data i get from this http api is :
{
"read": "2019-11-26T22:18:33.027963669Z",
"preread": "2019-11-26T22:18:32.013978454Z",
"pids_stats": {
"current": 1
},
"blkio_stats": {
"io_service_bytes_recursive": [],
"io_serviced_recursive": [],
"io_queue_recursive": [],
"io_service_time_recursive": [],
"io_wait_time_recursive": [],
"io_merged_recursive": [],
"io_time_recursive": [],
"sectors_recursive": []
},
"num_procs": 0,
"storage_stats": {},
"cpu_stats": {
"cpu_usage": {
"total_usage": 361652820,
"percpu_usage": [361652820],
"usage_in_kernelmode": 50000000,
"usage_in_usermode": 100000000
},
"system_cpu_usage": 144599100000000,
"online_cpus": 1,
"throttling_data": {
"periods": 0,
"throttled_periods": 0,
"throttled_time": 0
}
},
"precpu_stats": {
"cpu_usage": {
"total_usage": 361488978,
"percpu_usage": [361488978],
"usage_in_kernelmode": 50000000,
"usage_in_usermode": 100000000
},
"system_cpu_usage": 144598090000000,
"online_cpus": 1,
"throttling_data": {
"periods": 0,
"throttled_periods": 0,
"throttled_time": 0
}
},
"memory_stats": {
"usage": 11005952,
"max_usage": 11108352,
"stats": {
"active_anon": 11005952,
"active_file": 0,
"cache": 0,
"dirty": 0,
"hierarchical_memory_limit": 9223372036854771712,
"hierarchical_memsw_limit": 9223372036854771712,
"inactive_anon": 0,
"inactive_file": 0,
"mapped_file": 0,
"pgfault": 8151,
"pgmajfault": 0,
"pgpgin": 4137,
"pgpgout": 1450,
"rss": 11005952,
"rss_huge": 0,
"total_active_anon": 11005952,
"total_active_file": 0,
"total_cache": 0,
"total_dirty": 0,
"total_inactive_anon": 0,
"total_inactive_file": 0,
"total_mapped_file": 0,
"total_pgfault": 8151,
"total_pgmajfault": 0,
"total_pgpgin": 4137,
"total_pgpgout": 1450,
"total_rss": 11005952,
"total_rss_huge": 0,
"total_unevictable": 0,
"total_writeback": 0,
"unevictable": 0,
"writeback": 0
},
"limit": 6133108736
},
"name": "/mystifying_kepler",
"id": "05076af468cdeb3d15d147a25e8ccee5f4d029ffcba1d60f14f84e2c9e25d6a9",
"networks": {
"eth0": {
"rx_bytes": 656,
"rx_packets": 8,
"rx_errors": 0,
"rx_dropped": 0,
"tx_bytes": 0,
"tx_packets": 0,
"tx_errors": 0,
"tx_dropped": 0
}
}
}
I was able to calculate the utilization of memory from the data, but I didn't find a way to get the CPU utilization .
And ideas?

You've probably solved this by now, but for the next person... This example is in Python, but the data fields and math are the same if you're making API calls.
The API returns cumulative values, so you need more than one sample - do the math using the difference between samples to get the utilization for that period. This example uses the streaming mode, which pushes an update every second.
# These initial values will seed the "last" cycle's saved values
containerCPU = 0
systemCPU = 0
container = client.containers.get(containerID)
#This function is blocking; the loop will proceed when there's a new update to iterate
for stats in container.stats(decode=True):
#Save the values from the last sample
lastContainerCPU = containerCPU
lastSystemCPU = systemCPU
#Get the container's usage, the total system capacity, and the number of CPUs
#The math returns a Linux-style %util, where 100.0 = 1 CPU core fully used
containerCPU = stats.get('cpu_stats',{}).get('cpu_usage',{}).get('total_usage')
systemCPU = stats.get('cpu_stats',{}).get('system_cpu_usage')
numCPU = len(stats.get('cpu_stats',{}).get('cpu_usage',{}).get('percpu_usage',0))
# Skip the first sample (result will be wrong because the saved values are 0)
if lastContainerCPU and lastSystemCPU:
cpuUtil = (containerCPU - lastContainerCPU) / (systemCPU - lastSystemCPU)
cpuUtil = cpuUtil * numCPU * 100
print(cpuUtil)

How do I perform a “diff” on two set of json buckets using Apache Beam Python SDK?

I would like to compare the run results of my pipeline. Getting the diff between jsons with the same schema though different data.
Run1 JSON
{"doc_id": 1, "entity": "Anthony", "start": 0, "end": 7}
{"doc_id": 1, "entity": "New York", "start": 30, "end": 38} # Missing from Run2
{"doc_id": 2, "entity": "Istanbul", "start": 0, "end": 8}
Run2 JSON
{"doc_id": 1, "entity": "Anthony", "start": 0, "end": 7} # same as in Run1
{"doc_id": 2, "entity": "Istanbul", "start": 0, "end": 10} # different end span
{"doc_id": 2, "entity": "Karim", "start": 10, "end": 15} # added in Run2, not in Run1
Based on the answer here my approach has been making a tuple out of the json values and then cogrouping using this large composite key made of some of the json values: How do I perform a "diff" on two Sources given a key using Apache Beam Python SDK?
Is there a better way to diff jsons with beam?
Code based on linked answer:
def make_kv_pair(x):
if x and isinstance(x, basestring):
x = json.loads(x)
""" Output the record with the x[0]+x[1] key added."""
key = tuple((x[dict_key] for dict_key in ["doc_id", "entity"]))
return (key, x)
class FilterDoFn(beam.DoFn):
def process(self, (key, values)):
table_a_value = list(values['table_a'])
table_b_value = list(values['table_b'])
if table_a_value == table_b_value:
yield pvalue.TaggedOutput('unchanged', key)
elif len(table_a_value) < len(table_b_value):
yield pvalue.TaggedOutput('added', key)
elif len(table_a_value) > len(table_b_value):
yield pvalue.TaggedOutput('removed', key)
elif table_a_value != table_b_value:
yield pvalue.TaggedOutput('changed', key)
Pipeline code:
table_a = (p | 'ReadJSONRun1' >> ReadFromText("run1.json")
| 'SetKeysRun1' >> beam.Map(make_kv_pair))
table_b = (p | 'ReadJSONRun2' >> ReadFromText("run2.json")
| 'SetKeysRun2' >> beam.Map(make_kv_pair))
joined_tables = ({'table_a': table_a, 'table_b': table_b}
| beam.CoGroupByKey())
output_types = ['changed', 'added', 'removed', 'unchanged']
key_collections = (joined_tables
| beam.ParDo(FilterDoFn()).with_outputs(*output_types))
# Now you can handle each output
key_collections.unchanged | "WriteUnchanged" >> WriteToText("unchanged/", file_name_suffix="_unchanged.json.gz")
key_collections.changed | "WriteChanged" >> WriteToText("changed/", file_name_suffix="_changed.json.gz")
key_collections.added | "WriteAdded" >> WriteToText("added/", file_name_suffix="_added.json.gz")
key_collections.removed | "WriteRemoved" >> WriteToText("removed/", file_name_suffix="_removed.json.gz")

Case-sensitive formula to perform a COUNTIF for number-letter combination of IDs

We have 15 to 18 symbols long IDs that are a mix of letters and numbers.
On the regular, we need to perform a COUNTIF() to determine the exact number of unique IDs.
The issue is that sometimes the only difference between one ID or another is whether the case of one letter is upper or lower.
COUNTIF() is not case sensitive and we need to apply a very long formula that converts the IDs to a unique combination of numbers in a separate column and perform the COUNTIF() in yet another column.
It is very important that one of the duplicate IDs is market with 1, as this is key for further processes.
Is there a more simple, but also an accurate way to do this with a single formula?
The formula in mention:
=IFERROR(CODE(MID(AL3,1,1))&CODE(MID(AL3,2,1))&CODE(MID(AL3,3,1))&CODE(MID(AL3,4,1))&CODE(MID(AL3,5,1))&CODE(MID(AL3,6,1))&CODE(MID(AL3,7,1))&CODE(MID(AL3,8,1))&CODE(MID(AL3,9,1))&CODE(MID(AL3,10,1))&CODE(MID(AL3,11,1))&CODE(MID(AL3,12,1))&CODE(MID(AL3,13,1))&CODE(MID(AL3,14,1))&CODE(MID(AL3,15,1))&IFERROR(CODE(MID(AL3,16,1)),""))
Some dummy sample IDs:
003B999992CcVWS
003B999992GdEDo
003B999992D4afI
003B999992CcVWs
003B999992CcVWZ
003B999992D40gR
003B999992D40gR
003B999992CcVWz
Formula's output:
484851665757575757506799868783
48485166575757575750711006968111
4848516657575757575068529710273
4848516657575757575067998687115
484851665757575757506799868790
4848516657575757575068524810382
4848516657575757575068524810382
4848516657575757575067998687122
The desired output can be seen in the last column on the right:
+---+-----------------+----------------------------------+---------+
| # | Account ID | Formula ID | Countif |
+---+-----------------+----------------------------------+---------+
| 1 | 003B999992CcVWS | 484851665757575757506799868783 | 1 |
+---+-----------------+----------------------------------+---------+
| 2 | 003B999992GdEDo | 48485166575757575750711006968111 | 1 |
+---+-----------------+----------------------------------+---------+
| 3 | 003B999992D4afI | 4848516657575757575068529710273 | 1 |
+---+-----------------+----------------------------------+---------+
| 4 | 003B999992CcVWs | 4848516657575757575067998687115 | 1 |
+---+-----------------+----------------------------------+---------+
| 5 | 003B999992CcVWZ | 484851665757575757506799868790 | 1 |
+---+-----------------+----------------------------------+---------+
| 6 | 003B999992D40gR | 4848516657575757575068524810382 | 1 |
+---+-----------------+----------------------------------+---------+
| 7 | 003B999992D40gR | 4848516657575757575068524810382 | 2 |
+---+-----------------+----------------------------------+---------+
| 8 | 003B999992CcVWz | 4848516657575757575067998687122 | 1 |
+---+-----------------+----------------------------------+---------+

={"#", "Account ID", "Formula ID", "Countif";
ARRAYFORMULA({ROW(INDIRECT("A1:A"&COUNTA(A21:A))), ARRAY_CONSTRAIN({A21:A,
IFERROR(CODE(MID(A21:A, 1, 1))&
CODE(MID(A21:A, 2, 1))&CODE(MID(A21:A, 3, 1))&
CODE(MID(A21:A, 4, 1))&CODE(MID(A21:A, 5, 1))&
CODE(MID(A21:A, 6, 1))&CODE(MID(A21:A, 7, 1))&
CODE(MID(A21:A, 8, 1))&CODE(MID(A21:A, 9, 1))&
CODE(MID(A21:A, 10, 1))&CODE(MID(A21:A, 11, 1))&
CODE(MID(A21:A, 12, 1))&CODE(MID(A21:A, 13, 1))&
CODE(MID(A21:A, 14, 1))&CODE(MID(A21:A, 15, 1))&
IFERROR(CODE(MID(A21:A, 16, 1)), )),
IF(LEN(A21:A), MMULT((
IFERROR(CODE(MID(A21:A, 1, 1))&
CODE(MID(A21:A, 2, 1))&CODE(MID(A21:A, 3, 1))&
CODE(MID(A21:A, 4, 1))&CODE(MID(A21:A, 5, 1))&
CODE(MID(A21:A, 6, 1))&CODE(MID(A21:A, 7, 1))&
CODE(MID(A21:A, 8, 1))&CODE(MID(A21:A, 9, 1))&
CODE(MID(A21:A, 10, 1))&CODE(MID(A21:A, 11, 1))&
CODE(MID(A21:A, 12, 1))&CODE(MID(A21:A, 13, 1))&
CODE(MID(A21:A, 14, 1))&CODE(MID(A21:A, 15, 1))&
IFERROR(CODE(MID(A21:A, 16, 1)), )) = TRANSPOSE(
IFERROR(CODE(MID(A21:A, 1, 1))&
CODE(MID(A21:A, 2, 1))&CODE(MID(A21:A, 3, 1))&
CODE(MID(A21:A, 4, 1))&CODE(MID(A21:A, 5, 1))&
CODE(MID(A21:A, 6, 1))&CODE(MID(A21:A, 7, 1))&
CODE(MID(A21:A, 8, 1))&CODE(MID(A21:A, 9, 1))&
CODE(MID(A21:A, 10, 1))&CODE(MID(A21:A, 11, 1))&
CODE(MID(A21:A, 12, 1))&CODE(MID(A21:A, 13, 1))&
CODE(MID(A21:A, 14, 1))&CODE(MID(A21:A, 15, 1))&
IFERROR(CODE(MID(A21:A, 16, 1)), )))) * (ROW(A21:A) >= TRANSPOSE(ROW(A21:A))),
SIGN(ROW(A21:A))), IFERROR(1/0))}, COUNTA(A21:A), 3)})}

How about MMULT?
Case-sensitive COUNTIFS
=ARRAYFORMULA(MMULT(
N(EXACT(A2:A9,TRANSPOSE(A2:A9))),
ROW(A2:A9)^0
))
Case-sensitive COUNTIFS with increment
=ARRAYFORMULA(VLOOKUP(
{ROW(A2:A9)&A2:A9},
{
QUERY(
{ROW(A2:A9)&A2:A9,A2:A9},
"select Col1,Col2 order by Col2 label Col1'',Col2''"
),
TRANSPOSE(SPLIT(TEXTJOIN("|",0,
IF(TRANSPOSE(ROW(A2:A9)-1)<=QUERY(
{A2:A9},
"select count(Col1) where Col1<>'' group by Col1 label count(Col1)''",
),
TRANSPOSE(ROW(A2:A9)-1),
)
),"|"))
},
{3},
))
Updates 2019-09-26 08:01:56
The final formula is
=ARRAYFORMULA(MMULT(
(ROW(A2:A17)>=TRANSPOSE(ROW(A2:A17))) *
EXACT(A2:A17,TRANSPOSE(A2:A17))^1,
ROW(A2:A17)^0
))
Sheet example

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Problems with same query in mongoid rails db but different parameters - ruby-on-rails

Related

How to perform a Linear Regression by group in PySpark?

GridDB Bad Connection Exception

How to calculate cpu utilization of container in docker using http api

How do I perform a “diff” on two set of json buckets using Apache Beam Python SDK?

Case-sensitive formula to perform a COUNTIF for number-letter combination of IDs

Categories

Resources