How to replace arbitrary values of a variable with sequential values? - foreach

<For example:
a variable has values 1 2 3 5 10 11 12 13 14 20 21 ....
I want to replace it with 1 2 3 4 5 6 7 8 9 10 11.....
I was using this command but is not giving, the desired results:
old variable=district
I want to replace value with the correct sequential values>
levelsof district, local(district_new)
foreach i in `district_new'{
replace district= mod(_n-1,707)+1
}

Not fully sure what you trying to do, but is this a solution to what you are trying to do:
sort district
replace district = _n
This will replace the values in district with 1 for the lowest current value, 2 for the second lowest value etc. This might not be a good solution if your variable may have duplicates.

I agree with #TheIceBear but more can be said that won't fit easily into comments.
The particular code posted boils down to a single statement repeated
replace district = mod(_n-1,707) + 1
as that action is repeated regardless of the values of district. In a dataset with 707 or fewer observations, that in turn would be equivalent to
replace district = _n
as #TheIceBear points out. If there were duplicate observations on any district, this would definitely be a bad idea, and something like
egen newid = group(district), label
would be a better idea. For more, see https://www.stata.com/support/faqs/data-management/creating-group-identifiers/

Related

Google Sheet: How to determine where I should put arrayformula?

I understand some basic usages of Arrayformula, but when it comes to complex formulas, I often get confused and don't know where to put it.
Products:
ID
Name
init Stock
Current Stock
23
Bag
24
What arrayformula should I put in this cell?
43
Book
45
=C3 + SUM(filter('Records'!C2:C,'Records'!A2:A = A3,'Records'!B2:B = "in")) - SUM(filter('Records'!C2:C,'Records'!A2:A = A3,'Records'!B2:B = "out")) //a normal formula
31
Table
42
=ARRAYFORMULA(C2:C + SUM(filter('Records'!C2:C,'Records'!A2:A = A2:A,'Records'!B2:B = "in")) - SUM(filter('Records'!C2:C,'Records'!A2:A = A2:A,'Records'!B2:B = "out")) //This doesn't work
Records
ID
in/out
quantity
23
in
1
43
in
34
31
out
5
23
out
13
23
in
14
23
in
111
I am using the above tables to track stock of products, when a new in/out records is added to the Records table, the value in Current Stock should change accordingly.
In the table above I put my attempt but it doesn't work, returning error saying filter's range mismatch. I guess I will have to wrap another arrayformula around SUM and/or filter. This is when confusion starts.
How do I determine where I should put another arrayformula?
As far as I understand, when inside an arrayformula, some functions that would originally take one value as parameter can take an array as parameter, but some others can't. How do I know which functions have this behavior?
I'm no expert in order to better explain how to use ARRAYFORMULA, but it always get tricky when you need to use it with formulas that already include ranges. I recommend you to investigate about BYROW an BYCOL, basically they iterate a formula for a whole range row by row or column by column. Try this:
=BYROW(Records!C2:C,lambda(each,each + SUM(filter('Records'!C2:C,'Records'!A2:A = A2:A,'Records'!B2:B = "in")) - SUM(filter('Records'!C2:C,'Records'!A2:A = A2:A,'Records'!B2:B = "out"))))

Filter based on Unique Values that only match certain criteria

This may be beyond my skill level in Google Sheets, and it's certainly straining my brain to think through, but I have two columns out a large spreadsheet (30000 lines or so) that I need to find matches between unique values on one list, and non-unique but specific values ONLY on another list. That is, I would need the following list to return only the values on the left that had a 3 in the right column every time that value appears on the left, not just for a specific instance.
"Unique" Identifier (can repeat)
Value
1
2
2
3
3
2
4
2
5
3
6
2
1
2
2
2
3
2
4
2
5
2
6
2
I have the following formula from another couple answers mocked up, but it doesn't get me all the way there:=UNIQUE(FILTER(A2:A,B2:B>0))
How can I get it to exclude the ones that have, for instance, both a 2 and a 3 in the right column for the same value in the left column?
Edit: To put it in more real terms (I was trying to keep it abstract so I could understand the basics), I have a Catalog ID and a Condition for items, and need to find all Catalog IDs that only have Good copies, not any Very Good copies. This link should show what I want to achieve:
https://docs.google.com/spreadsheets/d/e/2PACX-1vSjenkDS2Mk3t4kTcDoJqSc8AV6ONu4Q17K1HPaIUdJkb7dhdnbAt-CzUxGO3ZoJISNpGajUtFTGz8c/pubhtml?gid=0&single=true
to return only the values on the left that had a 3 in the right column every time
try:
=UNIQUE(FILTER(A:A; B:B=3))
update 1:
=UNIQUE(FILTER(Sheet1!A:A; Sheet1!B:B="Good"))
update 2:
=UNIQUE(FILTER(Sheet1!A:A, Sheet1!B:B="Good",
NOT(COUNTIF(FILTER(Sheet1!A:A, Sheet1!B:B<>"Good"), Sheet1!A:A))))

Google Sheets - how to find a sum of 3 higher values from the range

how to find a sum of 3 higher values from the range of 6 which are on the one row e.g We have integer values A1:A6 like 2 5 7 4 9 9 It should sum 9+9+7 so 25
Is it possible by any formula or something?
Take a look at the answer Extracting the top five maximum unique values
That should provide you with a basic mechanism (QUERY), to get the top 3 values. Then, apply the SUM function to that result.
So, in your case, you would want:
=SUM(QUERY(A2:A6,"select A order by A desc limit 3",-1))
Here's another one:
=SUM(ARRAY_CONSTRAIN( SORT(A1:A6,1,0),3,1))
Shorter version:
=large(A:A,1)+large(A:A,2)+large(A:A,3)
to apply to an entire column, though A:A could be limited to A1:A6.

Formula to check if one cell is within a range between two cells

I'm trying to search through two columns with a given value. For example:
A(values)
0-2
3-4
5-6
7-8
9-10
B
275
285
295
305
330
now say I have 3 as a given value. I would like to compare it with the range of values in A so in a logical sense it would fall under 3-4 and return 285.
I think Vlookup would take part ... maybe an if statement.
It may be simpler to change your A values and use a formula like:
=vlookup(D1,A:B,2)
In which case any value greater than 9 would also return 330 (unless say an IF clause precludes that).
vlookup without a fourth parameter makes inexact matches (as well as exact) and when the first column of the lookup range is sorted ascending will chose the match appropriate to the highest value that is less than the search_key.
Does this formula work as you want:
=LOOKUP(3,ARRAYFORMULA(VALUE(LEFT(FILTER(A:A,LEN(A:A)),SEARCH("-",FILTER(A:A,LEN(A:A)))-1))),FILTER(B:B,LEN(B:B)))
In addition, if you use 'closed ranges' you can try something like:
=ArrayFormula(VLOOKUP("3", {REGEXEXTRACT(A2:A6, "(\d+)-"), B2:B6}, 2, 1))

Apache Pig: Join records by shifting

I have records of type:
time | url
==========
34 google.com
42 cnn.com
54 yahoo.com
64 fb.com
I want to add another column to these records time_diff which basically takes the difference of the time of the current record with the previous record. Output should look like:
time | url | time_diff
======================
34 google.com -- <can drop this row>
42 cnn.com 08
54 yahoo.com 12
64 fb.com 10
If I can somehow add another column (same as time) shifting the time by one such that 42 is aligned with 34, 54 is aligned with 42 and so on, then I can take the difference between these columns to calculate time_diff column.
I can project the time column to a new variable T and if I can drop the first record in the original data, then I can join it with T to obtain the desired result.
I appreciate any help. Thanks!
See this question, for example. You'll need to get your tuples in a bag (using GROUP ... ALL in your case), and then in a nested FOREACH, ORDER them and call a UDF to rank them. After you have this rank, you can FLATTEN the bag back out into a set of tuples again, and you'll have three fields: time, url, and rank. Once you have this, create a fourth column which is rank-1, do a self-join on those latter two columns, and you'll have what you need to compute the time_diff.
Since multiple records can have the same time, it would be a good idea to also sort on url so that you are guaranteed the same result every time.
I think you can use "lead" function of PiggyBank. Something like following might work.
A = LOAD 'T';
B = GROUP A ALL
C = FOREACH B {
C1 = ORDER A BY d;
GENERATE FLATTEN(Stitch(C1, Over(C1.time, 'lead')));
}
D = FOREACH C
GENERATE stitched::time AS time,
stitched::url AS url,
stitched::time - $3 AS time_diff;
https://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/evaluation/Over.html

Resources