Regression fit Issue: could not convert string to float: '-' - machine-learning

I am trying to come up with a solution to convert the datatypes of column in pandas dataframe. I have the following columns in my data: User Id, age, gender, marital status and prod code. I want to convert the following columns to float. I used to .replace function to change the values inside the entry
ITEM_ID prod_code id gender age marital_status
0 1 0 156873.0 - - -
1 2 1 156872.0 0 29 -
2 3 2 156871.0 0 24 -
3 4 3 156870.0 0 25 -
4 5 4 156869.0 0 23 -
Got the following error:

If you want to change the data type of a column all you have to do is to use the code below:
df['name of the column'] = df['name of the column'].astype('float64')
However, the problem you are facing now is something else
There are values in the columns that are strings such as '-' which you have to replace with missing values and later fill them.
in order to do this, you can use the code below:
df['name of the column'] = df['name of the column'].replace('-',np.nan)
To avoid this sort of issue you can use df.describe() or df.info() to see the type of each column. In this case, if a column is all numbers but it says it is an object means there are values that you have to check such as this one.

Related

Google Sheet - It's possible to array sum function in the following condition?

Would it be possible to use arrayformular for this condition?
Sum all the rows that PID are the same, the result should be as in the image.
I tried this code, but I think it's too long, and if the PID exceed over 20 rows, it would not work.
=IF(A3<>A2,BJ3+IF(A3=A4,BJ4,0)+IF(A3=A5,BJ5,0)+IF(A3=A6,BJ6,0)+IF(A3=A7,BJ7,0)+IF(A3=A8,BJ8,0)+IF(A3=A9,BJ9,0)+IF(A3=A10,BJ10,0)+IF(A3=A11,BJ11,0)+IF(A3=A12,BJ12,0)+IF(A3=A13,BJ13,0)+IF(A3=A14,BJ14,0)+IF(A3=A15,BJ15,0)+IF(A3=A16,BJ16,0)+IF(A3=A17,BJ17,0)+IF(A3=A18,BJ18,0)+IF(A3=A19,BJ19,0)+IF(A3=A20,BJ20,0)+IF(A3=A21,BJ21,0)+IF(A3=A22,BJ22,0),0)
With a table like this :
ID
Value
1
5
1
10
2
5
2
10
2
15
You have an expected output of :
ID
Value
Sum
1
5
15
1
10
blank
2
5
30
2
10
blank
2
15
blank
It is achievable with this formula (just drag it in your sum column) :
=IF(A2=A1,"",SUMIFS(B$2:B$12,A$2:A$12,A2))
It check if the ids are the same and then sum them, but only show them on the row where the id first appears
Found it on google by searching google sheets sum group by
The following in C2 will generate the required answer without any copying-down required:
=arrayformula(if(len(A2:A),ifna(vlookup(row(A2:A),query({row(A2:B),A2:B},"select min(Col1),sum(Col3) where Col2 is not null group by Col2"),2,false)),))
We are making a lookup table of grouped sums against the first row of each 'P#' group using QUERY, then using VLOOKUP to distribute the group sums to the first row in each group. Probably also doable using a SCAN/OFFSET combination as well, I think.

How to replace arbitrary values of a variable with sequential values?

<For example:
a variable has values 1 2 3 5 10 11 12 13 14 20 21 ....
I want to replace it with 1 2 3 4 5 6 7 8 9 10 11.....
I was using this command but is not giving, the desired results:
old variable=district
I want to replace value with the correct sequential values>
levelsof district, local(district_new)
foreach i in `district_new'{
replace district= mod(_n-1,707)+1
}
Not fully sure what you trying to do, but is this a solution to what you are trying to do:
sort district
replace district = _n
This will replace the values in district with 1 for the lowest current value, 2 for the second lowest value etc. This might not be a good solution if your variable may have duplicates.
I agree with #TheIceBear but more can be said that won't fit easily into comments.
The particular code posted boils down to a single statement repeated
replace district = mod(_n-1,707) + 1
as that action is repeated regardless of the values of district. In a dataset with 707 or fewer observations, that in turn would be equivalent to
replace district = _n
as #TheIceBear points out. If there were duplicate observations on any district, this would definitely be a bad idea, and something like
egen newid = group(district), label
would be a better idea. For more, see https://www.stata.com/support/faqs/data-management/creating-group-identifiers/

Collect value that is below other values

I'm trying to figure out how to collect the value that is always in LINE 9 of texts with this same template:
Aposta
Sport: 11.718.177
Compartilhar
Feita por
Privado
em 25/06/2021 às 10:04
Vitória
10:04 25/06/2021
Katerina Siniakova - Sorribes Tormo, Sara
2nd set jogo 6 - vencedor
Vitória
Katerina Siniakova
1,30
2-0
In this case, the value of LINE 9 is:
Vitória
I tried to use:
=TRANSPOSE(SPLIT(A1,"
"))
And after creating a column with the separate values, I tried using QUERY to remove the first lines of text and using LIMIT 9 to keep only the value of ROW 9, but QUERY joins the values from other lines and ends up giving a wrong value.
Note: I will need to use it to analyze texts like this on several different lines in Column A, so I should look for an option that can also be used as ARRAY so I don't need to put a different formula on each line.
This will give you the 9th column of an array split by carriage returns:
=INDEX(SPLIT(A2:A,CHAR(10),0,0),,9)

Negative References or reversing order of column for DATEDIF

I have a ascending sorted list of irregular dates in Column A:A:
A B C D (A:A,A2:A) E (A:A,A3:A)
2017-11-09 10 10 NA NA
2017-11-10 11 21 1 NA
2017-11-14 15 36 4 5
2017-11-15 22 58 1 5
Column C:C is a rolling sum of B:B. I'm trying to get arrayformula in D:D/E:E to find the datedif between current row (starting date) and X rows above (end date):
=ArrayFormula(DATEDIF(B:B-(X Rows),B:B,"D"))
The goal is to find range of change in D:D over X amount of days:
D:D - D:D-rowX / datedif (A:A-rowX, A:A)
i.e for 2 days on row C4:
(C4-C2) / datedif(C4-2,C4,"D")
(58-21) / datedif(C2,C4,"D")
37 / 5 = 7.4
for 5 days on row C10:
(C10-C5) / datedif(C10-5,C10,"D")
for 15 days on row C20:
(C20-C5) / datedif(C20-15,C20,"D")
I'm trying to calculate X for 1,2,3,4,7,28 rows up which means the array has to start that 1,2,3,4,7,28 rows down.
Right now, the array bugs out to bad reference because the first starting date is DATEDIF(B-X,B1,"D") where B-X is a invalid negative reference. Arrayformulas with bad values instead of bad references seems to just skip past errors and starts working once input are valid. But I can't figure out how to skip bad references. I've tried forcing start date with INDIRECT but can't get it to recognize value as a date. I also tried DATEDIF(B:B, B:B+X,"D"), which spits out the correct numbers but results are offset by X rows. I've tried reverse sorting A:A, =ArrayFormula(if(len(A:A),DATEDIF(SORT(A2:A,1,0),SORT(A:A,1,0),"D"),"")) it produces a reverse orders list of correct answers that I can't figure out how to flip back.
Seems like I'm missing something obvious?
EDIT: tried to clarify original post
Is there a easy way to displace an entire column?
Alternative Solution?
The formula roughly works but is not aligned to the correct row:
C D E
1 2 3
1 2 3
1 2 3
1 2
1
I just need it to display
C D E
1
1 2
1 2 3
1 2 3
1 2 3
To get things aligned, I can put in cell on row2 of Column F:
=array_constrain(ARRAYFORMULA(D:D),COUNT(A:A)-2,1)
Or cell in row3 of Column G:
=array_constrain(ARRAYFORMULA(E:E),COUNT(A:A)-3,1)
But if I try trigger teh formula from row1 via:
=arrayformula(if(row(A:A)>=2,array_constrain(D:D,COUNT(A:A)-2,1)))
It label everythign >=2 row false and still render D:D without displacing the cells the proper number of rows:
C D
1 false
1 2
1 2
1 2
1
EDIT: I'm closing the request, ended up just using vlookup(B:B-X) which provided an approximate enough result to work for my needs.
Short answer
Add the following formula to D1
=ArrayFormula({"N/A";ARRAY_CONSTRAIN(DATEDIF(A:A,A2:A,"D"),COUNT(A:A)-1,1)})
And the following formula to E1
=ArrayFormula({"N/A";"N/A";ARRAY_CONSTRAIN(DATEDIF(A:A,A3:A,"D"),COUNT(A:A)-2,1)})
Explanation
The solution use ARRAY_CONSTRAIN to return just the required result values and use a the array notation to add the required N/A values for the rows that as it don't have a pair to calculate the date difference.
REMARK:
Please note that the DATEDIF functions use the column A for the references as this column is the one that holds the date values.

Excel: Count no. of times last value in a row occured

Am sorry am unable to paste the table here as my work laptop security doesn't let me.
I have a row with multiple repetitive values eg columnB to BI containing 2s, 3s, 1s, and 3s again.
The value in last column is 3. I want to count for last how many columns was the value 3 before it changed to something else.
For example: if the row looks like
2 2 3 3 3 1 1 2 2 3 3 2 2 2 2 2 , then the answer I want is 5, because the last value is 2 and it was there for last 5 columns.
I hope it makes sense.
Thank you,
Parul.
You can do it by creating a UDF(User Defined Function) in VBA like this one:
Function CountLast(x As Range, y As Integer)
Dim lColumn, count
lColumn = x.Cells(x.count).Column 'Get last column in range x
count = 0
For i = lColumn To 0 Step -1 'Start with last column and work from right to left
If Cells(1, i).Value <> y Then 'Compare value of each cell with the value provided in y and leave the loop if not found
Exit For
End If
count = count + 1 'Counts how many times the value is found
Next i
CountLast = count 'Returns the counted value
End Function
Then you would use it like this:
=CountLast(B1:BI1,BI1)
For the example data that you provide in your question I used:
=CountLast(A1:P1,P1)
and the resulting answer is 5
What is happening is that the UDF is finding the last cell in the range and then starting there is comparing it to the selected value that you also provide the function and working from right to left (step -1) then it counts as long as they match and in the end returns the counted value.

Resources