I am looking for a solution in python for my data which is in an excel file that contains different statements and numbers. I want to filter out the rows on the base of column values.
import pandas as pd
df=pd.read.excel("Data.xlsx")
df[df.Numbers.apply(lambda x: str(x).isdigit())]
df.to_excel("Data1.xlsx")
Any suggestions please?
Here is one way to perform the filtering, using pandas' string tools and boolean masks. I did each step separately (easier to test, and easier to understand in the future).
# remove CAS and Cascade
mask = (df['Evaluations'].str.startswith('CAS') |
df['Evaluations'].str.contains('CASCADE'))
df = df[~mask]
# remove Numbers starting with 21 or 99
mask = (df['Numbers'].astype(str).str.startswith('21') |
df['Numbers'].astype(str).str.startswith('99'))
df = df[~mask]
# remove letter as 2th character (1 => zero-based indexing)
mask = df['Numbers'].astype(str).apply(lambda x: x[1].isalpha())
df = df[~mask]
# write to file
with open('Data1.xlsx', 'wb') as handle:
df.to_excel(handle)
print(df)
Evaluations Numbers
2 Nastolgic behaviours of people 75903324
3 google drive 76308764
6 Tesla's new inventions 83492836
7 Electric cars 78363522
1- If in the column named Evaluations, its content starts with "OBS" or has the word "Obsolete" in it then remove these rows
(^OBS|Obsolete)
2- If the column value in the Numbers column start with digits "99" or "51" then remove these rows
^(99|51)
3- If the 5th digit in the Numbers column is an alphabetic character then also remove these rows
^\d{4}\w
These are the Regexes that will help match these conditions.
Related
Let's say I have text in the following format in Column A imported to another spreadsheet (impossible to add = manually because the data is imported automatically and change):
45+5
45+3
90+2
90+7
Is there any formula that can convert this text into an equation that gives the result of the sum in Column B?
For example:
=ARRAYFORMULA(FUNCTIONTOCONVERTTEXTTOEQUATION(A1:A))
Expected Result:
50
48
92
97
Note: The texts will always be a number after the + sign and then another number.
Given your response to my clarifying question above, let's assume that your raw data is in A2:A. Place the following in the Row-2 cell (e.g., B2) of an otherwise empty column:
=ArrayFormula(IF(A2:A="",,MMULT(IFERROR(TRIM(SPLIT(A2:A,"+"))*1,0),SEQUENCE(COLUMNS(SPLIT(A2:A,"+")),1,1,0))))
MMULT is a powerful yet underused function. I'll include a graphic that explains what it does better than words might:
SPLIT will form the elements of the first matrix, while SEQUENCE will simply create the second matrix consisting of a column of 1's the same length as the number of horizontal elements formed by the SPLIT (which, in your case, will apparently always be 2).
Try, assuming the imported data starts at A1
=arrayformula(sum(value(split(A1,"+"))))
or, in a single formula at the top of the column
=mmult(arrayformula(value(split(A1:A4,"+"))),sequence(2,1,1,0))
I have three columns of information. For example: color, model, year.
Can I use the "unique" instruction to generate in three new columns each unique combination for color, model, year, each in one column?
ex.
color model year
red sedan 2016
red sedan 2020
black truck 2018
Thanks!
Suppose your three headers are in A1, B1 and C1 with your data running A2:C. And suppose you want the unique combinations in E:G. First, be sure that the entire range E:G is empty. Then place the following formula in E1:
=ArrayFormula({A1:C1;SPLIT(FLATTEN(UNIQUE(FILTER(A2:A,A2:A<>""))&"|"&TRANSPOSE(FLATTEN(UNIQUE(FILTER(B2:B,B2:B<>""))&"|"&TRANSPOSE(UNIQUE(FILTER(C2:C,C2:C<>"")))))),"|")})
The formula first reproduces the headers from A1:C1.
The combinations are formed by first concatenating each UNIQUE model (from a list that is FILTERed to remove blanks) with each UNIQUE year (from a list that is also FILTERed to remove blanks), with a pipe symbol between each as a separator that SPLIT will later use.
That grid of combinations is FLATTENed into a single column and then concatenated once more with a UNIQUE and FILTERed list of the colors leading off, and again with a pipe symbol as a separator. Once more, the entire grid of results is FLATTENed into a single column.
Finally, SPLIT acts on the pipe symbols to separate the three pieces into their own columns under the headers.
try:
=INDEX({A1:C1; UNIQUE(QUERY(SPLIT(FLATTEN(FLATTEN(A2:A&"×"&
TRANSPOSE(B2:B))&"×"&TRANSPOSE(C2:C)), "×"),
"where Col3 is not null"))})
the task is simple: take column A and combine it with transposed column B. flatten the output in one single column and combine it with transposed column C and again flatten it into one single column. then split it and query out all combinations that have less than 3 columns. next, run it through unique to remove duplicates.
My intention is to convert a single line of data into rows consist of a specific number of columns in Google Sheets.
For example, starting with the raw data:
A
B
C
D
E
F
1
id1
attr1-1
attr2-1
id2
attr2-1
attr2-2
And the expected result is:
(by dividing columns by three)
A
B
C
1
id1
attr1-1
attr1-2
2
id2
attr2-1
attr2-2
I already know that it's possible a bit manually, like:
=ARRAYFORMULA({A1:C1;D1:F1})
But I have to start over with it every time the target range is moved OR the subset size needs to be changed (in the case above it was three)!
So I guess there will be a much more graceful way (i.e. formula does not require manual update) to do the same thing and suspect ARRAYFORMULA() is the key.
Any help will be appreciated!
I added a new sheet ("Erik Help") where I reduced your manually entered parameters from two to one (leaving only # of columns to be entered in A2).
The formula that reshapes the grid:
=ArrayFormula(IFERROR(VLOOKUP(SEQUENCE(ROUNDUP(COUNTA(7:7)/A2),A2),{SEQUENCE(COUNTA(7:7),1),FLATTEN(FILTER(7:7,7:7<>""))},2,FALSE)))
SEQUENCE is used to shape the grid according to whatever is entered in A2. Rows would be the count of items in Row 7 divided by the number in A2 (rounded to the nearest whole number); and the columns would just be whatever number is entered in A2.
Example: If there are 11 items in Row 7 and you want 4 columns, ROUNDUP(11/4)=3 rows to the SEQUENCE and your requested 4 columns.
Then, each of those numbers in the grid is VLOOKUP'ed in a virtual array consisting of a vertical SEQUENCE of ordered numbers matching the number of data pieces in Row 7 (in Column 1) and a FLATTENed (vertical) version of the Row-7 data pieces themselves (in Column 2). Matches are filled into the original SEQUENCE grid, while non-matches are left blank by IFERROR
Though it's a bit messy, managed to get it done thanks to SEQUENCE() function anyway.
It constructs a grid by accepting number of rows/columns input, and that was exactly I was looking for.
For reference set up a sheet with the sample data here:
https://docs.google.com/spreadsheets/d/1p972tYlsPvC6nM39qLNjYRZZWGZYsUnGaA7kXyfJ8F4/edit#gid=0
Use a custom formula
Although you already solved this. If you are doing this kind of thing a lot, it could be beneficial to look into Apps Script and custom formulas.
In this case you could use something like:
function transposeSingleRow(range, size) {
// initialize new range
let newRange = []
// initialize counter to keep track
let count = 0;
// start while loop to go through row (range[0])
while (count < range[0].length){
// add a slice of the original range to the new range
newRange.push(
range[0].slice(count, count + size)
);
// increment counter
count += size;
}
return newRange;
}
Which works like this:
The nice thing about the formula here is that you select the range, and then you put in a number to represent its throw, or how many elements make up a complete row. So if instead of 3 attributes you had 4, instead of calling:
=transposeSingleRow(A7:L7, 3)
you could do:
=transposeSingleRow(A7:L7, 4)
Additionally, if you want this conversion to be permanent and not dependent on formula recalculation. Making it in run fully in Apps Script without using formulas would be neccesary.
Reference
Apps Script
Custom Functions
I'm trying to sort a list of documents, but I'm having an issue with the documents that have a letter as a suffix.
Whenever we amend a document we add a letter to the end of the number, but when I sort by number in excel it sorts like this:
1
2
3
10
11
1606
1603D
1605B
1606A
1606C
1610A
1623A
20A
220B
390A
399A
402A
415A
450A
488A
557B
How can I make it sort in order of document number and amendment?
Like so:
1
2
3
10
11
1603D
1605B
1606
1606A
1606C
1610A
1623A
20A
220B
390A
399A
402A
415A
450A
488A
557B
As long as you have a mix of text and number, you won't be able to use Excel's built-in sort to achieve the result you describe.
If you append a letter to a number you effectively change the data type from number to text. Text will always be sorted after any number, hence the number 1606 comes before the text 1606A.
You could try to make all values real text, maybe indicate levels by appending digits with dots, like this:
1.
1.0.
1.1.
1.6.0.3.D
1.6.0.5.B
1.6.0.6.
1.6.0.6.A
1.6.0.6.C
1.6.1.0.A
1.6.2.3.A
2.
2.0.A.
2.2.0.B.
3.
3.9.0.A.
3.9.9.A.
4.0.2.A.
4.1.5.A.
4.5.0.A.
4.8.8.A.
5.5.7.B.
But even that does not give you the sort order you describe as the desired result.
Your desired sort order will be hard to achieve even if all values are text, or if you replace the A, B, C with a decimal .1, .2, .3. -- It's really hard to understand why 20 would come after 1623.
The solution I found was to add a column, and copy this formula into each cell:
=IF(ISNUMBER(--RIGHT(A2)),A2,LEFT(A2,LEN(A2)-1))
The formula removes the letters from the numbers, you can then sort your sheet using the new column of clean numbers.
I have 5 columns of numbers that I want to sort per row into another set of columns. I figured I need to use small() (e.g. small(a2:e2,1) for f2; small(a2:e2,2) for g2 and so on). Is there away to iterate this for the next rows; if possible using only native google spreadsheet formulas?
Thanks in advance
I was able to make a temporary work around, but I had to use 3 cheat columns. It looks ok for now but I imagine it will be troublesome for really huge numbers.
Here's a sample sheet for reference: https://docs.google.com/spreadsheets/d/1MQTP2XkRsPRAnPQ5wLhkR8JoNVY6YOExVlOkkX8UeRs/edit#gid=0
The original data are in A3:E
The first cheat column (G3:G) simply creates a column of numbers from 1 to the largest number found in the source data. 1-9 is changed to 01-09 for easier searching. "#" is then added at the end-this will come handy later:
Cheat Column 1 =filter(if(row(A:A)=max(A:E)+1,ʺ#ʺ,text(row(A:A),ʺ00ʺ)),row(A:A)<=max(A:E)+1)
The second cheat column (H3:H) combines each row into a string separated by "-" with a "#" marker:
Cheat Column 2=filter(text(A3:A,ʺ00ʺ)&ʺ-ʺ&text(B3:B,ʺ00ʺ)&ʺ-ʺ&text(C3:C,ʺ00ʺ)&ʺ-ʺ&text(D3:D,ʺ00ʺ)&ʺ-ʺ&text(E3:E,ʺ00ʺ)&ʺ#ʺ,A3:A<>ʺʺ)
The last cheat column (I3:I) sorts each line (from cheat column 2) by finding each number from cheat column from 01 up to the max number, then the "#" char (this ensures that each line will still have the # end marker). "Find" will return the "position" of each number or an error if it's not found. By using "if", we can make "find" return the actual number or "" instead.
=filter(arrayformula(if(iferror(find(transpose(filter(G3:G,G3:G<>ʺʺ)),H3:H),ʺʺ), transpose(filter(G3:G,G3:G<>ʺʺ)),ʺʺ)),A3:A<>ʺʺ)
The formula above creates as many columns as there are numbers from cheat column 1. To prevent this, a "-" is added to each number then "Concatenate" is used to combine everything into one massive string with each set separated by "#". The string is then split using the "#" marker.
Cheat Column 3 =transpose(split(concatenate(filter(arrayformula(if(iferror(find(transpose(filter(G3:G,G3:G<>ʺʺ)),H3:H),ʺʺ),ʺ-ʺ&transpose(filter(G3:G,G3:G<>ʺʺ)),ʺʺ)),A3:A<>ʺʺ)),ʺ#ʺ))
Each number is then separated into each corresponding column by using mid().
Small 1 =filter(mid(I3:I,2,2)*1,A3:A<>ʺʺ)
Small 2 =filter(mid(I3:I,5,2)*1,A3:A<>ʺʺ)
Small 3 =filter(mid(I3:I,8,2)*1,A3:A<>ʺʺ)
Small 4 =filter(mid(I3:I,11,2)*1,A3:A<>ʺʺ)
Small 5 =filter(mid(I3:I,14,2)*1,A3:A<>ʺʺ)
Note that the formula above is only for numbers 1-99. For larger numbers, the Text() formulas should have more zeroes to correspond to the number of digits of the biggest number. The Mid() formulas should also be adjusted accordingly.
I would like to stress that I am very far from being a spreadsheet expert and that this solution is very "unoptimized". It requires several cheat columns; with the first one even having more rows than the original data. If anyone can help me get rid of the cheat columns (or at least the first one) I will be very grateful.
How about using SMALL like you mentioned in your question?
=small($A3:$E3,column()-columns($A3:$G3))
You will need to change the ranges accordingly. The last $G$3 is the cell just before the cell where the formula is placed.
Sample