Please, can you advise me how to compare the downloaded file with the expected one? I haven't seen anything like that anywhere. By the way, if I compare e.g. two xlsx files, so I would also like to have information about the different values in the cells.
Example
Fist xlsx file
Column A
Column B
1
2
3
4
Second xlsx file
Column A
Column B
2
1
4
3
Comparing result - example
----------------- DIFF -------------------
DIFF Cell at A2 => '1' v/s '2'
.
.
.
I thought I'd join the comparison https://github.com/vrootic/FileCompare
Related
I have several google spreadsheets with different number of records (rows) - let's say
file 1: 200.000 records (rows)
file 2: 350.000 records (rows)
file 3: 246.000 records (rows)
etc.
I use a lot of formulas (20-30) that reference entire columns from file 1:
sumif(a$2:a$200000,">3")
countif(b$2:b$200000, "=n")
etc.
I want to reuse the already created formulas for the other files, but since the number of records there is different, I would have to replace the 200.000 with 350.000 for file 2 in 20-30 cells, with 246.000 for file 3 in 20-30 cells etc.
That would be too much work.
Is there a way to specify the end point of the range not with a constant but by pointing to a cell that contains the number of rows?
e.g.
I would add in cell z1 the number of rows: 200000
The other formulas would contain something like
sumif(a$2:a$ (something that tells sheets to use as row number the number from z1) )
This way I would need to only replace the number in z1, and all formulas would be updated correctly. Any ideas?
I tried using indirect:
="a"&indirect("z1")
where z1 contains 200000
This pastes
a200000
But if I try using it in a range, it's not recognized as a range
=sum(a1:"a"&indirect("z1"))
Any ideas how to do that correctly?
why not just skip it... instead of:
=sumif(a$2:a$200000,">3")
use:
=sumif(a$2:a,">3")
to answer your indirecting, the correct syntax would be:
=sum(INDIRECT("a1:a"&z1))
You don't need to use the line numbers limit on this case.
Just use sumif(A$2:A,">3") and it will read the whole column A starting from line 2
I want to check a column of strings on another column of substrings to see the strings contain any of the substrings.
I am currently attempting =SEARCH(lower('Sheet2'!$A$1:$A$100),lower(B1)) and would like a True/False response.
Thanks in advance.
An example of Sheet2 would be:
A
1 Hello
2 Hi
3 I said she
an example of Sheet1 with the expected result in column C would be:
A B C
1 23 There are many FALSE
2 45 I said he is slow FALSE
3 3 I said she is bad TRUE
4 78 he yelled hello TRUE
Any help is appreciated
EDIT: link to example - https://docs.google.com/spreadsheets/d/1c2pskSYsGs12Yjbn-5gORQ22mDSaC9cSnp1nWeULlf4/edit?usp=sharing
In Sheet1!C1:
=ArrayFormula(IF(B:B="",,REGEXMATCH(LOWER(B:B),JOIN("|","\b"&FILTER(LOWER(Sheet2!A:A),Sheet2!A:A<>"")&"\b"))))
You haven't shared a link to a spreadsheet, so this is untested on any actual data. Your locale is also unknown, which may required modifications as well. So if this formula doesn't work as provided, share a link to your sample spreadsheet.
I am trying to write a formula that will look for a value in a column, and return the first cell in the row in which it finds the value. So a little like VLOOKUP, but I don't want to search the the first column.
Here is an example dataset:
Room
Monday
Tuesday
DWG 1
S01
S02
DWG 2
S02
S04
DWG 3
S03
S06
DWG 4
S04
S07
Here is what I would like to generate using a formula.
So for the value at B2, I would like it to look up A2 ("S01") in the B column ("Monday") of the top table, and return the value of the cell in the 1st column ("DWG 1").
Ideally it would return nothing or a blank if it doesn't find the exact string in the top table.
Section
Monday
Tuesday
S01
DWG 1
S02
DWG 2
DWG 1
S03
DWG 3
S04
DWG 4
DWG 2
S05
S06
DWG 3
s07
DWG 4
After seeing your in-sheet data and layout, and meeting up with you there live, this is the formula I left for you in the newly added sheet "Erik Help":
=ArrayFormula({"S"&TEXT(SEQUENCE(24,1),"00"),IFERROR(VLOOKUP(FILTER(B1:1,B1:1<>"")&"S"&TEXT(SEQUENCE(24,1),"00"),SPLIT(FLATTEN(FILTER(Sheet1!B1:1,Sheet1!B1:1<>"")&FILTER(INDIRECT("Sheet1!B3:"&ROWS(Sheet1!A:A)),Sheet1!B1:1<>"")&"|"&FILTER(Sheet1!$A3:$A,Sheet1!$A3:$A<>"")),"|"),2,FALSE))})
For the understanding of others, the days of the week (i.e., Monday, Tuesday...) are entered manually as top headers in B1:F1. The header "Section" is entered in A2 with B2:F2 blank. (This is just how the OP wanted it set up.) And the formula is in A3, processing data for A3:F.
The first part of the virtual array just generates the SEQUENCE of section names (S01 - S24) in A3:A26.
The next part looks up every element of one array within another array. The first array is a concatenation of every weekday with every section number from Column A. The second array didn't technically need to be as long as it is in the formula, because we already know exactly how many weekdays, classrooms and sections there are. But it is written to accommodate flexibility, perhaps for future use where four or six days are required, with more or fewer sections.
That second array concatenates every weekday with every section from Sheet1 followed by a pipe symbol and the room for each row from Sheet1. That grid is FLATTENed to one column, and then SPLIT to two columns at the pipe symbol.
Found elements, then, return the class name (which was SPLIT to Column 2 of the virtual VLOOKUP array). If there is no match, IFERROR returns null.
The shorter version possible since we know exactly how many days, sections and rooms we have (and which I left in a new sheet called "Erik Help 2") is this:
=ArrayFormula({"S"&TEXT(SEQUENCE(24,1),"00"),IFERROR(VLOOKUP(Sheet1!B1:F1&"S"&TEXT(SEQUENCE(24,1),"00"),SPLIT(FLATTEN(Sheet1!B1:F1&Sheet1!B3:F17&"|"&Sheet1!A3:A17),"|"),2,FALSE))})
The function you're looking for doesn't exist in Sheets as Vlookup match is performed horizontally from left to right. However, a workaround is to rearrange the columns within the function QUERY and perform a Vlookup to it
Here's an example formula you can use =iferror((vlookup("S01",QUERY(A2:C, "Select B,A",0),2)),"")
This will also leave the cell blank if there are no matching results.
Here's an example of what the end result would look like when I the string "S01" is Vlooked up:
I am looking for a solution in python for my data which is in an excel file that contains different statements and numbers. I want to filter out the rows on the base of column values.
import pandas as pd
df=pd.read.excel("Data.xlsx")
df[df.Numbers.apply(lambda x: str(x).isdigit())]
df.to_excel("Data1.xlsx")
Any suggestions please?
Here is one way to perform the filtering, using pandas' string tools and boolean masks. I did each step separately (easier to test, and easier to understand in the future).
# remove CAS and Cascade
mask = (df['Evaluations'].str.startswith('CAS') |
df['Evaluations'].str.contains('CASCADE'))
df = df[~mask]
# remove Numbers starting with 21 or 99
mask = (df['Numbers'].astype(str).str.startswith('21') |
df['Numbers'].astype(str).str.startswith('99'))
df = df[~mask]
# remove letter as 2th character (1 => zero-based indexing)
mask = df['Numbers'].astype(str).apply(lambda x: x[1].isalpha())
df = df[~mask]
# write to file
with open('Data1.xlsx', 'wb') as handle:
df.to_excel(handle)
print(df)
Evaluations Numbers
2 Nastolgic behaviours of people 75903324
3 google drive 76308764
6 Tesla's new inventions 83492836
7 Electric cars 78363522
1- If in the column named Evaluations, its content starts with "OBS" or has the word "Obsolete" in it then remove these rows
(^OBS|Obsolete)
2- If the column value in the Numbers column start with digits "99" or "51" then remove these rows
^(99|51)
3- If the 5th digit in the Numbers column is an alphabetic character then also remove these rows
^\d{4}\w
These are the Regexes that will help match these conditions.
I have 2 Pandas Dataframes.
The first one looks like this:
date rank id points
2010-01-04 1 100001 10550
2010-01-04 2 100002 9205
The second one like this:
id name
100001 A
100002 B
I want to join both dataframes via the id column. So the result should look like:
date rank id points name
2010-01-04 1 100001 10550 A
2010-01-04 2 100002 9205 B
Some weeks ago I wrote code for that, but for some reason it does not work anymore. I end up with an empty dataframe after I execute this code for joining:
join = pd.merge(df1,df2, on='id')
Why is join empty?
short story: as pointed out in the comment already, i was comparing strings with integers.
long story: i didn't expect python to parse the id-columns of two input csv files to different datatpyes. df1.id was of type Object. df2.id was of type int. and i needed to find out why df1.id was parsed to Object and not automatically to int, because it only contained numbers.
turns out that it had something to do with the encoding of my CSV file. in notepad++ the file was encoded as plain UTF-8. it seems that pandas did not like this, because when i tried to convert the id column to int, it raised an error like ValueError: invalid literal for int() with base 10: '\ufeff100001'. The number 100001 is the first ID of the first row. So there seems to be some encoded character before this number (at the very beginning of the file) \ufeff that prevented pandas to parse the whole column as int. in notepad++ i then changed the encoding of the file to UTF-8 without BOM and then everything worked.