Joining two Pandas DataFrames does not work anymore? - join

I have 2 Pandas Dataframes.
The first one looks like this:
date rank id points
2010-01-04 1 100001 10550
2010-01-04 2 100002 9205
The second one like this:
id name
100001 A
100002 B
I want to join both dataframes via the id column. So the result should look like:
date rank id points name
2010-01-04 1 100001 10550 A
2010-01-04 2 100002 9205 B
Some weeks ago I wrote code for that, but for some reason it does not work anymore. I end up with an empty dataframe after I execute this code for joining:
join = pd.merge(df1,df2, on='id')
Why is join empty?

short story: as pointed out in the comment already, i was comparing strings with integers.
long story: i didn't expect python to parse the id-columns of two input csv files to different datatpyes. df1.id was of type Object. df2.id was of type int. and i needed to find out why df1.id was parsed to Object and not automatically to int, because it only contained numbers.
turns out that it had something to do with the encoding of my CSV file. in notepad++ the file was encoded as plain UTF-8. it seems that pandas did not like this, because when i tried to convert the id column to int, it raised an error like ValueError: invalid literal for int() with base 10: '\ufeff100001'. The number 100001 is the first ID of the first row. So there seems to be some encoded character before this number (at the very beginning of the file) \ufeff that prevented pandas to parse the whole column as int. in notepad++ i then changed the encoding of the file to UTF-8 without BOM and then everything worked.

Related

How can I get the last numerical value value in a column in Google Sheets?

I need to find the last numerical value in a column. I was using this formula to get the last value in column G, but I made some changes and it no longer works: =INDEX(G:G, COUNTA(G:G), 1). My column now looks like this:
645
2345
4674.2345
123.1
"-"
"-"
"-"
...and the formula returns "-". I want it to return 123.1. How can I do this?
There are many ways to go about this. Here is one of them:
=QUERY(FILTER({G:G,ROW(G:G)},ISNUMBER(G:G)),"Select Col1 ORDER BY Col2 Desc LIMIT 1")
FILTER creates a virtual array of only numeric values in G in the first column and the row of those numeric values in the second column.
QUERY returns flips the order by row number and returns only the new top value from the first column (which winds up being your last numeric value in the original range).
However, if your numeric values start at G1, and if there are only numeric values up to where you start adding hyphens in cells, you could just alter your original formula like this:
=INDEX(G:G,COUNT(G:G))
This would work because COUNT only counts numeric values while COUNTA counts all non-null values (including errors BTW).
Not to take anything away from the accepted answer, but I've been working on this a bit lately in relation to this for the never-ending last row discussion and thought I'd share some potential similar solutions. These ideas are inspired by a pattern of google sheet array questions that seem to be coming up more often. I am also intentionally using different ways to do the same thing just to give people some ideas (i.e. left and Regex).
Last Row that is...
Number: =max(filter(row(G:G),isnumber(G:G)))
Text: =max(filter(row(G:G),isText(G:G)))
An error: =max(filter(row(G:G),iserror(G:G)))
Under 0 : =max(filter(row(G:G),G:G<0))
Also exists in column D: =max(filter(row(G:G),ISNUMBER(match(G:G,D:D,0))))
Not Blank: =max(filter(row(A:A),NOT(ISBLANK(A:A))))
Starts with ab: =max(filter(row(G:G),left(G:G,2)="ab"))
Contains the character !: =max(filter(row(G:G),isnumber(Find("!",G:G))))
Starts with a number: =max(filter(row(G:G),REGEXMATCH(G:G,"^\d")))
Only contains letters: =max(filter(row(G:G),REGEXMATCH(G:G,"^[a-zA-Z]+$")
Last four digits are upper case: =Max(filter(row(G:G),REGEXMATCH(G:G,"[A-Z]{4}$")))
To get the actual value (which I realize was the actual question), just wrap an index function around the Max function. So for this question, a solution could be :
=Index(G:G,max(filter(row(G:G),isnumber(G:G))))

How to filter Excel column?

I am looking for a solution in python for my data which is in an excel file that contains different statements and numbers. I want to filter out the rows on the base of column values.
import pandas as pd
df=pd.read.excel("Data.xlsx")
df[df.Numbers.apply(lambda x: str(x).isdigit())]
df.to_excel("Data1.xlsx")
Any suggestions please?
Here is one way to perform the filtering, using pandas' string tools and boolean masks. I did each step separately (easier to test, and easier to understand in the future).
# remove CAS and Cascade
mask = (df['Evaluations'].str.startswith('CAS') |
df['Evaluations'].str.contains('CASCADE'))
df = df[~mask]
# remove Numbers starting with 21 or 99
mask = (df['Numbers'].astype(str).str.startswith('21') |
df['Numbers'].astype(str).str.startswith('99'))
df = df[~mask]
# remove letter as 2th character (1 => zero-based indexing)
mask = df['Numbers'].astype(str).apply(lambda x: x[1].isalpha())
df = df[~mask]
# write to file
with open('Data1.xlsx', 'wb') as handle:
df.to_excel(handle)
print(df)
Evaluations Numbers
2 Nastolgic behaviours of people 75903324
3 google drive 76308764
6 Tesla's new inventions 83492836
7 Electric cars 78363522
1- If in the column named Evaluations, its content starts with "OBS" or has the word "Obsolete" in it then remove these rows
(^OBS|Obsolete)
2- If the column value in the Numbers column start with digits "99" or "51" then remove these rows
^(99|51)
3- If the 5th digit in the Numbers column is an alphabetic character then also remove these rows
^\d{4}\w
These are the Regexes that will help match these conditions.

TextPad Replace Character and Line Feed with Nothing

How do I replace a line in TextPad ' with nothing (ie: delete lines with just that one character)?
I have an Excel Spreadsheet containing three columns:
Column A - single quote
Column B - some number
Column C - single quote plus a comma
There are over 90,000 rows on this spreadsheet with data in column B. There are over one million rows with just a single quote in column A because I did a "Ctrl+D" on that column to copy the value in that column (a single quote) down to all rows.
When I copy and paste these three columns into TextPad, I end up with over one million lines. I replaced the tabs with nothing using the F8/Replace dialog.
(Replace: tab with: empty string)
The majority of what is left are lines that contain only a single quote. I want to delete these 900,000 extra lines.
How do I specify a Replace (delete) of single quote + line feed. I do not want to delete any of the single quotes from the lines that include a number that came from column B.
I just figured it out. The backslash n is the line feed.
If I check Regular Expression and enter this Find what:
'\n
(Keeping empty string for Replace with) and Replace All, I have deleted those extra lines.
I also experienced the same...it did not work for me until I did this:
uncheck the regular expression first before entering \n in the find box and replacing with whatever you chose to (in my case, it was ',').
Your result might be an entire list becoming transposed (that's what happened to my data).

How do you sort an alpha-numeric list in excel?

I'm trying to sort a list of documents, but I'm having an issue with the documents that have a letter as a suffix.
Whenever we amend a document we add a letter to the end of the number, but when I sort by number in excel it sorts like this:
1
2
3
10
11
1606
1603D
1605B
1606A
1606C
1610A
1623A
20A
220B
390A
399A
402A
415A
450A
488A
557B
How can I make it sort in order of document number and amendment?
Like so:
1
2
3
10
11
1603D
1605B
1606
1606A
1606C
1610A
1623A
20A
220B
390A
399A
402A
415A
450A
488A
557B
As long as you have a mix of text and number, you won't be able to use Excel's built-in sort to achieve the result you describe.
If you append a letter to a number you effectively change the data type from number to text. Text will always be sorted after any number, hence the number 1606 comes before the text 1606A.
You could try to make all values real text, maybe indicate levels by appending digits with dots, like this:
1.
1.0.
1.1.
1.6.0.3.D
1.6.0.5.B
1.6.0.6.
1.6.0.6.A
1.6.0.6.C
1.6.1.0.A
1.6.2.3.A
2.
2.0.A.
2.2.0.B.
3.
3.9.0.A.
3.9.9.A.
4.0.2.A.
4.1.5.A.
4.5.0.A.
4.8.8.A.
5.5.7.B.
But even that does not give you the sort order you describe as the desired result.
Your desired sort order will be hard to achieve even if all values are text, or if you replace the A, B, C with a decimal .1, .2, .3. -- It's really hard to understand why 20 would come after 1623.
The solution I found was to add a column, and copy this formula into each cell:
=IF(ISNUMBER(--RIGHT(A2)),A2,LEFT(A2,LEN(A2)-1))
The formula removes the letters from the numbers, you can then sort your sheet using the new column of clean numbers.

extract number from cell in openoffice calc

I have a column in open office like this:
abc-23
abc-32
abc-1
Now, I need to get only the sum of the numbers 23, 32 and 1 using a formula and regular expressions in calc.
How do I do that?
I tried
=SUMIF(F7:F16,"([:digit:].)$")
But somehow this does not work.
Starting with LibreOffice 6.4, you can use the newly added REGEX function to generically extract all numbers from a cell / text using a regular expression:
=REGEX(A1;"[^[:digit:]]";"";"g")
Replace A1 with the cell-reference you want to extract numbers from.
Explanation of REGEX function arguments:
Arguments are separated by a semicolon ;
A1: Value to extract numbers from. Can be a cell-reference (like A1) or a quoted text value (like "123abc"). The following regular expression will be applied to this cell / text.
"[^[:digit:]]": Match every character which is not a decimal digit. See also list of regular expressions in LibreOffice
The outer square brackets [] encapsulate the list of characters to search for
^ adds a NOT, meaning that every character not included in the search list is matched
[:digit:] represents any decimal digit
"": replace matching characters (every non-digit) with nothing = remove them
"g": replace all matches (don't stop after the first non-digit character)
Unfortunately Libre-Office only supports regex in find/replace and in search.
If this is a once-only deal, I would copy column A to column to B, then use [data] [text to columns] in B and use the - as a separator, leaving you with all the text in column B and the numbers in column C.
Alternatively, you could use =Right(A1,find("-",A1,1)+1) in column B, then sum Column C.
I think that this is not exactly what do you want, but maybe it can help you or others.
It is all about substring (in Calc called [MID][1] function):
First: Choose your cell (for example with "abc-23" content).
Secondly: Enter the start length ("british" --> start length 4 = tish).
After that: To print all remaining text, you can use the [LEN][2] function (known as length) with your cell ("abc-23") in parameter.
Code now looks like this:
D15="abc-23"
=MID(D15; 5; LEN(D15))
And the output is: 23
When you edit numbers (in this example 23), no problem. However, if you change anything before (text "abc-"), the algorithm collapses because the start length is defined to "5".
Paste the string in a cell, open search and replace dialog (ctrl + f) extended search option mark regular expression search for ([\s,0-9])([^0-9\s])+ and replace it with $1
adjust regex to your needs
I didn't figure out how to do this in OpenOffice/LibreOffice directly. After frustrations in searching online and trying various formulas, I realised my sheet was a simple CSV format, so I opened it up in vim and used vim's built-in sed-like feature to find/replace the text in vim command mode:
:%s/abc-//g
This only worked for me because there were no other columns with this matching text. If there are other columns with the same text, then the solution would be a bit more complex.
If your sheet is not a CSV, you could copy the column out to a text file and use vim to find/replace, and then paste the data back into the spreadsheet. For me, this was a lot less frustrating than trying to figure this out in LibreOffice...
I won't bother with a solution without knowing if there really is interest, but, you could write a macro to do this. Extract all the numbers and then implement the sum by checking for contained numbers in the text.

Resources