how to convert pandas multiple columns of text into tensors? - machine-learning

Hi I am working on Key Point Analysis Task, which is shared by IBM, here is the link. In the given dataset there are more than one rows of text and anyone can please tell me how can I convert the text columns into tensors and again assign them in the same dataFrame because there are other columns of data there.
Problem
Here I am facing a problem that I have never seen this kind of data before like have multiple text columns, How can I convert all those columns into tensors and then apply a model. Most of the time data is like : One Text Column
and other columns are label, Example: Movie Reviews , Toxic Comment classification.
def clean_text(text):
"""
text: a string
return: modified initial string
"""
text = text.lower() # lowercase text
text = REPLACE_BY_SPACE_RE.sub(' ',
text)
text = BAD_SYMBOLS_RE.sub('',
text)
text = text.replace('x', '')
# text = re.sub(r'\W+', '', text)
text = ' '.join(word for word in text.split() if word not in STOPWORDS)
return text

If I got your question right you will do sth like the following:
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
DF["args"]=DF["args"].apply(lambda x:tokenizer(x)['input_ids'])
This will convert sentences into token arrays.

Related

How to SPLIT cell content into sets of 50000 characters in new columns Google Sheets

I have 3 columns A, B & C as shown in the image. Column A contains the search key. The second column B contains names and their respective content in the third column C.
I am filtering rows that contain the text in A1 in B:C and concatenating them. The challenge is that each text in the third column is roughly 40k characters. The filter formula works well so the issue is the character limit. This formula =ArrayFormula(query(C1:C,,100000)) which I have in F1 concatenates more than 50000 characters but I am not how to apply it for my case.
Tried to wrap my formula in E1 inside the query function but it wasn't successful. Like so:
=ArrayFormula(query(CLEAN(CONCATENATE(FILTER(C1:C, B1:B=A1))),,100000)).
I also tried to SPLIT the concatenated result into sets of 50000 characters and put the extras in the next columns but wouldn't manage either. The formula I tried in this case is:
=SPLIT(REGEXREPLACE(CLEAN(CONCATENATE(FILTER(C1:C, B1:B=A1))),".{50000}", "$0,"),",")
The link to the spreadsheet
https://docs.google.com/spreadsheets/d/1rhVSQJBGaPQu6y2WbqkO2_UqzyfCc3_76t4AK3PdF7M/edit?usp=sharing
Since cell is limited to 50,000 characters, using CONCATENATE is not possible. Alternative solution is to use Google Apps Script's custom function. The good thing about Apps Script is it can handle millions of string characters.
To create custom function:
Create or open a spreadsheet in Google Sheets.
Select the menu item Tools > Script editor.
Delete any code in the script editor and copy and paste the code below.
At the top, click Save.
To use custom function:
Click the cell where you want to use the function.
Type an equals sign (=) followed by the function name and any input value — for example, =myFunction(A1) — and press Enter.
The cell will momentarily display Loading..., then return the result.
Code:
function myFunction(text) {
var arr = text.flat();
var newStr = arr.join(' ');
var slicedStr = stringChop(newStr, 50000);
return [slicedStr];
}
function stringChop(str, size){
if (str == null) return [];
str = String(str);
size = ~~size;
return size > 0 ? str.match(new RegExp('.{1,' + size + '}', 'g')) : [str];
}
Example:
Based on your sample spreadsheet, there are 4 rows that matches the criteria of the filter and each cell contains 38,976 characters, which is 155,904 characters in total. Dividing it by 50,000 is 3.12. The ceiling of 3.12 is 4 which means we have 4 columns of data.
Usage:
Paste this in cell E1:
=myFunction(FILTER(C1:C, B1:B=A1))
Output:
Reference:
Custom Function

How to filter Excel column?

I am looking for a solution in python for my data which is in an excel file that contains different statements and numbers. I want to filter out the rows on the base of column values.
import pandas as pd
df=pd.read.excel("Data.xlsx")
df[df.Numbers.apply(lambda x: str(x).isdigit())]
df.to_excel("Data1.xlsx")
Any suggestions please?
Here is one way to perform the filtering, using pandas' string tools and boolean masks. I did each step separately (easier to test, and easier to understand in the future).
# remove CAS and Cascade
mask = (df['Evaluations'].str.startswith('CAS') |
df['Evaluations'].str.contains('CASCADE'))
df = df[~mask]
# remove Numbers starting with 21 or 99
mask = (df['Numbers'].astype(str).str.startswith('21') |
df['Numbers'].astype(str).str.startswith('99'))
df = df[~mask]
# remove letter as 2th character (1 => zero-based indexing)
mask = df['Numbers'].astype(str).apply(lambda x: x[1].isalpha())
df = df[~mask]
# write to file
with open('Data1.xlsx', 'wb') as handle:
df.to_excel(handle)
print(df)
Evaluations Numbers
2 Nastolgic behaviours of people 75903324
3 google drive 76308764
6 Tesla's new inventions 83492836
7 Electric cars 78363522
1- If in the column named Evaluations, its content starts with "OBS" or has the word "Obsolete" in it then remove these rows
(^OBS|Obsolete)
2- If the column value in the Numbers column start with digits "99" or "51" then remove these rows
^(99|51)
3- If the 5th digit in the Numbers column is an alphabetic character then also remove these rows
^\d{4}\w
These are the Regexes that will help match these conditions.

How to create a tooltip showing multiple values of a field in Altair?

I have successfully created an interactive chart as the tutorial here and below is my result:
I want to create a tooltip as the image below:
But I cannot find any hint in Altair tutorial document.
Can anyone provide a suggestion about how to proceed?
You can create a single tooltip for multiple lines using the pivot transform. Here is an example using one of the vega datasets.
THe idea is that the tooltip is tied to the vertical rule mark, which represents a pivoted version of the data where each relevant value is in the same row:
import altair as alt
from vega_datasets import data
source = data.stocks()
base = alt.Chart(source).encode(x='date:T')
columns = sorted(source.symbol.unique())
selection = alt.selection_single(
fields=['date'], nearest=True, on='mouseover', empty='none', clear='mouseout'
)
lines = base.mark_line().encode(y='price:Q', color='symbol:N')
points = lines.mark_point().transform_filter(selection)
rule = base.transform_pivot(
'symbol', value='price', groupby=['date']
).mark_rule().encode(
opacity=alt.condition(selection, alt.value(0.3), alt.value(0)),
tooltip=[alt.Tooltip(c, type='quantitative') for c in columns]
).add_selection(selection)
lines + points + rule

Google Sheets: How do I include a newline within a field in a local .tsv file I want to import

I know that in Google Sheets I can type Control-Return to create a sort of "phantom" return that starts a new line within a field. But what is the actual character that represents this? Obviously it's not ASCII code 13, as that is the record separator.
I would like to be able to include this mystery character in a local .tsv file which I import into Google sheets, so that these multi-line fields will display as such. Is this possible?
Thanks!
it's CHAR(34)
so something like the following in a cell would give two distinct lines of text...
="direct text input"&A3&" more text"&CHAR(34)&"Newline with new text"
Q: Google Sheets: How do I include a newline within a field in a local .tsv file I want to import
A: Surround that field with "
Update: Only when working with CSV. Google Sheets TSV doesn't seem to include the 'new lines'. More info on the link at the end.
In technical drawing class I learnt that if you don't know how to get A-->C, try to go C-->A and draw conclusions to help you achieve your goal.
If you have this CSV text file
The google sheet will look like this
Which I obtained going the other way around.
If surrounding the field with " is not possible, you may want to use this formula, which does the opposite (changes 'new line' for § ) [adapt to your needs].
# For a whole column, if there are 'new lines' in the cell,
# copy the cell changing them to '§' otherwise copy the cell 'as is'
={"description";ArrayFormula(IF( REGEXMATCH(G2:G; char(10)) ; REGEXREPLACE(G2:G;char(10);char(167));G2:G))}
More rambling here
Inspired by #MattKing's answer, an import of a *.tsv file containing this example content (note the double-quoting within the 2nd column value) ...
"example value with no line breaks" "=""line one""&CHAR(10)&""line two"""
... seems to have the desired outcome ...

Twitter co-hashtagging with OpenRefine

I am using OpenRefine to format some Twitter metadata into a edge list to be read by Gephi.
It works easily if I want to study user-mention associations or user-hashtags associations.
But now I would like to study co-hashtagging, so how often hashtags co-occur in tweets.
To do this in OpenRefine (that I do not know very well) is a bit trickier and I need some help.
My data are in csv, with two columns: user name of the user, comma separated string of hashtags used in the tweet.
To get user-hashtags edge lists with OpenRefine I use "Split multi valued cells" on the hashtags column and then "Fill down" on the user column (very easy).
I do not know how to get hashtag-hashtag edge lists. I can use "Split multi valued cells" on the hashtags column to get a new row for every hashtag mentionned in a tweet. But how do I "fill" the rows so to get all combinations of hashtag-hashtag co-occurrence?
Example:
Data:
User Hashtags
Dario Data mining, R, OpenRefine
Desired result:
Hashtag 1 Hashtag 2
Data mining R
Data mining OpenRefine
R OpenRefine
Also posted on the OpenRefine Google Group:
I think you could do this with a combination of forEach and forRange. Try the following transformation on the cell containing the comma delimited hashtags:
forEachIndex(value.split(","),i,v,forRange(i+1,value.split(",").length(),1,j,v.trim() + "," + value.split(",")[j].trim()).join("|")).join("|")
This should produce a pipe-delimited list of the unique combinations. Then you can use 'split multi-valued cells'
here is my suggestion.
Let's use your example :
User Hashtags
Dario Data mining, R, OpenRefine
1°/ Use the function called "Split multi valued cells in column" on column Hashtags
You should get something like :
User Hashtags
Dario Data mining
R
OpenRefine
2°/ try this transformation on Hashtags column :
if((row.record.cells["Hashtags"].value[-1])==value,value+","+(row.record.cells["Hashtags"].value[0]),value+","+(row.record.cells["Hashtags"].value[-1]))
3°) Split your column in to columns bases on the "," separator.
It works for me.
Edit :
This solution generates a duplicate entry that can be easily removed like this :
Join multi valued cells, using a | separator (for example).
You get something like
1.
Dario
Data mining,Prout|R,Prout|OpenRefine,Prout|Prout,Data mining
2.
Essai
Data mining,R|R,Data mining
Then split cells in columns based on the separator |
finally, remove the first hashtag column.

Resources