Huggingface Load_dataset() function throws "ValueError: Couldn't cast" - machine-learning

My goal is to train a classifier able to do sentiment analysis in Slovak language using loaded SlovakBert model and HuggingFace library. Code is executed on Google Colaboratory.
My test dataset is read from this csv file:
https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv
and train dataset:
https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv
Data has two columns: column of Slovak sentences and 2nd column of labels which indicate sentiment of the sentence. Labels have values -1, 0 or 1.
Load_dataset() function throws this error:
ValueError: Couldn't cast
Vrtuľník je veľmi zraniteľný pri dobre mierenej streľbe zo zeme. Brániť sa, unikať, alebo vedieť zneškodniť nepriateľa je vecou sekúnd, ak nie stotín, kedy ide život. : string
-1: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 954
to
{'Priestorovo a vybavenim OK.': Value(dtype='string', id=None), '1': Value(dtype='int64', id=None)}
because column names don't match
Code:
!pip install transformers==4.10.0 -qqq
!pip install datasets -qqq
from re import M
import numpy as np
from datasets import load_metric, load_dataset, Dataset
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding
import pandas as pd
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
#links to dataset
test = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv'
train = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv'
model_name = 'gerulata/slovakbert'
#Load data
dataset = load_dataset('csv', data_files={'train': train, 'test': test})
What is done wrong while loading the dataset?

The reason is since delimiter is used in first column multiple times the code fails to automatically determine number of columns ( some time segment a sentence into multiple columns as it cannot automatically determine , is a delimiter or a part of sentence.
But, the solution is simple: (just add column names)
dataset = load_dataset('csv', data_files={'train': train,'test':test},column_names=['sentence','label'])
output:
DatasetDict({
train: Dataset({
features: ['sentence', 'label'],
num_rows: 89
})
test: Dataset({
features: ['sentence', 'label'],
num_rows: 91
})
})

Related

Preprocess and data transformation in machine learning

I have a problem where I have to predict a buyer using machine learning (created a dummy dataset). I need to transform the data first before I can use it for machine learning. I am aggregating information per id,visit level which gives me a list of food and cloths bought. This list needs to be one hot encoded before applying classifier model.
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
def preprocess(df):
# Only keep rows till buyer=1
df = df.groupby(["id1", "visit"], group_keys=False).apply(
lambda g: g.loc[: g["Buyer"].idxmax()]
)
# Form lists on each id1,visit level
df1 = df.groupby(["id1", "visit"], as_index=False).agg(
is_Pax=("Buyer", "max"),
fruits=("fruits", lambda x: x.dropna().unique().tolist()),
cloths=("cloths", lambda x: x.dropna().unique().tolist()),
)
col = ["fruits", "cloths"]
df_transformed = onehot(df1, col)
return df_transformed
def onehot(df, col):
"""
This function does one hot encoding of a list column.
"""
onehot_list_encoder = MultiLabelBinarizer()
for cl in col:
print("One hot encoding ", cl)
newd = pd.DataFrame(
onehot_list_encoder.fit_transform(df[cl]),
columns=onehot_list_encoder.classes_,
).add_prefix(cl + "_")
df = df.join(newd)
return df
df = pd.DataFrame(np.array([['a', 'a', 'b', 'b','a','a'], [1, 2, 2, 2,1,1],
['Apple', 'Apple', 'Banana', None,'Orange','Pear'],[1,2,1,3,4,5],
[0, 0, 1, 0,1,0]]).T,
columns=['id1', 'visit', 'fruits','cloths','Buyer'])
df['Buyer'] = df['Buyer'].astype('int')
How to create a simple ML model now that does this preprocessing to data (both fit and predict) since in test data, I want the same transformation (i.e. 0 for all columns not present in the test rows), Can pipeline solve this? I am not so good with writing pipelines and am getting errors.
droplist=['id1', 'visit', 'fruits','cloths']
pipe=Pipeline(steps=[
("preprocess",preprocess(df)),
("coltrans",ColumnTransformer([("drop",'drop',droplist)])),
("model",GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)),
])
Can someone help?

pyspark.sql.utils.IllegalArgumentException: 'Field "features" does not exist

I am trying to perform topic modelling and sentimental analysis on text data over SparkNLP. I have done all the pre-processing steps on the dataset but getting an error in LDA.
Error
Program is:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF
from pyspark.ml.clustering import LDA
from pyspark.sql.functions import col, lit, concat, regexp_replace
from pyspark.sql.utils import AnalysisException
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors
dataframe_new = spark.read.format('com.databricks.spark.csv') \
.options(header='true', inferschema='true') \
.load('/home/cdh#psnet.com/Gourav/chap3/abcnews-date-text.csv')
get_tokenizers = Tokenizer(inputCol="headline_text", outputCol="get_tokens")
get_tokenized = get_tokenizers.transform(dataframe_new)
remover = StopWordsRemover(inputCol="get_tokens", outputCol="row")
get_remover = remover.transform(get_tokenized)
counter_vectorized = CountVectorizer(inputCol="row", outputCol="get_features")
getmodel = counter_vectorized.fit(get_remover)
get_result = getmodel.transform(get_remover)
idf_function = IDF(inputCol="get_features", outputCol="get_idf_feature")
train_model = idf_function.fit(get_result)
outcome = train_model.transform(get_result)
lda = LDA(k=10, maxIter=10)
model = lda.fit(outcome)
Schema of DataFrame after the IDF :
According to the documentation, LDA includes a featuresCol argument, with default value featuresCol='features', i.e. the name of the column that holds the actual features; according to your shown schema, such a column is not present in your dataframe, hence the expected error.
It is not exactly clear which column contains the features in your dataframe - get_features or get_idf_feature (they look identical in the sample you show); assuming it is get_idf_feature, you should change the LDA call to:
lda = LDA(featuresCol='get_idf_feature', k=10, maxIter=10)
Spark (including pyspark) ML API has a quite distinct and different logic than, say, scikit-learn and similar frameworks; one of the differences is indeed that the features have to be all in a single column of the respective dataframe. For a general demonstration of the idea, see own answer in KMeans clustering in PySpark (it is about K-Means, but the logic is identical).

I am not able Training models in sklearn (scikit-learn) using python

i have data file it contain data to predict the admission in MS.
it contain 9 column 8 column contain student data and 9th column contain chance of selection of student.
i am new and i don't understand error come in training model
import pandas
import numpy as np
import sklearn as sl
from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier()
data = pandas.read_csv('Addmition.csv')
data_array = np.array(data)
X = data_array[:,1:8]
y = data_array[:,8]
classifier.fit(X,y)
print(classifier)
Traceback (most recent call last):
File "c.py", line 14, in <module>
classifier.fit(X,y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 977, in fit
hasattr(self, "classes_")))
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 324, in _fit
X, y = self._validate_input(X, y, incremental)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 920, in _validate_input
self._label_binarizer.fit(y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\preprocessing\label.py", line 413, in fit
self.classes_ = unique_labels(y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\multiclass.py", line 96, in unique_labels
raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: (array
Try this:
import numpy as np
import sklearn as sl
from sklearn.neural_network import MLPRegressor
classifier = MLPRegressor()
data = pandas.read_csv('Addmition.csv')
data_array = np.array(data)
X = data_array[:,1:8]
y = data_array[:,8]
classifier.fit(X,y)
print(classifier)
Explanation:
In machine learning we may have two types of problems:
1) Classification:
Ex: Predict if a person is male or female. (discrete)
2) Regression:
Ex: Predict the age of the person. (continuous)
With this in hand we are going to see your problem, your label (chance of selection) is continous, thus we have a regression problem.
See that you are using the MLPClassifier, resulting in the 'Unknown label error'.
Try using the MLPRegressor.

How do you make a KMeans prediction more accurate?

I'm learning about clustering and KMeans and such, so my knowldge is very basic on the topic. What I have below is a bit of a self study on how it works. Basically, if 'a' shows up in any of the columns, 'Binary' will equal 1. Essentially I am trying to teach it a pattern. I learned the following from a tutorial using the Titanic dataset, but I've adapted to my own data.
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
my constructed data
dataset = [
[0,'x','f','g'],[1,'a','c','b'],[1,'d','k','a'],[0,'y','v','w'],
[0,'q','w','e'],[1,'c','a','l'],[0,'t','x','j'],[1,'w','o','a'],
[0,'z','m','n'],[1,'z','x','a'],[0,'f','g','h'],[1,'h','a','c'],
[1,'a','r','e'],[0,'g','c','c']
]
df = pd.DataFrame(dataset, columns=['Binary','Col1','Col2','Col3'])
df.head()
df:
Binary Col1 Col2 Col3
------------------------
1 a b c
0 x t v
0 s q w
1 n m a
1 u a r
Encode non binary to binary:
labelEncoder = LabelEncoder()
labelEncoder.fit(df['Col1'])
df['Col1'] = labelEncoder.transform(df['Col1'])
labelEncoder.fit(df['Col2'])
df['Col2'] = labelEncoder.transform(df['Col2'])
labelEncoder.fit(df['Col3'])
df['Col3'] = labelEncoder.transform(df['Col3'])
Set clusters to two, because its either 1 or 0?
X = np.array(df.drop(['Binary'], 1).astype(float))
y = np.array(df['Binary'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
Test it:
correct = 0
for i in range(len(X)):
predict_me = np.array(X[i].astype(float))
predict_me = predict_me.reshape(-1, len(predict_me))
prediction = kmeans.predict(predict_me)
if prediction[0] == y[i]:
correct += 1
The result:
print(f'{round(correct/len(X) * 100)}% Accuracy')
>>> 71%
How can I get it more accurate to the point where it 99.99% knows that 'a' means binary column is 1? More data?
K-means does not even try to predict this value. Because it is an unsupervised method. Because it is not a prediction algorithm; it is a structure discovery task. Don't mistake clustering for classification.
The cluster numbers have no meaning. They are 0 and 1 because these are the first two integers. K-means is randomized. Run it a few times and you will also score just 29% sometimes.
Also, k-means is designed for continuous input. You can apply it on binary encoded data, but the results will be pretty poor.

How to fix tuple object error in feature union & pipelines (while using sklearn)?

I have a pandas data frame with 56 columns. Around half of the columns are float and the others are string(textual data) and finally col56 is the label column. The dataset looks something like this
Col1 Col2...Col26 Col27 Col 28 ..... Col55 Col 56
1 4 76 I like cats Cats are cool Cat bags 1
.
.
.
1900 rows
I want to use both numeric and textual data to run classification algorithms. A quick google search told that the best way to proceed is by using Feature Union
This is the code so far
import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.svm import SVC
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
df=pd.read_csv('url')
X=df[[Col1...Col55]]
y=df[[Col56]]
from sklearn.model_selection import train_test_split
stop_list=(i, am, the...)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)
pipeline = Pipeline([
('union',FeatureUnion([
('Col1', Pipeline([
('selector', ItemSelector(column='Col1')),
('caster', ArrayCaster())
])),
.
.
.
.
.
('Col27',Pipeline([
('selector', ItemSelector(column='Col27')),
('vectorizer', CountVectorizer())
])),
.
.
.
('Col55',Pipeline([
('selector', ItemSelector(column='Col55')),
('vectorizer', CountVectorizer())
]))
])),
('model',SVC())
])
Then I get an error
TypeError Traceback (most recent call last)
<ipython-input-8-7a2cab7bed7d> in <module>
167 (' Col27',Pipeline([
168 ('selector', ItemSelector(column=' Col27')),
--> 169 ('vectorizer', CountVectorizer(stop_words=stop_list))
170 ]))
TypeError: 'tuple' object is not callable
I don't understand since the exact same method is used here and here
And there doesn't seem any error. What am I doing wrong? How can I fix this?
I think the issue is with CountVectorizer.
cv = CountVectorizer
word_count_vector = cv.fit_transform(data)
word_count_vector = cv.shape()
This produces the same error as you. You could actually do the stuff manually. Use CountVectorizer to create a sparse matrix of the data and align it with your numerical data matrix or dataframe by using spare.hstack from scipy. It horizontally stacks the two matrices with equal rows and equal/different columns.

Resources