How do I convert a list of dictionaries to a Huggingface Dataset object? - huggingface

I have a list of dictionaries:
print(type(train_dataset))
>>> <class 'list'>
print(len(train_dataset))
>>> 4000
train_dataset[0]
>>>
{'id': '7',
'question': {'stem': 'Who is A',
'choices': [{'text': 'A is X', 'label': 'A'},
{'text': 'A is not B', 'label': 'D'}]},
'answerKey': 'D'}
How can I convert this to a huggingface Dataset object? From their website it seems like you can only convert pandas df (dataset = Dataset.from_pandas(df)) or a dictionary ( dataset = Dataset.from_dict(my_dict)), but it's not clear how to use a list of dictionaries

Related

Sequential Model incompatible with layer

I've recently updated my project to include more intents for my NLU chatbot. I retrained the model. However, when I make an input into the program I receive an error message saying
File "C:\Users\jiann\ChatBot - Copy\chatbot.py", line 39, in predict_clas
s
res = model.predict(np.array([bow]))[0]
File "c:\users\jiann\appdata\local\programs\python\python39\lib\site-pack
ages\keras\utils\traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "c:\users\jiann\appdata\local\programs\python\python39\lib\site-pack
ages\tensorflow\python\framework\func_graph.py", line 1147, in autograph_ha
ndler
raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:
File "c:\users\jiann\appdata\local\programs\python\python39\lib\site-pa
ckages\keras\engine\training.py", line 1801, in predict_function *
return step_function(self, iterator)
File "c:\users\jiann\appdata\local\programs\python\python39\lib\site-pa
ckages\keras\engine\training.py", line 1790, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "c:\users\jiann\appdata\local\programs\python\python39\lib\site-pa
ckages\keras\engine\training.py", line 1783, in run_step **
outputs = model.predict_step(data)
File "c:\users\jiann\appdata\local\programs\python\python39\lib\site-pa
ckages\keras\engine\training.py", line 1751, in predict_step
return self(x, training=False)
File "c:\users\jiann\appdata\local\programs\python\python39\lib\site-pa
ckages\keras\utils\traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
ckages\keras\engine\input_spec.py", line 264, in assert_input_compatibilityckages\keras\engine\input_spec.py", line 264, in assert_input_compatibilityckage
raise ValueError(f'Input {input_index} of layer "{layer_name}" is ' raise ValueError(f'Input {input_index} of layer "{layer_name}" is '
ValueError: Input 0 of layer "sequential" is incompatible with the laye
r: expected shape=(None, 9), found shape=(None, 40)
This error only pops up when I include more than one Intent. Below I've include the relevant code for the Sequential model and the Intents:
Intents.json:
{"intents": [
{"tag": "greeting",
"patterns": ["Hi", "How are you", "Is anyone there?", "Hello", "Good day", "Whats up", "Hey", "greetings"],
"responses": ["Hello!", "Good to see you again!", "Hi there, how can I help?"],
"context_set": ""
},
{"tag": "goodbye",
"patterns": ["cya", "See you later", "Goodbye", "I am Leaving", "Have a Good day", "bye", "cao", "see ya"],
"responses": ["Sad to see you go :(", "Talk to you later", "Goodbye!"],
"context_set": ""
},
{"tag": "stocks",
"patterns": ["what stocks do I own?", "how are my shares?", "what companies am I investing in?", "what am I doing in the markets?"],
"responses": ["You own the following shares: ABBV, AAPL, FB, NVDA and an ETF of the S&P 500 Index!"],
"context_set": ""
}
]
}
training.py:
import random
import json
import pickle
import numpy as np
import nltk
from keras.layers import Dense, Dropout
from keras.models import Sequential
from keras.optimizer_v2.gradient_descent import SGD
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
# Lemmatizer uses stem of a word instead of conjugate (performance purposes)
from nltk.stem import WordNetLemmatizer
from tensorflow import keras
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers i
# mport Dense, Activation, Dropout
# from tensorflow.keras.optimizers import SGD
lemmatizer = WordNetLemmatizer()
# Reading json file, pass to load function, get json object dictionary
intents = json.loads(open('intents.json').read())
words = []
classes = []
documents = []
# Characters that you won't pay attention to
ignore_letters = ['?', '!', '.', ',']
# Splits each pattern entry into individual words
for intent in intents['intents']:
for pattern in intent['patterns']:
word_list = nltk.word_tokenize(pattern)
words.extend(word_list)
#Wordlist belongs to specific tag
documents.append((word_list, intent['tag']))
if intent['tag'] not in classes:
classes.append(intent['tag'])
print(documents)
#lemmatizes word inf word list if it is not ignored
words = [lemmatizer.lemmatize(word) for word in words if word not in ignore_letters]
#Set Eliminates duplicate words
words = sorted(set(words))
classes = sorted(set(classes))
#Save the words in file
pickle.dump(words,open('words.pkl','wb'))
#Save classes in file
pickle.dump(classes,open('classes.pkl','wb'))
#CREATING THE TRAINING DATA
#Set individual word values to 0 or 1 depending on whether it occurs
training = []
output_empty = [0] * len(classes)
for document in documents:
bag = []
word_patterns = document[0]
word_patterns = [lemmatizer.lemmatize(word.lower()) for word in word_patterns]
for word in words:#checks to see if word is in pattern
bag.append(1) if word in word_patterns else bag.append(0)
output_row = list(output_empty)
#want to know class at index 1, want to know index,
# add class to oupt_row to 1
output_row[classes.index(document[1])] = 1
training.append([bag, output_row])
#shuffle the data
random.shuffle(training)
#turn into numpy array
training = np.array(training)
#split into x and y values, Features & Labels
train_x =list(training[:,0])
train_y = list(training[:,1])
#Start building Neural Network Model
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]),activation='softmax'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
model.save('chatbotmodel.h5',hist)
print('done')
chatbot.py:
import random
import pickle
import numpy as np
import nltk
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
from nltk.stem import WordNetLemmatizer
from keras.models import load_model
lemmatizer = WordNetLemmatizer()
words = pickle.load(open('words.pkl', 'rb'))
classes = pickle.load(open('classes.pkl', 'rb'))
model = load_model('chatbot_model.model')
print(classes)
def clean_up_sentence(sentence):
sentence_words = nltk.word_tokenize(sentence)
sentence_words = [lemmatizer.lemmatize(word) for word in sentence_words]
return sentence_words
def bag_of_words(sentence):
sentence_words = clean_up_sentence(sentence)
bag = [0] * len(words)
for w in sentence_words:
for i, word in enumerate(words):
if word == w:
bag[i] = 1
return np.array(bag)
def predict_class(sentence):
bow = bag_of_words(sentence)
res = model.predict(np.array([bow]))[0]
# allows for certain uncertainty.
# If Uncertainty is too high it won't allow to be taken into account
ERROR_THRESHOLD = 0.25
results = [[i, r] for i, r in enumerate(res) if r > ERROR_THRESHOLD]
results.sort(key=lambda x: x[1], reverse=True)
return_list = []
for r in results:
return_list.append({'intent': classes[r[0]], 'probability': str(r[1])})
return return_list
def get_response(intents_list, intents_json):
tag = intents_list[0]['intent']
list_of_intents = intents_json['intents']
for i in list_of_intents:
if i['tag'] == tag:
result = random.choice(i['responses'])
break
return result
print("Go! Bot is running!")
If I had to take a guess, it would be something wrong with the shape. I'm just not sure how to fix this.
There seems to be a mismatch between the input_shape of your model and the training sample(s) you are providing. I believe the issue stems from these two lines:
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
and,
res = model.predict(np.array([bow]))[0]
Depending on what value is returned by len(train_x[0]), calling model.predict() on np.array[bow] may not work if np.array[bow] does not match the input shape specified. Check out this answer for an in-depth explanation of how the various Keras inputs work.

How do I operate on groups returned by Dask's group by?

I have the following table.
value category
0 2 A
1 20 B
2 4 A
3 40 B
I want to add a mean column that contains the mean of the values for each category.
value category mean
0 2 A 3.0
1 20 B 30.0
2 4 A 3.0
3 40 B 30.0
I can do this in pandas like so
p = pd.DataFrame({"value":[2, 20, 4, 40], "category": ["A", "B", "A", "B"]})
groups = []
for _, group in p.groupby("category"):
group.loc[:,"mean"] = group.loc[:,"value"].mean()
groups.append(group)
pd.concat(groups).sort_index()
How do I do the same thing in Dask?
I can't use the pandas functions as-is because you can't enumerate over a groupby object in Dask. This
import dask.dataframe as dd
d = dd.from_pandas(p, chunksize=100)
list(d.groupby("category"))
raises KeyError: 'Column not found: 0'.
I can use an apply function to calculate the mean in Dask.
import dask.dataframe as dd
d = dd.from_pandas(p, chunksize=100)
q = d.groupby(["category"]).apply(lambda group: group["value"].mean(), meta="object")
q.compute()
returns
category
A 3.0
B 30.0
dtype: float64
But I can't figure how how to fold these back into the rows of the original table.
I would use a merge to achieve this operation:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({
'value': [2, 20, 4, 40],
'category': ['A', 'B', 'A', 'B']
})
ddf = dd.from_pandas(df, npartitions=1)
# Lazy-compute mean per category
mean_by_category = (ddf
.groupby('category')
.agg({'value': 'mean'})
.rename(columns={'value': 'mean'})
).persist()
mean_by_category.head()
# Assign 'mean' value to each corresponding category
ddf = ddf.merge(mean_by_category, left_on='category', right_index=True)
ddf.head()
Which should then output:
category value mean
0 A 2 3.0
2 A 4 3.0
1 B 20 30.0
3 B 40 30.0

pyspark ml model map id column after prediction

I have trained a classification model using pyspark.ml.classification.RandomForestClassifier and applied it on a new dataset for prediction.
I am removing the customer_id column before feeding the dataset to the model but not sure how to map the customer_id back after prediction. So, there is no way for me to identify which row belongs to which customer as Spark dataframes are inherently unordered.
Here is a nice spark doc example of classification using pipeline technique where the original schema is preserved and only the selected cols are used as input features to the learning algorithm (ex: I replaced with random forest).
reference => https://spark.apache.org/docs/latest/ml-pipeline.html
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import HashingTF, Tokenizer
# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and rf.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, rf])
# Fit the pipeline to training documents.
model = pipeline.fit(training)
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
(4, "spark i j k"),
(5, "l m n"),
(6, "spark hadoop spark"),
(7, "apache hadoop")
], ["id", "text"])
# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
# schema is preserved
prediction.printSchema()
root
|-- id: long (nullable = true)
|-- text: string (nullable = true)
|-- words: array (nullable = true)
| |-- element: string (containsNull = true)
|-- features: vector (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = false)
# sample row
for i in prediction.take(1): print(i)
Row(id=4, text='spark i j k', words=['spark', 'i', 'j', 'k'], features=SparseVector(262144, {20197: 1.0, 24417: 1.0, 227520: 1.0, 234657: 1.0}), rawPrediction=DenseVector([5.0857, 4.9143]), probability=DenseVector([0.5086, 0.4914]), prediction=0.0)
Here is a nice spark doc example of the VectorAssembler class where multiple cols are combined as input features which would be input to the learning algorithm.
reference => https://spark.apache.org/docs/latest/ml-features.html#vectorassembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
dataset = spark.createDataFrame(
[(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
["id", "hour", "mobile", "userFeatures", "clicked"])
assembler = VectorAssembler(
inputCols=["hour", "mobile", "userFeatures"],
outputCol="features")
output = assembler.transform(dataset)
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")
output.select("features", "clicked").show(truncate=False)
Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'
+-----------------------+-------+
|features |clicked|
+-----------------------+-------+
|[18.0,1.0,0.0,10.0,0.5]|1.0 |
+-----------------------+-------+

How does this binary encoder function work?

I'm trying to understand the logic behind this binary encoder.
It automatically takes categorical variables and dummy codes them (similar to one-hot-encoding on sklearn), but reduces the number of output columns equal to the log2 of the length of unique values.
Basically, when I used this library, I noticed that my dummy variables are limited to only a few of the unique values. Upon further investigation I noticed this #staticmethod, which take the log2 of the len of unique values in a categorical variable.
My question is WHY? I realize that this reduces the dimensionality of the output data, but what is the logic behind doing this? How does taking the log2 determine how many digits are needed to represent the data?
def calc_required_digits(X, col):
"""
figure out how many digits we need to represent the classes present
"""
return int( np.ceil(np.log2(len(X[col].unique()))) )
Full source code:
"""Binary encoding"""
import copy
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from category_encoders.ordinal import OrdinalEncoder
from category_encoders.utils import get_obj_cols, convert_input
__author__ = 'willmcginnis'
[docs]class BinaryEncoder(BaseEstimator, TransformerMixin):
"""Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.
Parameters
----------
verbose: int
integer indicating verbosity of output. 0 for none.
cols: list
a list of columns to encode, if None, all string columns will be encoded
drop_invariant: bool
boolean for whether or not to drop columns with 0 variance
return_df: bool
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array)
impute_missing: bool
boolean for whether or not to apply the logic for handle_unknown, will be deprecated in the future.
handle_unknown: str
options are 'error', 'ignore' and 'impute', defaults to 'impute', which will impute the category -1. Warning: if
impute is used, an extra column will be added in if the transform matrix has unknown categories. This can causes
unexpected changes in dimension in some cases.
Example
-------
>>>from category_encoders import *
>>>import pandas as pd
>>>from sklearn.datasets import load_boston
>>>bunch = load_boston()
>>>y = bunch.target
>>>X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>>enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>>numeric_dataset = enc.transform(X)
>>>print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 16 columns):
CHAS_0 506 non-null int64
RAD_0 506 non-null int64
RAD_1 506 non-null int64
RAD_2 506 non-null int64
RAD_3 506 non-null int64
CRIM 506 non-null float64
ZN 506 non-null float64
INDUS 506 non-null float64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
TAX 506 non-null float64
PTRATIO 506 non-null float64
B 506 non-null float64
LSTAT 506 non-null float64
dtypes: float64(11), int64(5)
memory usage: 63.3 KB
None
"""
def __init__(self, verbose=0, cols=None, drop_invariant=False, return_df=True, impute_missing=True, handle_unknown='impute'):
self.return_df = return_df
self.drop_invariant = drop_invariant
self.drop_cols = []
self.verbose = verbose
self.impute_missing = impute_missing
self.handle_unknown = handle_unknown
self.cols = cols
self.ordinal_encoder = None
self._dim = None
self.digits_per_col = {}
[docs] def fit(self, X, y=None, **kwargs):
"""Fit encoder according to X and y.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples
and n_features is the number of features.
y : array-like, shape = [n_samples]
Target values.
Returns
-------
self : encoder
Returns self.
"""
# if the input dataset isn't already a dataframe, convert it to one (using default column names)
# first check the type
X = convert_input(X)
self._dim = X.shape[1]
# if columns aren't passed, just use every string column
if self.cols is None:
self.cols = get_obj_cols(X)
# train an ordinal pre-encoder
self.ordinal_encoder = OrdinalEncoder(
verbose=self.verbose,
cols=self.cols,
impute_missing=self.impute_missing,
handle_unknown=self.handle_unknown
)
self.ordinal_encoder = self.ordinal_encoder.fit(X)
for col in self.cols:
self.digits_per_col[col] = self.calc_required_digits(X, col)
# drop all output columns with 0 variance.
if self.drop_invariant:
self.drop_cols = []
X_temp = self.transform(X)
self.drop_cols = [x for x in X_temp.columns.values if X_temp[x].var() <= 10e-5]
return self
[docs] def transform(self, X):
"""Perform the transformation to new categorical data.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Returns
-------
p : array, shape = [n_samples, n_numeric + N]
Transformed values with encoding applied.
"""
if self._dim is None:
raise ValueError('Must train encoder before it can be used to transform data.')
# first check the type
X = convert_input(X)
# then make sure that it is the right size
if X.shape[1] != self._dim:
raise ValueError('Unexpected input dimension %d, expected %d' % (X.shape[1], self._dim, ))
if not self.cols:
return X
X = self.ordinal_encoder.transform(X)
X = self.binary(X, cols=self.cols)
if self.drop_invariant:
for col in self.drop_cols:
X.drop(col, 1, inplace=True)
if self.return_df:
return X
else:
return X.values
[docs] def binary(self, X_in, cols=None):
"""
Binary encoding encodes the integers as binary code with one column per digit.
"""
X = X_in.copy(deep=True)
if cols is None:
cols = X.columns.values
pass_thru = []
else:
pass_thru = [col for col in X.columns.values if col not in cols]
bin_cols = []
for col in cols:
# get how many digits we need to represent the classes present
digits = self.digits_per_col[col]
# map the ordinal column into a list of these digits, of length digits
X[col] = X[col].map(lambda x: self.col_transform(x, digits))
for dig in range(digits):
X[str(col) + '_%d' % (dig, )] = X[col].map(lambda r: int(r[dig]) if r is not None else None)
bin_cols.append(str(col) + '_%d' % (dig, ))
X = X.reindex(columns=bin_cols + pass_thru)
return X
[docs] #staticmethod
def calc_required_digits(X, col):
"""
figure out how many digits we need to represent the classes present
"""
return int( np.ceil(np.log2(len(X[col].unique()))) )
[docs] #staticmethod
def col_transform(col, digits):
"""
The lambda body to transform the column values
"""
if col is None or float(col) < 0.0:
return None
else:
col = list("{0:b}".format(int(col)))
if len(col) == digits:
return col
else:
return [0 for _ in range(digits - len(col))] + col
My question is WHY? I realize that this reduces the dimensionality of
the output data, but what is the logic behind doing this?
Basically, the issue of categorical encoding is to make your algorithm it's dealing with categorical features. Therefore, several methods are available for doing it, including binary encoding. Actually, it's logic is close to the logic of One Hot Encoding (OHE), if you understood it.
For binary encoding, each unique label in your categorical vector is associated randomly to a number between (0) and (the number of unique labels-1). Now, you encode this number in base 2 and "transcript" the previous number in 0 and 1 through the newly created columns.
As an example, let's say your dataset as three different labels: 'A', 'B' & 'C'.
The following correspondance is randomly built:
'A' -> 1 -> 01;
'B' -> 2 > 10;
'C' -> 0 -> 00.
Therefore, an example of encoding of a given dataset is:
index my_category enc_category_0 enc_category_1
0 A, 1, 0
1, B, 0, 1
2, C, 0, 0
3 A, 1, 0
Regarding the utility of it, as you said it's reduce the dimensionality. Besides, I guess it helps not having too much zeros in the encoded columns as with OHE. Here is an interesting post: https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931
How does taking the log2 determine how many digits are needed to represent the data?
If you understood the working principle, you understand the use of the log2. Computing the log2 of a number retrives the necessary number of digits for a binary encoding of this number. Example: [log2(10)]=[3.32]=4, 4 digits are needed for binary encode 10.
For more info about the implementation and code example: http://contrib.scikit-learn.org/categorical-encoding/_modules/category_encoders/binary.html#BinaryEncoder
Hope I was clear,
Tchau

Unknown y_type for KNeighborsClassifier

I'm trying to run KNeighborClassifier on some numpy arrays and I've been getting the error ValueError: Unknown label type: 'unknown'
The type of my X_matrix and my y_vector are both
<class 'numpy.ndarray'> and the shape for the two respectivley are
(46, 240)
(46,)
Both the X_matrix and y_vector contain only ints. The y_vector only containing 1s and 0s.
Any help will be greatley appreciated.
When you are passing label (y) data to KNeighborClassifier classifier.fit(X_matrix ,y_vector), it expects y_vector to be 1D list.
y_vector=list(y_vector.values)
You need to check the shape of the numpy arrays:
Example
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
y = np.array( [0,1,0,1,0] )
x=np.array( [ [2.3,5.3,6.8,9,10],[1,2,3,4,5] ] )
x=x.reshape(5,2)
clf=KNeighborsClassifier()
clf.fit(x,y)
# check type and shape
type(x)
x.shape
type(y)
y.shape
Result:
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
(5L, 2L)
(5L,)
If you want to predict using the fitted clf:
x_new = np.array( [10, 20] )
x_new = x_new.reshape(1,2)
clf.predict(x_new)
Result:
array([0])

Resources