AttributeError: 'int' object has no attribute 'lower' - machine-learning

I am trying to pass a tweet from a flask UI and be able to make a prediction of the type of the tweet if its a donation, disaster etc.
Here is a working code from Jupyter notebook:
loaded_model = joblib.load('NB_spam_model.pkl')
result = loaded_model.score(X_test, y_test)
predict = loaded_model.predict([new_tweet])
print(result)
print(predict)
The results
0.8409090909090909
['donations_and_help']
May someone help me look at the code and correct me where i am doing it wrong?
app = Flask(__name__)
#app.route('/')
def index():
return render_template("index.html" )
#app.route("/predict", methods=["GET","POST"])
def api():
if request.method == "POST":
words = joblib.load('words.pkl')
model = joblib.load('NB_spam_model.pkl')
pstem = PorterStemmer()
tweet = request.form["tweet"]
text = tweet
text = re.sub("[^a-zA-Z]", ' ', text)
text = text.lower()
text = text.split()
text = [pstem.stem(word) for word in text if not word in set(stopwords.words('english'))]
text = ' '.join(text)
print("This is the text %s:" + text)
query = []
for word in words:
if word in text:
query.append(1)
else:
query.append(0)
print(query)
#prediction = list(model.predict(np.matrix(query)))[0]
pred = model.predict(query)[0]
print(pred)
if pred == 1:
msg = "."
return render_template("index.html", msg=msg, tweet=tweet)
else:
error = "Approximately 70%, your tweet Fake"
return render_template("index.html", error=error, tweet=tweet)
else:
return redirect(url_for("index"))
if __name__ == '__main__':
app.run(debug=False)
Stack trace:
Project\lib\site-packages\sklearn\feature_extraction\text.py",
line 69, in _preprocess
doc = doc.lower()
AttributeError: 'int' object has no attribute 'lower'
127.0.0.1 - - [03/Jun/2020 20:17:47] "?[35m?[1mPOST /predict HTTP/1.1?[0m" 500 -

As you see the error is AttributeError: 'int' object has no attribute 'lower' which means integer cannot be lower-cased. Somewhere in your code, it tries to lower case integer object which is not possible
First check your text value using print(text) before converting to lower

Related

IfcOpenShell(Parse)_IFC PropertySet, printing issue

Hy, I am new to programming and I have problems with printing my property sets and values.
I have more elements in my IFC and want to Parse all Property Sets and values.
My current result is elements ID(for every element), but it takes the attributes(property sets and values) form the first one.
Sketch:
see image
My code:
import ifcopenshell
ifc_file = ifcopenshell.open('D:\PZI_9-1_1441_LIN_CES_1-17c-O_M-M3.ifc')
products = ifc_file.by_type('IFCPROPERTYSET')
for product in products:
print(product.is_a())
print(product) # Prints
Category_Name_1 = ifc_file.by_type('IFCBUILDINGELEMENTPROXY')[0]
for definition in Category_Name_1.IsDefinedBy:
property_set = definition.RelatingPropertyDefinition
headders_list = []
data_list = []
max_len = 0
for property in property_set.HasProperties:
if property.is_a('IfcPropertySingleValue'):
headers = (property.Name)
data= (property.NominalValue.wrappedValue)
#print(headders)
headders_list.append(headers)
if len(headers) > max_len: max_len = len(headers)
#print(data)
data_list.append(data)
if len(data) > max_len: max_len = len(data)
headders_list = [headers.ljust(max_len) for headers in headders_list]
data_list = [data.ljust(max_len) for data in data_list]
print(" ".join(headders_list))
print(" ".join(data_list))
Has somebody a solution?
Thanks and kind regards,
On line:
Category_Name_1 = ifc_file.by_type('IFCBUILDINGELEMENTPROXY')[0]
it seems that you are referring always to the first IfcBuildingElementProxy object (because of the 0-index). The index should be incremented for each product, I guess.

how to speed up "POS tag" with StanfordPOSTagger?

I wanted to take none phrases of tweets, code is following. The problem is that it only process 300 tweets at a time and spend 5 minutes, how to speed up?
by the way, some code edited according to text blob.
I use dataset of gate-EN-twitter(https://gate.ac.uk/wiki/twitter-postagger.html) and NLTK interface to the Stanford POS tagger to tag tweets
from nltk.tag import StanfordPOSTagger
from nltk.tokenize import word_tokenize
import time,nltk
start_time = time.time()
CFG = {
('NNP', 'NNP'): 'NNP',
('NN', 'NN'): 'NNI',
('NNI', 'NN'): 'NNI',
('JJ', 'JJ'): 'JJ',
('JJ', 'NN'): 'NNI',
}
st = StanfordPOSTagger('/models/gate-EN-twitter.model','/twitie_tagger/twitie_tag.jar', encoding='utf-8')
def _normalize_tags(chunk):
'''Normalize the corpus tags.
("NN", "NN-PL", "NNS") -> "NN"
'''
ret = []
for word, tag in chunk:
if tag == 'NP-TL' or tag == 'NP':
ret.append((word, 'NNP'))
continue
if tag.endswith('-TL'):
ret.append((word, tag[:-3]))
continue
if tag.endswith('S'):
ret.append((word, tag[:-1]))
continue
ret.append((word, tag))
return ret
def noun_phrase_count(text):
matches1=[]
print('len(text)',len(text))
for i in range(len(text)//1000):
tokenized_text = word_tokenize(text[i*1000:i*10000+1000])
classified_text = st.tag(tokenized_text)
tags = _normalize_tags(classified_text)
merge = True
while merge:
merge = False
for x in range(0, len(tags) - 1):
t1 = tags[x]
t2 = tags[x + 1]
key = t1[1], t2[1]
value = CFG.get(key, '')
if value:
merge = True
tags.pop(x)
tags.pop(x)
match = '%s %s' % (t1[0], t2[0])
pos = value
tags.insert(x, (match, pos))
break
matches = [t[0] for t in tags if t[1] in ['NNP', 'NNI']]
matches1+=matches
print("--- %s seconds ---" % (time.time() - start_time))
fdist = nltk.FreqDist(matches1)
return [(tag,num) for (tag, num) in fdist.most_common()]
noun_phrase_count(tweets)
Looks like a duplicate of Stanford POS tagger with GATE twitter model is slow so you may find more info there.
Additionally; if there's any chance of stumbling upon identical inputs (tweets) twice (or more), you can consider a dictionary with the tweet (plain str) as key, and tagged as value, so that when you encounter a tweet, you first check if it's in your dict already. If not, tag it and put it there (and if this route is viable, why not pickle/unpickle that dictionary so that debugging/subsequent runs of your code go faster as well).

TF-IDF extracting keywords

Working on function somewhat like this:
def get_feature_name_by_tfidf(text_to_process):
with open(master_path + '\\additional_stopwords.txt', 'r') as f:
additional_stop_words = ast.literal_eval(f.read())
stop_words = text.ENGLISH_STOP_WORDS.union(set(additional_stop_words))
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 4), min_df=0, stop_words=stop_words)
tfidf_matrix = tf.fit_transform(text_to_process.split(','))
tagged = nltk.pos_tag(tf.get_feature_names())
feature_names_with_tags = {k: v for k, v in dict(tagged).items() if v != 'VBP'}
return list(feature_names_with_tags.keys())
Which return the list of keywords in the passed text.
Is there any way to get the keywords in the same case as it is provided?
Like passed string
Input:
a = "TIME is the company where I work"
Instead of getting keyword list as:
['time', 'company']
I like to get:
['TIME', 'company']
By default, TfidfVectorizer converts words to lowercase.Use this line:
tf = TfidfVectorizer(analyzer='word',lowercase=False, ngram_range=(1, 4), min_df=0, stop_words=stop_words)
and it should work. Use this link for ref. TfidfVectorizer

Emulate string to label dict

Since Bazel does not provide a way to map labels to strings, I am wondering how to work around this via Skylark.
Following my partial horrible "workaround".
First the statics:
_INDEX_COUNT = 50
def _build_label_mapping():
lmap = {}
for i in range(_INDEX_COUNT):
lmap ["map_name%s" % i] = attr.string()
lmap ["map_label%s" % i] = attr.label(allow_files = True)
return lmap
_LABEL_MAPPING = _build_label_mapping()
And in the implementation:
item_pairs = {}
for i in range(_INDEX_COUNT):
id = getattr(ctx.attr, "map_name%s" % i)
if not id:
continue
mapl = getattr(ctx.attr, "map_label%s" % i)
if len(mapl.files):
item_pairs[id] = list(mapl.files)[0].path
else:
item_pairs[id] = ""
if item_pairs:
arguments += [
"--map", str(item_pairs), # Pass json data
]
And then the rule:
_foo = rule(
implementation = _impl,
attrs = dict({
"srcs": attr.label_list(allow_files = True, mandatory = True),
}.items() + _LABEL_MAPPING.items()),
Which needs to be wrapped like:
def foo(map={}, **kwargs):
map_args = {}
# TODO: Check whether order of items is defined
for i, item in enumerate(textures.items()):
key, value = item
map_args["map_name%s" % i] = key
map_args["map_label%s" % i] = value
return _foo(
**dict(map_args.items() + kwargs.items())
)
Is there a better way of doing that in Skylark?
To rephrase your question, you want to create a rule attribute mapping from string to label?
This is currently not supported (see list of attributes), but you can file a feature request for this.
Do you think using "label_keyed_string_dict" is a reasonable workaround? (it won't work if you have duplicated keys)

LPeg Increment for Each Match

I'm making a serialization library for Lua, and I'm using LPeg to parse the string. I've got K/V pairs working (with the key explicitly named), but now I'm going to add auto-indexing.
It'll work like so:
#"value"
#"value2"
Will evaluate to
{
[1] = "value"
[2] = "value2"
}
I've already got the value matching working (strings, tables, numbers, and Booleans all work perfectly), so I don't need help with that; what I'm looking for is the indexing. For each match of #[value pattern], it should capture the number of #[value pattern]'s found - in other words, I can match a sequence of values ("#"value1" #"value2") but I don't know how to assign them indexes according to the number of matches. If that's not clear enough, just comment and I'll attempt to explain it better.
Here's something of what my current pattern looks like (using compressed notation):
local process = {} -- Process a captured value
process.number = tonumber
process.string = function(s) return s:sub(2, -2) end -- Strip of opening and closing tags
process.boolean = function(s) if s == "true" then return true else return false end
number = [decimal number, scientific notation] / process.number
string = [double or single quoted string, supports escaped quotation characters] / process.string
boolean = P("true") + "false" / process.boolean
table = [balanced brackets] / [parse the table]
type = number + string + boolean + table
at_notation = (P("#") * whitespace * type) / [creates a table that includes the key and value]
As you can see in the last line of code, I've got a function that does this:
k,v matched in the pattern
-- turns into --
{k, v}
-- which is then added into an "entry table" (I loop through it and add it into the return table)
Based on what you've described so far, you should be able to accomplish this using a simple capture and table capture.
Here's a simplified example I knocked up to illustrate:
lpeg = require 'lpeg'
l = lpeg.locale(lpeg)
whitesp = l.space ^ 0
bool_val = (l.P "true" + "false") / function (s) return s == "true" end
num_val = l.digit ^ 1 / tonumber
string_val = '"' * l.C(l.alnum ^ 1) * '"'
val = bool_val + num_val + string_val
at_notation = l.Ct( (l.P "#" * whitesp * val * whitesp) ^ 0 )
local testdata = [[
#"value1"
#42
# "value2"
#true
]]
local res = l.match(at_notation, testdata)
The match returns a table containing the contents:
{
[1] = "value1",
[2] = 42,
[3] = "value2",
[4] = true
}

Resources