Why are there strange characters in the embedding value? - r-text

I am doing a simple text embedding task with the textEmbed function in r-text.
rm(list=ls())
Sys.setenv(LANG = "C.UTF-8", LC_ALL="C.UTF-8")
library(text)
temp <- textEmbed("I'm trying to do so good and I keep messing up my life. I hate it so much.", model="roberta-large", layers=23:24, dim_name = FALSE)
View(temp[["tokens"]][["texts"]][[1]])
In the result, the column "tokens" has strange characters "Ġ", "<s>", "</s>", "<pad>". And some of the embedding rows do not have values, only "NA" values.
Could anyone kindly help me find out why?
I have tried nothing to solve it yet.

Thanks to the comments below the question. These symbols are from the tokenizer used in RoBERTa.
https://medium.com/analytics-vidhya/create-a-tokenizer-and-train-a-huggingface-roberta-model-from-scratch-f3ed1138180c
"< s >" or BOS, beginning Of Sentence
"< /s >" or EOS, End Of Sentence
"< pad >" the padding token
"Ġ" k

Related

How to delete non text characters, new line, tab, etc, from params of JSON string

I have JSON strings that may contain \n, \t, which I don't want to save into database. strip_tags helps only with simple strings. I am using gsub(/(\\n)|(\\t)/, "").
I wonder if there is another Rails helper method or a better way to achieve this.
e.g
"[{\"type\":\"checkbox-group\",\"label\":\"\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nFill in\\nthe Gap (Please\\nfill in the blank box with correct wor\",\"name\":\"checkbox-group-1527245153706\",\"values\":[{\"label\":\"Option 1\",\"value\":\"option-1\",\"selected\":true}]},{\"type\":\"text\",\"label\":\"\\n\\t\\t\\n\\t\\n\\t\\n\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\tWhat are the unique features of\\ne-commerce, digital markets, and\\ndigital goods? \\n\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\",\"className\":\"form-control\",\"name\":\"text-1527245426509\",\"subtype\":\"text\"}]"
You can make use of squish or squish!
" Some text \n\n and a tab \t\t with new line \n\n ".squish
#=> "Some text and a tab with new line"
Squish removes all the whitespace chars on both ends and grouping remaining whitespace chars (\n, \t, space) in one space
I think this one may help you,
JSON.parse(string).map{ |a| a['label'] = a['label'].squish; a}

Match a word or whitespaces in Lua

(Sorry for my broken English)
What I'm trying to do is matching a word (with or without numbers and special characters) or whitespace characters (whitespaces, tabs, optional new lines) in a string in Lua.
For example:
local my_string = "foo bar"
my_string:match(regex) --> should return 'foo', ' ', 'bar'
my_string = " 123!#." -- note: three whitespaces before '123!#.'
my_string:match(regex) --> should return ' ', ' ', ' ', '123!#.'
Where regex is the Lua regular expression pattern I'm asking for.
Of course I've done some research on Google, but I couldn't find anything useful. What I've got so far is [%s%S]+ and [%s+%S+] but it doesn't seem to work.
Any solution using the standart library, e.g. string.find, string.gmatch etc. is OK.
Match returns either captures or the whole match, your patterns do not define those. [%s%S]+ matches "(space or not space) multiple times more than once", basically - everything. [%s+%S+] is plain wrong, the character class [ ] is a set of single character members, it does not treat sequences of characters in any other way ("[cat]" matches "c" or "a"), nor it cares about +. The [%s+%S+] is probably "(a space or plus or not space or plus) single character"
The first example 'foo', ' ', 'bar' could be solved by:
regex="(%S+)(%s)(%S+)"
If you want a variable number of captures you are going to need the gmatch iterator:
local capt={}
for q,w,e in my_string:gmatch("(%s*)(%S+)(%s*)") do
if q and #q>0 then
table.insert(capt,q)
end
table.insert(capt,w)
if e and #e>0 then
table.insert(capt,e)
end
end
This will not however detect the leading spaces or discern between a single space and several, you'll need to add those checks to the match result processing.
Lua standard patterns are simplistic, if you are going to need more intricate matching, you might want to have a look at lua lpeg library.

Lua String Manipulation (Find Words Before & After)

I'm fairly new to this forum. I am having trouble with manipulating the correct string to achieve this.
Basically, what I'm trying to do is receive an input string like this example:
str = "Say hello to=Stack overflow, Say goodbye to=other resources"
for question, answer in pairs(string.gmatch(s, "(%w+)=(%w+)"))
print(question, answer)
end
I want it to return: question = "Say hello to" and answer = "Stack overflow, question = "Say goodbye to" and so on and so forth. but instead, it picks up the word just before the equal sign and the word just after. I've even tried the * quantifier, and it does the same exact thing.
I've also tried this pattern
[%w%s]*=[%w%s]
I just want to be able to sort this string into a key-value table where the key is all words before each = and the value is all words after that equal but before the comma.
Does anyone have a suggestion?
You can use something like this:
local str = "Say hello to=Stack overflow, Say goodbye to=other resources"
for question, answer in string.gmatch(str..",", "([^=]+)=([^,]+),%s*") do
print(question, answer)
end
"([^=]+)=([^,]+),%s*" means the following: anything except = ([^=]) repeated 1 or more times (+) followed by = and then anything except ',', followed by comma and optional whitespaces (to avoid including them in the next question). I also added comma to the string, so it parses the last pair as well.
To elaborate a bit further per request in the comments: in the expression [^=]+, [=] designates a set with one allowed character (=) and [^=] negates that, so it's a set with any character allowed except = and + allows the set to be repeated 1 or more times.
As #lhf suggested you can use a simpler expression: (.-)=(.-),%s*, which means: take all characters until the first = (- makes matching non-greedy) and then take all characters until the first ,.

Rails strip all except numbers commas and decimal points

Hi I've been struggling with this for the last hour and am no closer. How exactly do I strip everything except numbers, commas and decimal points from a rails string? The closest I have so far is:-
rate = rate.gsub!(/[^0-9]/i, '')
This strips everything but the numbers. When I try add commas to the expression, everything is getting stripped. I got the aboves from somewhere else and as far as I can gather:
^ = not
Everything to the left of the comma gets replaced by what's in the '' on the right
No idea what the /i does
I'm very new to gsub. Does anyone know of a good tutorial on building expressions?
Thanks
Try:
rate = rate.gsub(/[^0-9,\.]/, '')
Basically, you know the ^ means not when inside the character class brackets [] which you are using, and then you can just add the comma to the list. The decimal needs to be escaped with a backslash because in regular expressions they are a special character that means "match anything".
Also, be aware of whether you are using gsub or gsub!
gsub! has the bang, so it edits the instance of the string you're passing in, rather than returning another one.
So if using gsub! it would be:
rate.gsub!(/[^0-9,\.]/, '')
And rate would be altered.
If you do not want to alter the original variable, then you can use the version without the bang (and assign it to a different var):
cleaned_rate = rate.gsub!(/[^0-9,\.]/, '')
I'd just google for tutorials. I haven't used one. Regexes are a LOT of time and trial and error (and table-flipping).
This is a cool tool to use with a mini cheat-sheet on it for ruby that allows you to quickly edit and test your expression:
http://rubular.com/
You can just add the comma and period in the square-bracketed expression:
rate.gsub(/[^0-9,.]/, '')
You don't need the i for case-insensitivity for numbers and symbols.
There's lots of info on regular expressions, regex, etc. Maybe search for those instead of gsub.
You can use this:
rate = rate.gsub!(/[^0-9\.\,]/g,'')
Also check this out to learn more about regular expressions:
http://www.regexr.com/

Split lua string into characters

I only found this related to what I am looking for: Split string by count of characters but it is not useful for what I mean.
I have a string variable, which is an ammount of 3 numbers (can be from 000 to 999). I need to separate each of the numbers (characters) and get them into a table.
I am programming for a game mod which uses lua, and it has some extra functions. If you could help me to make it using: http://wiki.multitheftauto.com/wiki/Split would be amazing, but any other way is ok too.
Thanks in advance
Corrected to what the OP wanted to ask:
To just split a 3-digit number in 3 numbers, that's even easier:
s='429'
c1,c2,c3=s:match('(%d)(%d)(%d)')
t={tonumber(c1),tonumber(c2),tonumber(c3)}
The answer to "How do I split a long string composed of 3 digit numbers":
This is trivial. You might take a look at the gmatch function in the reference manual:
s="123456789"
res={}
for num in s:gmatch('%d%d%d') do
res[#res+1]=tonumber(num)
end
or if you don't like looping:
res={}
s:gsub('%d%d%d',function(n)res[#res+1]=tonumber(n)end)
I was looking for something like this, but avoiding looping - and hopefully having it as one-liner. Eventually, I found this example from lua-users wiki: Split Join:
fields = {str:match((str:gsub("[^"..sep.."]*"..sep, "([^"..sep.."]*)"..sep)))}
... which is exactly the kind of syntax I'd like - one liner, returns a table - except, I don't really understand what is going on :/ Still, after some poking about, I managed to find the right syntax to split into characters with this idiom, which apparently is:
fields = { str:match( (str:gsub(".", "(.)")) ) }
I guess, what happens is that gsub basically puts parenthesis '(.)' around each character '.' - so that match would consider those as a separate match unit, and "extract" them as separate units as well... But I still don't get why is there extra pair of parenthesis around the str:gsub(".", "(.)") piece.
I tested this with Lua5.1:
str = "a - b - c"
fields = { str:match( (str:gsub(".", "(.)")) ) }
print(table_print(fields))
... where table_print is from lua-users wiki: Table Serialization; and this code prints:
"a"
" "
"-"
" "
"b"
" "
"-"
" "
"c"

Resources