PGError: ERROR: invalid byte sequence for encoding "UTF8 - ruby-on-rails

I'm getting the following PGError while ingesting Rails emails from Cloudmailin:
PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xbb HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". : INSERT INTO "comments" ("content") VALUES ('Reply with blah blah ����������������������������������������������������� .....
So it seems pretty clear I have some invalid UTF8 characters getting into the email right? So I tried to clean that up but something is still Sneaking through. Here's what I have so far:
message_all_clean = params[:message]
Iconv.conv('UTF-8//IGNORE', 'UTF-8', message_all_clean)
message_plain_clean = params[:plain]
Iconv.conv('UTF-8//IGNORE', 'UTF-8', message_plain_clean)
#incoming_mail = IncomingMail.create(:message_all => Base64.encode64(message_all_clean), :message_plain => Base64.encode64(message_plain_clean))
Any ideas, thoughts or suggestions? Thanks

When encountering this issue on Heroku, we converted to US-ASCII to sanitize incoming data appropriately (i.e. pasted from Word):
Iconv.conv("UTF-8//IGNORE", "US-ASCII", content)
With this, we had no more issues with character encoding.
Also, double check that there's no other fields that need the same conversion, as it could affect anything that's passing a block of text to the database.

Related

Rails loading ActiveSupport regex with ISO-8859-1 encoding instead of UTF-8

When I call '返回'.titleize on a Chinese-language string in the Rails 6 app on my server, I get an error:
Encoding::CompatibilityError: incompatible encoding regexp match (ISO-8859-1 regexp with UTF-8 string)
The source for the titleize function leads to the following in ActiveSupport::Inflector:
def titleize(word, keep_id_suffix: false)
humanize(underscore(word), keep_id_suffix: keep_id_suffix).gsub(/\b(?<!\w['’`()])[a-z]/) do |match|
match.capitalize
end
end
Although calling ActiveSupport::Inflector.titleize('返回') gives the error above, if I just copy the function body and run it as follows, there's no error --- I just get the correct titleize behaviour:
ActiveSupport::Inflector.humanize(ActiveSupport::Inflector.underscore('返回'), keep_id_suffix: false).gsub(/\b(?<!\w['’`()])[a-z]/) do |match|
match.capitalize
end
I guess that the regular expression /\b(?<!\w['’()])[a-z]/ is getting compiled with a ISO-8859-1 encoding when ActiveSupport::Inflector loads but is compiling with a different encoding when I run it in my Rails console, but I don't see anything I can do with this information.
What can I do to get the Rails helper titleize to work on my server as intended?
I'm on Rails 6, Ruby 2.6.6, and ENV["LANG"] is en_US.utf8.

neo4j-admin import "Multi-line fields are illegal"

I'm getting the following error in Neo4j community 4.1.2 using the neo4-admin import tool.
Caused by:ERROR in input
data source: BufferedCharSeeker[source:/home/ubuntu/workspace/neo4j-community-4.1.2/bin/../import/nodes.csv, position:24455, line:359]
in field: code:string:6
for header: [id:ID, labels:LABEL, type:string, flags:string, lineno:string, code:string, childnum:string, funcid:string, classname:string, namespace:string, endlineno:string, name:string, doccomment:string]
raw field value: 402
original error: At /home/ubuntu/workspace/neo4j-community-4.1.2/bin/../import/nodes.csv # position 24455 - Multi-line fields are illegal in this context and so this might suggest that there's a field with a start quote, but a missing end quote. See /home/ubuntu/workspace/neo4j-community-4.1.2/bin/../import/nodes.csv # position 24455.
I checked each single byte with hexedit:
the line #359
the char #24455
the line #358
the line #360
357,AST,string,,34,"/load.php",1,310,,"",,,
358,AST,AST_CALL,,37,,9,310,,"",,,
359,AST,AST_NAME,NAME_NOT_FQ,37,,0,310,,"",,,
360,AST,string,,37,"wp_check_php_mysql_versions",0,310,,"",,,
361,AST,AST_ARG_LIST,,37,,1,310,,"",,,
362,AST,AST_INCLUDE_OR_EVAL,EXEC_REQUIRE,40,,10,310,,"",,,
This is the absurd situation:
no multi-line fields are present
no special char are present
no extra 0A byte
no extra "start quote" without its relative "end quote"
I found some issues on Github but are referred to old versions of Neo4j...what can be the reason?
Finally I found the line causing the exception.
The exception cause was correct but the number of the line was totally wrong.
I pointed out it by adding the following flag --multiline-fields=true to the neo4j-admin import command.

invalid byte sequence in UTF-8 for single quote in Ruby

I'm using the following code to show description in template:
json.description resource.description if resource.description.present?
It gives me invalid byte sequence in UTF-8 error. I dig this a little bit, and find out the issue is my description has single quote as ’ instead of '. Wondering what's the best way to fix this encoding issue? Description is input by user and I have no control over it. Another weird issue is, I have multiple test environments, they all have the same Ruby and Rails version and they are running the same code, but only one of the environment has this error.
def to_utf8(str)
str = str.force_encoding("UTF-8")
return str if str.valid_encoding?
str = str.force_encoding("BINARY")
str.encode("UTF-8", invalid: :replace, undef: :replace)
end
ref: https://stackoverflow.com/a/17028706/455770

SearchIO.parse xml blast and ampersands cElementTree.ParseError: not well-formed (invalid token) error

I would like some advice to work around an xml parsing error. In my BLAST xml output, I have a description that has an '&' character which is throwing off the SearchIO.parse function.
If I run
qresults=SearchIO.parse(PLAST_output,"blast-xml")
for record in qresults:
#do some stuff
I get the following error:
cElementTree.ParseError: not well-formed (invalid token): line 13701986, column 30
Which directs me to the this line:
<Hit_def>Lysosomal & prostatic acid phosphatases [Xanthophyllomyces dendrorhous</Hit_def>
Is there a way to override this in biopython so I do not have to change my xml file? Right now, I'm just doing a 'Try/Except' loop, but that is not optimal!
Thanks for your help!
Courtney

Rails 3, Heroku: Taps Server Error: PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xba

I have a Rails 3.0.9 application running both locally in my dev env and remotely on a heroku app. I have a method that imports a CSV file into a model, and this file can contain non-english characters, like °,á,é,í, etc (it's in spanish).
I am currently able to import the complete file (75k records) without any problems in my local dev (SQLite) database; but, when uploading the db to heroku with heroku db:push, it fails with the error I'm posting in the title:
!!! Caught Server Exception
HTTP CODE: 500
Taps Server Error: PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xba
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
Apparently, Heroku has issues inserting the '°' character. (At the moment the file doesn't have any á,é,í, etc characters, but I suspect these might fail too.)
I have set in my application.rb file the default encoding, as follows:
#.../application.rb
config.encoding = "utf-8"
What else can I do to set the 'client encoding' and solve this problem?
The numero sign, º, is 0xBA in ISO-8869-1 not UTF-8. So your CSV file is encoded with Latin-1 but you're trying to store it in your database as UTF-8 without fixing the encoding.
You can try telling your CSV library that it is dealing with Latin-1 encoded text and maybe it will take care of converting to UTF-8. If that doesn't work, then you can do it yourself with Iconv:
ruby-1.9.2 > Iconv.iconv('UTF-8', 'ISO-8859-1', "\xba")
=> ["º"]
ruby-1.9.2 > Iconv.iconv('UTF-8', 'ISO-8859-1', "\xb0")
=> ["°"]
You're not having trouble with SQLite because SQLite tends be very forgiving and it has a very loose type system. PostgreSQL, OTOH, tends to be rather strict and properly complains if you try to feed it invalid data. I'd recommend that you stop developing on top of SQLite if you're going to be deploying to Heroku and PostgreSQL, there are other differences that will cause problems (the behavior of GROUP BY and LIKE for example).

Resources