Rails more then one empty space - ruby-on-rails

I want to retrieve string from database with rails (Rails 2.3.8 andy jruby 1.6.5.1) that looks like "I am human", but in a browser it always looks like "I am human". There is no extra spaces in between. How can I keep extra spaces, or what is striping the string.
Operation is pretty simple, just pull a string and show it in view.
Thanks.

Nothing is stripping the string - it's the semantics of HTML that is screwing you over. All consecutive spaces in HTML are rendered as a single space, unless explicitly prevented by a <pre> element, or equivalent CSS rule (white-space). You can also keep your spaces if you convert them to another kind of space - a non-breaking space ( ) does not get collapsed.

browsers interpret usual spaces this way. One can use to keep spaces.
do
string_from_db.gsub(' ', ' ')
or use <pre> tag.

Related

iPhone XML parsing Norwegian characters æ ø å

I've had this problem for a long time but I've been implementing this ugly hack on the backend to get around it.
Now I've decided to act as a real developer and deal with it.
My problem is that when parsing an XML feed with any of the Norwegian characters æ, ø or å in the title node, all the letters appearing before these special characters are ommitted.
So if the word is "Bålhuset" it only displays "ålhuset" - the funny thing is that æ,ø and å characters AFTER the initial problem character is included.
So if I put for example "ÅBålhuset", I will get "Bålhuset". So it seems it's only the first occurence of one of these special characters that will cause a problem.
Any help would be immensely appreciated!
-Chris
Try while you creating XML use CDATA tags like
<title><![CDATA[Transport "Bålhuset"Classic World's]]></title>
Also here is a list of HTML Tags and more cases XML with those characters is invalid, unless they are contained within a CDATA. Also try this Question hope with help you
Otherwise you need to use their special character code. If you want to represent ö you need to type ö please review like.
And Final XML with those characters is invalid, unless they are contained within a CDATA.
You can Validate you XML while creating and easily fix the bug.
What did it for me was getting the data in JSON and using the native JSON methods; no dropped characters and other sporadic behaviour.
So what that means to me is that there is an issue with NSXMLParser that makes it choke on international characters (the first occurence of which mind you) even though everything is in order with encoding etc.

RegEx how to properly use OR pipelines

I need to know how to properly use "OR" when it comes to individual characters and whole phrases... For example I have code that is checking for any number of characters OR words that are found in an array...
I want to check for some unicode characters and also some html lines of code.
I'm currently just checking for the characters using this:
([\u200b\u200c\u200d\0\1\2\3\4\5\6\7]*)
(the backslashes are representing the unicode characters u+200b - u+200d and the special characters in my software \0-\7 (They are all individual characters), these are valid escape sequences in Objective-C.)
Now what if I wanted to check for these characters AND check for phrases like <b> or <font color="#FF0000">
I found stuff while doing research that said to use pipelines | but I'm not sure if I put them only in-between the words or also in-between the individual characters and I'm not sure if I put quotes around the words or what not... I need help before I screw this up badly haha!
(p.s., not sure if it will be any different but I'm also doing it for this:
([^\u200b\u200c\u200d\0\1\2\3\4\5\6\7])
it's be someting like
/([^....]|\<b\/\>|\<font color .... \>)/
though, the usual caveats about regexes and html apply here.
As for the confusion about where to put the |, consider this this hackneyed example: You want to find the word color, but also want to accommodate the british spelling, colour:
/(color|colour)/
/(colou?r)/
/(colo(r|ur))/
are all basically equivalent.

FitNesse: Can't see the difference between the expected and the actual result in failed assertion

I'm using FitNesse to test web service responses using check to compare the expected to the actual response.
In some cases the check is failing and I can't see what the differences are between the expected and the actual that is causing it to fail.
Here's a screenshot from what it's telling me in a specific instance (of many similar instances):
Feel free to point out the obvious; it's probably staring at me in the face and I'm looking so hard I can't see it!
I would check that the expected and actual strings are both written with the same text encoding. I've seen this error plenty of times when the text comparing failed due to a comma or apostrophe being written in different encodes.
It is possible that your string contains extra spaces in the actual value. FitNesse, being html based, will not respect leading or trailing spaces. It might not handle any extra spaces inside the actual either. So this can cause the result to be different, but not visibly so.
See if you can add some debug messages that would help you see the extra spaces, or at least count the number of characters in both strings.
This question doesn't specify whether Slim or Fit are being used, or which Slim server/plugin if using Slim, but I found the following to be true for me using FitNesse release 20130530 and fitSharp release 2.2:
Non-ASCII characters and { apostrophes / single quote characters } in input arguments/parameters that are strings are HTML encoded. The values in my FitNesse test tables are HTML encoded, but only the required syntax characters and (double) quotes; not the non-ASCII characters (and FitNesse doesn't seem to have any problems storing those values).
EOL characters in the input arguments that are strings consist of a linefeed character only
I imagine that because I'm using .NET, EOLs in my return values consist of carriage return and linefeed characters.
Because of [1], I'm HTML-encoding non-ASCII characters (but not the HTML syntax characters or quotes). Because of [2] and [3], I'm now removing carriage return characters from my fixture return values. Both changes seem to have resolved this issue for me and expected and actual values are now reported as being the same.
Whitespace has troubled me often. The resulting HTML just collapses whitespace, but the compare in code does not.
I now use a fixture to make differences more explicit to me. Example usage: http://fhoeben.github.io/hsac-fitnesse-fixtures/examples-results/HsacExamples.SlimTests.UtilityFixtures.CompareFixtureTest.html
Newer versions of FitNesse (since 20151230) do a diff on the expected and actual result values. Has that helped you at all?

’ is getting converted as "\u0092" by nokogiri in ruby on rails

I have html page which has following line with some html entities like "’".
#Here I am not pasting whole html page content. just putting issue line only
html_file = "<html>....<body><p>they’re originally intended to describe the spread of of viral diseases, but they&#146;re nice analogies for how web/SN apps grow.<p> ...</body></html>"
doc = Nokogiri::HTML(html)
body = doc.xpath('//body')
body_content = body[0].inner_html
puts body_content
Result:
These terms come from the fields of medicine and biology they\u0092re originally intended to describe the spread of of viral diseases, but they\u0092re nice analogies for how web/SN apps grow.
I want to leave these entities as it is instead of changing it to unicode.
Any thing, Am I missing?
Thanks
they’re
is wrong and should be avoided. If you want to use a close-single-quote there, to reproduce the typographical practice of rendering apostrophes as a slanted quote, then the correct character is U+2019 RIGHT SINGLE QUOTATION MARK, which can be written as ’ or ’. Or, if you're using UTF-8, just included verbatim as ’.
’ should refer to character U+0092, a little-used and pointless control character that typically renders as blank or a missing-glyph box. And indeed in XML, it does.
But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range € to Ÿ are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.
The problem is that Nokogiri doesn't know about this quirk, and takes character reference 146 at its word, ending up with the character 146 (\u0092) that you don't really want. I think Nokogiri is using libxml2 to parse HTML, so ultimately the proper fix would be to libxml2's htmlParseCharRef function, to substitute characters 128–159.
In the meantime you could perhaps try ‘fixing up’ character references manually with crude string substitution like ’->’ before parsing. It's a bit wrong, but at least in HTML the only other place you can have the markup sequence ’ without it being a character reference would be in a comment, so hopefully it wouldn't matter if you changed the content there accidentally too.
Have you tried changing
&#146;
into
’
i think the parser parses the ampersand first then concats it with "#146" and then parse them both. it's just an opinion though..I want this to be just a comment IDK how..lol
Well I got the idea from focos in his answer post here, and the unicode from here.

Sanitize pasted text from MS-Word

Here's my wild and whacky psuedo-code. Anyone know how to make this real?
Background:
This dynamic content comes from a ckeditor. And a lot of folks paste Microsoft Word content in it. No worries, if I just call the attribute untouched it loads pretty. But the catch is that I want it to be just 125 characters abbreviated. When I add truncation to it, then all of the Microsoft Word scripts start popping up. Then I added simple_format, and sanitize, and truncate, and even made my controller start spotting out specific variables that MS would make and gsub them out. But there's too many of them, and it seems like an awfully messy way to accomplish this. Thus so! Realizing that by itself, its clean. I thought, why not just slice it. However, the microsoft word text becomes blank but still holds its numbered position in the string. So I came up with this (probably awful) solution below.
It's in three steps.
When the text parses, it doesn't display any of the MSWord junk. But that text still holds a number position in a slice statement. So I want to use a regexp to find the first actual character.
Take that character and find out what its numbered position is in the total string.
Use a slice statement to cut it from.
def about_us_truncated
x = self.about_us.find.first(regExp representing first actual character)
x.charCount = y
self.about_us[y..125]
end
The only other idea i got, is a regex statement that allows it to explicitly slice only actual characters like so :
about_us([a-zA-Z][0..125]) , but that is definately not how it is written.
Here is some sample text of MS Word junk :
&Lt;! [If Gte Mso 9]>&Lt;Xml>&Lt;Br /> &Lt;O:Office Document Settings>&Lt;Br /> &Lt;O:Allow Png/>&Lt;Br /> &Lt;/O:Off...
You haven't provided much information to go off of, but don't be too leery of trying to build this regex on your own before you seek help...
Take your sample text and paste it in Rubular in the test string area and start building your regex. It has a great quick reference at the bottom.
Stumbled across this
http://gist.github.com/139987
it looks like it requires the sanitize gem.
This is technically not a straight answer, but it seems like the best possible one you can find.
In order to prevent MS Word, you should be using CK Editor's built-in MS word sanitizer. This is because writing regex for it can be very complicated and you can very easily break tags in half and destroy your site with it.
What I did as a workaround, is I did a force paste as plain text in the CK Editor.

Resources