cawler: html file encodings issue

cawler: html file encodings issue - character-encoding

I try to write a crawler to get some information.But I find the word is different in webpage source.For example, the word Möller is Möller in html file.
I want to know how can I recover it after I get the html file.

Having fix this problem and provide the answer in case some beginner meet the same problem.
I use chr() to substitute the wrong code, for example use chr(246) to substitute ö
If there is better solution, please tell me.

Related

Where can I see the specification for thymeleaf th:method?

I've seen a lot of answers about how to send PUT/DELETE/PATCH HTTP requests with thymeleaf, and it's by using th:method = "the_specific_method", but i haven't found the thymeleaf specification about that. Can anyone help showing me where is it?
Thanks in advance.
I've tried to google for the answer, but no luck.

th:method isn't special to Thymeleaf -- it's just like any other plain old attribute which will output the result of an expression to the method attribute. It doesn't do (or care about) anything else. You can put any string and/or string expression into it, and Thymeleaf will happily output it.
th:method="${'the_specific_method'}"
will output
method="the_specific_method"
without regards to whether or not it's valid. If you want to learn about the method attribute, you just need to learn about how method works in plain old regular html and how browsers (and/or Spring) work with it.

Getting yen symbol when I try to type backslash in pycharm

Note: I have been redirected to this website, as it believed to be the appropriate option for questions like this. If this is not the correct website, could someone please just let me know where I can find help?
I'm trying to write my program in Pycharm, but for some annoying reason whenever I try to type \, it shows up as ¥.
Here's a screenshot:
this is actually supposed to say print('\n'). Whatever has happened has changed all the \ to ¥ in all my files!
And, yes, I have tried copying and pasting the \ but it just ends up changing into ¥
So, could someone please let me know how to fix this??

This could be happening because you are using a font, particularly a Japanese don't. Change the font to an English font like Arial.
If that doesn't work you can use the Unicode backslash in Unicode and ASCII it is encoded at U+005C

\u0092 is not printed in UILabel

I have a local json file with some descriptions of an app and I have found a weird behaviour when parsing \u0092 and \u0091 characters.
When json file contains these characters, the corresponding parsed NSString is printed like "?" and in UIlabel it dissapears completely.
Example "L\u2019H\u00e9r." is showed as "LHér." instead of "L'Hér."
If I replace this characters with \u2019, then I can see the caracter ' in UILabel
Does anybody any clue about this?
EDIT: For the moment I will substitute both of them with character \u2019, it is also a ' and there is no problem confusing it with a control character. Thank you all!

This answer is a little speculative, but I hope it gets you on the right tracks.
Your best bet may be to give up and substitute \u0091 and \u0092 for something else as a preprocessing step before string display. These are control characters and are unprintable in most encodings. But:
If rest of the file is proper UTF, your json file probably has problems: encoding is wrong (CP-1250?) while you read the file as UTF, some error has been made when converting the file, or a similar issue. So another solution is of course fixing your file.
If you're not sure about how your file is encoded, it may simply be encoded in CP-1250 - so reading the file using NSWindowsCP1250StringEncoding might fix your problem.
BTW, if you hardcode a string #"\u0091", you'll get a compilation time error Universal character name refers to a control character. Yes, not even a warning, it's that much unprintable in Unicode ;)

How to replace Mandrill's | | symbols?

Is there any chance to replace the mandrill's *| |* symbols?
The CMS i'm using (MODX) has its own symbols to enclose the tags, eg: [[+ ]]
The case is that I also have "read on web" link, where the page on the web needs to generate dynamic content as well.
I have googled and searched on http://help.mandrill.com but still no luck.
Any hint will be appreciated.

You wouldn't be able to use different symbols in your emails - those are how Mandrill's system recognizes merge tags and to replace them in the HTML and/or text of your email. You'd need to convert any placeholders you have or want for the email to that format, so you can pass the data to Mandrill as expected. If it's going to mirror what you're putting on the web, then you probably just want to have something that transforms strings, for example, to convert your CMS tags to Mandrill tags specifically for the emails.

#kaitlin-mandrill,
Exactly,
I just figured it out.
I need to replace it right before it is sent.
More or less, this is the code.
Hopefully it's useful for anyone else.

How to parse a .xfa file

Hoping that someone has some info on how to parse a xfa file. I can parse csv or xml files just fine, but an xfa one has come along and I'm not familar with the format. Looks like tab delimited body with column metadata at the top.
Anyone dealt with these before or can give me a steer on how to parse them?
I use vb.net but the language of any solution isn't too relevant.
Much appreciated.

Mmm, looks like nobody has a clue. The problem is that .xfa doesn't look like a "standard" extension: after all, anybody can create its own extension names, from .xyz to .something...
I looked around a bit, found, unsurprisingly (the 'x') an XML format with this extension, not much more.
Indicating where this kind of file come from, what kind of data it holds, might help. Or not.
You describe the file as being a simple TSV (tab separated values) with a header. It is quite trivial to parse, with a tokenizer or some regex, so I am not sure where you are stuck.

I think you might be talking about this: http://en.wikipedia.org/wiki/XFA_forms
This seemed to be a page that was designed to deal with that template: http://www.w3.org/1999/05/XFA/xfa-template-19990614
That information should be enough to get the ball rolling. If that fails then you can always analyse the file itself for patterns and go from there. I don't see it being too tricky.
Anyway, I hope that helps.
P.S. If you could provide a link to that .xfa we could probably give you more help.

The original post says the content looks like "tab delimited body with column metadata at the top". An XFA form doesn't look anything like that - XFA forms typically use a *.xdp extension and are XML.

Check out the Adobe page:
http://partners.adobe.com/public/developer/xml/index_arch.html
(Adobe XML Forms Architecture, currently 1400 pages)
Let LiveCycle/Acrobat parse it for you.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

cawler: html file encodings issue - character-encoding

I try to write a crawler to get some information.But I find the word is different in webpage source.For example, the word Möller is Möller in html file. I want to know how can I recover it after I get the html file.

Having fix this problem and provide the answer in case some beginner meet the same problem. I use chr() to substitute the wrong code, for example use chr(246) to substitute ö If there is better solution, please tell me.

Related

Where can I see the specification for thymeleaf th:method?

Getting yen symbol when I try to type backslash in pycharm

\u0092 is not printed in UILabel

How to replace Mandrill's | | symbols?

How to parse a .xfa file

Categories

Resources

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

cawler: html file encodings issue - character-encoding

I try to write a crawler to get some information.But I find the word is different in webpage source.For example, the word Möller is Möller in html file. I want to know how can I recover it after I get the html file.

Having fix this problem and provide the answer in case some beginner meet the same problem. I use chr() to substitute the wrong code, for example use chr(246) to substitute ö If there is better solution, please tell me.

Related

Where can I see the specification for thymeleaf th:method?

Getting yen symbol when I try to type backslash in pycharm

\u0092 is not printed in UILabel

How to replace Mandrill's *| |* symbols?

How to parse a .xfa file

Categories

Resources

How to replace Mandrill's | | symbols?