Why is twitter double encoding XML entity references?
Here's an example tweet:
xml entity ref test < & '
The response from statuses/friends_timeline:
<status>
<created_at>Wed Jun 24 00:16:15 +0000 2009</created_at>
<id>2302770346</id>
<text>xml entity ref test < & '</text>
<source>web</source>
<truncated>false</truncated>
shouldn't it be
< & '
I did some more test, here's what happens in the http post to send the update:
sniff again < & '
post data:
authenticity_token=secret_sauce_removed&status=sniff+again+%3C+%26+'&twttr=true&return_rendered_status=true
I've confirmed Justin's observation that only < > is double encoded. First line is the xml repsonse, 2nd line json.
<text>" & ' < ></text>
"text":"\" & ' < >"
Twitter documentation says "escaped and HTML encoded status body", I guess escaped means xml encoding < >.
But i still don't understand why they're doing it. No web pages are involved in the whole process. The tweet is sent through the rest API url-encoded, and it is retrieved as xml or json.
It's double coded because the text property is quasi HTML Encoded text (looks like they're only encoding < and > so that you don't start/end a new html element in your tweet). Therefore, before the XML parses it for communication across the wire, you'd have:
xml entity ref test < & '
That string then gets encoded again (so that when it is decoded, it is still the proper HTML Encoded text) which turns it in to the:
xml entity ref test < & '
That you are getting back.
It looks like it's taking the HTML code, and sticking that inside of an XML field, so when you use your XML parser on the XML, you get valid HTML.
Related
Currently we are using SendGrid Inbound Parse to receive emails.
We handle the Inbound Parse webhook request by Azure HttpTrigger function implmented in C# (.NET 6).
When the received email is in UTF-8 encoding, everything's okay.
However, when we tried to receive email in shift_jis encoding, headers are okay,
but japanese characters in text and html are garbled.
From Inbound Parse request, we got the charsets as below:
subject: UTF-8
to: UTF-8
from: UTF-8
cc: UTF-8
html: shift_jis
text: shift_jis
And the string we got directly from request.form["text"] (or "html") was already garbled like "�e�L�X�gshiftJis-007"
(should be "テキストshiftJis-007"), so we cannot use string in request directly.
Then we tried to convert (System.Text.Encoding.Convert method) it from charset encoding (shift_jis) to utf-8,
and the result was different from original string but still unreadable "?e?L?X?gshiftJis-007".
Our questions are:
When using C# HttpTrigger Azure function to handle Inbound Parse webhook request
(request data is passed through AspNetCore.)
What encoding is in html/text string in Inbound Parse webhook request
when the email is send in encoding other than UTF-8?
How to read text and html in shift_jis encoding (or other encodings excluding UTF-8)
correctlyfrom an Inbound Parse webhook request?
Twilio Developer Evangelist here. I would recommend reaching out to the support team because it requires to investigate the payload to figure out what is going on.
I also tried to replicate the issue on my end with using send_raw option. Here's the payload, and it does contain shift_jis characters. You may be able to process the payload manually.
(stripped X-Mailer info)
'Content-Type: text/plain; charset="shift_jis"\n' +
'X-Mailer: \n' +
'Content-Transfer-Encoding: quoted-printable\n' +
'\n' +
'\n' +
'=83e=83L=83X=83gshiftJis-007\n'
I want my app to response with body utf-8 and iso-8859-1 encoded
per requests with Accept-Charset="utf-8" or Accept-Charset="iso-8859-1".
The response body is always JSON.
In my controller, when I doing this
render(json: data, status: :created)
It response with Content-Type="application/json; charset=utf-8" as well.
But how to make a response with body iso-8859-1 encoded when request Accept-Charset="iso-8859-1"?
In order to do this, you can use the method force_encoding and encoding for example
data = {'name'=>'raghav'}.to_json
data.encoding #This would return what encoding the value as #<Encoding:UTF-8>
new_data = data.force_encoding('ISO-8859-1') #This would force the encoding
new_data.encoding #<Encoding:ISO-8859-1>
Also to do this on the specific case you can always read the request.headers hash to determine the encoding.
There is also another method called encode the main difference between these are force_encoding changes the way string is being read from bytes, and encode changes the way string is written without changing the output (if possible)
This is kind of related to this post. I am trying to post some form data using TIdHTTP and TIdMultiPartFormDataStream, but when monitoring the communication using Wireshark, each form field gets a content-Type: text/plain attached to it and for some reason the server that I am sending these stuff to does not like it. Is there a way that I can make sure only the name and value is sent?
The Content-Transfer was also being added and I was able to remove that using:
aFieldItem := PostStream.AddFormField(fName, fValue);
aFieldItem.ContentTransfer := '';
but I cannot find any way to get rid of content type.
At this moment the data that is being sent looke like this (in Wireshark)
Boundary: \r\n----------051715151353026\r\n
Encapsulated multipart part: (text/plain)
Content-Disposition: form-data; name="description"\r\n
Content-Type: text/plain\r\n
Line-based text data: text/plain
\r\n
Testing new AW Mobile
and I want it to look like:
Boundary: \r\n------WebKitFormBoundary32hCBG8zkGMBpxqL\r\n
Encapsulated multipart part:
Content-Disposition: form-data; name="description"\r\n
Data (21 bytes)
Data: 0d0a5465737420616e6420747261636520636f6d6d
Length: 21
Thank you
Sam
HTML5 Section 4.10.22.7 alters how RFC 2388 applies to webform submissions:
The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified. Their names and values must be encoded using the character encoding selected above (field names in particular do not get converted to a 7-bit safe encoding as suggested in RFC 2388).
This is different from RFC 2388:
As with all multipart MIME types, each part has an optional "Content-Type", which defaults to text/plain.
Your server is clearly expecting the HTML5 behavior.
The Content-Type header on each MIME part added to TIdMultipartFormDataStream is hard-coded and cannot be removed without altering TIdMultipartFormDataStream's source code can be omitted by setting the TIdFormDataField.ContentType property to a space character (not a blank string, like the ContentTransfer property allows):
aFieldItem := PostStream.AddFormField(fName, fValue);
aFieldItem.ContentTransfer := '';
aFieldItem.ContentType := ' '; // <-- here
If you set the ContentType property to a blank string, it will set the Content-Type header to application/octet-stream, but assigning a space character instead has a side effect of omitting the header when the property setter parses the new value.
That being said, I have already made some changes to TIdMultipartFormDataStream to account for this change in webform submission in HTML5, but I have not finalized and released it yet.
i'm trying to get some HTML code in a post Request with the symfony 2 API.
For example when i post something like "< p > hello < / p > "
I got in my request action handler (using $request->request->get(X)) an escaped string => "p hello /p"
Is there any way to get the raw data in the action handler ?
$request->request->get(X) doesn't escape values. To view posted data use Profiler. Your data are escaped by something else.
My app sends out an email with a URL in it. The url contains a query string attribute that is encrypted. I CGI escaped the encrypted value so that symbols like + * . etc are escaped. The escaped URL appears in the email as expected, but when we click on the link, the encrypted values are decrypted.
For Example, the url in the email is as follows
http://development.com/activate/snJAmJxkMo3WZ1sG27Aq?album_id=2&email=5M%2BjE1G6UB26tw/Ah%2Bzr1%2BJSSxeAoP6j&owner_id=4
email=5M%2BjE1G6UB26tw/Ah%2Bzr1%2BJSSxeAoP6j
when we click on this link the url in the browser appears as
http://development.com/activate/snJAmJxkMo3WZ1sG27Aq?album_id=2&email=5M+jE1G6UB26tw/Ah+zr1+JSSxeAoP6j&owner_id=4
email=5M+jE1G6UB26tw/Ah+zr1+JSSxeAoP6j
The + is substituted with space. As a result
params[:email] = 5M jE1G6UB26tw/Ah zr1 JSSxeAoP6j
which gives me a 404.
Is there any way I can avoid this situation. How can I make the url in the browser also appear as
http://development.com/activate/snJAmJxkMo3WZ1sG27Aq?album_id=2&email=5M%2BjE1G6UB26tw/Ah%2Bzr1%2BJSSxeAoP6j&owner_id=4
in the browser?
In order to avoid this situation I Hex encoded the email attribute so that the it contains only alphabets and numbers. Used these are the methods to Hex encode and decode.
convert string2hex:
def hexdigest_to_string(string)
string.unpack('U'*string.length).collect {|x| x.to_s 16}.join
end
convert hex2string
def hexdigest_to_digest(hex)
hex.unpack('a2'*(hex.size/2)).collect {|i| i.hex.chr }.join
end