Azure data factory encoding issue for characters with acute. For example: ú, á, é - character-encoding

Azure data factory is not encoding the special characters properly.
For example, the CSV file has word sún which gets converted into sún after performing transformation through data flow and writing it to the blob storage container.
There are many files with different encoding types in my container which dataflow is selecting to apply transformation and these encoding types are like UTF-8, ANSI, etc.
So if I set my encoding part to WINDOWS-1252 in DelimitedText dataset then it works fine for ANSI encoding type csv file but if encoding type if UTF-8 then I have to set this part to UTF-8, then only dataflow generates proper output for these special characters.
Dataset Image
My CSV file data screenshot is here: CSV file data
Is there any generic way that irrespective of what encoding type of file, we can generate proper output for such characters?

I got it if I understand you correctly. For Data Factory, we must choose one encoding type firstly to read the file. If you files have many encoding, you want to keep the data between different encoding, that is limited my the encoding type not Data Factory. If the output encoding can't parse the data and it will be converted to other type. Data Factory only provide these encoding type for us to read/write data.
Data Factory can't get the encoding type of the files even with get Metadata active. Maybe you can achieve that in code level, try function or notebook, that's the only way.
HTH.

Related

Multiple encryption append to a file

I have a log of a programs state. This log can be manualy or time interval saved on a file for persistant storage. Before saving it to the file it is encrypted with RNCryptor.
My current appending(saving) to file flow:
Read file
Decript information from the read string
Concat decrypted string with the new string
Encrypt the concatenated string
Write it to file
What I imagine:
Encode new string
Append to file
When I read this I will have to build a string from all the encoded strings. But I don't know how to decrypt the file with multiple encrypted blocks in it. How to differentiate where one ends and another begins.
Also is this the best performance choice. The text in the file at maximum could get to 100MB(Possibly it will never get this big).
Is using Core Data viable? Each append as different record or something. And core data could be encrypted so no need for RNCryptor.
Would appreciate code in Objective-C if any.
There are many things you can do:
Easiest would be to encode the ciphertexts to text (e.g. with Base64) and write each encoded ciphertext to a new line. You need encoding for that, because the ciphertext itself might contain bytes that can be interpreted as newline control characters, but that won't happen with a text encoding. The problem with this is that it blows up the logs unnecessarily (e.g. by 33% if Base64 is used)
You can prepend each unencoded ciphertext with its length (e.g. big-endian int32 encoding) and write both as-is to a file in binary mode. If you begin reading the file from the beginning, then you can distinguish each ciphertext, because you know how long the following ciphertext is and when the next encoded length starts. The blowup is only as big as the encoding of the ciphertext length for each ciphertext.
Use a binary delimiter such as 0x0101 between ciphertexts, but such a delimiter might still appear in the ciphertexts, so you need to escape it if you find it somewhere in the ciphertext. This is a little tricky to get right.
If the amount of logs is small (few MB), then you can find a library to append to a ZIP file.
You can use the array to store the information and then read and write that array to file. find Example here.
Steps :
Read Array from the file.
Add the New Encrypted string to array.
Write array to file.

file encoding on a mac, charset=binary

I typed in
file -I*
to look at all the encoding of all the CSV files in an entire directory. A lot of the file encodings are charset=binary. I'm not too familiar with this encoding format.
Does anyone know how to handle this encoding?
Thanks a lot for your time.
"Binary" encoding pretty much means that the encoding is unknown.
Everything is binary data under the hood. In text files each byte, or sequence of bytes, represents a specific character, and which character in particular depends on the encoding the file was encoded with/you're interpreting the file with. Some encodings are unambiguously recognisable, others aren't (e.g. any file is valid in any single-byte encoding, you can't easily distinguish one single-byte encoding from another). What file is telling you with charset=binary is that it doesn't have any more specific information than that the file contains bits and bytes (Capt'n Obvious to the rescue). It's up to you to interpret the file in the correct encoding/interpret it as the correct file format.

Rails oracle raw16

I'm using Rails 3.2.1 and I have stuck on some problem for quite long.
I'm using oracle enhanced adapter and I have raw(16) (uuid) column and when I'm trying to display the data there is 2 situations:
1) I see the weird symbols
2) I'm getting incompatible character encoding: Ascii-8bit and utf-8.
In my application.rb file I added the
config.encoding = 'utf-8'
and in my view file I added
'#encoding=utf-8'
But so far nothing worked
I also tried to add html_safe but it failed .
How can I safely diaply my uuid data?
Thank you very much
Answer:
I used the unpack method to convert the
binary with those parameters
H8H4H4H4H12 and in the end joined the
array :-)
The RAW datatype is a string of bytes that can take any value. This includes binary data that doesn't translate to anything meaningful in ASCII or UTF-8 or in any character set.
You should really read Joel Spolsky's note about character sets and unicode before continuing.
Now, since the data can't be translated reliably to a string, how can we display it? Usually we convert or encode it, for instance:
we could use the hexadecimal representation where each byte is converted to two [0-9A-F] characters (in Oracle using the RAWTOHEX function). This is fine for display of small binary field such as RAW(16).
you can also use other encodings such as base 64, in Oracle with the UTL_ENCODE package.

What is the usefulness of mb_http_output() given that the output encoding is typically fixed by other means?

All over the Internet, including in stackoverflow, it is suggested to use mb_http_input('utf-8') to have PHP works in the UTF-8 encoding. For example, see PHP/MySQL encoding problems. � instead of certain characters. On the other hand, the PHP manual says that we cannot fix the input encoding within the PHP script and that mb_http_input is only a way to query what it is, not a way to set it. See http://www.php.net/manual/en/mbstring.http.php and http://php.net/manual/en/function.mb-httpetinput.php . Ok, this was just a clarification of the context before the question. It seems to me that there is a lot of redundant commands in Apache + PHP + HTML to control the conversion from the input encoding to the internal encoding and finally to the output encoding. I don't understand the usefulness of this. For example, if the original input encoding from some external HTTP client is EUC-JP and I set the internal encoding to UTF-8, then PHP would have to make the conversion. Am I right? If I am right, why would I set an input encoding in php.ini (instead of just passing the original one) given that it would be next immediately converted to the utf-8 internal encoding anyway? A similar question hold for the output. In all my htpp files, I use a meta tag with charset=utf-8. So, the output HTTP encoding is fixed. Moreover, in PHP.ini, I can set the default_charset that will appear in the HTTP header to utf-8. Why would I bother to use mb_http_output('uft-8') when the final output encoding is already fixed. To sum up, can someone give me a practical concrete example where mb_http_output('uft-8') is clearly necessary and cannot be replaced by more usual commands that are often inserted by default in editors such as Dreamweaver?
These two options are just about the worst idea the PHP designers ever had, and they had plenty of bad ideas when it comes to encodings.
To convert strings to a specific encoding, one has to know what encoding one is converting from. Incoming data is often in an undeclared encoding; the server just receives some binary data, it doesn't know what encoding it represents. You should declare what encoding you expect the browser to send by setting the accept-charset attribute on forms; doing that is no guarantee that the browser will do so and it doesn't make PHP know what encoding to expect though.
The same goes for output; PHP strings are just byte arrays, they do not have an associated encoding. I have no idea how PHP thinks it knows how to convert arbitrary strings to a specific encoding upon input or output.
You should handle this manually, and it's really easy to do anyway: declare to clients what encoding you expect, check whether input is in the correct encoding using mb_check_encoding (not _detect encoding or some such, just check), reject invalid input, take care to keep everything in the same encoding within the whole application flow. I.e., ideally you have no conversion whatsoever in your app.
If you do need to convert at any point, make it a Unicode sandwich: convert input from the expected encoding to UTF-8 or another Unicode encoding on input, convert it back to desired output encoding upon output. Whenever you need to convert, make sure you know what you're converting from. You cannot magically "make all strings UTF-8" with one declaration.

How to read unicode characters accurately

I have a text file containing what I am told are unicode characters, for example:
\320\222\320\21015-25'ish per main or \320\222\320\21020-40'ish per starter
Which should read:
£15-25'ish per main or £20-40'ish per main starter
However, when viewing this text in Firefox, the output is mangled with various unwanted characters.
So, are these really unicode characters? And if so, how can I convert them to a form which is displayable correctly?
You need to:
know the encoding of the text file
read the data without losing information (either by reading it as binary or by reading it as text with the right encoding)
write the data with the right encoding (either by writing it out in binary and specifying the original encoding, or writing it out as text in an encoding which you also specify in the headers)
Try to separate out the problem into "reading" and/or "writing". Do you know the encoding of the file? What do you have to do with the file? When you've written it with backslashes, is that actually what's in the file (i.e. an escaped form) or is it actually just a "normal" text encoding such as UTF-8?

Resources