Best practices for creating a CSV file? - ios

I am working in Swift although perhaps the language is not as relevant, and I am creating a relatively simple CSV file.
I wanted to ask for some recommendations in creating the files, in particular:
Should I wrap each column/value in single or double quotes? Or nothing? I understand if I use quotes I'll need to escape them appropriately in case the text in my file legitimately has those values. Same for \r\n
Is it ok to end each line with \r\n ? Anything specific to Mac vs. Windows I need to think about?
What encoding should I use? I'd like to make sure my csv file can be read by most readers (so on mobile devices, mac, windows, etc.)
Any other recommendations / tips to make sure the quality of my CSV is ideal for most readers?

I have a couple of apps that create CSV files.
Any column value that contains a newline or the field separator must be enclosed in quotes (double quotes is common, single quotes less so).
I end lines with just \n.
You may wish to give the user some options when creating the CSV file. Let them choose the field separator. While the comma is common, a tab is also common. You can also use a semi-colon, space, or other characters. Just be sure to properly quote values that contain the chosen field separator.
Using UTF-8 encoding is arguably the best choice for encoding the file. It lets you support all Unicode characters and just about any tool that supports CSV can handled UTF-8. It avoid any issues with platform specific encodings. But again, depending on the needs of your users, you may wish to give them the choice of encoding.

Related

Eggplant : How to read text with special characters like ' _ etc

I am trying to read a text in a given rectangle using readText() function.
The function works correctly except when it has to read some text which has special characters like ' _ & etc.
I tried using validCharacters with readText() function. But it didn't help.
Code -
put ReadText((287,125,810,164),validCharacters:"_-'.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890") into Login
I tried working with character collections. But that doesn't seem to be right because the text trying to pick is a dynamic text combination of numbers alphabets and a special character. So one cannot create a library of character collection of every alphabet (a-z, A-Z), numbers(0-9) and special characters.
Example of text trying to read:
Login_Userid1_1, Login'Userid1_1
So how do I read such text correctly
Debugging OCR is a bit of an imprecise science. EggPlant has a lot of OCR Parameters to tweak. When designing test cases it's best to try use other mechanisms to gather information whenever possible. ReadText() should be considered a last resort when more reliable methods are unavailable. When I've used it I've often needed to do a lot of trial and error to find the right set of settings, and SearchRectangle to get consistent results. Without seeing exactly what images you are trying to read text from it's difficult to impossible to troubleshoot where the issue might be.
One thing that does stand out to me is that you're trying to read strings that may contain underscores. ReadText() has an optional property IgnoreUnderscores which treats underscores as spaces. By default this property is set to ON. It defaults to ON because some OCR engines have problems identifying underscore characters consistently.
If you want to have ReadText() handle underscores you'll want to explicitly set this property to OFF.
ReadText(rect, validCharacters:chars, ignoreUnderscores:OFF)

Loading data in hive table with multiple charsets

I am facing issues where i have multiple files with different charsets, say one file has Chinese charsets and other has French Charsets, how can i load them in a single hive table? I searched online and found this :-
ALTER TABLE mytable SET SERDEPROPERTIES ('serialization.encoding'='SJIS');
With this i can handle charsets for one of the file either Chinese or French. Is there a way to handle both charsets once?
[UPDATE]
Okay i am using RegexSerde for fixed width file alongside encoding scheme being used is - ISO 8859-1. Seems Regex Serde is not taking this encoding scheme into account and splitting the characters considering default UTF-8 encoding scheme. Is there a way to take encoding scheme into account with Regex serde.
I am not sure if this is possible (i think it isn't based on https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/AbstractEncodingAwareSerDe.java). A workaround could be create two tables with different enconding and create a view on top of that.

Is it possible to have a special character in a string that comes from a YAML file?

I'm working on a translation project and I'm moving all of the English strings out of the views and into a YAML file. Some of the very well written strings employ special characters such as ampersands and N-dashes.
Is there any way to include those?
In the meantime I've turned "&" to "and" and "–" to "--"
but, at least in the English version, I feel like the copy starts to loose it's flavor. I doubt the Chinese version will miss these, but maybe they will want different special characters that I don't know about.
You can have special characters in YAML file values as long as they're not at the beginning of a string.
In the case of &, for example, if it is the first character of your string, then your YAML parser will think it's an anchor, when it tries to read the string (if it's not, like key: this & that, then it will be read as a string, as you would expect).
For more information about what you can and can't have in your YAML strings (and what are considered special characters), see:
YAML Ruby Cookbook
The question and accepted answer for Do I need quotes for strings in Yaml?

Is there known URI scheme or URN namespace for Unicode characters?

I need to reference to a Unicode character with a URI. Following IANA references list multiple schemes and namespaces but do not mention anything about identifiers for the Unicode characters. Does anyone know if something like this exists already?
http://www.iana.org/assignments/uri-schemes.html
http://www.iana.org/assignments/urn-namespaces/urn-namespaces.xml
I hoped to find something like
unicode://U+0394
urn:unicode://0394
http://unicode.org/unicode/0394
for the greek capital letter delta Δ.
If someone wonders, this is for a semantic web like application that uses URIs as identifiers for concepts, including concepts of the Unicode characters.
I’m afraid there is no URL or URN for referring authoritative information on a Unicode character in general. In the Unicode Standard, information about individual characters is partly in the so-called character database (mostly plain text files in specific formats), partly in the Code Charts (PDF files). Neither of them offers a way to point at an individual character. Moreover, the information there is not exhaustive: there are important remarks on individual characters information scattered around the standard.
The Decodeunicode site has individually addressable items, such as
http://www.decodeunicode.org/en/u+0394
but its information content varies a lot and is generally very limited. It is not official, and it currently contains Unicode 5.0 only.
The Fileformat.info site is much more systematic, but it, too, is unofficial. It is basically limited to formal properties and data derivable from them, plus comments extracted from the Code Charts, plus instructions on typing the character in Windows, plus information about support in fonts—but that’s quite a lot! Example:
http://www.fileformat.info/info/unicode/char/0394/
[ EDIT ] : found this URL matching your needs : http://unicode.org/cldr/utility/character.jsp?a=1F40F
.
Well, there is an URL referencing the authoritative information on the Unicode database, even though it does not describe (as said in the other answer) all the information on one specific character.
You have the following URL, pointing to the latest Unicode database. This is a simple list of existing valid Unicode characters. Some upcoming characters are missing (㋿), and you should expect it to be mutable.
https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
The contents looks like the following, which isn't so practical to use as-is.
$ grep -ai kangaroo UnicodeData.txt -C 7
1F991;SQUID;So;0;ON;;;;;N;;;;;
1F992;GIRAFFE FACE;So;0;ON;;;;;N;;;;;
1F993;ZEBRA FACE;So;0;ON;;;;;N;;;;;
1F994;HEDGEHOG;So;0;ON;;;;;N;;;;;
1F995;SAUROPOD;So;0;ON;;;;;N;;;;;
1F996;T-REX;So;0;ON;;;;;N;;;;;
1F997;CRICKET;So;0;ON;;;;;N;;;;;
1F998;KANGAROO;So;0;ON;;;;;N;;;;;
1F999;LLAMA;So;0;ON;;;;;N;;;;;
1F99A;PEACOCK;So;0;ON;;;;;N;;;;;
1F99B;HIPPOPOTAMUS;So;0;ON;;;;;N;;;;;
1F99C;PARROT;So;0;ON;;;;;N;;;;;
1F99D;RACCOON;So;0;ON;;;;;N;;;;;
1F99E;LOBSTER;So;0;ON;;;;;N;;;;;
1F99F;MOSQUITO;So;0;ON;;;;;N;;;;;
You could build up a hacky « hash-based » namespace with a suffix like this, but that's definitely non-standard.
https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt#1F998
Since this is also tagged semantic-web, I will try to pick URIs that are easily (and permanently) dereferenceable and cannot be mistaken for a document describing that character: the data: scheme. Not only can that refer to a character in Unicode, but any encoding, and also any string thereof.
data:;charset=utf-8,%CE%94
Attempting to open this URI should result in a text/plain file with the single character as its content.
If the system accepts IRIs (as many semantic web applications do), the character can be included directly:
data:;charset=utf-8,Δ
This is mapped to the same URI as shown above, and your browser may convert it directly. Specifying UTF-8 is necessary in this case, since the mapping is not defined for other encodings.

is it ever appropriate to localize a single ascii character

When would it be appropriate to localize a single ascii character?
for instance /, or | ?
is it ever necessary to add these "strings" to the localization effort?
just want to give some people the benefit of the doubt and make sure there's not something I didn't think of.
Generally it wouldn't be appropriate to use something like that except as a graphic element (which of course wouldn't be I18N'd in the first place, much less L10N'd). If you are trying to use it to e.g. indicate a ratio then you should have something like "%d / %d" instead, and localize the whole thing.
Yes, there are cases where these individual characters change in localization. This is not a comprehensive list, just examples I happen to know.
Not every locale uses , to separate thousands and . for the decimal. (However, these will usually be handled by your number formatter. If you do so yourself, you're probably doing it wrong. See this MSDN blog post by Michael Kaplan, Number format and currency format are not always the same.)
Not every language uses the same quotation marks (“, ”, ‘ and ’). See Wikipedia on Non-English Uses of Quotation Marks. (Many of these are only easy to replace if you use full quote marks. If you use the " and ' on your keyboard to mark both the start and end of sentences, you won't know which of two symbols to substitute.)
In Spanish, a question or exclamation is preceded by an inverted ? or !. ¿Question? ¡Exclamation! (Obviously, you can't fix this with a locale substitution for a single character. Any questions or exclamations in your application should be entire strings anyway, unless you're writing some stunningly intelligent natural language generator.)
If you do find a circumstance where you need to localize these symbols, be extra cautious not to accidentally localize a symbol like / used as a file separator, " to denote a string literal or ? for a search wildcard.
However, this has already happened with CSV files. These may be separated by ,, or may be separated by the local list separator. See What would happen if you defined your system's CSV delimiter as being a quotation mark?
In Greek, questions end with a semicolon rather than ?, so essentially the ? is replaced with ; ... however, you should aim to always translate the question as a complete string including question mark anyway.

Resources