I'm trying to read a text file that is comprised of stock symbols and the associated company into a dictionary, each line in the text file has one symbol and company like
APPL Apple
GOOG-Q Google
Ultimately, I'm trying to have a searchbar that looks up the corresponding company based on the stock symbol (or vice versa).
So, what is the best way to approach this? read the entire file as a string then try to separate the items with fileString componentsSeparatedByString:#"\n" or is there a better way?
NSScanner is often a good approach for parsing data.
You might want to pay attention to the end-of-line character, depending on where the file comes from, it can be a linefeed (\n), carriage-return (\r), or both characters (\r\n).
Using NSScanner will make it easy for you to scan for lines, while ignoring the actual end-of-line character (see [NSCharacterSet newLineCharacterSet]).
Related
I am working in Swift although perhaps the language is not as relevant, and I am creating a relatively simple CSV file.
I wanted to ask for some recommendations in creating the files, in particular:
Should I wrap each column/value in single or double quotes? Or nothing? I understand if I use quotes I'll need to escape them appropriately in case the text in my file legitimately has those values. Same for \r\n
Is it ok to end each line with \r\n ? Anything specific to Mac vs. Windows I need to think about?
What encoding should I use? I'd like to make sure my csv file can be read by most readers (so on mobile devices, mac, windows, etc.)
Any other recommendations / tips to make sure the quality of my CSV is ideal for most readers?
I have a couple of apps that create CSV files.
Any column value that contains a newline or the field separator must be enclosed in quotes (double quotes is common, single quotes less so).
I end lines with just \n.
You may wish to give the user some options when creating the CSV file. Let them choose the field separator. While the comma is common, a tab is also common. You can also use a semi-colon, space, or other characters. Just be sure to properly quote values that contain the chosen field separator.
Using UTF-8 encoding is arguably the best choice for encoding the file. It lets you support all Unicode characters and just about any tool that supports CSV can handled UTF-8. It avoid any issues with platform specific encodings. But again, depending on the needs of your users, you may wish to give them the choice of encoding.
I need to know how to properly use "OR" when it comes to individual characters and whole phrases... For example I have code that is checking for any number of characters OR words that are found in an array...
I want to check for some unicode characters and also some html lines of code.
I'm currently just checking for the characters using this:
([\u200b\u200c\u200d\0\1\2\3\4\5\6\7]*)
(the backslashes are representing the unicode characters u+200b - u+200d and the special characters in my software \0-\7 (They are all individual characters), these are valid escape sequences in Objective-C.)
Now what if I wanted to check for these characters AND check for phrases like <b> or <font color="#FF0000">
I found stuff while doing research that said to use pipelines | but I'm not sure if I put them only in-between the words or also in-between the individual characters and I'm not sure if I put quotes around the words or what not... I need help before I screw this up badly haha!
(p.s., not sure if it will be any different but I'm also doing it for this:
([^\u200b\u200c\u200d\0\1\2\3\4\5\6\7])
it's be someting like
/([^....]|\<b\/\>|\<font color .... \>)/
though, the usual caveats about regexes and html apply here.
As for the confusion about where to put the |, consider this this hackneyed example: You want to find the word color, but also want to accommodate the british spelling, colour:
/(color|colour)/
/(colou?r)/
/(colo(r|ur))/
are all basically equivalent.
I'm having trouble parsing utf8 characters into Text when deriving a Read instance. For example, when I run the following in ghci...
> import Data.Text
> data Message = Message Text deriving (Read, Show)
> read ("Message \"→\"") :: Message
Message "\8594"
Can I do anything to keep my text inside Message utf-8 encoded? I.e. The result should be...
Message "→"
(P.S. I already receive my serialized messages as Text, but currently need to unpack to a String in order to call read. I'd love to avoid this...)
EDIT: Ah sorry, answers rightly point out that it's show not read which converts to "\8594" - is there a way to show and convert back to Text again without the backslash encoding?
To the best of my knowledge, the internal encoding used by Text (which is actually UTF-16) is consistent and not exposed directly. If you want UTF-8, you can decode/encode a Text value as appropriate. Similarly, it doesn't make sense to talk about an encoding for String, because that's just a list of Char, where each Char is a unicode code point.
Most likely, it's only the Show instance for Text displaying things differently here.
Also, keep in mind that (by consistent convention in standard libraries) read and show are expected to behave as (de-)serialization functions, with a "serialized" format that, interpreted as a Haskell expression, describes a value equivalent to the one being (de-)serialized. As such, the slash encoding with ASCII text is often preferred for being widely supported and unambiguous. If you want to display a Text value with the actual code points, show isn't what you want.
I'm not entirely clear on what you want to do with the Text--using show directly is exactly what you're trying to avoid. If you want to display text in a terminal window that's going to dictate the encoding, and you want the stuff defined in Data.Text.IO. If you need to convert to a specific encoding for whatever other reason, Data.Text.Encoding will give you an encoded ByteString (emphasis on "byte", not "string"--a ByteString is a sequence of raw bytes, not a string of characters).
If you just want to convert from Text to String and back to Text... what's wrong with the slash encoding? show is not really intended for pretty-printing output for users to read, despite many people's initial expectations otherwise.
My requirements are to write binary records inside a file. The binary records can be thought of as raw bytes in memory. I need a way to delimit each record, so that i can do something similar to binary search on the file. For example start in middle of file, find the next record delimited and start the search.
My question is that can ASCII such "START-RECORD" be used to delimit the binary record ?
START-RECORD, data-length, .......binary data...........START-RECORD, data-length, .......binary data...........
When starting from an arbitrary position within a file, i can simply search for ASCII String "START-DATA". Is this approach feasible?
Not in a single pass, since you're reading in binary mode or not. If you insert some strings or another pattern as "delimiter", you'd need to search for the binary representation of it while reading the file.
When would it be appropriate to localize a single ascii character?
for instance /, or | ?
is it ever necessary to add these "strings" to the localization effort?
just want to give some people the benefit of the doubt and make sure there's not something I didn't think of.
Generally it wouldn't be appropriate to use something like that except as a graphic element (which of course wouldn't be I18N'd in the first place, much less L10N'd). If you are trying to use it to e.g. indicate a ratio then you should have something like "%d / %d" instead, and localize the whole thing.
Yes, there are cases where these individual characters change in localization. This is not a comprehensive list, just examples I happen to know.
Not every locale uses , to separate thousands and . for the decimal. (However, these will usually be handled by your number formatter. If you do so yourself, you're probably doing it wrong. See this MSDN blog post by Michael Kaplan, Number format and currency format are not always the same.)
Not every language uses the same quotation marks (“, ”, ‘ and ’). See Wikipedia on Non-English Uses of Quotation Marks. (Many of these are only easy to replace if you use full quote marks. If you use the " and ' on your keyboard to mark both the start and end of sentences, you won't know which of two symbols to substitute.)
In Spanish, a question or exclamation is preceded by an inverted ? or !. ¿Question? ¡Exclamation! (Obviously, you can't fix this with a locale substitution for a single character. Any questions or exclamations in your application should be entire strings anyway, unless you're writing some stunningly intelligent natural language generator.)
If you do find a circumstance where you need to localize these symbols, be extra cautious not to accidentally localize a symbol like / used as a file separator, " to denote a string literal or ? for a search wildcard.
However, this has already happened with CSV files. These may be separated by ,, or may be separated by the local list separator. See What would happen if you defined your system's CSV delimiter as being a quotation mark?
In Greek, questions end with a semicolon rather than ?, so essentially the ? is replaced with ; ... however, you should aim to always translate the question as a complete string including question mark anyway.