I've seen a number of QR codes that contain a URL but also have extra some text after it. Something like:
http://www.example.com Thanks for scanning this QR code!
I've experimented with using a number of different delimiting methods (several spaces, a question mark, two dashes, one or two returns) and all work to varying degrees on various scanning programs.
Some respect the space character, others respect the return. Some think the URL isn't a URL at all when I use a return. Long story short, it's all over the map how the various scanning programs (NeoReader, iNigma, Qrafter, Beetag, OptiScan, etc.) treat characters after a URL.
Is there any consensus on weather (a)this is even a good idea or not and (b)if so, what is the 'correct' (best practice) way to do it? (I know I should go read the RFC for the exact definition of a URL but since the reader programs are all over the map, I suspect they didn't read it either.)
You can make it work by converting the text message into valid URL, while trying to keep readability.
In your case it can be:
http://www.example.com?Thanks_for_scanning_this_QR_code
It's not perfect, but it can help on web analytics side to distinguish all QR code users.
Spaces are definitely not part of a URL, so, in that sense a space definitely should delimit the end of a URL.
The entire string is not a URL, taken as a whole of course. So yes it's asking for trouble.
As you've found, the empirical answer is that not every reader does what you want. Barcode Scanner for instance understands the split here, but does not prompt the user to launch the browser since the payload isn't a URL per se.
So: it's a bad idea.
Related
For those who are not familiar with what a homophone is, I provide the following examples:
our & are
hi & high
to & too & two
While using the Speech API included with iOS, I am encountering situations where a user may say one of these words, but it will not always return the word I want.
I looked into the [alternativeSubstrings] (link) property wondering if this would help, but in my testing of the above words, it always comes back empty.
I also looked into the Natural Language API, but could not find anything in there that looked useful.
I understand that as a user adds more words, the Speech API can begin to infer context and correct for these, but my use case will not work well with this since it will often only want one or two words at most, limiting the effectiveness of context.
An example of contextual processing:
Using the words above on their own, I get these results:
are
hi
to
However, if I put together the following sentence, you can see they are all wrong:
I am too high for our ladder
Ideally, I would either get a list back containing [are, our], [to, too, two], [hi, high] for each transcription segment, or would have a way to compare a string against a function that supports homophones.
An example of this would be:
if myDetectedWord == "to" then { ... }
Where myDetectedWord can be [to, too, two], and this function would return true for each of these.
This is a common NLP dilemma, and I'm not so sure what might be your desired output in this application. However, you may want to bypass this problem in your design/architecture process, if possible and if you could. Otherwise, this problem is to turn into a challenge.
Being said that, if you wish to really get into it, I like this idea of yours:
string against a function
This might be more efficient and performance friendly.
One way, I'd be liking to solve this problem would be though RegEx processing, instead of using endless loops and arrays. You could maybe prototype loops and arrays to begin with and see how it works, then you might want to use regular expression for gaining performance.
You could for instance define fixed arrays in regular expressions and quickly check against your string (word by word, maybe using back-referencing) and you can add many boundaries in your expressions for string processing, as you wish.
Your fixed arrays also can be designed based on probabilities of occurring certain words in certain part of a string. For instance,
^I
vs
^eye
The probability of I being the first word is much higher than that of eye.
The probability of I in any part of a string is higher than that of eye, also.
You might want to weight words based on that.
I'd say the key would be that you'd narrow down your desired outputs as focused as possible and increase accuracy, [maybe even with 100 words if possible], if you wish to have a good/working application.
Good project though, I hope you like/enjoy the challenge.
I am trying to read a text in a given rectangle using readText() function.
The function works correctly except when it has to read some text which has special characters like ' _ & etc.
I tried using validCharacters with readText() function. But it didn't help.
Code -
put ReadText((287,125,810,164),validCharacters:"_-'.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890") into Login
I tried working with character collections. But that doesn't seem to be right because the text trying to pick is a dynamic text combination of numbers alphabets and a special character. So one cannot create a library of character collection of every alphabet (a-z, A-Z), numbers(0-9) and special characters.
Example of text trying to read:
Login_Userid1_1, Login'Userid1_1
So how do I read such text correctly
Debugging OCR is a bit of an imprecise science. EggPlant has a lot of OCR Parameters to tweak. When designing test cases it's best to try use other mechanisms to gather information whenever possible. ReadText() should be considered a last resort when more reliable methods are unavailable. When I've used it I've often needed to do a lot of trial and error to find the right set of settings, and SearchRectangle to get consistent results. Without seeing exactly what images you are trying to read text from it's difficult to impossible to troubleshoot where the issue might be.
One thing that does stand out to me is that you're trying to read strings that may contain underscores. ReadText() has an optional property IgnoreUnderscores which treats underscores as spaces. By default this property is set to ON. It defaults to ON because some OCR engines have problems identifying underscore characters consistently.
If you want to have ReadText() handle underscores you'll want to explicitly set this property to OFF.
ReadText(rect, validCharacters:chars, ignoreUnderscores:OFF)
I'm using the same zachrone iphonepdf but I did not get the text. My text view shows nothing. What's the problem?
Here is my code:
NSString *text=convertPDF(#"Course.pdf");
texview.text=text;
But I did not get anything in text view?
The text extractor zachron / pdfiphone (I assume you meant that one) is extremely naive and makes very many assumptions.
It ignores the PDF file structure and, therefore completely ignores whether the data it inspects is still used in the current revision.
It ignores encryption and therefore will fail completely for many documents with usage restrictions.
It completely ignores font encodings and implicitely assumes an ASCII'ish one --- this is fairly often true in small PDFs with English text only and not embedded fonts; otherwise the result can be anything.
... many many more assumptions ...
Unless one only has to deal with very simple documents and the extracted text is not really necessary for the functionality of one's code, I would propose using different code for text extraction.
My application is parsing incoming emails. I try to parse them as best as possible but every now and then I get one with puzzling content. This time is an email that looks to be in ASCII but the specified charset is: ansi_x3.110-1983.
My application handles it correctly by defaulting to ASCII, but it throws a warning which I'd like to stop receiving, so my question is: what is ansi_x3.110-1983 and what should I do with it?
According to this page on the IANA's site, ANSI_X3.110-1983 is also known as:
iso-ir-99
CSA_T500-1983
NAPLPS
csISO99NAPLPS
Of those, only the name NAPLPS seems interesting or informative. If you can, consider getting in touch with the people sending those mails. If they're really using Prodigy in this day and age, I'd be amazed.
The IANA site also has a pointer to RFC 1345, which contains a description of the bytes and the characters that they map to. Compared to ISO-8859-1, the control characters are the same, as are most of the punctuation, all of the numbers and letters, and most of the remaining characters in the first 7 bits.
You could possibly use the guide in the RFC to write a tool to map the characters over, if someone hasn't written a tool for it already. To be honest, it may be easier to simply ignore the whines about the weird character set given that the character mapping is close enough to what is expected anyway...
I have lots of UTF-8 content that I want inserted into the URL for SEO purposes. For example, post tags that I want to include in th URI (site.com/tags/id/TAG-NAME). However, only ASCII characters are allowed by the standards.
Characters that are allowed in a URI
but do not have a reserved purpose are
called unreserved. These include
uppercase and lowercase letters,
decimal digits, hyphen, period,
underscore, and tilde.
The solution seems to be to:
Convert the character string into a
sequence of bytes using the UTF-8
encoding
Convert each byte that is
not an ASCII letter or digit to %HH,
where HH is the hexadecimal value of
the byte
However, that converts the legible (and SEO valuable) words into mumbo-jumbo. So I'm wondering if google is still smart enough to handle searches in URL's that contain encoded data - or if I should attempt to convert those non-english characters into there semi-ASCII counterparts (which might help with latin based languages)?
Firstly, search engines really don't care about the URLs. They help visitors: visitors link to sites, and search engines care about that. URLs are easy to spam, if they cared there would be incentive to spam. No major search engines wants that. The allinurl: is merely a feature of google to help advanced users, not something that gets factored into organic rankings. Any benefits you get from using a more natural URL will probably come as a fringe benefit of the PR from an inferior search engine indexing your site -- and there is some evidence this can be negative with the advent of negative PR too.
From Google Webmaster Central
Does that mean I should avoid
rewriting dynamic URLs at all?
That's
our recommendation, unless your
rewrites are limited to removing
unnecessary parameters, or you are
very diligent in removing all
parameters that could cause problems.
If you transform your dynamic URL to
make it look static you should be
aware that we might not be able to
interpret the information correctly in
all cases. If you want to serve a
static equivalent of your site, you
might want to consider transforming
the underlying content by serving a
replacement which is truly static. One
example would be to generate files for
all the paths and make them accessible
somewhere on your site. However, if
you're using URL rewriting (rather
than making a copy of the content) to
produce static-looking URLs from a
dynamic site, you could be doing harm
rather than good. Feel free to serve
us your standard dynamic URL and we
will automatically find the parameters
which are unnecessary.
I personally don't believe it matters all that much short of getting a little more click through and helping users out. So far as Unicode, you don't understand how this works: the request goes to the hex-encoded unicode destination, but the rendering engine must know how to handle this if it wishes to decode them back to something visually appealing. Google will render (aka decode) unicode (encoded) URL's properly.
Some browsers make this slightly more complex by always encoding the hostname portion, because of phishing attacks using ideographs that look the same.
I wanted to show you an example of this, here is request to http://hy.wikipedia.org/wiki/Գլխավոր_Էջ issued by wget:
Hypertext Transfer Protocol
GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n
[Expert Info (Chat/Sequence): GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
[Message: GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
[Severity level: Chat]
[Group: Sequence]
Request Method: GET
Request URI: /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB
Request Version: HTTP/1.0
User-Agent: Wget/1.11.4\r\n
Accept: */*\r\n
Host: hy.wikipedia.org\r\n
Connection: Keep-Alive\r\n
\r\n
As you can see, wget like every other browser will just url-encode the destination for you, and the continue the request to the url-encoded destination. The url-decoded domain only exists as a visual convenience.
Do you know what language everything will be in? Is it all latin based?
If so, then I would suggest building a sort of lookup table that will convert UTF-8 to ASCII when possible(and non-colliding) Something like that would convert Ź into Z and such, and when there is a collision or the character doesn't exist in your lookup table, then it just uses %HH.