FSharp.Data.Toolbox.Twitter truncates Text to 140 characters - twitter

F#'s type provider for Twitter doesn't seem to have been updated to account for the larger 280 character limit. Is anyone aware of a work-around or is this no longer usable?

Related

Matching text parsed from a PDF with PDFBox

This is more of a learning than a question. I was recently struggling with matching strings parsed out of a PDF using PDFBox. My solution might be helpful to others
A list of text was obtained from the PDF using PDFBox like this (Exceptions omitted for brevity):
List<String> lines = new ArrayList<String>();
PDDocument document = PDDocument.load(f);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
String[] pageText = text.trim().split(pdfStripper.getLineSeparator());
for (String line : pageText) {
lines.add(line);
}
The List now contains all the lines from the file in order.
However, String.contains and String.equals fails on lines that are seemingly identical in the logs (ie: 'EMERA INCORPORATED'). In converting each characters into a Hex, it became clear the Space character was the issue:
Line (Parsed from PDF with PDF Box): EMERA INCORPORATED
45 4d 45 52 41 a0 49 4e 43 4f 52 50 4f 52 41 54 45 44
CompanyName (Set In Java): EMERA INCORPORATED
45 4d 45 52 41 20 49 4e 43 4f 52 50 4f 52 41 54 45 44
Note the 'a0' in the PDFBox String where in Java there is the space ('20').
The solution was to use Regex to identify the line: EMERA\S+INCORPORATED. This gives better controller over the matching, so its not bad. But it was a bit annoying to figure this out as when reviewing the logs, the Strings being compared looked identical, yet both contains and equals returned false.
My conclusion, use RegEx to identify text patterns coming out of a PDF (obtained with PDFBox) and ensure to add '\S' to represent potential spaces. Maybe this post can save someone some pain. Also, perhaps someone more familiar with PDFBox could provide tips on using the API better if this is user error on my part.
perhaps someone more familiar with PDFBox could provide tips on using the API better if this is user error on my part
It is not an error in PDFBox API usage. It is not even specific to PDFBox at all. It more is a matter of wrong expectations.
Different kinds of space characters
First of all, there are different kinds of space characters. There of course is the most often used Unicode Character 'SPACE' (U+0020) but there also are others, in particular the Unicode Character 'NO-BREAK SPACE' (U+00A0).
Thus, if you don't know that only one particular space character is used in a given text, it is completely normal to use regular expressions with '\S' instead of ' '.
What does PDFBox extract?
In the case at hand using the non breaking space was not even used by choice of PDFBox. Instead, it was ingrained in the PDF.
When extracting text from a PDF, PDFBox (just like other PDF libraries) uses the information inside the PDF concerning which glyph represents which Unicode character. This information can be given by an Encoding entry or an ToUnicode entry of the respective font declaration in the PDF.
Only if there is a gap between two text chunks (a free space not created by drawing a space character but by moving the text insertion point without a text character), PDF text extractors add a space character of their respective choice, usually the regular space.
As PDFBox does use the regular space in the later case, the issue at hand is a situation of the first case, the PDF itself indicates that the space there is a non breaking one.

Why do some characters have Escape character before them?

Some special characters like | , ~ , ^ , { , } and many others have Escape characters before them. Have a look at the screenshot or visit this link : http://messente.com/documentation/sms-length-calculator, to check it yourself.
I want to know as I don't understand why these characters have Escape characters before them or how/why these characters are different from other (special) characters.
See here for information about the GSM 03.38 encoding.
“Why” questions are always difficult to answer precisely, but my guess is that the goal is to be able to encode the characters deemed most common with 7 bits, while other, less frequent characters will require 14 bits.
There are only 1120 bits per SMS, so saving space is desirable. With the above encoding, you can get more than 140 characters encoded for a “normal” text message.

iOS Twitter post issue - number of characters

As I know the max length of characters for twitter post is 117. When I post plain text. it's no problem. But when I add a text for hyper link, it will cause problem when I post 117 characters (I can reduce the total length in order to post successfully). Why is like that?
Plain text:
Text with url inside:(this will cause problem even there are 3 characters remains. But if I keep reducing the characters. For example, 10 characters left, then I can post successfully)
Error:
I think the Twitter count algorithm is different from iOS. Any idea? Thanks
I figured it out finally. My conclusion:
The maximum length of twitter text is 140 in web, 117 in iOS if there is no url inside.
The length of each url is 23 no matter original length of the url is. So you have to calculate the maximum length of text allowed by yourself.
Refer to https://github.com/twitter/twitter-text for details although the max length in ObjC is wrong.

Blackberry scrambled device ID in 8 characters unique

I need to scramble Device IDs on the Blackberry for privacy matter.
When I call the function DeviceInfo.getDeviceId() I got a 9 characters number. After convert it in Hexa, I got the real PIN number for the device (or device ID depends how you call that) on 8 characters.
Like I said, for privacy matter I can't store the PIN as is in my database. So I would like to scramble the ID to a unique one, still in 8 characters. If I do MD5 or other encryption, I always got an number containing more than 8 characters.
Do you know a way to get a unique 8 characters string from the Device ID?
Thank you.
You can use a short block cipher to obsfucate the message. Look at the CBC-MAC mode of operation.
As the output you want is actually only 4 bytes long, you could even use a CRC, such as CRC32.
Note that you would need a "perfect hash" to not have an overlap - neither short key CBC-MAC or CRC32 will give you a perfect hash. I would strongly suggest using a longer hash function.

Why Doesn't Delphi 2009 Give A Message For A String Constant that is Too Long?

It got me stuck for an hour. I happened to have a string constant with about 280 characters in it.
I was trying to use Pos to look for a substring within a long string constant. It gives the result 0 when the string is longer than 255 characters, but gives the correct result when the string is shorter.
e.g.:
pos('5', '123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.12345')
Containing 255 characters correctly returns the number 5.
pos('5', '123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456')
Containing 256 characters gives a compiler error:
[DCC Error] E2056 String literals may have at most 255 elements.
pos('5', '123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.1234567')
But 257 characters or more does not produce any messages and incorrectly returns the number 0.
This led me on a wild goose chase for a while.
I also found the same is true for a simple assignment to a string, e.g.:
S1 :='123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456789.123456';
gives that error message and won't compile.
But add one more character, and S1 is assigned a null string.
Is there a Delphi option that should be turned on to warn me of this, or is this just a "bug" by the Embarcadero programmers?
Could someone please check if Delphi 2010 now gives a message for all strings >= 255 characters, or for just 256 characters and not for those >= 257.
If not, how do I go about getting it noticed at Quality Central? I can't even figure out how to see if that problem has been reported.
Thanks for allowing me to vent.
Kornel's answer linked to a forum discussion that links to the bug report that says it has been fixed in build 14.0.3513.24210.
p.s. Don't you think Embarcadero should have eliminated the 255 limit when Delphi 2009's Unicode strings were introduced?
Are you using AnsiStrings or ShortStrings? ShortStrings (string) have a cap on length, AnsiStrings don't (they're null terminated). Or alternatively, have you tried compiling with {$H+} (AnsiStrings by default)?
To get over the length limit of the constant, use "split longer into addition" + "of shorter strings under 255 chars".
Also, there's a similar discussion on the Delphi support forums here.
I assume that the reason why the literals can't be longer is because the compiler stores them as short strings (not to allocate them on heap) hence the one-byte-size length limit stands here.
As for why Delphi doesn't report it... well, it's a known bug that supposedly has been fixed, and even has a compiler patch.
Delphi has a limit of 255 characters for a string constant. You can make it longer I think by concatinating together using + but its a very old limit based on the fact that in pascal and Delphi 1 all Strings were limited to a maximum of 255 characters.
Annoying but easily worked around.

Resources