Regex: Capturing a hashtag when it is the first character of a tweet - twitter

/[\s\W]#(\w+)/g
This is a very simple that will still correctly capture most desired tweeter hashtag cases, EXCEPT for the special case where the hashtag is actually the first word with no leading characters:
http://regexr.com/3h2rr
If we make the first character set lazy it will correctly capture the first hashtag, but also incorrectly not reject hashtags with leading alphanumeric characters
http://regexr.com/3h2ru
A programatic workaround can simply be: "always adding a space to the begining of the tweet string", thus bypassing the limitation of this simple expression, but now I'm really curious to see how to do this the right way.
Cheers

/[\s\W]*#(\w+)/g
This should allow for no preceding character to be needed, might not be the most beautiful way to do this though.

Related

Unicode URLs shown in wrong order

I have enabled unicode urls in my joomla site
My language is Persian which is a right-to-left language but
urls written in persian appear in wrong order. For example:
Mysite.com/محصولات/محصول-اول
It translates to:
Mysite.com/first-product/products
Which should have been:
Mysite.com/products/first-product
This is only a matter of displaying text. I know that the actual text the server receives is in correct order because url-encoded version has the correct order.
(If you don't get the idea type "something.com/" in your url bar. Now copy/paste this at the end of url
محصولات
Now type a slash and copy/paste this at the end
محصول
You see? The last one should have gone to the right but goes to the left)
I have two questions regarding this issue:
1-is there anything i can do to display urls in correct order?
2-can it affect how google indexes my pages? Can it misdirect google?
The behaviour of the url display is totally correct in Unicode sense, as the slash is defined as bidirectionally neutral:
http://www.fileformat.info/info/unicode/char/002f/index.htm
Thus, standing between two arabic (right-to-left) words, the slash has to adapt to the writing direction of the surrounding words. The slash would, though, never adapt to the writing direction of the whole line within in a right-to-left neighborhood.
To answer your questions:
(1) It is not possible to influence this behaviour if you do not change the URL, as Jukka K. Korpela already assumed.
(2) As long as the order of the words is correctly encoded, I do not see any bad consequences for search engine indexings.
If you want to change it anyway, and assumed that your URLs are artificial and do no represent real paths, I can see the following workarounds:
(a) Substitute the slash with another "strong" symbol which influences the writing direction.
(b) Insert a "pseudo strong" character before (U+200e) the slash, which will enforce LTR for the slash.
Hope this helps.

Regular expression for hashtag text with multi language support

I have a texts like #sample_123 , #123_sample , #_sample123 so i have to use regular expression to check the text contains only alphanumeric and underscore and also i want to support multi languages.
Currently i am using regular expression like (#)([:alpha:]+) but it detects only #sample( eg: #sample_123). So, Can any one please suggest the correct regular expression to fix out this problem.
you can use:
^#(\d|\w|_)+$
Debuggex Demo
This would validate any words that start with an hash and contains only alpha numeric characters or underscore. Of course there are no restrictions on how many characters after the hashtag there should be, so for example, a hashtag like #_ is considered valid, if this is not the wanted behavior please be more detailed on the constraints you want.

CFStringTokenizer not tokenizing lower-case sentences

I'm trying to use CFStringTokenizer with kCFStringTokenizerUnitSentence to split a string into sentences. The first problem I'm having is that sentences need to be capitalized in order for them to be recognized as sentences. If not, it just thinks it's part of the previous sentence.
I'm splitting user-entered text so I'm expecting the text to be very unclean.
Is there something else I can do with CFStringTokenizer to have it detect uncapitalized sentences? Or will I have to use another method of splitting altogether?
I followed the answer on this SO question for my implementation:
How to get an array of sentences using CFStringTokenizer?
NOTE: After testing a bit more it seems that with kCFStringTokenizerUnitSentence, if a '!' or a '?' is followed by an uncapitalized sentence, it will recognize the sentence. Also, if one of those punctuation marks is followed by a sentence without a space between the '!' and the first word, it will still separate.
So the one case I need to work around is a '.' followed by an uncapitalized sentence.
ANOTHER OPTION I found, if you're getting the text from a textField, is to use this:
textField.autocapitalizationType = UITextAutocapitalizationTypeSentences;
It will automatically capitalize sentences so you don't have to worry about converting for CFStringTokenizer. It still doesn't account for edge cases like abbreviations, but at least in my case the user will have an option to delete the auto-capitalization if it's wrong.
You can convert the input string to all uppercase first and then run it through CFStringTokenizer and use the ranges to get the substrings of the original input string. But you must be careful here because some characters might become more than 1 character after conversion to uppercase.

Valid URL separators

I have a long URL with several values.
Example 1:
http://www.domain.com/list?seach_type[]=0&search_period[]=1&search_min=3000&search_max=21000&search_area=6855%3B7470%3B7700%3B7730%3B7741%3B7742%3B7752%3B7755%3B7760%3B7770%3B7800%3B7840%3B7850%3B7860%3B7870%3B7884%3B7900%3B7950%3B7960%3B7970%3B7980%3B7990%3B8620%3B8643%3B8800%3B8830%3B8831%3B8832%3B8840%3B8850%3B8860%3B8881%3B9620%3B9631%3B9632
My variable search area contains only 4 number digits (example 4000, 5000), but can contain a lot of them. Right now I seperate these in the URL by using ; as separator symbol. Though as seen in Example 1, the ; is converted into %3B. This makes me believe that this is a bad symbol to use.
What is the best URL separator?
Moontear, I think you have misread the linked document. That limitation only applies to the "scheme" portion of the URL. In the case of WWW URLs, that is the "http".
The next section of the document goes on to say:
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
I'd personally use comma (,). However, plus (+) and dash (-) are both reasonable options as well.
BTW, that document also mentions that semi-colon (;) is reserved in some schemes.
Well, according to RFC1738, valid URLs may only contain the letters a-z, the plus sign (+), period and hyphen (-).
Generally I would go with a plus to separate your search areas. So your URL would become http://www.domain.com/list?seach_type=0&search_period=1&search_min=3000&search_max=21000&search_area=6855+7470+7700+...
--EDIT--
As GinoA pointed out I misread the document. Hence "$-_.+!*'()," are valid characters too. I'd still go with the + sign though.
If there are only numbers to separate, you have a large choice of separators. You can choose any letter for example.
Probably a space can be a good choice. It will be transformed into + character in the URL, so will be more readable than a letter.
Example: search_area=4000+5000+6000
I'm very late to the party, but a valid query string can repeat variables so instead of...
http://x.y.z/list?type=0&period=1&min=3000&max=21000&area=6855+7470+7700
...you could also use...
http://x.y.z/list?type=0&period=1&min=3000&max=21000&area=6855&area=7470&area=7700
"+" is to be interpreted as a space " " when the content-type is application/x-www-form-urlencoded (standard for HTML forms). This may be handled by your server software.
I prefer "!". It doesn't get URL encoded (at least not in Chrome) and it reserves "+" for use as a real space character in the typical case.

Why do you need to encode URLs?

Why do you need to encode urls? Is there a good reason why you have to change every space in the GET data to %20?
Because some characters have special meanings.
For instance, in a query string, the ampersand (&) is used as a separator between key-value pairs. If you were to put an ampersand into one of those values, it would look like the separator between the end of a value and the beginning of the next key. So for special characters like this, we use percent encoding so that we can be sure that the data is unambiguously encoded.
From RFC 2936, section 2.4.3:
The space character is excluded
because significant spaces may
disappear and insignificant spaces may
be introduced when URI are transcribed
or typeset or subjected to the
treatment of word- processing
programs. Whitespace is also used to
delimit URI in many contexts.
originally older browsers could get confused by the spaces (not really an issue anymore).
now, if someone copies the url to send as a link - the space can break the hyperlink - ie
Hey! Check out this derping cat playing a piano!
http://www.mysite.com/?video=funny cat plays piano.
See how the link breaks?
Now look at this:
http://www.mysite.com/?video=funny%20cat%20plays%20piano.
Let's break down your question.
Why do you need to encode URL?
A URL is composed of only a limited number of characters and those are digits(0-9), letters(A-Z, a-z), and a few special characters("-", ".", "_", "~").
So does it mean that we cannot use any other character?
The answer to this question is "YES". But wait a minute, there is a hack and the hack is URL Encoding or Perchantage Encoding. So if you want to transmit any character which is not a member of the above mentioned (digits, letters, and special chars), then we need to encode them. And that is why we need to encode "space" as "%20".
OK? Is this enough for URL encoding? No this is not enough, there's a lot about URL encoding but here, I'm not gonna make it a pretty big, boring technical answer. But If you want to know more, then you can read it from here: https://www.urlencoder.io/learn/ (Credit goes to this writer)
Well, you do so because every different browsers knows how the string that makes up the URL is encoded. converting the space to %20, etc makes that URL/URI portable. It could be latin-1 it could be unicode. It needs normalized to something that is understood universally. Take a look at rfc3986 https://www.rfc-editor.org/rfc/rfc3986#section-2.1

Resources