TL;DR What are the rules used by twitter to determine whether a tweet matches a certain query, and how can those be replicated?
Hello,
I am using the Twitter API (both v1 and v2, long story) for the development of an academic tool for research purposes.
I need to be able to know if a given string would match a twitter query. A simple regex keyword match wouldn't work as it is my understanding that they are tokenized so that looking for Pied Piper can return #PiedPiper, #piedpiper_official, Pied Piper, #pied piper, etc.
I think the problem requires a deeper understanding of how the search works "under the hood" (not how to use the API, but rather understanding the matching process and rules used by twitter to determine which tweets are returned by the query). After days of research, I have found nothing.
Please let me know if you know any details. As small as they might seem, they can help a lot.
Concatenate tokens, lowercase, then match.
#! /usr/bin/env python3
import re ## docs.python.org/3/howto/regex.html
## only grab AlphaNumeric chars, skip # # and whitespace
tokens = re .findall( '\w+', 'Pied Piper' )
print( tokens )
['Pied', 'Piper']
## concatenate & lowercase
lowercase = '' .join( tokens ) .lower()
## www.w3schools.com/python/ref_string_join.asp
print( lowercase )
piedpiper
Related
I am developing an application in iOS where I need to add a functionality to switch the application language in Chinese/English.
I am using Baidu API to achieve this. I am able to translate single world or complete one sentence. But suppose my has multiple text which I need to place at diff location then either I have to hit the API multiple times or by wrapping all into one API.
I have followed their documentation but nothing seems work.
As per their documentation.....
1. How do I translate multiple words or more text in a request?
You can use the newline (in the majority of the programming language for the escape symbol \ n) in the sent field q to separate the multiple words or pieces of text to be translated so that you can get multiple words or multiple text independent translations The result. Note that before sending the request to the q field do URL encode!
And I am trying to get the result for this....
appid = 2015063000000001 + q = apple + salt = 1435660288 + key = 12345678
Let me give an example: Suppose I need to convert two different word. “apple” and “mango”
2015063000000001+apple\ nmango+1435660288+7_8ogRLnl7PO52O0UYd2
2015063000000001apple\n mango143566028812345678 (Get the MD5 = c0610b314af72e42a4a5b9e62757faf7)
http://api.fanyi.baidu.com/api/trans/vip/translate?q=apple\nmango&from=en&to=zh&appid=2015063000000001&salt=1435660288&sign=c0610b314af72e42a4a5b9e62757faf7
When I am hitting above url on chrome then getting this result.
Result : {"error_code":"54001","error_msg":"Invalid Sign”}
Now I got the answer for my question.
In URL replace "\n" with URL encode "%0A".
Also generate MD5 via the code not from online.
I found the powerful RegexNER and it's superset TokensRegex from Stanford CoreNLP.
There are some rules that should give me fine results, like the pattern for PERSONs with titles:
"g. Meho Mehic" or "gdin. N. Neko" (g. and gdin. are abbrevs in Bosnian for mr.).
I'm having some trouble with existing tokenizer. It splits some strings on two tokens and some leaves as one, for example, token "g." is left as word <word>g.</word> and token "gdin." is split on 2 tokens: <word>gdin</word> and <word>.</word>.
That causes trouble with my regex, I have to deal with one-token and multi-token cases (note the two "maybe-dot"s), RegexNER example:
( /g\.?|gdin\.?/ /\./? ([{ word:/[A-Z][a-z]*\.?/ }]+) ) PERSON
Also, this causes another issue, with sentence splitting, some sentences are not well recognized so regex fails... For example, when a sentence contains "gdin." it will split it on two, so a dot will end the (non-existing) sentence. I managed to bypass this with ssplit.isOneSentence = true for now.
Questions:
Do I have to make my own tokenizer, and how? (to merge some tokens like "gdin.")
Are there any settings I missed that could help me with this?
Ok I thought about this for a bit and can actually think of something pretty straight forward for your case. One thing you could do is add "gdin" to the list of titles in the tokenizer.
The tokenizer rules are in edu.stanford.nlp.process.PTBLexer.flex (look at line 741)
I do not really understand the tokenizer that well, but clearly there are a list of job titles in there, so they must be cases where it will not split off the period.
This will of course require you to work with a custom build of Stanford CoreNLP.
You can get the full code at our GitHub:https://github.com/stanfordnlp/CoreNLP
There are instructions on the main page for building a jar with all of the main Stanford CoreNLP classes. I think if you just run the ant process it will automatically generate the new PTBLexer.java based on PTBLexer.flex.
I'm making a list of the videos of my channel, and want to use the search endpoint of the API : https://developers.google.com/youtube/v3/docs/search/list
Ther eis a "q" parameter to send the query. What completely bugs me is that no wildcard is referenced in the documentation, and when using * it doesn't do anything. For example, in order to find any video containing "television" in the title, the full word has to be input ! Sending "tel" won't work, nor sending "televisio".
Did I miss something ? Is there a way around this ?
Thanks !
YouTube searching works along the same paradigm as Google searching, which is quite a bit different than the character-wildcard keyword approach. It's semantic probabilistic searching, looking for relevance based on the terms you give it, so while the * does represent a wildcard, it represents a whole word. For example, you can search for "a * saved" and it will return to you the videos which score the highest relevance score where any word could be substituted in place of your wildcard.
You can also use other punctuation based search operators ... the + sign, - sign, quotation marks, etc. Just make sure they're all URL encoded before you send the query in.
In developing an iOS app containing a twitter client, I must allow for user generated hashtags (which may be created elsewhere within the app, not just in the tweet body).
I would like to ensure any such hashtags are valid for twitter, so I would like to error check the entered value for invalid characters. Bear in mind that users may be from non-English speaking countries.
I am aware of the usual limitations, such as not beginning a hashtag with a number, and no special punctuation characters, but I was wondering if there is a known list of all additional characters that are technically allowed within hashtags (i.e. international characters).
Karl, as you've rightly pointed out, any word in any language can be a valid twitter hashtag (as long as it meets a number of basic criteria). As such what you are asking for is a list of valid international word characters. I'm sure someone has compiled such a list somewhere, but using it would not be the most efficient approach to reaching what appears to be your initial goal: ensuring that a given hashtag is valid for twitter.
I believe, what you are looking for is a regular expression that can match all word characters within a Unicode range. Such an expression would not be dependant on your locale and would match all characters in the modern typography that can appear as part of a word.
You didn't specify what language you are writing your app in, so I can't help you with a language specific implementation. However, the basic approach would be as follows:
Check if any of the bracket expressions or character classes already support Unicode character ranges in your language. If yes, then use them.
Check if there is regex modifier that can enable Unicode character range support for your language.
Most modern languages implement regular expressions in a fairly similar way and a lot of them borrow heavily from Perl, so I hope the following two example will put you on the right track:
Perl:
Use POSIX bracket expressions (eg: [[:alpha:]], [[:allnum:]], [[:digit:]], etc) as they give you greater control over the characters you want to match, compared to character classes (eg: \w).
Use /u modifier to enable Unicode support when pattern matching. Under this modifier, the ASCII platform effectively becomes a Unicode platform; and hence, for example, \w will match any of the more than 100,000 word characters in Unicode.
See Perl documentation for more info:
http://perldoc.perl.org/perlre.html#Character-set-modifiers
http://perldoc.perl.org/perlrecharclass.html#POSIX-Character-Classes
Ruby:
Use POSIX bracket expressions as they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
See Ruby documentation for more info:
http://www.ruby-doc.org/core-2.1.1/Regexp.html#class-Regexp-label-Character+Classes
Examples:
Given a list of hashtags, the following regex will match all hashtags that start with a word character (inc. international word characters) followed by at least one other word character, a number or an underscore:
m/^#[[:alpha:]][[:alnum:]_]+$/u # Perl
/^#[[:alpha:]][[:alnum:]_]+$/ # Ruby
Twitter allows letters, numbers, and underscores.
I checked this by generating tweets via their API. For example, tweeting
Hash tag test #foo[bar
resulted in "#foo" being marked as a hash tag, and "[bar" being unformatted text.
Well, for starters you can't use a # in the hashtag (##hash).
The guidelines below are being quoted from Twitter's help center:
People use the hashtag symbol # before a relevant keyword or phrase (no spaces) in their Tweet to categorize those Tweets and help them show more easily in Twitter Search.
Clicking on a hashtagged word in any message shows you all other Tweets marked with that keyword.
Hashtags can occur anywhere in the Tweet – at the beginning, middle, or end.
Hashtagged words that become very popular are often Trending Topics.
Example: In the Tweet below, #eddie included the hashtag #FF. Users created this as shorthand for "Follow Friday," a weekly tradition where users recommend people that others should follow on Twitter. You'll see this on Fridays.
Using hashtags correctly:
If you Tweet with a hashtag on a public account, anyone who does a search for that hashtag may find your Tweet
Don't #spam #with #hashtags. Don't over-tag a single Tweet. (Best practices recommend using no more than 2 hashtags per Tweet.)
Use hashtags only on Tweets relevant to the topic.
Just want to add that in addition to alphanumeric characters and underscore, you can apparently use em dash in a Twitter hashtag like #COVIDー19.
Only letters and numbers are allowed to be part of a hashtag. If a character other than these follows the leading # and a letter or number, the hashtag will be cut off at this point.
I would recommend that your user interface indicate this to the user by changing the text color of the input field if the user enters anything other than a letter or number.
I had the same issue to implement in golang.
It seems allowed chars with [[:alpha:]] is only English-alphabet and could not use this syntax for other language characters.
Instead, I could use \p{L} for this purpose.
My test with \p{L} is here.
* Arabic, Hebrew, Hindi...etc is not confirmed yet.
I'm using the Twitter streaming API. It works wonderfully for single words, but seemingly cannot filter by an exact bigram (two word string).
I'm testing this by searching for common words, that are commonly in combination:
e.g. "feel good"
This is the URL: (will require OAuth login):
https://stream.twitter.com/1.1/statuses/filter.json?track=keywords_go_here
Things that don't work:
track=feel%20good ==> still produces: "text":"Feels so good outside!..."
track=%27feel%20good%27 ==> produces nothing
track=feel%20good, ==> still produces "good that my friend has an ED too because I can feel..."
Any ideas on getting this to work?
edit: someone sort-of answered this in early 2010: Twitter Streaming API - tracking exact multiple keywords in exact order , but are there any updates on this issue?
It seems like you can do that search according to the api: https://dev.twitter.com/docs/using-search
"happy hour" containing the exact phrase "happy hour"
Just need to put your phrase in quotation
I am sorry, but the answer is
Exact matching of phrases (equivalent to quoted phrases in most search engines) is not supported.
Furthermore,
Punctuation and special characters will be considered part of the term they are adjacent to.
So if you track "feel good", you will get messages such as
He said, "feel it", and I replied, "I am good".
If you want exact matches, then you have two options:
A) track both terms and then discard all tweets that don't have exact matches, or
B) get a paid subscription to the Twitter firehose with Gnip or DataSift. Twitter makes a living out of things like this, so I don't think it's ever gonna be available on the Streaming API.