I am compiling a corpus of Tweets for sentiment analysis and am trying to grab Tweets with Apple Emoji characters.
I have found the unicode character for one of the faces as: U+1F604 (U+D83D U+DE04), UTF-8: F0 9F 98 84
So far, I haven't been able to get any meaningful results. If I search \ud83d\ude04 I'll get some Tweets back, but nothing useful. \U0001f604 doesn't return anything on search.
Is there any way for me to query Twitter for these characters?
I am using the python-twitter wrapper for the API, but would be willing to use something else if a better alternative exists.
As #Terence Eden points out, twitters REST search api doesn't work with emoji characters, but the streaming API does (as of Jan 2016).
There are a few tools out there for accessing twitters APIs in python. The one I've mostly used it tweepy. It can be installed with pip.
The tweepy docs on setting up the streaming api are quite easy to follow. The strings you filter on need to contain the actual emoji characters (e.g.: '😀').
Note that this searches for emojis as "words": that is, surrounded by white space. Something like "free😀" won't be found!
This is possible - but it's slightly tricky....
You can't use the standard Twitter search - but you can use the Streaming Search.
There are open source libraries available at https://github.com/mroth/emojitrack-feeder in Ruby and Node.
Related
A simple search of "how alexa works" yielded no results so here it is.
If you go through the documentation for utterances the need to exhaustively list out all possible variations is ridiculous. For example you need to list down the following variations separately to support them.
what's my horoscope
what is my horoscope
what my horoscope is
Maybe I didn't interpret the documentation correctly but I'm just curious as to where exactly the machine learning algorithms come in for identifying intents and skills.
Any pointers to helpful resources will be fine too.
Just pure pattern matching on the transcribed text. We are still in 21st century ...
I read a few papers about machine translation but did not understand them well.
The language models (in Google translate) use phonetics and machine learning as best as I can tell.
My question then becomes is it possible to convert an Arabic word that is phonetically spelled in English to translate the users intended Arabic word?
For instance the word 'Hadith' is an English phonetic of the Arabic word 'حديث'. Can I programmatically go from 'Hadith' to Arabic?
Thanks the Wiki article, there's an entire field of work in the area of Transliteration. There was a Google API for this that was deprecated in 2011 and moved to the Google Input Tools service.
The simplest answer is Buck Walter Transliteration but at first glace a 1:1 mapping doesn't seem like a good enough idea.
I am going to try to see if there's a way to hack the Google Input tools and call it even at CLI level because their online demo works very well
I am trying to create a simple tool that uses this website's functionality http://cat.prhlt.upv.es/mer/ which parses some strokes of text to a math formula. I noticed that they mention that it converts the input to InkML or MathML.
Now I noticed that according to this link: Tradeoff between LaTex, MathML, and XHTMLMathML in an iOS app? you can use MathJax to convert certain input to MathML.
What I need clarification/assistance with is how can I take input (say from finger strokes) or a picture and then convert it to a format in which I can provide this website from an iOS device and read the result at the top of the page. I have done everything regarding taking a picture or drawing an equation on an iPhone but I am just confused how I can take that and feed it to this site in order to get a result.
Is this possible, and if so how?
I think there's a misunderstanding here. http://cat.prhlt.upv.es/mer/ isn't an API for converting images into formulae—it's just an example demonstration of the Seshat software tool.
If you're looking to convert hand-drawn math expressions into LaTeX or MathML (which can then be pretty printed on your device), you want to compile Seshat and then feed it, not the website, your input. My answer here explains how to format input for Seshat.
I got a bunch of .DOC documents. I'm not even positive they are Word documents, but even if they are, I need to open and parse them with eg. Python to extract information from them.
Problem is, I couldn't figure out how they were encoded: UltraEdit's Conversion function wouldn't correct the text no matter which encoding I tried. OpenOffice 3.2 also failed displaying the contents correctly (guessing Windows-1252).
Here's an example, hoping that someone knows what pagecode it is:
"lÕAssemblŽe gŽnŽrale" instead of "l'Assemblée générale"
Thank you for any tip.
Greenstone digital library http://www.greenstone.org/ provides pretty good text extraction from word documents, including encoding detection.
Running msword in server mode gives you a range of scripting options- I'm sure detecting the encoding will be possible.
As the title says, how SEO friendly is a URL containing Unicode characters.
Edit: To clarify, I meant URL with non-ASCII characters but valid Unicode.
If I were a Google other search engines authority I wouldn't consider the unicode URL-s an advantage. I have been using unicode urls for more than two years in my Persian website but believe me I just did it because I felt I was forced to do this. We know Google handles Uncode urls very well but I can't see the unicode words in URL-s when I'm working with them in google webmaster tools here is an example:
http://www.learnfast.ir/%D9%88%D8%A8%D9%84%D8%A7%DA%AF-%D8%A2%DA%AF%D9%87%DB%8C
there are only two Farsi words in such a messy and lengthy URL.
I believe other Unicode url users don't like this either but they do this only for SEO optimization not for categorizing their contents or directing their users to the right address. Of course unicode is excellent for crawling the contents but there should be other ways to index URL-s. more over, English is our international language; Isn't it? It can be beneficially used for URL-s. There should be other means for indexing Unicode Urls. (Sorry for too much words from an amateur webmaster).
All URLs can be represented as Unicode. Unicode just defines a range of code-points from U+0000 to U+10FFFF, which allows you to define any characters.
If what you mean is "How SEO friendly are URLs containing characters above U+007F" then they should be as good as anything else, as long as the word is correct. However, they won't be very easy for most users to type if that's a concern, and may not be supported by all internet browsers/libraries/proxies etc. so I'd tend to steer clear.
FWIW, Amazon (Japan) uses Unicode URL for their product pages.
http://www.amazon.co.jp/任天堂-193706011-Wiiスポーツ-リゾート-「Wiiモーションプラス」1個同梱/dp/B001DLXXCC/ref=pd_bxgy_vg_img_a
(As you can see, it causes trouble with systems like the Stackoverflow wiki formatter)
if we consider that the urls that have the searched keywords in them have higher placements in the search results and you're targeting unicode search terms then it may actually help.
But of course this is hardly the most important thing when it comes to position in search results.
I would guess from an SEO point of view it would be a really bad idea, unless you are specifically looking to target unicode search terms.