Tokenizing Twitter Posts in Lucene

Tokenizing Twitter Posts in Lucene - twitter

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene?
More detailed version:
I want to index a number of tweets in Lucene and keep the terms like #user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like #user and #hashtag?
My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example,
String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("#", "addresstag");
Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense?
Thanks in advance!
Amaç

The StandardTokenizer and StandardAnalyzer basically pass your tokens through a StandardFilter (which removes all kinds of characters from your standard tokens like 's at ends of words), followed by a Lowercase filter (to lowercase your words) and finally by a StopFilter. That last one removes insignificant words like "as", "in", "for", etc.
What you could easily do to get started is implement your own analyzer that performs the same as the StandardAnalyzer but uses a WhitespaceTokenizer as the first item that processes the input stream.
For more details one the inner workings of the analyzers you can have a look over here

It is cleaner to use a custom tokenizer that handles Twitter usernames natively. I have made one here: https://github.com/wetneb/lucene-twitter
This tokenizer will recognize Twitter usernames and hashtags, and a companion filter can be used to lowercase them (given that they are case-insensitive):
<fieldType name="text_twitter" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
<filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
<filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
</analyzer>
</fieldType>

There's a Twitter-specific tokenizer here: https://github.com/brendano/ark-tweet-nlp/blob/master/src/cmu/arktweetnlp/Twokenize.java

A tutorial on twitter specific tokenizer which is a modified version of ark-tweet-nlp API can be found at http://preciselyconcise.com/apis_and_installations/tweet_pos_tagger.php
This API is capable of identifying emoticons, hashtags,interjections etc present in a tweet

The Twitter API can be told to return all Tweets, Bios etc with the "entities" (hashtags, userIds, urls etc) already parsed out of the content into collections.
https://dev.twitter.com/docs/entities
So aren't you just looking for a way to re-do something that the folks at Twitter have already done for you?

Twitter open source there text process lib, implements token handler for hashtag etc.
such as: HashtagExtractor
https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/text/extractor/HashtagExtractor.java
It is base on lucene's TokenStream.

Related

Using Python RegEx in Zapier Formatter Extract Pattern

I have a field in an RSS item that includes a URL such as:
https://www.facebook.com/9999249845065110
https://www.yelp.com/biz/bix-berkeley-2?hrid=TaFUhHhVrhEJdCPjaB6RUQ
https://www.google.com/search?q=hello%20Signs%20&%20Graphics&ludocid=1720220414695611454#lrd=0x0:0x17df735a614e9c3e,1
I'm trying to setup a Zap in Zapier using the Formatter tool to essentially extract the root domain without the .com. So:
facebook
yelp
google
I have no clue how to use the Formatter Extract Pattern tool though. Can't figure out the syntax.
Best case scenario, it can look at any URL and extract the name of the site (e.g. facebook/google/yelp). If that's too complicated, then I could provide a finite list of what terms to look for and have it return the first (and only) one found. So it would check if the URL contained facebook or google or yelp and if so return that name as a value.
Any help would be appreciated. Thanks.

David here, from the Zapier Platform team.
This is totally possible. The input is the text you want to search (the full url) and the pattern is your regular expression.
In your case, you want to find the word between www. and .com. Use the regular expression www\.(\w+)\.com.
That worked for me, and pulled out yelp.
You can see each part of the regex explained here: https://regex101.com/r/KmwMAV/1
Let me know if you've got any other questions!

Difference Between "##" and "#" in Twitter Cards

When I was coding my meta tag and trying to figure out what other companies have implemented, I noticed some of them have ## instead of #. Does this make any difference?
<meta name="twitter:creator" content="##https://twitter.com/company">
<meta name="twitter:site" content="##company">
I always implement with only one # sign.
I was wondering, could this actually have something to do with SEO strategy?

Update: "# is sufficient. ## at your own risk." - Twitter Engineer
After a cursory browsing of the Twitter Card documentation, I see only examples of a single at-sign (#) preceding content. In my experience, no other official convention or pattern exists, or is encouraged by the documentation.
One likely explanation for the redundancy could be confusion in the template. Suppose the following exists in your source:
<meta name="twitter:creator" content="#<?= $username; ?>">
If the $username variable already consists of an at-sign, the resulting output will contain two. Twitter may have no issue with this, depending on how they search the value for usernames. If they look for nothing more than an at-sign, followed by a valid username, ##jonathansampson is valid.
Searching GitHub also didn't yield examples of developers explicitly and unequivocally desiring to use ##, but instead a smattering of resources showing the above pattern; an at-sign followed by a variable (which could also contain its own at-sign).

Any problems with using a period in URLs to delimiter data?

I have some easy to read URLs for finding data that belongs to a collection of record IDs that are using a comma as a delimiter.
Example:
http://www.example.com/find:1%2C2%2C3%2C4%2C5
I want to know if I change the delimiter from a comma to a period. Since periods are not a special character in a URL. That means it won't have to be encoded.
Example:
http://www.example.com/find:1.2.3.4.5
Are there any browsers (Firefox, Chrome, IE, etc) that will have a problem with that URL?
There are some related questions here on SO, but none that specific say it's a good or bad practice.

To me, that looks like a resource with an odd query string format.
If I understand correctly this would be equal to something like:
http://www.example.com/find?id=1&id=2&id=3&id=4&id=5
Since your filter is acting like a multi-select (IDs instead of search fields), that would be my guess at a standard equivalent.
Browsers should not have any issues with it, as long as the application's route mechanism handles it properly. And as long as you are not building that query-like thing with an HTML form (in which case you would need JS or some rewrites, ew!).
May I ask why not use a more standard URL and querystring? Perhaps something that includes element class (/reports/search?name=...), just to know what is being queried by find. Just curious, I knows sometimes standards don't apply.

how can I use colon instead of question mark in url query?

for example this image:
https://pbs.twimg.com/media/BFmDUA5CcAAmcBl.jpg
then I add a color symbol to send query string:
https://pbs.twimg.com/media/BFmDUA5CcAAmcBl.jpg:large
https://pbs.twimg.com/media/BFmDUA5CcAAmcBl.jpg:small
I googled that is twitter image
what coding language can achieve this?
php? ruby on rails?
or any htaccess rewrite rule?

Any.
It has nothing to do with programming languages, but with CGI: http://en.wikipedia.org/wiki/Common_Gateway_Interface
The colon is however not a valid part of the CGI spec, so the server receiving the request will probably parse it in code.
Note though that the CGI spec defines '&' as separator between different variable/value pairs, which results in incorrect (X)HTML when used in <a> tags. This is because it doesn't define a valid entity. To remedy this, at least in PHP, you can change this separator: http://www.php.net/manual/en/ini.core.php#ini.arg-separator.output

Indexing and Querying URLS in Solr

I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls.
I've tried a few things, and I think I'm close but not sure why it doesn't work:
Here is my custom field type:
<fieldType name="customUrlType" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
For example:
http://www.twitter.com/AndersonCooper when indexed, will have the following words in different positions: http,www,twitter,com,andersoncooper
If I search for simply twitter.com/andersoncooper, I would like this query to match the record that was indexed, which is why I also use the WDF to split the search query,
however the search query ends up being like so:
myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper
Is there a different query filter or tokenizer I should be using?

If I understand this statement from your question
myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper
You are trying to write a query that would match both:
http://www.twitter.com/AndersonCooper
and
http://www.andersoncooper.com/socialmedia/twitter
(both links contain all of the tokens), but not match either
http://www.facebook.com/AndersonCooper
or
http://www.twitter.com/AliceCooper
If that is correct, your existing configuration should work just fine. Assuming that you are using the standard query parser and you are querying via curl or some other url based mechanism, you need the query parameter to look like this:
&q=myField:andersoncooper AND myField:twitter AND myField:com
One of the gotchas that may have been tripping you up is that the default query operator (between terms in a query) is "OR", which is why the AND's must be explicitly specified above. Alternately to save some space, you can change the default query operator to "AND" like this:
&q.op=AND&q=myField:(andersoncooper twitter com)

This should be the most simplest solution:
<field name="iconUrl" type="string" indexed="true" stored="true" />
But for you requirement you will need to make it multivalued and index it 1. no changes 2. without http 3. without www
or make the URL searchable via wildcards at the front (which is slower I guess)

You can try the keyword tokenizer
From the book Solr 1.4 Enterprise Search Server published by Packt
KeywordTokenizerFactory: This doesn't
actually do any tokenization or
anything at all for that matter! It
returns the original text as one term.
There are cases where you have a
field that always gets one word, but
you need to do some basic analysis
like lowercasing. However, it is more
likely that due to sorting or
faceting requirements you will require
an indexed field with no more than
one term. Certainly a document's
identifier field, if supplied and not
a number, would use this.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Tokenizing Twitter Posts in Lucene - twitter

There's a Twitter-specific tokenizer here: https://github.com/brendano/ark-tweet-nlp/blob/master/src/cmu/arktweetnlp/Twokenize.java

A tutorial on twitter specific tokenizer which is a modified version of ark-tweet-nlp API can be found at http://preciselyconcise.com/apis_and_installations/tweet_pos_tagger.php This API is capable of identifying emoticons, hashtags,interjections etc present in a tweet

Twitter open source there text process lib, implements token handler for hashtag etc. such as: HashtagExtractor https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/text/extractor/HashtagExtractor.java It is base on lucene's TokenStream.

Related

Using Python RegEx in Zapier Formatter Extract Pattern

Difference Between "##" and "#" in Twitter Cards

Any problems with using a period in URLs to delimiter data?

how can I use colon instead of question mark in url query?

Indexing and Querying URLS in Solr

Categories

Resources