I would like to know how to extract all the Wikipedia links that were added and removed within a time window for a specific article in Wikipedia.
So far I know how to extract Wikipedia revisions in this questions: How to get full Wikipedia revision-history list from some article?
And how to do it for a specific time window: API to get Wikipedia revision id by date
For example, here is how I obtain the content of the revision for a time window for the article Germanwings_Flight_9525 :
https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=Germanwings_Flight_9525&rvstart=20150325180000&rvend=20150323180000&rvprop=ids|timestamp|content
How to obtain the links added and removed?
Thanks
You could retrieve all the revisions, split them by "[[" and look for the next "|" or "]" character. With that you find all the links, which you can collect in a list or something in order to recognize new ones.
Related
I wanted to cluster sentences based on their context and extract common keywords from similar context sentences.
For example
1. I need to go to home
2. I am eating
3. He will be going home tomorrow
4. He is at restaurant
Sentences 1 and 3 will be similar with keyword like go and home and maybe it's synonyms like travel and house .
Pre existing API will be helpful like using IBM Watson somehow
This API actually is doing what you are exactly asking for (Clustering sentences + giving key-words):
http://www.rxnlp.com/api-reference/cluster-sentences-api-reference/
Unfortunately the algorithm used for clustering and the for generating the key-words is not available.
Hope this helps.
You can use RapidMiner with Text Processing Extension.
Insert each sentence in a seperate file and put them all in a folder.
Put the operators and make a design like below.
Click on the Process Documents from files operator and in the right bar side choose "Edit list" on "Text directories" field. Then choose the folder that contains your files.
Double click on Process Documents from files operator and in the new window add the operators like below design(just the ones you need).
Then run your process.
I am trying to develop Artificial Bot i found AIML is something that can be used for achieving such goal i found these points regarding AIML parsing which is done by Program-O
1.) All letters in the input are converted to UPPERCASE
2.) All punctuation is stripped out and replaced with spaces
3.) extra whitespace chatacters, including tabs, are removed
From there, Program O performs a search in the database, looking for all potential matches to the input, including wildcards. The returned results are then “scored” for relevancy and the “best match” is selected. Program O then processes the AIML from the selected result, and returns the finished product to the user.
I am just wondering how to define score and find relevant answer closest to user input
Any help or ideas will be appreciated
#user3589042 (rather cumbersome name, don't you think?)
I'm Dave Morton, lead developer for Program O. I'm sorry I missed this at the time you asked the question. It only came to my attention today.
The way that Program O scores the potential matches pulled from the database is this:
Is the response from the aiml_userdefined table? yes=300/no=0
Is the category for this bot, or it's parent (if it has one)? this=250/parent=0
Does the pattern have one or more underscore (_) wildcards? yes=100/no=0
Does the current category have a <topic> tag? yes(see below)/no=0
a. Does the <topic> contain one or more underscore (_) wildcards? yes=80/no=0
b. Does the <topic> directly match the current topic? yes=50/no=0
c. Does the <topic> contain a star (*) wildcard? yes=10/no=0
Does the current category contain a <that> tag? yes(see below)/no=0
a. Does the <that> contain one or more underscore (_) wildcards? yes=45/no=0
b. Does the <that> directly match the current topic? yes=15/no=0
c. Does the <that> contain a star (*) wildcard? yes=2/no=0
Is the <pattern> a direct match to the user's input? yes=10/no=0
Does the <pattern> contain one or more star (*) wildcards? yes=1/no=0
Does the <pattern> match the default AIML pattern from the config? yes=5/no=0
The script then adds up all passed tests listed above, and also adds a point for each word in the category's <pattern> that also matches a word in the user's input. The AIML category with the highest score is considered to be the "best match". In the event of a tie, the script will then select either the "first" highest scoring category, the "last" one, or one at random, depending on the configuration settings. this selected category is then returned to other functions for parsing of the XML.
I hope this answers your question.
I'm making a list of the videos of my channel, and want to use the search endpoint of the API : https://developers.google.com/youtube/v3/docs/search/list
Ther eis a "q" parameter to send the query. What completely bugs me is that no wildcard is referenced in the documentation, and when using * it doesn't do anything. For example, in order to find any video containing "television" in the title, the full word has to be input ! Sending "tel" won't work, nor sending "televisio".
Did I miss something ? Is there a way around this ?
Thanks !
YouTube searching works along the same paradigm as Google searching, which is quite a bit different than the character-wildcard keyword approach. It's semantic probabilistic searching, looking for relevance based on the terms you give it, so while the * does represent a wildcard, it represents a whole word. For example, you can search for "a * saved" and it will return to you the videos which score the highest relevance score where any word could be substituted in place of your wildcard.
You can also use other punctuation based search operators ... the + sign, - sign, quotation marks, etc. Just make sure they're all URL encoded before you send the query in.
I have a file that contain twitter post and I am trying to identify the structure of the twitter post per line, like get the noun ,verb and stuff, using opennlp.
it work perfectly until it reach line that contain hashtag and link only
example :
#birthday www.mybirthday/test/mypi.com
and give error com.cybozu.labs.langdetect.LangDetectException: no features in text
when I write a sentence next to the line it just work. any idea how to handle it?? there are more then thousand line that almost like the example.
To use the POS tagger, you need to pass tokens, (in laymen terms say individual words). The link contains multiple words separated by a slash /. The link in itself is not associated with any Part Of Speech. See here the list of tags and how they are assigned to a word. If you want it to identify your link, and give a separate tag to it, say LN either give your own training data, here you will know how to create the training dataor separate the words in the link as separate token (you can separate a link by slash/, question mark?, equal to sign = or ampersand (&)) to get the underlying words and then use the POSTagger to get Part Of Speech (similar case for the hash tag.) For tokenization also, you can use opennlp tokenizer and for your special case, train it. Go through the documentation, it will help you a lot.
I want to do some mining on tweets. Is there any more specific stop word list for tweets such as removing "lol" and other twitter smiley?
I guess you should merge ordinary stop word list, like this one or that, with the specific acronyms dictionary, e.g. this slang dictionary, or that, or that, or that (the last one seems to be the easiest for parsing, see comments here for the idea).
I'm not aware of a specific stopwords list, but you could get a list of most frequent single words here:
http://clic.cimec.unitn.it/amac/twitter_ngram/ (download en.1grams.gz)
To detect and then ignore smilies use: https://github.com/brendano/tweetmotif
You may also find these tools useful:
https://github.com/willf/segment (if you want to segment hashtags)
https://github.com/amacinho/Rovereto-Twitter-Tokenizer (if you don't)
I'm not aware of a Twitter-specific stop word list, but it is common practice to simply remove the n most frequent words from your analyses, where n could be 100, for example. Depending on what you would like to do, smileys may actually provide very relevant information.