Cumulate word count by Flink timeWindow using Scala

Cumulate word count by Flink timeWindow using Scala - twitter

I have the twitter streaming API and I am retrieving tweets from there.
I also have a list of desired words that I want to take into account.
What I want to do is to store to my Cassandra dataBase always the most accurate value corresponding to how many times the word was used on the day.
I was thinking of using window functions to consolidade the results each 5 seconds and then writing this consolidate value on the database.
I don't know if this is the best approach.
If this is the best approach, I tried to do a simple example following the documentation, but it doesn't group the words each 5 seconds.
val env = StreamExecutionEnvironment.getExecutionEnvironment
val counts =
env.fromElements("foo bar test test baz foo", "yes no no yes", "hi hello hi hello")
.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
.filter(word => Words.listOfWords.contains(word) || Words.listOfWords2.contains(word))
.map { (_, 1) }
.keyBy(0)
.timeWindow(Time.seconds(5)).sum( 1)
counts.print()
env.execute("test-code")
}

Well, currently it will not work, because You are creating the DataStream from elements, which is not the best idea for windowing, because You won't really have a 5 seconds of the runtime to create more than one window, so all of the messages will go to the same window. But, if you would run this on the actual Twitter API, this should generally group the items into windows properly.

Related

Is it possible to search Twitter Users by number of followers?

I would like to find public users on Twitter that have 0 followers. I was thinking of using https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-users-search, but this doesn't have a way to filter by number of followers. Are there any simple alternatives? (Otherwise, I might have to resort to using a graph/search based approach starting from a random point)

Well you didn't specify what library you are using to interact with the Twitter API. But regardless of which technology you're using, the underlying concept is the same. I will use tweepy library in python for my example.
Start by getting the public users using this. The return type is a list of user objects. The user object has several attributes which you can learn about here. For now we are interested in the followers_count attribute. Simply loop through the objects returned and check where the value of this attribute is 0.
Here's how the implementation would look like in python using tweepy library;
search_query = 'your search query here'
get_users = api.search_users(q = search_query)#Returns a list of user objects
for users in get_users:
if users.followers_count ==0:
#Do stuff if the user has 0 followers

Bird SQL by Perplexity AI allows you to do this simply: https://www.perplexity.ai/sql
Query: Users with 0 followers, and 0 following, with at least 5 tweets
SELECT user_url, full_name, followers_count, following_count, tweet_count
FROM users
WHERE (followers_count = 0)
AND (following_count = 0)
AND (tweet_count >= 5)
ORDER BY tweet_count DESC
LIMIT 10

Can I pull a list of info out of an email?

I get a daily email that lists upcoming appointments, and their length. The number of appointments vary from day to day.
The emails go like this:
================
Today's Schedule
9:30 AM
3h
Brazilian Blowout
[Client #1 name]
12:30 PM
1h
Women's Cut
[Client 2 name]
6:00 PM
45m
Men's Cut
[Client #3 name]
Projected Revenue
===================
I want to create an event in a Google Calendar for each appointment, and it seems like zapier MIGHT be able to do this, but all the help resources I can find are very general in nature.
Is this do-able on Zapier? If so, any nudges in the right direction would be awesome.
Any thoughts greatly appreciated.

I had some time to kill and enjoy the odd challenge. So I have put together a solution that should do what you are looking for. I will break it down by steps.
TEMPLATE
Zapier Trigger - Step 1
Type: Trigger
Module: Gmail
Criteria: User Dependent
Comments: For the trigger zap you will want to use a Gmail specific trigger, something to the effect of "execute trigger on emails titled 'xyz'", or "emails labeled 'xyz'" if you setup a filter in your inbox.
Input screenshot:
Output Screenshot:
Zapier Action - Step 2
Type: Action
Module: Code (Python 3)
Comments: The Code offered by Zapier executes whatever (properly written) code you place in its container. It is especially handy as it allows you to incorporate data from previous steps in it through the use of a dictionary variable titled 'input_data'. Zapier offers the Code module in two languages: Javascript and Python. As I am most familiar with Python my solution for this step was written in Python. I will append the code to the end of this answer. Using the data held in the body of the email (retrieved in step 1) we can execute some string manipulations and datetime conversions to break apart the email into its component parts and pass those on to the following Action Step: Create Calendar Event.
Input Screenshot:
Output Screenshot:
Zapier Action - Step 3
Type: Action
Module: Google Calendar - Create Event
Comments: Using the data outputted from the previous code step we can fill out the required fields for creating a new appointment.
Input Screenshot:
Output Screenshot:
PYTHON CODE
from datetime import timedelta, date, datetime
'''
Goal: Extract individual appointment details from variable length email
Steps:
Remove all extraneous and new line characters.
Isolate each individual appointment and group its relevant details.
Derive appointment start and end times using appointment time and duration.
Return all appointments in a list.
'''
def format_appt_times(appt_dict):
appt_start_str = appt_dict.get("appt_start")
appt_dur_str = appt_dict.get("appt_length")
# isolate hour and minutes from appointment time
appt_s_hour = int(appt_start_str[:appt_start_str.find(":")])
if ("pm" in appt_start_str.lower()):
appt_s_hour = 12 if appt_s_hour + 12 >= 24 else appt_s_hour + 12
appt_s_min = int(appt_start_str[appt_start_str.find(":") + 1 :
appt_start_str.find(":") + 3])
# isolate hour and minutes from duration time
appt_d_hour = 0
appt_d_min = 0
if ("h" in appt_dur_str):
appt_d_hour = int(appt_dur_str[:appt_dur_str.find("h")])
if ("m" in appt_dur_str):
appt_d_min = int(appt_dur_str[appt_dur_str.find("m") - 2 : appt_dur_str.find("m")])
# NOTE: adjust timedelta hours depending on your relation to UTC
# create datetime objects for appointment start and end times
time_zone = timedelta(hours=0)
tdy = date.today() - time_zone
duration = timedelta(hours=appt_d_hour, minutes=appt_d_min)
appt_start_dto = datetime(year=tdy.year,
month=tdy.month,
day=tdy.day,
hour=appt_s_hour,
minute=appt_s_min)
appt_end_dto = appt_start_dto + duration
# return properly formatted datetime as string for use in next step.
return (appt_start_dto.strftime("%Y-%m-%dT%H:%M"),
appt_end_dto.strftime("%Y-%m-%dT%H:%M"))
def partition_list(target, part_size):
for data in range(0, len(target), part_size):
yield target[data : data + part_size]
def main():
# Remove all extraneous and new line characters.
email_body = input_data.get("email_body")
head,delin,*email_body,delin,foot = [text for text in email_body.splitlines() if text != ""]
appointment_list = []
# Isolate each individual appointment and group its relevant details.
for text in partition_list(email_body, 4):
template = {
"appt_start" : text[0],
"appt_end" : None,
"appt_length" : text[1],
"appt_title" : text[2],
"appt_client" : text[3]
}
appointment_list.append(template)
for appt in appointment_list:
appt["appt_start"], appt["appt_end"] = format_appt_times(appt)
return appointment_list
return main()
I am not sure of your familiarity with Python, or programming more generally, but the comments in the code explain what each section is doing. If you have any specific questions regarding aspects of the code let me know. Assuming your email template does not change this setup should work exactly as needed. Let me know if anything is unclear.
UPDATE
I thought it best to address your question in the original answer should anyone else have similar questions.
explaining how this code is removing the extra characters:
There is actually a fair bit going on in the first line, so I will do my best to break it down, and provide resources where necessary.
The code in question:
head,delin,*email_body,delin,foot = [text for text in email_body.splitlines() if text != ""]
First step here was to break the text into manageable chunks. I did so with the line email_body.splitlines() which, by default, breaks strings into a list at each newline character found (you can specify your own delimiter).
If we were to inspect the list at this moment its contents would be something of the following:
["================", "", "Today's Schedule", "", "9:30 AM", "", "3h", ..., "[Client #3 name]", "", "Projected Revenue", "", "==================="]
You will notice there is a fair amount of information in there that we really don't want.
First lets look at the "" elements. These are left over as a result of the blank lines between each line of text, which even though they are blank do still have newline characters at the end of them. There a number of ways you could address this within python. We could simply write a for-loop to go through and copy all elements that are not "" to a new list.
To me this felt like additional work, and besides, Python offers list comprehension for just such a scenario. I won't go too deep into list comprehension as there is a lot that can be said about it, and in more insightful ways than I could muster, but it essentially allows you to provide logic against a set of 'data' to form a list. In this case, I specifically wanted to filter out the "" elements returned from the call to splitlines().
And so you will see I address this with the following line
[text for text in email_body.splitlines() if text != ""]
With that we have a list as above less the "" elements. Now we must turn our attention towards the more 'dynamic' garbage strings. Again there are a number of ways to do this. A, not particularly flexible, option could be to simply store the strings we want to remove in variables something to the effect of:
garb_1 = "==================="
garb_2 = "Projected Revenue"
garb_3 = ...
and once again filter the list with yet another for-loop. I instead chose to leverage Python's list unpacking idiom. Which allows us to 'unpack' list objects (and I believe tuples) into variables. As an example:
one, two, three = ["a", "b", "c"]
I'm sure you can guess what is happening above, as long as we provide the same number of variables as are in the list we can 'unpack' it in this fashion. But wait! In our case we don't know how long the list is going to be as it is entirely dependent on the number of appointments you have for any given day. Well this is where star unpacking enters to elevate the functionality. Using my code as the example:
head,delin,*email_body,delin,foot = [text for text in email_body.splitlines() if text != ""]
The *, in plain-English, is saying "I don't know how many elements to expect just give me all of them in a list". As we know that there will always be two lines of garbage at the beginning and end of the email we can assign them to throw away variables and capture everything in between using our variable length *email_body container.
With all of this complete we now have a list with only the data we are looking to capture. If, as you say, there are additional lines of garbage before or after the email_body, you can simply add additional throw away variables to account for them.
Once again feel free to ask any follow up questions.
Michael
Resources
List Comprehension
Star Unpacking

Can odata total records be found without fetching them?

Am creating paging and fetching 10 JSON records per page:
var coursemodel = query.Skip(skip).Take(take).ToList();
I need to display on the web page the total number of records available in database. For example, you are viewing 20 to 30 of x (where x is total number of records). Can x be found without transferring the records over the network?

Sorted it. Did this
http://yourserver7:40479/odata/Courses?$top=1&skip=1&$inlinecount=allpages
Got this
{
"odata.metadata":"http://yourserver7:40479/odata/$metadata#Courses","odata.count":"503","value":[
{
"CourseID":20,"Name":"Name 20","Description":"Description 20","Guid":"Guid 20"
}
]
}
I then got the value from odata.count! My url gets all records found, add $filter where applicable...

You can use the $count operator to return the total number of records, something along these lines:
http://services.odata.org/OData/OData.svc/Categories(1)/Products/$count
Not sure what the syntax would be with Linq, but pretty sure it is possible.
PS - Always a good reference: http://www.odata.org/documentation/odata-version-2-0/uri-conventions/

Just try this:
context.EFCars.Where(c => c.Description == desc).Count();
Where EFCars is an entity set name.

detect if a combination of string objects from an array matches against any commands

Please be patient and read my current scenario. My question is below.
My application takes in speech input and is successfully able to group words that match together to form either one word or a group of words - called phrases; be it a name, an action, a pet, or a time frame.
I have a master list of the phrases that are allowed and are stored in their respective arrays. So I have the following arrays validNamesArray, validActionsArray, validPetsArray, and a validTimeFramesArray.
A new array of phrases is returned each and every time the user stops speaking.
NSArray *phrasesBeingFedIn = #[#"CHARLIE", #"EAT", #"AT TEN O CLOCK",
#"CAT",
#"DOG", "URINATE",
#"CHILDREN", #"ITS TIME TO", #"PLAY"];
Knowing that its ok to have the following combination to create a command:
COMMAND 1: NAME + ACTION + TIME FRAME
COMMAND 2: PET + ACTION
COMMAND n: n + n, .. + n
//In the example above, only the groups of phrases 'Charlie eat at ten o clock' and 'dog urinate'
//would be valid commands, the phrase 'cat' would not qualify any of the commands
//and will therefor be ignored
Question
What is the best way for me to parse through the phrases being fed in and determine which combination phrases will satisfy my list of commands?
POSSIBLE solution I've come up with
One way is to step through the array and have if and else statements that check the phrases ahead and see if they satisfy any valid command patterns from the list, however my solution is not dynamic, I would have to add a new set of if and else statements for every single new command permutation I create.
My solution is not efficient. Any ideas on how I could go about creating something like this that will work and is dynamic no matter if I add a new command sequence of phrase combination?

I think what I would do is make an array for each category of speech (pet, command, etc). Those arrays would obviously have strings as elements. You could then test each word against each simple array using
[simpleWordListOfPets containsObject:word]
Which would return a BOOL result. You could do that in a case statement. The logic after that is up to you, but I would keep scanning the sentence using NSScanner until you have finished evaluating each section.
I've used some similar concepts to analyze a paragraph... it starts off like this:
while ([scanner scanUpToString:#"," intoString:&word]) {
processedWordCount++;
NSLog(#"%i total words processed", processedWordCount);
// Does word exist in the simple list?
if ([simpleWordList containsObject:word]) {
//NSLog(#"Word already exists: %#", word);
You would continue it with whatever logic you wanted (and you would search for a space rather than a ",".

how to sort documents using the erlang map reduce in riak

i'm using riak to store json documents right now, and i want to sort them based on some attribute, let's say there's a key, i.e
{
"someAttribute": "whatever",
"order": 1
}
so i want to sort the documents based on the "order".
I am currently retrieving the documents in riak with the erlang interface. i can retrieve the document back as a string, but i dont' really know what to do after that. i'm thinking the map function just reduces the json document itself, and in the reduce function, i'd make a check to see whether the item i'm looking at has a higher "order" than the head of the rest of the list, and if so append to beginning, and then return a lists:reverse.
despite my ideas above i've had zero results after almost an entire day, i'm so confused with the erlang interface in riak. can someone provide insight on how to write this map/reduce function, or just how to parse the json document?

As far as I know, You do not have access to Input list in Map. You emit from Map a document as 1 element list.
Inputs (all the docs to handle as {Bucket, Key}) -> Map (handle single doc) -> Reduce (whole list emitted from Map).
Maps are executed per each doc on many nodes whereas Reduce is done once on so called coordinator node (the one where query was called).
Solution:
Define Inputs (as a list or bucket)
Retrieve Value in Map and emit whole doc or {Id, Val_to_sort_by)
Sort in Reduce (using regular list:keysort)

This is not a map reduce solution but you should check out Riak Search.

so i "solved" the problem using javascript, still can't do it using erlang.
here is my query
{"inputs":"test",
"query":[{"map":{"language":"javascript",
"source":"function(value, keyData, arg){ var data = Riak.mapValuesJson(value)[0]; var obj = {}; obj[data.order] = data; return [ obj ];}"}},
{"reduce":{"language":"javascript",
"source":"function(values, arg){ return [ values.reduce(function(acc, item){ for(var order in item){ acc[order] = item[order]; } return acc; }) ];}",
"keep":true}}
]
}
so in the map phase, all i do is create a new array, obj, with the key as the order, and the value as the data itself. so visually, the obj is like this
{"1":{"firstName":"John","order":1}
in the reduce phase, i'm just putting it in the accumulator, so basically that's the sort if you think about it, because when you're done, everything will be put in order for you. so i put 2 json documents for testing, one is above, the ohter is just firstName: Billie, order 2. and here is my result for the query above
[{"1":{"firstName":"John","order":1},"2":{"firstName":"Billie","order":2}}]
so it works! . but i still need to do this in ERLANG, any insights?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart