Build a graph with Twitter stream and query using Apache Flink - twitter

I listen to Twitter stream and successful with extracting data I want from tweets. Now I want to keep building a graph with the extracted info, like
(user)--[tweets]-->(tweet)
(tweet)--[mentions]-->(user)
(tweet)--[tagged]-->(hashtag)
While this graph keep building over the time, I want to run queries over this graph. How can I do this with Apache Flink?

With some more digging in to the forums and JIRA, I found gelly-streaming matching my needs.
With it, we can do create a GraphStream,
GraphStream<Long, NullValue, NullValue> graph = new SimpleEdgeStream<>(getEdgesDataSet(env), env);
Examples : https://github.com/vasia/gelly-streaming/tree/master/src/main/java/org/apache/flink/graph/streaming/example
Here are some other relevant links.
On Apache Flink mailing list : http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Graph-with-stream-of-updates-td5166.html
Vasia Kalavri's talk on Graphs as Streams: https://berlinbuzzwords.de/session/graphs-streams-rethinking-graph-processing-streaming-era

Related

Azure-devops rest api - pagination and rate limit

I am trying to pull Azure-Devops entities data (teams, projects, repositories, members etc...) and process that data locally,
I cannot find any documentation regarding rate-limiting and pagination,
does anyone has any experience with that?
There is some documentation for pagination on the members api:
https://learn.microsoft.com/en-us/rest/api/azure/devops/memberentitlementmanagement/members/get?view=azure-devops-rest-6.0
But that is the only one, i couldn't find any documentation for any of the git entities,
e.g: repositories.
https://learn.microsoft.com/en-us/rest/api/azure/devops/git/repositories/list?view=azure-devops-rest-6.0
If someone could point me to the right documentation,
Or shed some light on these subjects it would be great.
Thanks.
I cannot find any documentation regarding rate-limiting and pagination, does anyone has any experience with that?
There is a document about Service limits and rate limits, which introduced service limits and rate limits that all projects and organizations are subject to.
For the Rate limiting:
Azure DevOps Services, like many Software-as-a-Service solutions, uses
multi-tenancy to reduce costs and to enhance scalability and
performance. This leaves users vulnerable to performance issues and
even outages when other users of their shared resources have spikes in
their consumption. To combat these problems, Azure DevOps Services
limits the resources individuals can consume and the number of
requests they can make to certain commands. When these limits are
exceeded, subsequent requests may be either delayed or blocked.
You could refer Rate limits documentation for details
For the pagination, generally, REST API will have paginated response and ADO REST API normally have limits of 100 / 200 (depending which API) per page in each response. The way to retrieve next page information is to refer the response header x-ms-continuationtoken and use this for next request parameter as continuationToken.
But Microsoft does not document this very well - this should have been mentioned in every API call that supports continuation tokens:
Builds - List:
GET https://dev.azure.com/{organization}/{project}/_apis/build/builds?definitions={definitions}&continuationToken={continuationToken}&maxBuildsPerDefinition={maxBuildsPerDefinition}&deletedFilter={deletedFilter}&queryOrder={queryOrder}&branchName={branchName}&buildIds={buildIds}&repositoryId={repositoryId}&repositoryType={repositoryType}&api-version=5.1
If I use above REST API with $top=50, as expected I get 50 back and a header called "x-ms-continuationtoken", then we could loop output the result with continuationtoken:
You could check this similar thread for some more details.
I think for most of the apis you have query parameter as $top/$skip.You can use these parameter to do pagination. Lets say the default run gives 200 documents in the response. For the next run skip those 200 by providing $skip=200 in the query parameter of the request to get the next 200 items. You can keep on iterating until count attribute of the response becomes 0.
For those apis were you don't have these parameter you can use continuation-token as mentioned by Leo Liu-MSFT.
It looks like you can pass $top and continuationToken to list Azure Git Refs.
The documentation is here:
https://learn.microsoft.com/en-us/rest/api/azure/devops/git/refs/list?view=azure-devops-rest-6.0

Send Docker Entrypoint logs to APP in realtime

I'm looking for ideas to send Docker Logs for each runs to be sent to my application in realtime. I'm looking ways this can be done. Please let me know how this can be done.
Let me know if you have done this already or know how this can be achieved. I want to build feature similar to Netlify or vercel where they show you all build log on UI in realtime. I want something similar for my node application.
You can achieve this with Vercel and Log Drains.
Log Drains make it easy to collect logs from your deployments and forward them to archival, search, and alerting services by sending them via HTTPS, HTTP, TLS, and TCP once a new log line is created.
At the time of writing, we currently support 3 types of Log Drains:
JSON
NDJSON
Syslog
Along with Log Drains, we are introducing two new open-source integrations with logging services for you to start using them today: LogDNA and Datadog.
Install the integration: https://vercel.com/integrations?category=logging
See the announcement blog post: https://vercel.com/blog/log-drains
Note that Vercel does not allow Docker deployments, but does support Serverless Functions.

Dashboard using output results from Jenkins API

I am new to Jenkins API. I just had assignement in company where PL asked me to create a new job in Jenkins where I will run all the testing,build related things on my code and it should create dashboard where all figures and graph should be shown. He said that its feasible. Can anyone please guide me to do so.
Checkout Sectioned-Vew-Plugin.
Create a Job on Jenkins and add /api after the Url. You could see the API Information related to the Job you have just created. The API will contain the Get End points for Retrieving the data. Its available in JSON as well as XMl which you can parse and use as the source of Info for your dashboard. You can also trigger a new Build by using the Post API.

What are the differences between versions 1 and 2 of STRAVA API and how to get "streams" objects with v2?

I am currently tinkering with STRAVA API (Strava is a site for logging, sharing and comparing GPS tracklogs taken during cycling and running activities).
In order to get the streams (sample logs) of an activity like this:
http://www.strava.com/rides/9999
one can use the Version 1 of the API like this:
http://www.strava.com/api/v1/streams/9999
which returns a json string with time-series of speed, position, heartrate, etc.
My problems are:
Is there a way to get streams using API v2?
Where is the documentation for API v1?
Docs for API v2 are here
I've read somewhere that there are differences between POST and GET methods of the API, and that some data require authentication, but I am still (yet more) confused...
Thanks for any help!
UPDATE:
For anyone arriving here, as for end of 2013 Strava has (not) released their rather closed V3 API, and shut down their V1 and V2 endpoints.
However, it's still possible to get the JSON streams of a given activity with URLs like these (using activity of Id 9999 as a working example):
http://app.strava.com/stream/9999
http://app.strava.com/activities/9999/streams
Be aware that these APIs are being deprecated. Here is link to both versions of the API documentation, and a place to sign up for notification about the new API coming in early 2013.
You will find that the REST style is only loosely followed for these versions of the API, thus your confusion is understandable. The new API follows the REST style much more rigorously. For V1 and V2 GET of a resource will usually return an the object representing that resource in json format. However there are cases where POST returns the object rather than creating one. Streams are not returned by the V2 API, only V1. IHTH

Gmail IMAP, Searching for optimal method for X most recently active Gmail threads

I am looking for the optimal way to acquire a list of the top X most recently active Gmail threads.
Background:
I am using Java accessing IMAP with OAuth in a Google Apps for Education domain. A Gmail Atom inbox feed is available which can list the last 20 threads containing unread messages. Access to this feed seems to be very fast much faster than anything I having managed to produce using OAuth/IMAP.
The advantage of using the IMAP approach over the Gmail Atom inbox feed is with IMAP I can access an arbitrary number of messages (not just 20), see read messages, get thread size, get any associated google labels, fetch quota details and check for flags. Essentially this will give my users a much more Gmail like experience (I only need a read only experience for our portal). My problem is IMAP access is significantly slower than the Atom feed. Comparison wise the IMAP method takes around 10 seconds whilst the Atom feed is usually returned within 2 seconds.
I am aware and have been working with the Gmail IMAP Extensions and Gmail Advanced Search syntax .
Current Method:
Imagine I want the top 40 threads from my IMAP inbox. Currently I download some arbitrary number of messages say (40 * 4), fetching only the X-GM-THRID. I iterate through these messages storing the thread id as I go (fetching more messages if required) until I either exhaust the list of inbox messages or I reach my target number of threads.
I then have a list of Gmail thread ids which I can use to perform an IMAP search (with an appropriate FetchProfile.Item's, depending on what message details are required).
I iterate through the search results producing something like (using the wonderful Google Guava/Google Collections Multimap):
Multimap<Long, Message> threadMultiMap = LinkedListMultimap.create();
and this is easily massaged into:
LinkedHashMap<Long, Message[]> threadMap;
Is there a better way than iterating through the INBOX until X distinct message threads have been identified?
Not actually an answer but a relevant query.
Mark, as per your api request, I'm posting a comment as an answer (http://code.google.com/p/java-gmail-imap/wiki/DisplayingGmailThreadBasedView)
Does your lib support 3 legged oauth, I tried looking for XoauthAuthenticator on the source in the repo and could not find it.
Thanks
Hi agallego,
I use java-gmail-imap with XOAUTH. There is nothing explicitly in JavaMail that requires any changes to work with XOAUTH.
If you look at the XOAUTH projects (google-mail-xoauth-tools and google-mail-xoauth-tool-java-two-legged) you will see that you can create a SASL provider that can be used to authenticate against Gmail. See e.g. XoauthAuthenticator.java
I hope this helps,
Mark

Resources