How to use Python ParDo in a Java pipeline? - google-cloud-dataflow

I have a Python code for data analysis and I want to embed it inside a much bigger pipeline which is written in Apache Beam Java SDK.
I know that there is also a Python SDK but I don't know how to combine between them.
How can it be done?

This is currently not supported. It is a great idea, and something that is being considered. Much of the work on the in-progress portability API has the potential to enable this, but there is no estimate for when it will actually be possible.

Related

Building and extending a Knowledge Graph with entity extraction while Neo4j for my database

My goal is to build an automated Knowledge Graph. I have decided to use Neo4j as my database. I am intending to load a json file from my local directory to Neo4j. The data I will be using are the yelp datasets(the json files are quite large).
I have seen some Neo4j examples with Graphaware and OpenNLP. I read that Neo4j has a good support for JAVA apps. I have also read that Neoj supports python(I am intending to use nltk). Is it advisable to use Neo4j with JAVA maven/gradle and OpenNLP? Or should I use it with py2neo with nltk.
I am really sorry that I don't have any prior experience with these tools. Any advice or recommendation will be greatly appreciated. Thank you so much!
Welcome to Stack Overflow! Unfortunately, this question is a suggestion/opinion question so isn't appropriate for this forum.
However, this is an area I have worked in so I can confidently say that Java (or Kotlin) is the best way to go for Neo. The reason being, it is the native language for Neo and there is significantly more support in terms of the community for questions and libraries available out there.
However, NLTK is much more powerful than OpenNLP. So, if your usecase is simple enough for OpenNLP, then purely Java/Kotlin is a perfect approach. Alternatively, you can utilize java as an interfacing layer for the stored graph, but use python with NLTK for language work feeding into the graph. This would, of course, dramatically increase the complexity of your project.
Ultimately, the best approach depends on your exact use-case and which trade-offs make the most sense for you.

Distributed Dask CPP workers

DASK has a very powerful distributed api. As far as I can understand it can only support though native python code and modules.
Does anyone know if distributed DASK can support c++ workers?
I could not find anything in the docs.
Would there be any other approach apart from adding python bindings to cpp code to use that functionality?
You are correct, if you wanted to call into C++ code using Dask, you would do it by calling from python, which usually means writing some form of binding layer to make the calling convenient. If there is also a C API, you could use ctypes or cffi.
In theory, the scheduler is agnostic of the language of the client and workers, so long as they agree with each other, but no one has implemented a C++ client/worker. This has been done, at leats a POC, for Julia.

Entering the second knowledge level of jenkins-scripted-pipeline

It is easy to find simple examples for declarative or scripted pipeline. But when the point comes where you go deep into scripting you need so much more information. When you're not familiar to the world of web, java and groovy you are running out of questions which can be asked to go future. Googeling helps you find some magic "hudson.model.Hudson..." or .methods and e.g. #NonCPS-operators solutions. Those solutions work, but I'm searching for the bigger context to work my self from the bottom up. Not from the top down. I'm looking for the knowledge, which is obvious to the insiders.
I'm looking for links/books/api-references or introductions to learn to find the entrance to knowledge around the jenkins scripted pipeline. e.g. like this one =).
I am not looking for answers to those questions below from the stackoverflow communety. This would be to much! I am looking for links of documentation to get deep into the topic. I assume that for an insider it's insider knowledge is not obvious. So I'm stating here some questions to make it obvious what I would describe as insider knowledge.
Example questions:
like : "hudson.model.Hudson..." but where do I get those magical dot.separated strings?
Is there a documentation of the Jenkins Api?
How can I find documentation of the classes and methods usable in jenkins like e.g. X.Y.collect?
Is there a way to debug a pipeline?
Is there a faster way in testing code than every time run it in a pipeline?
How does the inner mechanism work?
Is the Knowledge more about groovy or is it about general Jenkins? Or is it java?
Why println MyArrayList.getClass() class java.util.ArrayList which is a java class? does grooy inherit the types from java, or does the pipeline inherit the types from jenkins, which is java?
...
Asking one question at a time:
where do I get those magical dot.separated strings?
Those are inner java classes at the Jenkins core (or plugins). For the former, Javadoc is available, the latter have their code at Github
classes and methods usable in jenkins
Mostly every Java and Groovy class/method is usable
debug a pipeline?
You can only replay it, issuing changes on each Run
testing
you have two approaches: LesFurets one and the real-unit-one
innards
wide question and wider answer. pipelines are loaded, transformed and run as a near to groovy code (#NonCPS annotation alters this behaviour).
Knowledge about Java, Groovy and Jenkins will apply.
Groovy indeed extends Java hence, both languages apply

Tool for load-testing a website by requests per second

I need a tool similar to ApacheBench but with the ability to specify the requests per second.
This tool needs to be runnable from the command line on Windows (any scripting languages, Ruby, Python etc are fine) and should be able to output results to file.
Bonus points if it can generate graphs or produce data files that can easily be graphed.
Yes
JMeter is good tool to use and it has other features also(non-ui mode,lightweight,opensource etc.)
if you don't have licenses issue then probably go for Loadrunner which is very costly but has more features which are required for you (graphs,analysis)
JMeter is what you are looking for!
It has good original documentation - http://jmeter.apache.org/ and a lot of other sourses, like this
JMeter could be run from command line, it names non gui-mode
JMeter supports Javascript, Groovy, Java, Beanshell languages
And of course it could generate graphs

Python for New Distributed Computing Project?

I need to write a compute-intensive simulation program. I tried writing a multi-threaded version of this program, but it's taking too much time. Now I plan to expand to multiple nodes (probably via Amazon EC2 nodes).
I'm already familiar with Python. Is Python outfitted with some parallel module a viable option if I care about speed, or would I be better off going to some other framework/language like Erlang?
Can you even write a simulation program in Erlang?
The project is more about dividing up computation rather than dividing up a dataset, so I didn't consider frameworks based on map reduce
dispy is a framework for distributed computing with Python. It uses asyncoro, a framework for asynchronous, concurrent programming using coroutines, with some features of erlang (broadly speaking). Disclaimer: I am the author of both these frameworks.
If you are familiar with python already I would recommend you keep simulation in python (and speed up critical parts in C) and use Erlang for managing it. Writing simulation in Erlang would be a far away out of your comfort zone (even personally I would do it). You probably can reuse parts of Erlang projects as Disco project or Riak core. Start your project with some sub-optimal POC and tune it in iterations. It means start with python, embed it to Erlang (probably Disco) and then move bits around until you are not happy with performance and features. You can end up with anything including pure Erlang solution or emended Python in BEAM using NIF or anything else which satisfy your needs.
Does your problem parallelize trivially? Then you may want to take a look at Elastic Map Reduce instead of EC2.

Resources