predictionio not producing any predictions - mahout

I am trying to test out prediction-io for the first time. I followed the installation instructions for linux and developed several test engines. After repeatedly getting the following error on my own datasets I decided to follow the movie 100k tutorial (https://github.com/PredictionIO/PredictionIO-Docs/blob/cbca03b1c2bad949db951a3a798f0080c48b3674/source/tutorials/movie-recommendation.rst). The same error seems to persist even though it seems as if my Hadoop is running correctly (and not in safe mode) and the engine says that it is running and training is complete. The error that I am getting is:
predictionio.ItemRecNotFoundError: request: GET
/engines/itemrec/movie-rec/topn.json {'pio_n': 10, 'pio_uid': '28',
'pio_appkey':
'UsZmneFir39GXO9hID3wDhDQqYNje4S9Ea3jiQjrpHFzHwMEqCqwJKhtAziveC9D'}
/engines/itemrec/movie-rec/topn.json?pio_n=10&pio_uid=28&pio_appkey=UsZmneFir39GXO9hID3wDhDQqYNje4S9Ea3jiQjrpHFzHwMEqCqwJKhtAziveC9D
status: 404 body: {"message":"Cannot find recommendation for user."}
The rest of the tutorial runs as expected, just no predictions ever seem to appear. Can someone please point me in the right direction on how to solve this issue?
Thanks!

Several suggestions:
Check if there is data in PredictioIO's database. I saw jobs failing because there was some items in database but no users and no user-to-item actions. Look into Mongo database appdata - there should be collections named users, items and u2iActions. These collections are only created when you add first user-item-u2iaction there via API. That's bad that it is not clear whether job completed successfully or not via the web interface.
Check logs - PredictionIO logs, and Hadoop logs if you use Hadoop jobs. See if model training jobs did complete (BTW, did you invoke "Train prediction model now" via web interface?)
Verify if there is some data in predictionio_modeldata for your algorithm.
Well, even if model is trained OK, there can still be not enough data to produce recommendations for some user. Try "Random" to get the simplest recommendations available for all, to check if system as a whole works.

Related

Custom REST end point is not called at all from doccano auto labelling

I have filed an issue in the official doccano repo. Here. However I am posting here also in the hope of getting some idea on what I am doing wrong.
I have two EC2 instances both running Ubuntu 20
In one of them I have set up doccano and uploaded some data
I annotated a bit of that data and then trained a custom model using Hgging Face.
In the second EC2 instance I have uploaded the trained model and created a FastAPI based API to serve the result.
I want to set up auto labeling (it is a Sequence Labeling project).
I follow the steps in the official document and also take help from here.
Everything goes right, including at the second step when I am testing the api connectivity doccano could successfully connect and fetch the data.
Once all is done I go to one of the documents and try to do the auto labeling. And Surprise
NOTHING HAPPENS.
There is no log in the model server showing that no request has ever reached there!
Both the doccano and the model server and running via Docker inside the EC2 instances.
What am I doing wrong?
Please help.
Warm regards
Ok I found the reason. Thanks to a reply in the Github issue. Here I am going to quote the reply as it is.
The website is not behaving intuitively from the UX perspective. When
you turn on Auto Labeling, just try going to the next example (arrow
on the top right of the page) and your Auto Labeling API should be
called. Then go back and it will be called on the first example. Also,
it's called only on examples that are not marked as "labeled".
So anyone having difficulty with the same problem hopefully you will get some help from here.

Creating a structured Jenkins Failing Test Report

The situation right now:
Every Monday morning I manually check Jenkins jobs jUnit results that ran over the weekend, using Project Health plugin I can filter on the timeboxed runs. I then copy paste this table into Excel and go over each test case's output log to see what failed and note down the failure cause. Every weekend has another tab in Excel. All this makes tracability a nightmare and causes time consuming manual labor.
What I am looking for (and hoping that already exists to some degree):
A database that stores all failed tests for all jobs I specify. It parses the output log of a failed test case and based on some regex applies a 'tag' e.g. 'Audio' if a test regarding audio is failing. Since everything is in a database I could make or use a frontend that can apply filters at will.
For example, if I want to see all tests regarding audio failing over the weekend (over multiple jobs and multiple runs) I could run a query that returns all entries with the Audio tag.
I'm OK with manually tagging failed tests and the cause, as well as writing my own frontend, is there a way (Jenkins API perhaps?) to grab the failed tests (jUnit format and Jenkins plugin) and create such a system myself if it does not exist?
A good question. Unfortunately, it is very difficult in Jenkins to get such "meta statistics" that spans several jobs. There is no existing solution for that.
Basically, I see two options for getting what you want:
Post-processing Jenkins-internal data to get the statistics that you need.
Feeding a database on-the-fly with build execution data.
The first option basically means automating the tasks that you do manually right now.
you can use external scripting (Python, Perl,...) to process Jenkins-internal data (via REST or CLI APIs, or directly reading on-disk data)
or you run Groovy scripts internally (which will be faster and more powerful)
It's the most direct way to go. However, depending on the statistics that you need and depending on your requirements regarding data persistance , you may want to go for...
The second option: more flexible and completely decoupled from Jenkins' internal data storage. You could implement it by
introducing a Groovy post-build step for all your jobs
that script parses job results and puts data of interest in a custom, external database
Statistics you'd get from querying that database.
Typically, you'd start with the first option. Once requirements grow, you'd slowly migrate to the second one (e.g., by collecting internal data via explicit post-processing scripts, putting that into a database, and then running queries on it). You'll want to cut this migration phase as short as possible, as it eventually requires the effort of implementing both options.
You may want to have a look at couchdb-statistics. It is far from a perfect fit, but at least seems to do partially what you want to achieve.

Dataflow OutOfMemoryError while reading small tables from BigQuery

We have a pipeline reading data from BigQuery and processing historical data for various calendar years. It fails with OutOfMemoryError errors if the input data is small (~500MB)
On startup it reads from BigQuery about 10.000 elements/sec, after short time it slows down to hundreds elements/s then it hangs completely.
Observing 'Elements Added' on the next processing step (BQImportAndCompute), the value increases and then decreases again. That looks to me like some already loaded data is dropped and then loaded again.
Stackdriver Logging console contains errors with various stack traces that contain java.lang.OutOfMemoryError, for example:
Error reporting workitem progress update to Dataflow service:
"java.lang.OutOfMemoryError: Java heap space
at com.google.cloud.dataflow.sdk.runners.worker.BigQueryAvroReader$BigQueryAvroFileIterator.getProgress(BigQueryAvroReader.java:145)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.setProgressFromIteratorConcurrent(ReadOperation.java:397)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.setProgressFromIterator(ReadOperation.java:389)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$1.run(ReadOperation.java:206)
I would suspect that there is a problem with topology of the pipe, but running the same pipeline
locally with DirectPipelineRunner works fine
in cloud with DataflowPipelineRunner on large dataset (5GB, for another year) works fine
I assume problem is how Dataflow parallelizes and distributes work in the pipeline. Are there any possibilities to inspect or influence it?
The problem here doesn't seem to be related to the size of the BigQuery table, but likely the number of BigQuery sources being used and the rest of the pipeline.
Instead of reading from multiple BigQuery sources and flattening them have you tried reading from a query that pulls in all the information? Doing that in a single step should simplify the pipeline and also allow BigQuery to execute better (one query against multiple tables vs. multiple queries against individual tables).
Another possible problem is if there is a high degree of fan-out within or after the BQImportAndCompute operation. Depending on the computation being done there, you may be able to reduce the fan-out using clever CombineFns or WindowFns. If you want help figuring out how to improve that path, please share more details about what is happening after the BQImportAndCompute.
Have you tried debugging with Stackdriver?
https://cloud.google.com/blog/big-data/2016/04/debugging-data-transformations-using-cloud-dataflow-and-stackdriver-debugger

Database to store & process client logs efficiently

So the context is that I have a client application that generates logs and I want to occasionally upload this data to a backend. The backend will function as an analytics server, storing, processing and displaying this data - so as you can imagine there will be some querying involved.
In terms of data collection peak load, I expect to have about 5k clients, each generating about 50 - 100 lines per day, and I'd like the solution I'm tackling to be able to process that kind of data. If you do the math, thats upwards of 1 million log lines / month.
In terms of data analytics load, it will be fairly low - I expect a couple of us (admins) to run queries to harvest some info once a week or so from all the logs.
My application is currently running RoR + Postgres, though I'm open to using a different dB if it maps better to my needs. Current contenders in my head are MongoDB & Cassandra, but I don't really want to leave Postgres if it can scale to get the job done.
I'd recommend a purpose built tool like logstash for this:
http://logstash.net/
Another alternative would be Apache Flume:
http://flume.apache.org/
For my experiences, you will need an search engine to do troubleshooting and analysis when you have a lot of logs, instead of using database. (Search engine will more faster than database.)
For now, I am using logstash+Elasticsearch+Kibana total solution to build up my Log system.
Logstash is a tool can parse the logs and make it more human
readable.
Elasticsearch is a search engine to do indexing and
searching on your logs.
Kibana is a webUI that you can use it
to communicate with your Elasticsearch.
This is an Kibana Demo website. You can visit it. http://demo.kibana.org/ .
It provides the search interface and analysis tools such as Pie chart, Table, etc.
In my project, My application generates over 1.5 million logs per day. This Log system can handle all this logs.
Enjoy it.
If you are looking for a database solution that will grow with requests, then I would recommend looking beyond Postgres.
Cassandra is really well-suited for time-series data, though key-value stores are not suited for ad-hoc analytics. One idea could be to store your logs in Cassandra, and then roll them up into a different system later.
For straightforward storing-and-displaying of data, take a look at Graphite, a realtime graphing project.
You can create your own custom graphs with Graphite, and save them as dashboards.

Agresso payment creation via acrbatchinput

We're attempting to generate payments in an Agresso 5.5 system. The mechanism we've been told to use is to write new payment data into table acrbatchinput where it will be picked up and processed by a regular job running in agrbibat.dll. We have code that worked on a previous version of Agresso but following the upgrade our payments get rejected by the agrbibat job. Sometimes it generates useful messages in the log, sometimes it doesn't, and working through failures without good information is becoming a bit of a slog.
Is there some documentation we're missing? In particular it would be useful to have a full list of validation rules the job is using so we can implement these ourselves rather than trying to infer them from the log. I can't find any - there's not a lot for acrbatchinput on Google. Does this list or some other documentation exist? Is agribibat something easily decompilable, e.g. .NET?
Thanks. The test system we have is running against Oracle on Solaris with the Agresso jobs hosted on Windows. We have limited access to the Oracle and Agresso systems because (I think!) the same Oracle server is hosting the live payment system, but I could probably talk finance into giving us agrbibat.dll if that might help. We're unlikely to get enough access to their servers to debug it in place.
It turns out that our problem is partly because the new test system we've been given access to wasn't set up correctly, so we might be able to progress this without extra information - we're waiting on the financial team here for input.
However we're still interested in acrbatchinput or agrbibat documentation or information. You've missed the bounty I set but ticks, votes and gratitude still available.
I know this is an ancient old question, but here's my response anyway for anyone else that finds it.
The only documentation is the usual Agresso help files from within the desktop client. Meaningful information is only gleaned through trial and error, however!
The required fields differs depending on whether a given record is a GL, AP/AR or tax transaction. (That much is, at least, explained in the help).
In addition to using the log file, it's often helpful to look at GL07's report output for errors.

Resources