Scrapy - Parse_item not being called - html-parsing

I'm having two main problems
1) The parse_item method is not being called/executed after crawling a page
2) When the "callback='self.parse_item'" is included in the rules, scrapy does not continue to follow the links. Instead, it only follows the links immediately available from the Start Urls.
Here is the code
from scrapy.spider import BaseSpider
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from sheprime.items import SheprimeItem
class HerroomSpider(CrawlSpider):
name = "herroom"
allowed_domains = ["herroom.com"]
start_urls = [
"http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml",
"http://www.herroom.com/hosiery.aspx",
rules = [
Rule(SgmlLinkExtractor(allow=(r'/[A-Za-z0-9\-]+\.shtml', )), callback='self.parse_item')
]
def parse_item(self, response):
print "some message"
#I have put in this simple parse function, because I just want to get it to work
Thanks for your help,
L

Your code:
Rule(SgmlLinkExtractor(allow=(r'/[A-Za-z0-9\-]+\.shtml', )), callback='self.parse_item')
It should be:
Rule(SgmlLinkExtractor(allow=(r'/[A-Za-z0-9\-]+\.shtml', )), callback='parse_item')
This works for me:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class HerroomSpider(CrawlSpider):
name = "herroom"
allowed_domains = ["herroom.com"]
start_urls = [
"http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml",
"http://www.herroom.com/hosiery.aspx"
]
rules = [
Rule(SgmlLinkExtractor(allow=(r'/[A-Za-z0-9\-]+\.shtml', )), callback='parse_item')
]
def parse_item(self, response):
print "some message"
Results:
vic#wic:~/projects/test$ scrapy crawl herroom
2012-07-09 08:08:51+0400 [scrapy] INFO: Scrapy 0.15.1 started (bot: domains_scraper)
2012-07-09 08:08:51+0400 [scrapy] DEBUG: Enabled extensions: LogStats, CloseSpider, CoreStats, SpiderState
2012-07-09 08:08:51+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-09 08:08:51+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-09 08:08:51+0400 [scrapy] DEBUG: Enabled item pipelines: Pipeline
2012-07-09 08:08:51+0400 [herroom] INFO: Spider opened
2012-07-09 08:08:51+0400 [herroom] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-09 08:08:52+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml> (referer: None)
2012-07-09 08:08:54+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/hosiery.aspx> (referer: None)
2012-07-09 08:08:55+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/simone-perele.shtml> (referer: http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml)
some message
2012-07-09 08:08:56+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/simone-perele-12p300-trocadero-strapless-bra.shtml> (referer: http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml)
some message
2012-07-09 08:08:57+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/simone-perele-12p342-trocadero-push-up-bra-with-racerback.shtml> (referer: http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml)
some message

Related

iOS error Failed to load resource: unsupported URL

I'm using the chrisben imgcache.js plugin to cache images,
Since the change to WKWebview this plugin has stopped working for me. I've switched to v2.1.1 which is working fine on Android. I'm using Cordova with the following plugins
cordova-plugin-wkwebview-engine: 1.2.1
cordova-plugin-wkwebview-file-xhr: 2.1.4
The following noted from the console indicates the file is downloaded/stored correctly but when fetching it it fails.
> [Log] INFO: Download complete:
> file:///Users/shadow4768/Library/Developer/CoreSimulator/Devices/3965C47C-7718-48C3-82ED-DF9A2CCB3989/data/Containers/Data/Application/3BFC0F90-F7D4-4DFA-8648-0F440929F835/Library/NoCloud/imgcache/5b1950b1ee383f3fdd0e51bf84dfdbd505006d79
> (cordova.js, line 1540) [Log] INFO: Cached file size: 37161
> (cordova.js, line 1540) [Log] INFO: current size: 2533404 (cordova.js,
> line 1540) [Log] INFO: com.apple.MobileBackup metadata set
> (cordova.js, line 1540) [Log] INFO: File
> getdocument?documentid=41623&width=300 loaded from cache (cordova.js,
> line 1540) [Error] Failed to load resource: unsupported URL
> cdvfile://localhost/library-nosync/imgcache/91c59e590d88a60c252d8281aa165be35a7d5798
The only solutions I have found are related to Ionic,
I initially thought the below code was a fix to realise it might not have been compatible with what I'm using as only some of the functions were now working.
ImgCache.getCachedFileURL(src,
(originalUrl, cacheUrl) => {
const file = new File();
const cacheFileUrl = cacheUrl.replace('cdvfile://localhost/persistent/', file.documentsDirectory);
const localServerFileUrl = cacheFileUrl.replace('file://', 'http://localhost:8080');
//localServerFileUrl contains the loadable url
resolve(localServerFileUrl);
},
(e) => {
console.error('img-cache-error:', e);
reject(e)
});
Any ideas on how I could get around this issue would be greatly appreciated.

how to promtail parse json to label and timestamp

I have a probleam to parse a json log with promtail, please, can somebody help me please. I try many configurantions, but don't parse the timestamp or other labels.
log entry:
{timestamp=2019-10-25T15:25:41.041-03, level=WARN, thread=http-nio-0.0.0.0-8080-exec-2, mdc={handler=MediaController, ctxCli=127.0.0.1, ctxId=FdD3FVqBAb0}, logger=br.com.brainyit.cdn.vbox.
controller.MediaController, message=[http://localhost:8080/media/sdf],c[500],t[4],l[null], context=default}
promtail-config.yml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://localhost:3100/loki/api/v1/push
scrape_configs:
- job_name: vbox-main
static_configs:
- targets:
- localhost
labels:
job: vbox
appender: main
__path__: /var/log/vbox/main.log
pipeline_stages:
- json:
expressions:
timestamp: timestamp
message: message
context: context
level: level
timestamp:
source: timestamp
format: RFC3339Nano
labels:
context:
level:
output:
source: message
I've tried the setup of Promtail with Java SpringBoot applications (which generates logs to file in JSON format by Logstash logback encoder) and it works.
The example log line generated by application:
{"timestamp":"2020-06-06T01:00:30.840+02:00","version":1,"message":"Started ApiApplication in 1.431 seconds (JVM running for 6.824)","logger_name":"com.github.pnowy.spring.api.ApiApplication","thread_name":"main","level":"INFO","level_value":20000}
The prometail config:
# Promtail Server Config
server:
http_listen_port: 9080
grpc_listen_port: 0
# Positions
positions:
filename: /tmp/positions.yaml
clients:
- url: http://localhost:3100/loki/api/v1/push
scrape_configs:
- job_name: springboot
pipeline_stages:
- json:
expressions:
level: level
message: message
timestamp: timestamp
logger_name: logger_name
stack_trace: stack_trace
thread_name: thread_name
- labels:
level:
- template:
source: new_key
template: 'logger={{ .logger_name }} threadName={{ .thread_name }} | {{ or .message .stack_trace }}'
- output:
source: new_key
static_configs:
- targets:
- localhost
labels:
job: applogs
__path__: /Users/przemek/tools/promtail/*.log
Please notice that the output (the log text) is configured first as new_key by Go templating and later set as the output source. The logger={{ .logger_name }} helps to recognise the field as parsed on Loki view (but it's an individual matter of how you want to configure it for your application).
Here you will find quite nice documentation about entire process: https://grafana.com/docs/loki/latest/clients/promtail/pipelines/
The example was run on release v1.5.0 of Loki and Promtail (Update 2020-04-25: I've updated links to current version - 2.2 as old links stopped working).
The section about timestamp is here: https://grafana.com/docs/loki/latest/clients/promtail/stages/timestamp/ with examples - I've tested it and also didn't notice any problem. Hope that help a little bit.
The JSON configuration part: https://grafana.com/docs/loki/latest/clients/promtail/stages/json/
Result on Loki:

Failed to laod API defination error using Flasgger in python

I am creating an API for my machine learning Model using Flasgger and Flask in python.
After running my API file I am getting the below error as ‘Failed to load API documents.
Fetch error
Internal Server Error/ apispex_1.json
Below is my code :
import pickle
from flask import Flask, abort, jsonify, request
import numpy as np
import pandas as pd
from flasgger import Swagger
with open('./im.pkl', 'rb') as model_file:
model = pickle.load(model_file)
app = Flask(__name__)
swagger = Swagger(app)
#app.route('/predict')
def predict1():
"""Example
---
parameters:
-name: Days
in: query
type= number
required: true
--
--
--
"""
Days = request.args.json('Days')
prediction = model.predict(np.array([[Days]]))
return str(prediction)
if __name__ == '__main__':
app.run(port=5000, debug=True)
You got an error in docstring description:
#app.route('/predict')
def predict1():
"""Example
---
parameters:
- name: Days
in: query
type: integer
required: true
"""
Just replace type= number to type: integer

Error: gazebo_ros_control plugin is waiting for model URDF in parameter

Please help.
Very few times the gazebo opens with the loaded model, almost 99 times it fails with the below error.
After searching for one day in all forums I tried the following, so far no luck :( 1) runnning verbose:=true 2) running rosrun gzclient and then the launch file 3) making sure box size is not zero 4) transmission type properly mentioned 5) gazebo ros control plugin installed and mentioned in model file 6) gazebo ros control plugin installed (please note that i was able to run the same launch before, suddenly this error is coming up) 7) checked namesapce
Error trace:
balaji#balaji:~/Documents/balaji/unl/Media/Downloads/robot_ws_final$ source devel/setup.bash
balaji#balaji:~/Documents/balaji/unl/Media/Downloads/robot_ws_final$ roslaunch robot_gazebo robot_world.launch
... logging to /home/balaji/.ros/log/e78e4fbc-7f83-11e7-9f51-9801a7b07983/roslaunch-balaji-31825.log
Checking log directory for disk usage. This may take awhile.
Press Ctrl-C to interrupt
WARNING: disk usage in log directory [/home/balaji/.ros/log] is over 1GB.
It's recommended that you use the 'rosclean' command.
xacro: Traditional processing is deprecated. Switch to --inorder processing!
To check for compatibility of your document, use option --check-order.
For more infos, see http://wiki.ros.org/xacro#Processing_Order
started roslaunch server http://balaji:45487/
SUMMARY
========
PARAMETERS
* /first_pelican/image_processing_node/namesapce_deploy: first_pelican
* /first_pelican/joint1_position_controller/joint: palm_riser
* /first_pelican/joint1_position_controller/pid/d: 10.0
* /first_pelican/joint1_position_controller/pid/i: 0.01
* /first_pelican/joint1_position_controller/pid/p: 100.0
* /first_pelican/joint1_position_controller/type: effort_controller...
* /first_pelican/joint_state_controller/publish_rate: 100
* /first_pelican/joint_state_controller/type: joint_state_contr...
* /first_pelican/robot_description: <?xml version="1....
* /first_pelican/smart_exploration/dist_x: 0
* /first_pelican/smart_exploration/dist_y: 0
* /first_pelican/smart_exploration/namesapce_deploy: first_pelican
* /rosdistro: kinetic
* /rosversion: 1.12.7
* /use_sim_time: True
NODES
/first_pelican/
controller_spawner (controller_manager/spawner)
image_processing_node (image_processing/image_processing_node)
mybot_spawn (gazebo_ros/spawn_model)
robot_state_publisher (robot_state_publisher/robot_state_publisher)
smart_exploration (robot_exploration/smart_exploration)
/
gazebo (gazebo_ros/gzserver)
gazebo_gui (gazebo_ros/gzclient)
auto-starting new master
process[master]: started with pid [31839]
ROS_MASTER_URI=http://localhost:11311
setting /run_id to e78e4fbc-7f83-11e7-9f51-9801a7b07983
process[rosout-1]: started with pid [31852]
started core service [/rosout]
process[gazebo-2]: started with pid [31864]
process[gazebo_gui-3]: started with pid [31879]
process[first_pelican/mybot_spawn-4]: started with pid [31886]
process[first_pelican/controller_spawner-5]: started with pid [31887]
process[first_pelican/robot_state_publisher-6]: started with pid [31888]
process[first_pelican/image_processing_node-7]: started with pid [31889]
process[first_pelican/smart_exploration-8]: started with pid [31890]
[ WARN] [1502559016.978709697]: The root link chassis has an inertia specified in the URDF, but KDL does not support a root link with an inertia. As a workaround, you can add an extra dummy link to your URDF.
[ INFO] [1502559016.986332012]: Got param: 0.000000
[ INFO] [1502559016.995995700]: Got param: 0.000000
[ INFO] [1502559016.999604731]: Got param: first_pelican
[ INFO] [1502559017.008884277]: In image_converter, got param: first_pelican
SpawnModel script started
[INFO] [1502559017.185603, 0.000000]: Loading model XML from ros parameter
[INFO] [1502559017.190666, 0.000000]: Waiting for service /gazebo/spawn_urdf_model
[ INFO] [1502559017.208092409]: Finished loading Gazebo ROS API Plugin.
[ INFO] [1502559017.209366293]: waitForService: Service [/gazebo/set_physics_properties] has not been advertised, waiting...
[INFO] [1502559017.386893, 0.000000]: Controller Spawner: Waiting for service controller_manager/load_controller
[ INFO] [1502559017.566665686, 246.206000000]: waitForService: Service [/gazebo/set_physics_properties] is now available.
[ INFO] [1502559017.611486634, 246.249000000]: Physics dynamic reconfigure ready.
[INFO] [1502559017.795112, 246.428000]: Calling service /gazebo/spawn_urdf_model
[ INFO] [1502559018.103326226, 246.494000000]: Camera Plugin: Using the 'robotNamespace' param: '/first_pelican/'
[ INFO] [1502559018.107184854, 246.494000000]: Camera Plugin (ns = /first_pelican/) <tf_prefix_>, set to "/first_pelican"
[ INFO] [1502559018.628739638, 246.494000000]: Laser Plugin: Using the 'robotNamespace' param: '/first_pelican/'
[ INFO] [1502559018.628941833, 246.494000000]: Starting Laser Plugin (ns = /first_pelican/)
[ INFO] [1502559018.630496093, 246.494000000]: Laser Plugin (ns = /first_pelican/) <tf_prefix_>, set to "/first_pelican"
[INFO] [1502559018.650747, 246.494000]: Spawn status: SpawnModel: Successfully spawned entity
[ INFO] [1502559018.669444812, 246.494000000]: Loading gazebo_ros_control plugin
[ INFO] [1502559018.669578793, 246.494000000]: Starting gazebo_ros_control plugin in namespace: first_pelican
[ INFO] [1502559018.670483364, 246.494000000]: gazebo_ros_control plugin is waiting for model URDF in parameter [/robot_description] on the ROS param server.
I know it has been some time since this question has been asked, but if someone is still looking for an answer then please find it below:
I was able to load the robot after making changes in the launch file. I had to set the robot_description parameter outside of the <group> tags. I then loaded or spawned the URDF in Gazebo inside the <group> tags, please find the changes below:
<arg name="robot_description"
default="$(find urdf_test_pkg)/model/robot.xacro"/>
<param name="/robot_description"
command="$(find xacro)/xacro --inorder $(arg robot_description) namesapce_deploy:=$(arg ns_1)"/>
<group ns="$(arg ns_1)">
<node name="mybot_spawn" pkg="gazebo_ros" type="spawn_model" output="screen"
args="-urdf -param /robot_description -model mybot_$(arg ns_1)
-x $(arg x) -y $(arg y) -z $(arg z)
-R $(arg roll) -P $(arg pitch) -Y $(arg yaw)" respawn="false" />
<!-- convert joint states to TF transforms for rviz, etc -->
<!-- Notice the leading '/' in '/robot_description' -->
<node name="robot_state_publisher" pkg="robot_state_publisher" type="robot_state_publisher" respawn="false" output="screen">
<remap from="/joint_states" to="/$(arg ns_1)/joint_states" />
</node>
</group>
Please note that this answer is based on the solution provided in the following website: https://answers.ros.org/question/268655/gazebo_ros_control-plugin-is-waiting-for-model-urdf-in-parameter/. For more details about the question and answer then please visit the above link.

PredictionIO text classification quick start failing when reading the data

I'm following this quick start after starting this ready-to-use PredictionIO Amazon EC2 instance and after running these commands it fails in the pio train:
pio app new MyTextApp
pio import --appid 1 --input data/stopwords.json
pio import --appid 1 --input data/emails.json
pio build
pio train
...
Data set is empty, make sure event fields match imported data.
Exception in thread "main" java.lang.IllegalStateException: Haven't seen any document yet.
at org.apache.spark.mllib.feature.IDF$DocumentFrequencyAggregator.idf(IDF.scala:132)
at org.apache.spark.mllib.feature.IDF.fit(IDF.scala:56)
at uk.co.news.PreparedData.<init>(Preparator.scala:70)
at uk.co.news.Preparator.prepare(Preparator.scala:47)
at uk.co.news.Preparator.prepare(Preparator.scala:43)
Since there is no error when running the command to import emails, I don't understand why the data set is still empty. I double-checked the email.json file and the data is indeed there and this is the result when running
pio import --appid 1 --input data/emails.json
ubuntu#ip-172-31-0-60:~/pio-textclassification$ pio import --appid 1 --input data/emails.json
[INFO] [Runner$] Submission command: /opt/spark-1.4.1-bin-hadoop2.6/bin/spark-submit --class io.prediction.tools.imprt.FileToEvents --files file:/opt/PredictionIO/conf/log4j.properties --driver-class-path /opt/PredictionIO/conf file:/opt/PredictionIO/lib/pio-assembly-0.9.4.jar --appid 1 --input file:/home/ubuntu/pio-textclassification/data/emails.json --env PIO_ENV_LOADED=1,PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta,PIO_FS_BASEDIR=/home/ubuntu/.pio_store,PIO_HOME=/opt/PredictionIO,PIO_FS_ENGINESDIR=/home/ubuntu/.pio_store/engines,PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio,PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=PGSQL,PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=PGSQL,PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event,PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio,PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc,PIO_FS_TMPDIR=/home/ubuntu/.pio_store/tmp,PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio,PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model,PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=PGSQL,PIO_CONF_DIR=/opt/PredictionIO/conf
[INFO] [Remoting] Starting remoting
[INFO] [Remoting] Remoting started; listening on addresses :[akka.tcp://sparkDriver#172.31.0.60:49257]
[INFO] [FileToEvents$] Events are imported.
[INFO] [FileToEvents$] Done.
EDIT:
pio build --verbose
showed an exception that was being swallowed. The problem is with the database connection, but it's still not clear what is wrong since parts of the exception are being replaced with "..."
[DEBUG] [ConnectionPool$] Registered connection pool : ConnectionPool(url:jdbc:postgresql://localhost/pio, user:pio) using factory : <default>
[DEBUG] [ConnectionPool$] Registered singleton connection pool : ConnectionPool(url:jdbc:postgresql://localhost/pio, user:pio)
[DEBUG] [StatementExecutor$$anon$1] SQL execution completed
[SQL Execution]
create table if not exists pio_meta_enginemanifests ( id varchar(100) not null primary key, version text not null, engineName text not null, description text, files text not null, engineFactory text not null); (10 ms)
[Stack Trace]
...
io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$1.apply(JDBCEngineManifests.scala:37)
io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$1.apply(JDBCEngineManifests.scala:29)
scalikejdbc.DBConnection$class.autoCommit(DBConnection.scala:222)
scalikejdbc.DB.autoCommit(DB.scala:60)
scalikejdbc.DB$$anonfun$autoCommit$1.apply(DB.scala:215)
scalikejdbc.DB$$anonfun$autoCommit$1.apply(DB.scala:214)
scalikejdbc.LoanPattern$class.using(LoanPattern.scala:18)
scalikejdbc.DB$.using(DB.scala:138)
scalikejdbc.DB$.autoCommit(DB.scala:214)
io.prediction.data.storage.jdbc.JDBCEngineManifests.<init>(JDBCEngineManifests.scala:29)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:526)
io.prediction.data.storage.Storage$.getDataObject(Storage.scala:293)
...
[INFO] [RegisterEngine$] Registering engine JmhjlGoEjJuKXhXpY70MbEkuGHMuOZzL 8ccd38126d56ed48adaa9f85547131467f7629f7
[DEBUG] [StatementExecutor$$anon$1] SQL execution completed
[SQL Execution]
update pio_meta_enginemanifests set engineName = 'pio-textclassification', description = 'pio-autogen-manifest', files = 'file:/home/ubuntu/pio-textclassification/target/scala-2.10/uk.co.news-assembly-0.1-SNAPSHOT-deps.jar... (192)', engineFactory = '' where id = 'JmhjlGoEjJuKXhXpY70MbEkuGHMuOZzL' and version = '8ccd38126d56ed48adaa9f85547131467f7629f7'; (3 ms)
[Stack Trace]
...
io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$7.apply(JDBCEngineManifests.scala:85)
io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$7.apply(JDBCEngineManifests.scala:78)
scalikejdbc.DBConnection$$anonfun$3.apply(DBConnection.scala:297)
scalikejdbc.DBConnection$class.scalikejdbc$DBConnection$$rollbackIfThrowable(DBConnection.scala:274)
scalikejdbc.DBConnection$class.localTx(DBConnection.scala:295)
scalikejdbc.DB.localTx(DB.scala:60)
scalikejdbc.DB$.localTx(DB.scala:257)
io.prediction.data.storage.jdbc.JDBCEngineManifests.update(JDBCEngineManifests.scala:78)
io.prediction.tools.RegisterEngine$.registerEngine(RegisterEngine.scala:50)
io.prediction.tools.console.Console$.build(Console.scala:813)
io.prediction.tools.console.Console$$anonfun$main$1.apply(Console.scala:698)
io.prediction.tools.console.Console$$anonfun$main$1.apply(Console.scala:684)
scala.Option.map(Option.scala:145)
io.prediction.tools.console.Console$.main(Console.scala:684)
io.prediction.tools.console.Console.main(Console.scala)
...
[DEBUG] [StatementExecutor$$anon$1] SQL execution completed
[SQL Execution]
INSERT INTO pio_meta_enginemanifests VALUES( 'JmhjlGoEjJuKXhXpY70MbEkuGHMuOZzL', '8ccd38126d56ed48adaa9f85547131467f7629f7', 'pio-textclassification', 'pio-autogen-manifest', 'file:/home/ubuntu/pio-textclassification/target/scala-2.10/uk.co.news-assembly-0.1-SNAPSHOT-deps.jar... (192)', ''); (1 ms)
[Stack Trace]
...
io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$2.apply(JDBCEngineManifests.scala:48)
io.prediction.data.storage.jdbc.JDBCEngineManifests$$anonfun$2.apply(JDBCEngineManifests.scala:40)
scalikejdbc.DBConnection$$anonfun$3.apply(DBConnection.scala:297)
scalikejdbc.DBConnection$class.scalikejdbc$DBConnection$$rollbackIfThrowable(DBConnection.scala:274)
scalikejdbc.DBConnection$class.localTx(DBConnection.scala:295)
scalikejdbc.DB.localTx(DB.scala:60)
scalikejdbc.DB$.localTx(DB.scala:257)
io.prediction.data.storage.jdbc.JDBCEngineManifests.insert(JDBCEngineManifests.scala:40)
io.prediction.data.storage.jdbc.JDBCEngineManifests.update(JDBCEngineManifests.scala:89)
io.prediction.tools.RegisterEngine$.registerEngine(RegisterEngine.scala:50)
io.prediction.tools.console.Console$.build(Console.scala:813)
io.prediction.tools.console.Console$$anonfun$main$1.apply(Console.scala:698)
io.prediction.tools.console.Console$$anonfun$main$1.apply(Console.scala:684)
scala.Option.map(Option.scala:145)
io.prediction.tools.console.Console$.main(Console.scala:684)
...
[INFO] [Console$] Your engine is ready for training.
A few things to check:
Does "pio app list" show MyTextApp has appId 1?
Download https://github.com/yipjustin/pio-event-distribution-checker and change engine.json so that appId reads 1, then "pio build" and "pio train" to see if the data is actually imported.
P.S. There is a google group (https://groups.google.com/forum/#!forum/predictionio-user) for which your question will be answered more quickly by the community of PredictionIO users.
The solution was to change the DataSource.scala to match the schema in the emails.json file before running pio build.
This is the only method I had to change in the file:
private def readEventData(sc: SparkContext) : RDD[Observation] = {
//Get RDD of Events.
PEventStore.find(
appName = dsp.appName,
entityType = Some("content"),
eventNames = Some(List("e-mail"))
// Convert collected RDD of events to and RDD of Observation
// objects.
)(sc).map(e => {
val label : String = e.properties.get[String]("label")
Observation(
if (label == "spam") 1.0 else 0.0,
e.properties.get[String]("text"),
label
)
}).cache
}
I had to change the previous values to "content", "e-mail" and "spam".

Resources