Using iforest as described here: https://github.com/titicaca/spark-iforest
But model.save() is throwing exception:
Exception:
scala.NotImplementedError: The default jsonEncode only supports string, vector and matrix. org.apache.spark.ml.param.Param must override jsonEncode for java.lang.Double.
Followed the code snippet mentioned under "Python API" section on mentioned git page.
from pyspark.ml.feature import VectorAssembler
import os
import tempfile
from pyspark_iforest.ml.iforest import *
col_1:integer
col_2:integer
col_3:integer
assembler = VectorAssembler(inputCols=in_cols, outputCol="features")
featurized = assembler.transform(df)
iforest = IForest(contamination=0.5, maxDepth=2)
model=iforest.fit(df)
model.save("model_path")
model.save() should be able to save model files.
Below is the output dataframe I'm getting after executing model.transform(df):
col_1:integer
col_2:integer
col_3:integer
features:udt
anomalyScore:double
prediction:double
I have just fixed this issue. It was caused by an incorrect param type. You can checkout the latest codes in the master branch, and try it again.
Related
When I run the following code with Jupyter Notebook, it originally worked normally, but suddenly an error pops up.
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
tf.disable_eager_execution()
# Load pre trained ELMo model
elmo = hub.Module("https://tfhub.dev/google/elmo/1", trainable=True)
The error code is
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 149: invalid start byte
I had the same error before, so when I changed the version of Elmo, the problem was solved, but now it doesn't solve it either.
I'm getting an error while adding a spacy compatible extension, med7, to the pipeline. I've included the replicable code below.
!pip install -U https://med7.s3.eu-west-2.amazonaws.com/en_core_med7_lg.tar.gz
import spacy
import en_core_med7_lg
from spacy.lang.en import English
med7 = en_core_med7_lg.load()
# Create the nlp object
nlp2 = English()
nlp2.add_pipe(med7)
# Process a text
doc = nlp2("This is a sentence.")
The error I get is
Argument 'string' has incorrect type (expected str, got spacy.tokens.doc.Doc)
I realized I was having this problem because I don't understand the difference some components of Spacy. For instance, in the Negex extension package, loading the pipeline is done with the Negex command:
negex = Negex(nlp, ent_types=["PERSON","ORG"])
nlp.add_pipe(negex, last=True)
I don't understand what the difference between Negex and en_core_med7_lg.load(). For some reason, I when add "med7" into the pipeline, it causes this error. I'm new to Spacy and would appreciate an explanation so that I can learn. And please let me know if I can make this question any more clear. Thanks!
med7 is already the loaded pipeline. Run:
doc = med7("This is a sentence.")
I am using dependency parsing of coreNLP for a project of mine. The basic and enhanced dependencies are different result for a particular dependency.
I used the following code to get enhanced dependencies.
val lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
lp.setOptionFlags("-maxLength", "80")
val rawWords = edu.stanford.nlp.ling.Sentence.toCoreLabelList(tokens_arr:_*)
val parse = lp.apply(rawWords)
val tlp = new PennTreebankLanguagePack()
val gsf:GrammaticalStructureFactory = tlp.grammaticalStructureFactory()
val gs:GrammaticalStructure = gsf.newGrammaticalStructure(parse)
val tdl = gs.typedDependenciesCCprocessed()
For the following example,
Account name of ramkumar.
I use simple API to get basic dependencies. The dependency i get between
(account,name) is (compound). But when i use the above code to get enhanced dependency i get the relation between (account,name) as (dobj).
What is the fix to this? Is this a bug or am i doing something wrong?
When I run this command:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse -file example.txt -outputFormat json
With your example text in the file example.txt, I see compound as the relationship between both of those words for both types of dependencies.
I also tried this with the simple API and got the same results.
You can see what simple produces with this code:
package edu.stanford.nlp.examples;
import edu.stanford.nlp.semgraph.SemanticGraphFactory;
import edu.stanford.nlp.simple.*;
import java.util.*;
public class SimpleDepParserExample {
public static void main(String[] args) {
Sentence sent = new Sentence("...example text...");
Properties props = new Properties();
// use sent.dependencyGraph() or sent.dependencyGraph(props, SemanticGraphFactory.Mode.ENHANCED) to see enhanced dependencies
System.out.println(sent.dependencyGraph(props, SemanticGraphFactory.Mode.BASIC));
}
}
I don't know anything about any Scala interfaces for Stanford CoreNLP. I should also note my results are using the latest code from GitHub, though I presume Stanford CoreNLP 3.8.0 would also produce similar results. If you are using an older version of Stanford CoreNLP that could be a potential cause of the error.
But running this example in various ways using Java I don't see the issue you are encountering.
A PicklingError is raised when I run my data pipeline remotely: the data pipeline has been written using the Beam SDK for Python and I am running it on top of Google Cloud Dataflow. The pipeline works fine when I run it locally.
The following code generates the PicklingError: this ought to reproduce the problem
import apache_beam as beam
from apache_beam.transforms import pvalue
from apache_beam.io.fileio import _CompressionType
from apache_beam.utils.options import PipelineOptions
from apache_beam.utils.options import GoogleCloudOptions
from apache_beam.utils.options import SetupOptions
from apache_beam.utils.options import StandardOptions
if __name__ == "__main__":
pipeline_options = PipelineOptions()
pipeline_options.view_as(StandardOptions).runner = 'BlockingDataflowPipelineRunner'
pipeline_options.view_as(SetupOptions).save_main_session = True
google_cloud_options = pipeline_options.view_as(GoogleCloudOptions)
google_cloud_options.project = "project-name"
google_cloud_options.job_name = "job-name"
google_cloud_options.staging_location = 'gs://path/to/bucket/staging'
google_cloud_options.temp_location = 'gs://path/to/bucket/temp'
p = beam.Pipeline(options=pipeline_options)
p.run()
Below is a sample from the beginning and the end of the Traceback:
WARNING: Could not acquire lock C:\Users\ghousains\AppData\Roaming\gcloud\credentials.lock in 0 seconds
WARNING: The credentials file (C:\Users\ghousains\AppData\Roaming\gcloud\credentials) is not writable. Opening in read-only mode. Any refreshed credentials will only be valid for this run.
Traceback (most recent call last):
File "formatter_debug.py", line 133, in <module>
p.run()
File "C:\Miniconda3\envs\beam\lib\site-packages\apache_beam\pipeline.py", line 159, in run
return self.runner.run(self)
....
....
....
File "C:\Miniconda3\envs\beam\lib\sitepackages\apache_beam\runners\dataflow_runner.py", line 172, in run
self.dataflow_client.create_job(self.job))
StockPickler.save_global(pickler, obj)
File "C:\Miniconda3\envs\beam\lib\pickle.py", line 754, in save_global (obj, module, name))
pickle.PicklingError: Can't pickle <class 'apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnum'>: it's not found as apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnum
I've found that your error gets raised when a Pipeline object is included in the context that gets pickled and sent to the cloud:
pickle.PicklingError: Can't pickle <class 'apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnum'>: it's not found as apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnum
Naturally, you might ask:
What's making the Pipeline object unpickleable when it's sent to the cloud, since normally it's pickleable?
If this were really the problem, then wouldn't I get this error all the time - isn't a Pipeline object normally included in the context sent to the cloud?
If the Pipeline object isn't normally included in the context sent to the cloud, then why is a Pipeline object being included in my case?
(1)
When you call p.run() on a Pipeline with cloud=True, one of the first things that happens is that p.runner.job=apiclient.Job(pipeline.options) is set in apache_beam.runners.dataflow_runner.DataflowPipelineRunner.run.
Without this attribute set, the Pipeline is pickleable. But once this is set, the Pipeline is no longer pickleable, since p.runner.job.proto._Message__tags[17] is a TypeValueValuesEnum, which is defined as a nested class in apache_beam.internal.clients.dataflow.dataflow_v1b3_messages. AFAIK nested classes cannot be pickled (even by dill - see How can I pickle a nested class in python?).
(2)-(3)
Counterintuitively, a Pipeline object is normally not included in the context sent to the cloud. When you call p.run() on a Pipeline with cloud=True, only the following objects are pickled (and note that the pickling happens after p.runner.job gets set):
If save_main_session=True, then all global objects in the module designated __main__ are pickled. (__main__ is the script that you ran from the command line).
Each transform defined in the pipeline is individually pickled
In your case, you encountered #1, which is why your solution worked. I actually encountered #2 where I defined a beam.Map lambda function as a method of a composite PTransform. (When composite transforms are applied, the pipeline gets added as an attribute of the transform...) My solution was to define those lambda functions in the module instead.
A longer-term solution would be for us to fix this in the Apache Beam project. TBD!
This should be fixed in the google-dataflow 0.4.4 sdk release with https://github.com/apache/incubator-beam/pull/1485
I resolved this problem by encapsulating the body of the main within a run() method and invoking run().
I upgraded to py2neo 2.0 and the code I used for streaming and writing results to file doesn't work any more. The error I am getting is TypeError: 'CypherResource' object is not callable.
from py2neo import Graph
from py2neo.packages.httpstream import http
http.socket_timeout = 9999
graph = Graph()
query="""
MY QUERY
"""
result=graph.cypher(graph,query)
with open('myfile','w') as f:
for record in result.stream():
v=record.values
I assume the error is in graph.cypher(graph,query), but I was not successful in fixing the error.
You are using Graph.cypher incorrectly. This is an object with execution and transaction management methods, not a callable itself. Please read the docs at:
http://py2neo.org/2.0/cypher.html