Dependency parse large text file with python - parsing

I am trying to parse a large txt file (about 2000 sentence). when I want to set the model_path, I get this massage:
NLTK was unable to find stanford-parser.jar! Set the CLASSPATH
environment variable.
And also when I set the CLASSPATH to this file, another message comes out:
NLTK was unable to find stanford-parser-(\d+)(.(\d+))+-models.jar!
Set the CLASSPATH environment variable.
Would you help me to solve it?
This is my code:
import nltk
from nltk.parse.stanford import StanfordDependencyParser
dependency_parser = StanfordDependencyParser( model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
===========================================================================
NLTK was unable to find stanford-parser.jar! Set the CLASSPATH
environment variable.
For more information, on stanford-parser.jar, see:
https://nlp.stanford.edu/software/lex-parser.shtml
import os
os.environ['CLASSPATH'] = "stanford-corenlp-full-2018-10-05/*"
dependency_parser = StanfordDependencyParser( model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
===========================================================================
NLTK was unable to find stanford-parser.jar! Set the CLASSPATH
environment variable.
For more information, on stanford-parser.jar, see:
https://nlp.stanford.edu/software/lex-parser.shtml
os.environ['CLASSPATH'] = "stanford-corenlp-full-2018-10-05/stanford-parser-full-2018-10-17/stanford-parser.jar"
>>> dependency_parser = StanfordDependencyParser( model_path="stanford-corenlp-full-2018-10-05/stanford-parser-full-2018-10-17/edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
NLTK was unable to find stanford-parser-(\d+)(.(\d+))+-models.jar!
Set the CLASSPATH environment variable.
For more information, on stanford-parser-(\d+)(.(\d+))+-models.jar, see:
https://nlp.stanford.edu/software/lex-parser.shtml

You should get the new stanfordnlp dependency parser that is native to Python!
It will run slower on the CPU than GPU, but it still should run reasonably fast.
Just run pip install stanfordnlp to install.
import stanfordnlp
stanfordnlp.download('en') # This downloads the English models for the neural pipeline
nlp = stanfordnlp.Pipeline() # This sets up a default neural pipeline in English
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
doc.sentences[0].print_dependencies()
There is also a helpful command line tool:
python -m stanfordnlp.run_pipeline -l en example.txt
Full details here: https://stanfordnlp.github.io/stanfordnlp/
GitHub: https://github.com/stanfordnlp/stanfordnlp

Related

Where do I save the log4j.properties file so that RedhawkSDR will automatically find it

I've compiled RedhawkSDR v3.0.0 within a Docker container and tried to implement the Basic Sandbox Example from RedhawkSDR's manual (Manual for v3.0.0 & v3.0.1). However, I've encountered this problem, as shown below:
$ docker run -it docker-redhawk3.0.0-src
[root#2c65bf0659d7 Redhawk_User]# python3
Python 3.6.8 (default, Nov 16 2020, 16:55:22)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ossie.utils import sb
>>> sb.catalog()
['rh.AmFmPmBasebandDemod', 'rh.ArbitraryRateResampler', 'rh.DataConverter', 'rh.FileReader', 'rh.FileWriter', 'rh.HardLimit', 'rh.RBDSDecoder', 'rh.SigGen', 'rh.SinkSDDS', 'rh.SinkVITA49', 'rh.SourceSDDS', 'rh.SourceVITA49', 'rh.TuneFilterDecimate', 'rh.agc', 'rh.autocorrelate', 'rh.fastfilter', 'rh.fcalc', 'rh.psd', 'rh.psk_soft', 'rh.sinksocket', 'rh.sourcesocket']
>>> sigGen = sb.launch("rh.SigGen")
>>> hardLimit = sb.launch("rh.HardLimit")
>>> sb.IDELocation("/eclipse")
>>> plot = sb.Plot()
>>> sigGen.connect(hardLimit)
>>> hardLimit.connect(plot)
>>> log4j:WARN No appenders could be found for logger (jacorb.config).
log4j:WARN Please initialize the log4j system properly.
Plotter: Cannot open display:
(Note: The error takes about 5 sec to appear, which is why the >>> prompt appears before the error message.)
I've Googled this error and found a relevant StackOverflow entry(No appenders could be found for logger(log4j)?). I also looked at the section on logging in the Redhawk Manual(RedhawkSDR-LoggingConfig). I got the impression that if I created a file called log4j.properties and either indicated where it was by including it in the Java CLASSPATH environment variable, or placed it in the directory with the log4j jar files, then this error would be resolved.
A reasonable default for this file was given as: (Note: Multiple answers in StackOverflow had the bottom formatted string without quotes, but the Redhawk version used quotes. I tried with & without, but without success):
# Root logger option
log4j.rootLogger=INFO, stdout
# Direct log messages to stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
However, adding this file in both locations did not solve the problem. I also tried adding the directory with log4j jar files to the CLASSPATH variable, but without success.
logAt this point, I don't know what to do. How can I fix the log4j errors? How can I open the plotter? Is the plotter not opening a function of the log4j errors or does it have something to do with my use of Docker?

Debugging AML Model Deployment

I have an ML model (trained locally) in python. Previously the model has been deployed to a Windows IIS server and it's working fine.
Now, I am trying to deploy it as a service on Azure container instance (ACI) with 1 core, and 1 GB of memory. I took references from one and two Microsoft docs. The docs use SDK for all the steps, but I am using the GUI feature from the Azure portal.
After registering the model, I created an entry script and a conda environment YAML file (see below), and uploaded both to "Custom deployment asset" (at Deploy model area).
Unfortunately, after hitting deploy, the Deployment state is stuck at Transitioning state. Even after 4 hours, the state remains the same and there were no Deployment logs too, so I am unable to find what I am doing wrong here.
NOTE: below is just an excerpt of the entry script
import pandas as pd
import pickle
import re, json
import numpy as np
import sklearn
def init():
global model
global classes
model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'randomForest50.pkl')
model = pickle.load(open(model_path, "rb"))
classes = lambda x : ["F", "M"][x]
def run(data):
try:
namesList = json.loads(data)["data"]["names"]
pred = list(map(classes, model.predict(preprocessing(namesList))))
return str(pred[0])
except Exception as e:
error = str(e)
return error
name: gender_prediction
dependencies:
- python
- numpy
- scikit-learn
- pip:
- pandas
- pickle
- re
- json
The issue was in the YAML file. The dependencies/libraries in the YAML should be according to conda environment. So, I changed everything accordingly, and it worked.
Modified YAML file:
name: gender_prediction
dependencies:
- python=3.7
- numpy
- scikit-learn
- pip:
- azureml-defaults
- pandas
- pickle4
- regex
- inference-schema[numpy-support]

ImportError: No module named cv2 when run Batch transform jobs in SageMaker

When I tried to run a Batch transform job in AWS SageMaker, I met below error:
ImportError: No module named cv2
Please note that, I am able to "import CV2" in the notebook instance. The jupter can run "import CV2" in notebook instance. But failed to run it in endpoints during inference time. I have tried below method using "env" as the link AWS Sagemaker - Install External Library and Make it Persist
but it still not work.
anyone have good way to solve it? Thanks!
my codes are:
env = {
'SAGEMAKER_REQUIREMENTS': 'requirements.txt', # path relative to `source_dir` below.
}
image_embed_model = MXNetModel(model_data=model_data,
entry_point='sagemaker_entrypoint.py',
role=role,
source_dir = 'src',
env = env,
py_version='py3',
framework_version='1.6.0')
transformer = image_embed_model.transformer(instance_count=1, # Please pay attention here!!!
instance_type='ml.m4.xlarge',
output_path=output_path,
assemble_with = 'Line',
accept = 'text/csv'
)
transformer.transform(batch_input,
content_type='text/csv',
split_type='Line',
input_filter='$[0:]',
join_source='Input',
wait=False)
You can follow https://github.com/aws/sagemaker-python-sdk/blob/master/doc/using_mxnet.rst#use-third-party-libraries to import third party libraries to your batch transform instances. Make sure the requirement.txt file is saved under the right directory before packaging the model data.

How to change the Python version in Azure Machine Learning sdk ContainerImage with CondaDependencies

I am trying to get my Faster R-CNN model into an Container Instance on ACI. For that I need my docker image to posses python version 3.5.*. I specify that in my conda yaml file, but every time I spin an instance up and docker run -it *** /bin/bash into it I see that it only has Python 3.6.7.
https://user-images.githubusercontent.com/21140767/50680590-82b20b80-1008-11e9-9bfe-4a0e71084ce0.png
How can I get my Docker image to have Python version 3.5.*? I already tried conda installing Python version 3.5.2, but that didn't work as eventually it didn't posses 3.5.2, but only 3.6.7. (dfimage lets you see the dockerfile from which the image was created, https://hub.docker.com/r/chenzj/dfimage/).
https://user-images.githubusercontent.com/21140767/50680673-d6245980-1008-11e9-9d48-71a7c150d925.png
My yaml:
name: project_environment
dependencies:
- python=3.5.2
- pip:
- matplotlib
- opencv-python==3.4.3.18
- azureml-core==1.0.6
- numpy
- cntk
- cython
channels:
- anaconda
Notebook cell:
from azureml.core.conda_dependencies import CondaDependencies
svmandss = CondaDependencies.create(python_version="3.5.2", pip_packages=[
"matplotlib",
"opencv-python==3.4.3.18",
"azureml-core",
"numpy",
"cntk",
"cython"], )
svmandss.add_channel('anaconda')
with open("fasterrcnn.yml","w") as f:
f.write(svmandss.serialize_to_string())
Another notebook cell with ContainerImage specifications.
image_config = ContainerImage.image_configuration(execution_script="score_fasterrcnn.py",runtime="python",conda_file="./fasterrcnn.yml",dependencies=listdir("utils"),docker_file="./Dockerfile")
service = Webservice.deploy_from_model(workspace=ws,
name='faster-rcnn',
deployment_config=aciconfig,
models=[Model(workspace=ws, name='Faster-RCNN')],
image_config=image_config)
service.wait_for_deployment(show_output=True)
Note
For better readability see my GitHub issue: (https://github.com/Azure/MachineLearningNotebooks/issues/163).
Currently, the version of Python is fixed to what's in Azure ML's base image, when deploying the web service. We're investigating removing this limitation in future.
Since this is one of the top Google answers when searching for "azureml python version" I'm posting the answer here. The documentation is not very clear when it comes to this, but the following will work:
from azureml.core import Workspace
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
ws = Workspace.from_config()
# This is the important part
conda_dep = CondaDependencies(conda_dependencies_file_path="pipeline/environment.yml")
aml_run_config = RunConfiguration(conda_dependencies=conda_dep)
# Define compute target - must be preconfigured in th workspace
compute_target = ws.compute_targets['my-azureml-target']
aml_run_config.target = compute_target
from azureml.pipeline.steps import PythonScriptStep
script_source_dir = "./pipeline"
step_1_script = "test.py"
step_1 = PythonScriptStep(
script_name=step_1_script,
source_directory=script_source_dir,
compute_target=compute_target,
runconfig=aml_run_config,
allow_reuse=True
)
from azureml.pipeline.core import Pipeline
# Build the pipeline
pipeline1 = Pipeline(workspace=ws, steps=[step_1])
from azureml.core import Experiment
# Submit the pipeline to be run
pipeline_run1 = Experiment(ws, 'Test-pipeline').submit(pipeline1)
pipeline_run1.wait_for_completion(show_output=True)
This assumes the following directory structure:
root/
create_pipeline.py
pipeline/
test.py
environment.yml
where create_pipeline.py is the file above, test.py is the script you would like to run and environment.yml is the conda environment file - including the python version.
I was able to change the Python version by registering the environment in Azure ML Workspace:
from azureml.core.environment import Environment, Workspace
environment = Environment.from_conda_specification(name='myenv', file_path='environment.yml')
environment.python.user_managed_dependencies = False
workspace = Workspace.from_config()
environment = environment.register(workspace=workspace)
env_build = environment.build(workspace=workspace)
Then, configure the endpoint for publishing as follows:
from azureml.core.model import InferenceConfig
environment = Environment.get(workspace=workspace, name='myenv')
inference_config = InferenceConfig(
entry_script='inference.py',
source_directory='.',
environment=environment
)
This is using Azure ML SDK 1.29.0. Perhaps this has already been fixed and the original method works as well, but I didn't test that.
EDIT:
This is no longer an issue for me. I found another way to get my code to work with python version 3.6.7.
This is however still an issue if you ask me. If in the future I do need python version 3.5 then there will not be a solution as of now.
You can still post an answer if you would like.

How to configure Jenkins Cobertura plugin to monitor specific packages?

My project has a number of packages ("models", "controllers", etc.). I've set up Jenkins with the Cobertura plugin to generate coverage reports, which is great. I'd like to mark a build as unstable if coverage drops below a certain threshold, but only on certain packages (e.g., "controllers", but not "models"). I don't see an obvious way to do this in the configuration UI, however -- it looks like the thresholds are global.
Is there a way to do this?
(Answering my own question here)
As far as I can tell, this isn't possible -- I haven't seen anything after a couple days of looking. I wrote a simple script that would do what I want -- take the coverage output, parse it, and fail the build if coverage of specific packages didn't meet certain thresholds. It's dirty and can be cleaned up/expanded, but the basic idea is here. Comments are welcome.
#!/usr/bin/env python
'''
Jenkins' Cobertura plugin doesn't allow marking a build as successful or
failed based on coverage of individual packages -- only the project as a
whole. This script will parse the coverage.xml file and fail if the coverage of
specified packages doesn't meet the thresholds given
'''
import sys
from lxml import etree
PACKAGES_XPATH = etree.XPath('/coverage/packages/package')
def main(argv):
filename = argv[0]
package_args = argv[1:] if len(argv) > 1 else []
# format is package_name:coverage_threshold
package_coverage = {package: int(coverage) for
package, coverage in [x.split(':') for x in package_args]}
xml = open(filename, 'r').read()
root = etree.fromstring(xml)
packages = PACKAGES_XPATH(root)
failed = False
for package in packages:
name = package.get('name')
if name in package_coverage:
# We care about this one
print 'Checking package {} -- need {}% coverage'.format(
name, package_coverage[name])
coverage = float(package.get('line-rate', '0.0')) * 100
if coverage < package_coverage[name]:
print ('FAILED - Coverage for package {} is {}% -- '
'minimum is {}%'.format(
name, coverage, package_coverage[name]))
failed = True
else:
print "PASS"
if failed:
sys.exit(1)
if __name__ == '__main__':
main(sys.argv[1:])

Resources