Trying to find out if you can use multiple files for your dataset in Amazon Sagemaker BlazingText.
I am trying to use it in Text Classification mode.
It appears that it's not possible, certainly not in File mode, but wondering about whether Pipe mode supports it. I don't want to have all my training data in 1 file, because if it's generated by an EMR cluster I would need to combine it afterwards which is clunky.
Thanks!
You are right in that File mode doesn't support multiple files (https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html).
Pipe mode would in theory work but there are a few caveats:
The format expected is Augmented Manifest (https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html). This is essentially Json lines, for instance:
{"source":"linux ready for prime time ", "label":1}
{"source":"bowled by the slower one ", "label":2}
and then you have to pass the _ AttributeNames_ argument to the createTrainingJob SageMaker API (it is all explained in the link above).
With Augmented Manifest, currently only one label is supported.
In order to use Pipe mode, you would need to modify your EMR job to generate Augmented Manifest format, and you could only use one label per sentece.
At this stage, concatenating the files generated by your EMR job into a single file seems like the best option.
Related
I am attempting to build a program which restricts users from modifying certain configuration parameters across a variety of items. For example, I would like to allow a user to upload the nginx.conf file or a preconfiguration file for the linux OS, then be able to identify the key value pairs in these files (which may have different delimiters), extract these KV pairs and store them in a database somewhere.
As there are a wide variety of config file structures out there, I was thinking along the lines of using an NLP library which could look for these KV pairs (as opposed to a function in my program based on standard delimiters). Is there anything that you've used before or would recommend? A Go library would be a bonus as my program is currently written in Go.
I want to collect a pointcloud of a simulated space in gazebo. I have tried scanning the environment and saving the scans as individual pcd files and then concatenating them but this did not work. I have also tried to take the scans from Gazebo and visualise them in open3d but this ended up just being the same as concatenating the pcd files. I know that the issue is not being able to transform the messages correctly but I have not found a working method with clear steps to execute the transformation. I am doing this on Ros noetic and would really appreciate help.
You should be using rosbag record for saving topic data inside ros. This command simply records topic data, saves it to a file, and allows you to analyze or play it back later.
In your situation, you would also need to record the transform topic if you’re having tf issues.
To record data you can simple run a command such as: rosbag record /tf /my_scan_topic
Based on your comments what you actually want to do is: first combine multiple scans from the lidar into a single topic; i.e. create a single pointcloud from multiple. The easiest way might be to use the laser_assembler package. Since this is all in ros the transforms will be handled automatically for you. Then once you have all your scans assembled put it into a PCD file.
In one of my Argo workflow steps a Docker container splits up a large file into a number of smaller files.
The tutorials show how one can save a small and pre-determined number of outputs (e.g., 2 or 3) as artifacts in an S3 bucket by going through each output one at a time.
In my use case, I do not know in advance how many smaller files will be created; it can be upwards of hundreds. The large number of output files makes it hard, if not impossible to follow the tutorials and specify each one by one even if I know how many smaller files are create in advance.
Is there a way to save all the outputs to an S3 bucket?
This sounds like standard output artifacts. You can put all your files in a single directory, and then have the directory be the output artifacts.
Here are some examples to help you:
https://argoproj.github.io/argo-workflows/examples/#artifacts
I have a drawing application that renders images (using Quartz2D). I want to be able to run regression tests to determine if a file is rendered consistently the same, so I can determine if anything broke following code changes. Are there any API's which allow me to compare screenshots (or image files), and get some similarity score.?
I am not sure if this will suite your needs (It doesn't return a score).
But this software let you compare image and decide which parts to ignore
Visual CI
I am looking for libraries that would help in programatically manipulating EPS (Encapsulated PostScript) files. Basically, what I want to do is following:
Show / Hide preexisting layers in the EPS file (toggle them on and off)
Fill (color) named shapes in the EPS file
Retrieve coordinates of named points in the EPS file
draw shapes on a new layer in the EPS file
on a server, without user interaction (scripting Adobe Illustrator won't work)
I am aware of how the EPS file format is based on the PostScript language and must therefore be interpreted - for creating simple drawings from scratch this is rather easy. But for actually modifying existing files, I guess you need a library that interprets the file and provides some kind of "DOM" for manipulation.
Can I even have named shapes and points inside an EPS file?
EDIT: Assuming I had the layers saved in separate EPS files. Or better still: Just the "data" part of the layers. Could I then concatenate this stuff to create a new EPS file? And append drawing commands? Fill existing named objects?
This is extremely difficult and here is why: a PS file is a program whose execution results in pixels put on a page. Instruction in a PS program are at the level of "draw a line using the current pen and color" or "rotate the coordinate system by 90 degrees" but there is no notion of layers or complex objects like you would see them in a vector drawing application.
There are very few conventions in the structure of PS files to allow external programs to modify them: pages are marked separately, font resources, and media dimensions are spelled out in special comments. This is especially true for Embedded Postscript (EPS) which must follow these guidelines because they are meant to be read by applications but not for general PS as it is sent to a printer. A PS program is a much lower level of abstraction than what you need and there is now way to reconstruct it for arbitrary PS code. In principle could a PS file result in different output every time it is printed because it may query its execution environment and branch based on random decisions.
Applications like Adobe Illustrator emit PS code that follow a rigid structure. There is a chance that these could be parsed and manipulated without interpreting the code. I would stil suggest to rethink the current architecture: you are at a too low level of abstraction for what you need.
PDF is not manipulable since it is not possible to change any existing parts of a pdf (in general) only add stuff. EPS is the same as PostScript except that it has a boundary header.
Problem with doing what you want is that PS is a programming language whose output (mostly) is some kind of image. So the question could be stated as "how can I draw shapes on a new layer in the Java file". You probably need to generate the complete PS on the fly, or use another image format altogether.
I am not aware of any available libraries for this but you may be able to build something to meet your needs based on epstool from Ghostscript/GSview
I think your best bet is to generate a PDF from the EPS and then manipulate the PDF. Then back to EPS. PDF is much more "manipulable" than is EPS.