Apache Beam TextIO.Read with line number - google-cloud-dataflow

Is it possible to get access to line numbers with the lines read into the PCollection from TextIO.Read? For context here, I'm processing a CSV file and need access to the line number for a given line.
If not possible through TextIO.Read it seems like it should be possible using some kind of custom Read or transform, but I'm having trouble figuring out where to begin.

You can use FileIO to read the file manually, where you can determine the line number when you read from the ReadableFile.
A simple solution can look as follows:
p
.apply(FileIO.match().filepattern("/file.csv"))
.apply(FileIO.readMatches())
.apply(FlatMapElements
.into(strings())
.via((FileIO.ReadableFile f) -> {
List<String> result = new ArrayList<>();
try (BufferedReader br = new BufferedReader(Channels.newReader(f.open(), "UTF-8"))) {
int lineNr = 1;
String line = br.readLine();
while (line != null) {
result.add(lineNr + "," + line);
line = br.readLine();
lineNr++;
}
} catch (IOException e) {
throw new RuntimeException("Error while reading", e);
}
return result;
}));
The solution above just prepends the line number to each input line.

Related

How to pipe to a process using vala/glib

I'm trying to pipe output from echo into a command using GLib's spawn_command_line_sync method. The problem I've run into is echo is interpreting the entire command as the argument.
To better explain, I run this in my code:
string command = "echo \"" + some_var + "\" | command";
Process.spawn_command_line_sync (command.escape (),
out r, out e, out s);
I would expect the variable to be echoed to the pipe and the command run with the data piped, however when I check on the result it's just echoing everything after echo like this:
"some_var's value" | command
I think I could just use the Posix class to run the command but I like having the result, error and status values to listen to that the spawn_command_line_sync method provides.
The problem is that you are providing shell syntax to what is essentially the kernel’s exec() syscall. The shell pipe operator redirects the stdout of one process to the stdin of the next. To implement that using Vala, you need to get the file descriptor for the stdin of the command process which you’re running, and write some_var to it manually.
You are combining two subprocesses into one. Instead echo and command should be treated separately and have a pipe set up between them. For some reason many examples on Stack Overflow and other sites use the Process.spawn_* functions, but using GSubprocess is an easier syntax.
This example pipes the output of find . to sort and then prints the output to the console. The example is a bit longer because it is a fully working example and makes use of a GMainContext for asynchronous calls. GMainContext is used by GMainLoop, GApplication and GtkApplication:
void main () {
var mainloop = new MainLoop ();
SourceFunc quit = ()=> {
mainloop.quit ();
return Source.REMOVE;
};
read_piped_commands.begin ("find .", "sort", quit);
mainloop.run ();
}
async void read_piped_commands (string first_command, string second_command, SourceFunc quit) {
var output = splice_subprocesses (first_command, second_command);
try {
string? line = null;
do {
line = yield output.read_line_async ();
print (#"$(line ?? "")\n");
}
while (line != null);
} catch (Error error) {
print (#"Error: $(error.message)\n");
}
quit ();
}
DataInputStream splice_subprocesses (string first_command, string second_command) {
InputStream end_pipe = null;
try {
var first = new Subprocess.newv (first_command.split (" "), STDOUT_PIPE);
var second = new Subprocess.newv (second_command.split (" "), STDIN_PIPE | STDOUT_PIPE);
second.get_stdin_pipe ().splice (first.get_stdout_pipe (), CLOSE_TARGET);
end_pipe = second.get_stdout_pipe ();
} catch (Error error) {
print (#"Error: $(error.message)\n");
}
return new DataInputStream (end_pipe);
}
It is the splice_subprocesses function that answers your question. It takes the STDOUT from the first command as an InputStream and splices it with the OutputStream (STDIN) for the second command.
The read_piped_commands function takes the output from the end of the pipe. This is an InputStream that has been wrapped in a DataInputStream to give access to the read_line_async convenience method.
Here's the full, working implementation:
try {
string[] command = {"command", "-options", "-etc"};
string[] env = Environ.get ();
Pid child_pid;
string some_string = "This is what gets piped to stdin"
int stdin;
int stdout;
int stderr;
Process.spawn_async_with_pipes ("/",
command,
env,
SpawnFlags.SEARCH_PATH | SpawnFlags.DO_NOT_REAP_CHILD,
null,
out child_pid,
out stdin,
out stdout,
out stderr);
FileStream input = FileStream.fdopen (stdin, "w");
input.write (some_string.data);
/* Make sure we close the process using it's pid */
ChildWatch.add (child_pid, (pid, status) => {
Process.close_pid (pid);
});
} catch (SpawnError e) {
/* Do something w the Error */
}
I guess playing with the FileStream is what really made it hard to figure this out. Turned out to be pretty straightforward.
Based on previous answers probably an interesting case is to use program arguments to have a general app to pipe any input on it:
pipe.vala:
void main (string[] args) {
try {
string command = args[1];
var subproc = new Subprocess(STDIN_PIPE | STDOUT_PIPE, command);
var data = args[2].data;
var input = new MemoryInputStream.from_data(data, GLib.free);
subproc.get_stdin_pipe ().splice (input, CLOSE_TARGET);
var end_pipe = subproc.get_stdout_pipe ();
var output = new DataInputStream (end_pipe);
string? line = null;
do {
line = output.read_line();
print (#"$(line ?? "")\n");
} while (line != null);
} catch (Error error) {
print (#"Error: $(error.message)\n");
}
}
build:
$ valac --pkg gio-2.0 pipe.vala
and run:
$ ./pipe sort "cc
ab
aa
b
"
Output:
aa
ab
b
cc

I am not able to parse IOS driver page source

I got Page source using
String pageSource = driver.getPageSource();
Now i need to save this xml file to local in cache. So i need to get element attributes like x and y attribute value rather than every time get using element.getAttribute("x");. But I am not able to parse pageSource xml file to some special character. I cannot remove this character because at if i need element value/text it shows different text if i will remove special character. Appium is use same way to do this.
I was also facing same issue and i got resolution using below code which i have written and it works fine
public static void removeEscapeCharacter(File xmlFile) {
String pattern = "(\\\"([^=])*\\\")";
String contentBuilder = null;
try {
contentBuilder = Files.toString(xmlFile, Charsets.UTF_8);
} catch (IOException e1) {
e1.printStackTrace();
}
if (contentBuilder == null)
return;
Pattern pattern2 = Pattern.compile(pattern);
Matcher matcher = pattern2.matcher(contentBuilder);
StrBuilder sb = new StrBuilder(contentBuilder);
while (matcher.find()) {
String str = matcher.group(1).substring(1, matcher.group(1).length() - 1);
try {
sb = sb.replaceFirst(StrMatcher.stringMatcher(str),
StringEscapeUtils.escapeXml(str));
} catch (Exception e) {
e.printStackTrace();
}
}
try {
Writer output = null;
output = new BufferedWriter(new FileWriter(xmlFile, false));
output.write(sb.toString());
output.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if you will get that kind of problem then catch it with remove special character and parse again.
try {
doc = db.parse(fileContent);
} catch (Exception e) {
removeEscapeCharacter(file);
doc = db.parse(file);
}
It might works for you.
I can able to do same using SAXParser and add handler to do for this.
Refer SAX Parser

How to convert sequence file generated in mahout to text file

I have been looking for parser to convert sequence file(.seq) generated to normal text file to get to know intermediate outputs. I am glad to know if anyone come across how to do this.
I think you can create a SequenceFile Reader in a few lines of codes as below
public static void main(String[] args) throws IOException {
String uri = "path/to/your/sequence/file";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
SequenceFile.Reader reader = null;
try {
reader = new SequenceFile.Reader(fs, path, conf);
Writable key = (Writable) ReflectionUtils.newInstance(
reader.getKeyClass(), conf);
Writable value = (Writable) ReflectionUtils.newInstance(
reader.getValueClass(), conf);
long position = reader.getPosition();
while (reader.next(key, value)) {
System.out.println("Key: " + key + " value:" + value);
position = reader.getPosition();
}
} finally {
reader.close();
}
}
Suppose you have sequence data in hdfs in /ex-seqdata/part-000...
so the part-* data are in binary format.
now you can run command hadoop fs -text /ex-seqdata/part*
in command prompt to get the data in human readable format.

Is it possible to subcribe to a sentence of an epl module?

I have deployed an epl module with the code:
InputStream inputFile = this.getClass().getClassLoader().getResourceAsStream("Temperature.epl");
if (inputFile == null) {
inputFile = this.getClass().getClassLoader().getResourceAsStream("etc/Temperature.epl");
}
if (inputFile == null) {
throw new RuntimeException("Failed to find file 'Temperature.epl' in classpath or relative to classpath");
}
try {
epService.getEPAdministrator().getDeploymentAdmin().readDeploy(inputFile, null, null, null);
// subscribers Ok, tested before whith epService.getEPAdministrator().createEPL ()
// sentences ok, printed
EPStatement statement;
statement = epService.getEPAdministrator().getStatement("Monitor");
System.out.println(statement.getText() + ";");
statement.setSubscriber(new MonitorEventSubscriber());
statement = epService.getEPAdministrator().getStatement("Warning");
System.out.println(statement.getText() + ";");
statement.setSubscriber(new WarningEventSubscriber());
statement = epService.getEPAdministrator().getStatement("Error");
System.out.println(statement.getText() + ";");
statement.setSubscriber(new ErrorEventSubscriber());
}
catch (Exception e) {
throw new RuntimeException("Error deploying EPL from 'Temperature.epl': " + e.getMessage(), e);
}
I can get the sentences by statement.getText(), but the subscribers are not activated. What it's wrong?
I'm working with Esper 5.0.0
Seeing that your code uses the current classloader, you'd want to make sure the classloader is the same else you can get different engine instances.
Also have your code actually send an event to see if it matches since this code doesn't send events.

How to identify correct url

I have list of URL in txt file I am using it for performance test, since URL were not formed correctly java.IO.exeption were thrown,I would like to know how to check correctness of URL? and whether it is working fine? I have more than 35 K url checking manually will consume lot's of time.
To check whether URL are properly formed try casting the string to an URI object.
eg:
public void validURLs(List<string> urlList)
{
int line = 1;
for(string s : urlList)
{
try
{
URI test = new URI(s);
}
catch(Exception e)
{
System.err.println(s + " is not a valid URL, item " + line);
}
line ++;
}
}

Resources