Can the Sphinx Search Engine index a folder of xml files? - ruby-on-rails

I have folders that contain xml that I need to index in Sphinx. I explored the xmlpipe2 driver, and my understanding is that it only reads xml generated from a script, aka, one source. Is there a way to index a folder of xml files if I don't have the option of putting it a single xml file?

a XMLPipe Script is just a script that outputs XML, which sphinx then ingests.
It doesnt matter WHERE that script, gets the data from that it outputs.
It could get it from other the XML files, the script would just walk the folder structure, read all the files, and output XML.

Related

tool to convert avpr file to avdl file

Avro's IDL page documents that avro-tools.jar has an idl command converting an avdl file to an avpr file. Is there a way to go in the other direction, from an avpr file to an avdl file?
I was unable to find any documentation on this matter but given that the two formats appear to contain the same data with different syntax, it should be possible to convert both ways.
I have written a java util to create a IDL from a bunch of avro schemas, part of spf4j-avro for more detail see. Makes schemas a lot more readable...

Read files from a PCollection of GCS filenames in Pipeline?

I have a streaming pipeline hooked up to pub/sub that publishes filenames of GCS files. From there I want to read each file and parse out the events on each line (the events are what I ultimately want to process).
Can I use TextIO? Can you use it in a streaming pipeline when the filename is defined during execution (as opposed to using TextIO as a source and the fileName(s) are known at construction). If not I'm thinking of doing something like the following:
Get the topic from pub/sub
ParDo to read each file and get the lines
Process the lines of the file...
Could I use the FileBasedReader or something similar in this case to read the files? The files aren't too big so I wouldn't need to parallelize the reading of a single file, but I would need to read a lot of files.
You can use the TextIO.readAll() transform, which has been recently added to Beam in #3443. For example:
PCollection<String> filenames = p.apply(PubsubIO.readStrings()...);
PCollection<String> lines = filenames.apply(TextIO.readAll());
This will read all lines in each file arriving over pubsub.

How to use flume for uploading zip files to hdfs sink

I am new to flume.My flume agent having source as http server,from where it getting zip files(compressed xml files) on regular interval.This zip files are very small (less than 10 mb) and i want to put the zip files extracted into the hdfs sink.Please share some idea how to do this.Do i have to go for a custom interceptor.
Flume will try to read your files line by line, except if you configure a specific deserializer. A deserializer lets you control how the file is parsed and split into events. You could of course follow the example of the blob deserizalizer, which is designed for PDFs and such, but I understand that you actually want to unpack them and then read them line by line. In that case you would need to write a custom deserializer which reads Zip and writes line by line events.
Here's the reference in the documentation:
https://flume.apache.org/FlumeUserGuide.html#event-deserializers

Printing out Javadocs

Something that I've had a good hard look for and I have not been able to find, is how to efficiently obtain a hard copy of Javadocs? Obviously, one solution is simply to navigate to each page and execute a browser print, but there's got to be a better way! Do you guys have any ideas?
You can use DocBook Doclet (dbdoclet) to create DocBook XML from your JavaDoc Comments. The DocBook XML can then be transformed to PDF or (Singlepage-)HTML.
You can call the tool from the commandline. Point it to your class files and it will generate the DocBook XML. This works similar to the javadoc command which will generate the JavaDoc HTML. Example:
./dbdoclet -sourcepath ~/my-java-program/src/main/java -subpackages org.example
The result is a DocBook XML file in a dbdoclet subdirectory which can be used to create a PDF or HTML file. This can also be done from the command line; I am using the docbkx-maven-plugin for this.
You can do mass conversions with it, but it would require some time to make it work the way you want.

How can I generate a SWC from asset files dynamically?

Lets say you have 3 swf files in a directory:
/game/assets/
1.swf
2.swf
3.swf
What I need to do, is package these up into a SWC File, and then move that SWC file to the libs/ directory.
I plan to use ant, so this step must always occur before the compliation stage.
Today I use a VBS file to generate a XML file. Then I use that XML file to generate a AssetMap which is a series of [Embeds] (1.swf, 2.swf, 3.swf) which are ByteArrays.
I then pass these byte arrays to a loader.loabytes to generate a MovieClip.
But this "real time bytearray conversion" as far too slow. Id prefer I could have direct access to instances like I do with a SWC.
Can anyone offer me advice?

Resources