How to convert sequence file generated in mahout to text file - mahout

I have been looking for parser to convert sequence file(.seq) generated to normal text file to get to know intermediate outputs. I am glad to know if anyone come across how to do this.

I think you can create a SequenceFile Reader in a few lines of codes as below
public static void main(String[] args) throws IOException {
String uri = "path/to/your/sequence/file";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
SequenceFile.Reader reader = null;
try {
reader = new SequenceFile.Reader(fs, path, conf);
Writable key = (Writable) ReflectionUtils.newInstance(
reader.getKeyClass(), conf);
Writable value = (Writable) ReflectionUtils.newInstance(
reader.getValueClass(), conf);
long position = reader.getPosition();
while (reader.next(key, value)) {
System.out.println("Key: " + key + " value:" + value);
position = reader.getPosition();
}
} finally {
reader.close();
}
}

Suppose you have sequence data in hdfs in /ex-seqdata/part-000...
so the part-* data are in binary format.
now you can run command hadoop fs -text /ex-seqdata/part*
in command prompt to get the data in human readable format.

Related

Saxon CS: transform.doTransform cannot find out file from first transformation on windows machine but can on mac

I am creating an azure function application to validate xml files using a zip folder of schematron files.
I have run into a compatibility issue with how the URI's for the files are being created between mac and windows.
The files are downloaded from a zip on azure blob storage and then extracted to the functions local storage.
When the a colleague runs the transform method of the saxon cs api on a windows machine the method is able to run the first transformation and produce the stage 1.out file, however on the second transformation the transform method throws an exception stating that it cannot find the file even though it is present on the temp directory.
On mac the URI is /var/folders/6_/3x594vpn6z1fjclc0vx4v89m0000gn/T and on windows it is trying to find it at file:///C:/Users/44741/AppData/Local/Temp/ but the library is unable to find the file on the windows machine even if it is moved out of temp storage.
Unable to retrieve URI file:///C:/Users/44741/Desktop/files/stage1.out
The file is present at this location but for some reason the library cannot pick it up on the windows machine but it works fine on my mac. I am using Path.Combine to build the URI.
Has anyone else ran into this issue before?
The code being used for the transformations is below.
{
try
{
var transform = new Transform();
transform.doTransform(GetTransformArguments(arguments[Constants.InStage1File],
arguments[Constants.SourceDir] + "/" + schematronFile, arguments[Constants.Stage1Out]));
transform.doTransform(GetTransformArguments(arguments[Constants.InStage2File], arguments[Constants.Stage1Out],
arguments[Constants.Stage2Out]));
transform.doTransform(GetFinalTransformArguments(arguments[Constants.InStage3File], arguments[Constants.Stage2Out],
arguments[Constants.Stage3Out]));
Log.Information("Stage 3 out file written to : " + arguments[Constants.Stage3Out]);;
return true;
}
catch (FileNotFoundException ex)
{
Log.Warning("Cannot find files" + ex);
return false;
}
}
private static string[] GetTransformArguments(string xslFile, string inputFile, string outputFile)
{
return new[]
{
"-xsl:" + xslFile,
"-s:" + inputFile,
"-o:" + outputFile
};
}
private static string[] GetFinalTransformArguments(string xslFile, string inputFile, string outputFile)
{
return new[]
{
"-xsl:" + xslFile,
"-s:" + inputFile,
"-o:" + outputFile,
"allow-foreign=true",
"generate-fired-rule=true"
};
}```
So assuming the intermediary results are not needed as files but you just want the result (I assume that is the Schematron schema compiled to XSLT) you could try to run XSLT 3.0 using the API of SaxonCS (using Saxon.Api) by compiling and chaining your three stylesheets with e.g.
using Saxon.Api;
string isoSchematronDir = #"C:\SomePath\SomeDir\iso-schematron-xslt2";
string[] isoSchematronXslts = { "iso_dsdl_include.xsl", "iso_abstract_expand.xsl", "iso_svrl_for_xslt2.xsl" };
Processor processor = new(true);
var xsltCompiler = processor.NewXsltCompiler();
var baseUri = new Uri(Path.Combine(isoSchematronDir, isoSchematronXslts[2]));
xsltCompiler.BaseUri = baseUri;
var isoSchematronStages = isoSchematronXslts.Select(xslt => xsltCompiler.Compile(new Uri(baseUri, xslt)).Load30()).ToList();
isoSchematronStages[2].SetStylesheetParameters(new Dictionary<QName, XdmValue>() { { new QName("allow-foreign"), new XdmAtomicValue(true) } });
using (var schematronIs = File.OpenRead("price.sch"))
{
using (var compiledOs = File.OpenWrite("price.sch.xsl"))
{
isoSchematronStages[0].ApplyTemplates(
schematronIs,
isoSchematronStages[1].AsDocumentDestination(
isoSchematronStages[2].AsDocumentDestination(processor.NewSerializer(compiledOs)
)
);
}
}
If you only need the compiled Schematron to apply it further to validate an XML instance document against that Schematron you could even store the Schematron as an XdmDestination whose XdmNode you feed to XsltCompiler e.g.
using Saxon.Api;
string isoSchematronDir = #"C:\SomePath\SomeDir\iso-schematron-xslt2";
string[] isoSchematronXslts = { "iso_dsdl_include.xsl", "iso_abstract_expand.xsl", "iso_svrl_for_xslt2.xsl" };
Processor processor = new(true);
var xsltCompiler = processor.NewXsltCompiler();
var baseUri = new Uri(Path.Combine(isoSchematronDir, isoSchematronXslts[2]));
xsltCompiler.BaseUri = baseUri;
var isoSchematronStages = isoSchematronXslts.Select(xslt => xsltCompiler.Compile(new Uri(baseUri, xslt)).Load30()).ToList();
isoSchematronStages[2].SetStylesheetParameters(new Dictionary<QName, XdmValue>() { { new QName("allow-foreign"), new XdmAtomicValue(true) } });
var compiledSchematronXslt = new XdmDestination();
using (var schematronIs = File.OpenRead("price.sch"))
{
isoSchematronStages[0].ApplyTemplates(
schematronIs,
isoSchematronStages[1].AsDocumentDestination(
isoSchematronStages[2].AsDocumentDestination(compiledSchematronXslt)
)
);
}
var schematronValidator = xsltCompiler.Compile(compiledSchematronXslt.XdmNode).Load30();
using (var sampleIs = File.OpenRead("books.xml"))
{
schematronValidator.ApplyTemplates(sampleIs, processor.NewSerializer(Console.Out));
}
The last example writes the XSLT/Schematron validation SVRL output to the console but could of course also write it to a file.

Apache Beam TextIO.Read with line number

Is it possible to get access to line numbers with the lines read into the PCollection from TextIO.Read? For context here, I'm processing a CSV file and need access to the line number for a given line.
If not possible through TextIO.Read it seems like it should be possible using some kind of custom Read or transform, but I'm having trouble figuring out where to begin.
You can use FileIO to read the file manually, where you can determine the line number when you read from the ReadableFile.
A simple solution can look as follows:
p
.apply(FileIO.match().filepattern("/file.csv"))
.apply(FileIO.readMatches())
.apply(FlatMapElements
.into(strings())
.via((FileIO.ReadableFile f) -> {
List<String> result = new ArrayList<>();
try (BufferedReader br = new BufferedReader(Channels.newReader(f.open(), "UTF-8"))) {
int lineNr = 1;
String line = br.readLine();
while (line != null) {
result.add(lineNr + "," + line);
line = br.readLine();
lineNr++;
}
} catch (IOException e) {
throw new RuntimeException("Error while reading", e);
}
return result;
}));
The solution above just prepends the line number to each input line.

Generate torrent links from server-side

I don't know a lot about torrents, at least not enough to understand how certain websites can offer both a normal download link and a torrent link to download a file uploaded by a user.
Is generating a torrent link something common and simple to achieve. Would I need a server installation?
I've made an ugly C# implementation from a Java source, and to make sure some of my encoded variables were correct, I used NBEncode from Lars Warholm.
// There are 'args' because I'm using it from command-line. (arg0 is an option not used here)
// Source file
args[1] = Directory.GetCurrentDirectory() + args[1];
// Name to give to the torrent file
args[2] = Directory.GetCurrentDirectory() + args[2];
var inFileStream = new FileStream(args[1], FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
var filename = args[2];
//BEncoding with NBEencode
var transform = new BObjectTransform();
MemoryStream s = new MemoryStream();
OSS.NBEncode.Entities.BDictionary bod = new OSS.NBEncode.Entities.BDictionary();
OSS.NBEncode.Entities.BDictionary meta = new OSS.NBEncode.Entities.BDictionary();
// Preparing the first part of the file by creating BEncoded objects
string announceURL = "https://www.mysite.com/announce";
int pieceLength = 512 * 1024;
bod.Value.Add(new BByteString("name"), new OSS.NBEncode.Entities.BByteString(filename));
bod.Value.Add(new BByteString("length"), new OSS.NBEncode.Entities.BInteger(inFileStream.Length));
bod.Value.Add(new BByteString("piece length"), new OSS.NBEncode.Entities.BInteger(pieceLength));
bod.Value.Add(new BByteString("pieces"), new BByteString(""));
meta.Value.Add(new BByteString("announce"), new BByteString(announceURL));
meta.Value.Add(new BByteString("info"), bod);
byte[] pieces = hashPieces(args[1], pieceLength);
transform.EncodeObject(meta, s);
s.Close();
// Notice that we finish with a dictionary entry of "pieces" with an empty string.
byte[] trs = s.ToArray();
s.Close();
inFileStream.Close();
// I don't know how to write array of bytes using NBEncode library, so let's continue manually
// All data has been written a MemoryStreamp, except the byte array with the hash info about each parts of the file
Stream st = new FileStream(filename + ".torrent", FileMode.Create);
BinaryWriter bw = new BinaryWriter(st);
// Let's write these Bencoded variables to the torrent file:
// The -4 is there to skip the current end of the file created by NBEncode
for (int i = 0; i < trs.Length - 4; i++)
{
bw.BaseStream.WriteByte(trs[i]);
}
// We'll add the length of the pieces SHA1 hashes:
var bt = stringToBytes(pieces.Length.ToString() + ":");
// Then we'll close the Bencoded dictionary with 'ee'
var bu = stringToBytes("ee");
// Let's append this to the end of the file.
foreach (byte b in bt)
{
bw.BaseStream.WriteByte(b);
}
foreach (byte b in pieces)
{
bw.BaseStream.WriteByte(b);
}
foreach (byte b in bu)
{
bw.BaseStream.WriteByte(b);
}
bw.Close();
st.Close();
// That's it.
}
Functions used:
private static byte[] stringToBytes(String str)
{
System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();
Byte[] bytes = encoding.GetBytes(str);
return bytes;
}
private static byte[] hashPieces(string file, int pieceLength)
{
SHA1 sha1 = new SHA1CryptoServiceProvider();
StreamReader inn = new StreamReader(file);
MemoryStream pieces = new MemoryStream();
byte[] bytes = new byte[pieceLength];
byte[] digest = new byte[20];
int pieceByteCount = 0, readCount = inn.BaseStream.Read(bytes, 0, pieceLength);
while (readCount != 0)
{
pieceByteCount += readCount;
digest = sha1.ComputeHash(bytes, 0, readCount);
if (pieceByteCount == pieceLength)
{
pieceByteCount = 0;
foreach (byte b in digest)
{
pieces.WriteByte(b);
}
}
readCount = inn.BaseStream.Read(bytes, 0, pieceLength - pieceByteCount);
}
inn.Close();
if (pieceByteCount > 0)
foreach (byte b in digest)
{
pieces.WriteByte(b);
}
return pieces.ToArray();
}
It depends on how you're trying to create it. If you run a website, and want to generate torrent files from uploaded files, then you'll obviously need server-side code.
Generating a torrent file involves: adding the files you want to the torrent, and adding tracker info. Some popular trackers are:
http://open.tracker.thepiratebay.org/announce
http://www.torrent-downloads.to:2710/announce
To create the .torrent file, you'll have to read the about the format of the file. A piece of Java that generates .torrent files is given at https://stackoverflow.com/a/2033298/384155

OleDbConnection to Excel File in MOSS 2007 Shared Documents

I need to programmatically open an Excel file that is stored in a MOSS 2007 Shared Documents List. I’d like to use an OleDbConnection so that I may return the contents of the file as a DataTable. I believe this is possile since a number of articles on the Web imply this is possible. Currently my code fails when trying to initialize a new connection (oledbConn = new OleDbConnection(_connStringName); The error message is:
Format of the initialization string does not conform to specification starting at index 0.
I believe I am just not able to figure the right path to the file. Here is my code:
public DataTable GetData(string fileName, string workSheetName, string filePath)
{
// filePath == C:\inetpub\wwwroot\wss\VirtualDirectories\80\MySpWebAppName\Shared Documents\FY12_FHP_SPREADSHEET.xlsx
// Initialize global vars
_connStringName = DataSource.Conn_Excel(fileName, filePath).ToString();
_workSheetName = workSheetName;
dt = new DataTable();
//Create the connection object
if (!string.IsNullOrEmpty(_connStringName))
{
SPSecurity.RunWithElevatedPrivileges(delegate()
{
oledbConn = new OleDbConnection(_connStringName);
try
{
oledbConn.Open();
//Create OleDbCommand obj and select data from worksheet GrandTotals
OleDbCommand cmd = new OleDbCommand("SELECT * FROM " + _workSheetName + ";", oledbConn);
//create new OleDbDataAdapter
OleDbDataAdapter oleda = new OleDbDataAdapter();
oleda.SelectCommand = cmd;
oleda.Fill(dt);
}
catch (Exception ex)
{
System.Diagnostics.Debug.WriteLine(ex.Message);
}
finally
{
oledbConn.Close();
}
});
}
return dt;
}
public static OleDbConnection Conn_Excel(string ExcelFileName, string filePath)
{
// filePath == C:\inetpub\wwwroot\wss\VirtualDirectories\80\MySpWebAppName\Shared Documents\FY12_FHP_SPREADSHEET.xlsx
OleDbConnection myConn = new OleDbConnection();
myConn.ConnectionString = string.Format(#"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + filePath + ";Extended Properties=Excel 12.0");
return myConn;
}
What am I doing wrong, or is there a better way to get the Excel file contents as a DataTable?
I ended up using the open source project Excel Data Reader

Accessing the 'Media' directory of a Blackberry within the JDK

Trying to use JSR 75 to access media saved under the '/home/video/' directory on the device. Using Blackbery JDK 4.6.1. Single line of code throws a 'FileSystem IO Error' Exception. Which is, as usual, unhelpful in the extreme.
fconn = (FileConnection)Connector.open("file:///home/user/videos/"+name, Connector.READ);
Has anyone tried to do this? I can open files within my jar, but can't seem to access the media folder. I have the javax.microedition.io.Connector.file.read permission set and my appplication is signed.
There are two kind of filesystems on BlackBerry - SDCard and store. You have to use one of them, defining it in the path. Standard directory on SDCard where video, music etc stored is "file:///SDCard/BlackBerry".
String standardPath = "file:///SDCard/BlackBerry";
String videoDir = System.getProperty("fileconn.dir.videos.name");
String fileName = "video.txt";
String path = standardPath+"/"+videoDir+"/"+fileName;
String content = "";
FileConnection fconn = null;
DataInputStream is = null;
ByteVector bytes = new ByteVector();
try {
fconn = (FileConnection) Connector.open(path, Connector.READ);
is = fconn.openDataInputStream();
int c = is.read();
while(-1 != c)
{
bytes.addElement((byte) (c));
c = is.read();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
content = new String(bytes.toArray());
add(new RichTextField(content));
See also
SUN Dev Network - Getting Started with the FileConnection APIs
RIM Forum - Some questions about FileConnection/JSR 75
Use System.getProperty("fileconn.dir.memorycard") to check if SDCard available
How to save & delete a Bitmap image in Blackberry Storm?

Resources