Extracting the text inside a docx file - f#

I am using the below code to read .docx file and it is successfully extracting the text from the file. But the problem here is, it is just extracting the text. For example if my document data is like below
I am line 1
I am line 2 I am some other text
Then it is returning me like
I am line 1I am line 2I am some other text.
I just want as it is. How can I do that. Below is the code I am using now.
open System
open System.IO
open System.IO.Packaging
open System.Xml
let getDocxContent (path: string) =
use package = Package.Open(path, FileMode.Open)
let stream = package.GetPart(new Uri("/word/document.xml",UriKind.Relative)).GetStream()
stream.Seek(0L, SeekOrigin.Begin) |> ignore
let xmlDoc = new XmlDocument()
xmlDoc.Load(stream)
xmlDoc.DocumentElement.InnerText
let docData = getDocxContent #"C:\a1.docx"
printfn "%s" docData

You need to set the PreserveWhitespace property on your XmlDocument before loading it.
So change the code from:
let xmlDoc = new XmlDocument()
xmlDoc.Load(stream)
To:
let xmlDoc = new XmlDocument()
xmlDoc.PreserveWhitespace <- true
xmlDoc.Load(stream)

Related

Writing a function that replaces a string using OpenText, ReadLine and WriteLine

I have to write a function, given a filename, needle and a replace, that swaps the two strings in a given text document. The function has to use the System.IO.File.OpenText, WriteLine and ReadLine syntax. I'm currently stuck here, where the function seems to override given text document instead of replacing the needle.
open System
let fileReplace (filename : string) (needle : string) (replace : string) : unit =
try // uses try-with to catch fail-cases
let lines = seq {
use file = IO.File.OpenText filename // uses OpenText
while not file.EndOfStream // runs through the file
do yield file.ReadLine().Replace(needle, replace)
file.Close()
}
use writer = IO.File.CreateText filename // creates the file
for line in lines
do writer.Write line
with
_ -> failwith "Something went wrong opening this file" // uses failwith exception
let filename = #"C:\Users\....\abc.txt"
let needle = "string" // given string already appearing in the text
let replace = "string" // Whatever string that needs to be replaced
fileReplace filename needle replace
The problem with your code is that you are using lazy sequence when reading lines. When you use seq { .. }, the body is not actually evaluated until it is needed. In your example, this is when iterating over lines in a for loop - but before the code gets there, you call CreateText and overwrite the file!
You can fix this by using a list, which is evaluated immediately. You also need to replace Write with WriteLine, but the rest works!
let fileReplace (filename : string) (needle : string) (replace : string) : unit =
try // uses try-with to catch fail-cases
let lines =
[ use file = IO.File.OpenText filename // uses OpenText
while not file.EndOfStream do // runs through the file
yield file.ReadLine().Replace(needle, replace)
]
use writer = IO.File.CreateText filename // creates the file
for line in lines do
writer.WriteLine line
with
_ -> failwith "Something went wrong opening this file" // uses failwith exception
I also removed the Close call, because use takes care of that for you.
EDIT: I put back the required do keywords - I was confused by your formatting. Most people would write them at the end of the previous line as in my updated version.

JSON Type provider with Deedle (F#): how to read several JSON files on HD?

suppose i have a lot of files with the same JSON format saved on my HD.
i can recover the samples doing the following:
type TypeA = JsonProvider<".../Documents/FileA.json">
let sampleA = TypeA.GetSamples()
but if i have many files (and for example their name in a list), how do i recover all the samples ?
JsonProvier supplies a number of methods for parsing runtime supplied data into the data types provided by the type provided:
//Load from named file/website
member this.Load(uri: string): this.Root[]
//Load data from stream source
member this.Load(reader: System.IO.TextReader): this.Root[]
member this.Load(stream: System.IO.Stream): this.Root[]
//Load data from named file/website (async)
member this.AsyncLoad(uri: string): Async<this.Root[]>
//Load data directly from string
member this.Parse(text: string): this.Root[]
These will all load the relevant data into an array of the type generated from the static parameter of the type provider. For example:
open FSharp.Data
type TypeA = JsonProvider<"C:\\DataTemp\\FileA.json">
let directory = "C:\\DataTemp\\"
let files: string[] =
[|
"FileA.json"
"FileB.json"
"FileC.json"
"FileD.json"
|]
[<EntryPoint>]
let main argv =
let sampleA = TypeA.GetSamples()
let sampleB = TypeA.Load(directory+"FileB.json")
let allData = files |> Array.collect (fun f -> TypeA.Load(directory+f))
for d in allData do
printfn "%A" d
//etc
0
Note that these will not always strictly enforce the schema. For example, string type values are allowed to be missing, and will silently be replaced by empty strings; extra data is allowed to be present, and will be loaded into the JsonValue data, but inaccessable through the statically typed properties, and so on.

How to close a stream properly?

I have been assigned a task to write a program that will:
Open a file.
Read the content.
Replace a specific word with another word.
Save the changes to the file.
I know for sure that my code can open, read and replace words. The problem occurs when i add the "Save the changes to the file" - part. Here is the code:
open System.IO
//Getting the filename, needle and replace-word.
System.Console.WriteLine "What is the name of the file?"
let filename = string (System.Console.ReadLine ())
System.Console.WriteLine "What is your needle?"
let needle = string (System.Console.ReadLine ())
System.Console.WriteLine "What you want your needle replaced with?"
let replace = string (System.Console.ReadLine ())
//Saves the content of the file
let mutable saveLine = ""
//Opens a stream to read the file
let reader = File.OpenText filename
//Reads the file, and replaces the needle.
let printFile (reader : System.IO.StreamReader) =
while not(reader.EndOfStream) do
let line = reader.ReadLine ()
let lineReplace = line.Replace(needle,replace)
saveLine <- saveLine + lineReplace
printfn "%s" lineReplace
//Opens a stream to write to the file
let readerWrite = File.CreateText(filename)
//Writes to the file
let editFile (readerWrite : System.IO.StreamWriter) =
File.WriteAllText(filename,saveLine)
printf "%A" (printFile reader)
I get the error message "Sharing violation on path...", which makes me believe that the reading stream do not close properly. I have tried playing around with the structure of my code and tried different things for the .NET library, but i always get the same error message. Any help is much appreciated.
Streams are normally closed by calling Stream.Close() or disposing them.
System.IO has methods to read or write complete files from/to arrays of lines. This would shorten the operation to something like this:
File.ReadAllLines filePath
|> Array.map (fun line -> line.Replace(needle, replace))
|> fun editedLines -> File.WriteAllLines(filePath, editedLines)
What documentation are you using? Have a look at the MSDN documentation for System.IO and the similar MSDN documentations for various things in .NET/the CLR; these answer questions like this one quickly.
I retained most of your original code, although it's not very idiomatic. If you use use with disposable resources, .NET will clean up after you. See for example F# Docs and Fun&Profit, the latter also has a nice section on Expressions and syntax.
If you execute your code, you should get System.IO.IOException:
Unhandled Exception: System.IO.IOException: The process cannot access
the file 'C:\Users\xcs\Documents\Visual Studio
2015\Projects\StackOverflow6\ConsoleApplication11\bin\Release\testout.txt'
because it is being used by another process. at
System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
at System.IO.FileStream.Init(String path, FileMode mode, FileAccess
access, Int32 rights, Boolean useRights, FileShare share, Int32
bufferSize, FileOptions options, SECURITY_ATTRIBUTES secAttrs, String
msgPath, Boolean bFromProxy, Boolean useLongPath, Boolean checkHost)
at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess
access, FileShare share, Int32 bufferSize, FileOptions options, String
msgPath, Boolean bFromProxy, Boolean useLongPath, Boolean checkHost)
at System.IO.StreamWriter.CreateFile(String path, Boolean append,
Boolean checkHost) at System.IO.StreamWriter..ctor(String path,
Boolean append, Encoding encoding, Int32 bufferSize, Boolean
checkHost) at System.IO.StreamWriter..ctor(String path, Boolean
append) at System.IO.File.CreateText(String path) at
Program.op#46(Unit unitVar0) in C:\Users\xcs\Documents\Visual
Studio
2015\Projects\StackOverflow6\ConsoleApplication11\Program.fs:line 74
at Program.main(String[] argv) in C:\Users\xcs\Documents\Visual
Studio
2015\Projects\StackOverflow6\ConsoleApplication11\Program.fs:line 83
It starts at line 83, which is the call to the function, goes to line 74. Line 74 is the following: let readerWrite = File.CreateText(filename). Nowhere in your code have you closed reader. There is also another problem, you're opening a StreamWriter with File.CreateText. And then you're trying to write to this opened stream with File.WriteAllText, which opens the file, writes to it and closes it. So a bunch of IO handles are floating around there...
To quickly fix it consider the following:
//Getting the filename, needle and replace-word.
System.Console.WriteLine "What is the name of the file?"
let filename = string (System.Console.ReadLine ())
System.Console.WriteLine "What is your needle?"
let needle = string (System.Console.ReadLine ())
System.Console.WriteLine "What you want your needle replaced with?"
let replace = string (System.Console.ReadLine ())
//Saves the content of the file
//Opens a stream to read the file
//let reader = File.OpenText filename
//Reads the file, and replaces the needle.
let printFile (filename:string) (needle:string) (replace:string) =
let mutable saveLine = ""
use reader = File.OpenText filename //use will ensure that the stream is disposed once its out of scope, i.e. the functions exits
while not(reader.EndOfStream) do
let line = reader.ReadLine ()
let lineReplace = line.Replace(needle,replace)
saveLine <- saveLine + lineReplace + "\r\n" //you will need a newline character
printfn "%s" lineReplace
saveLine
//Writes to the file
let editFile filename saveLine =
File.WriteAllText(filename,saveLine) //you don't need a stream here, since File.WriteAllText will open, write, then close the file
let saveLine = printFile filename needle replace //read the file into saveLine
editFile filename saveLine //write saveLine into the file
It does a couple of things:
creates the StreamReader inside the printFile
binds it to reader with use, not let, to ensure it is closed once we don't need it anymore
add a linefeed to the string, since you insist rebuilding a mutable string
encapsulates the mutable saveLine inside the function
passes the needle and replace arguments explicitly
returns a new string to be used in 7.
gets rid of the Streamwriter by using File.WriteAllText and also passes in explicitly the filename and the string to write

How to pass F# a string and get the result back in c# [duplicate]

This question already has answers here:
Call F# code from C#
(4 answers)
Closed 8 years ago.
I am SQL developer and am really new to both F# and C#. I need help on how to pass a string to f# function below and to return the result from F# to C#.
Description of project:
I am using stanford postagger to tag a sentence with the parts of speech.
Reference link from where i copied this code.
(http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordPOSTagger.html)
module File1
open java.io
open java.util
open edu.stanford.nlp.ling
open edu.stanford.nlp.tagger.maxent
// Path to the folder with models
let modelsDirectry =
__SOURCE_DIRECTORY__ + #'..\stanford-postagger-2013-06-20\models\'
// Loading POS Tagger
let tagger = MaxentTagger(modelsDirectry + 'wsj-0-18-bidirectional-nodistsim.tagger')
let tagTexrFromReader (reader:Reader) =
let sentances = MaxentTagger.tokenizeText(reader).toArray()
sentances |> Seq.iter (fun sentence ->
let taggedSentence = tagger.tagSentence(sentence :?> ArrayList)
printfn "%O" (Sentence.listToString(taggedSentence, false))
)
// Text for tagging
let text = System.Console.ReadLine();
tagTexrFromReader <| new StringReader(text)
it won't matter if C# or F# - do make a function that gets a string and returns ... let
s say an int, you just need something like this (put it in some MyModule.fs):
namespace MyNamespace
module MyModule =
// this is your function with one argument (a string named input) and result of int
let myFun (input : string) : int =
// do whatever you have to
5 // the value of the last line will be your result - in this case a integer 5
call it in from C#/.net with
int result = MyNamespace.MyModule.myFun ("Hallo");
I hope this helps you out a bit
For your example this would be:
let myFun (text : string) =
use reader = new StringReader(text)
tagTexrFromReader reader
as you'll have this in the module File1 you can just call it with var res = Fiel1.myFun(text);
BTW: use is in there because StringReader is IDisposable and using use F# will dispose the object when you exit the scope.
PS: is tagTexrFromReader a typo?

F# lazy eval from stream reader?

I'm running into a bug in my code that makes me think that I don't really understand some of the details about F# and lazy evaluation. I know that F# evaluates eagerly and therefore am somewhat perplexed by the following function:
// Open a file, then read from it. Close the file. return the data.
let getStringFromFile =
File.OpenRead("c:\\eo\\raw.txt")
|> fun s -> let r = new StreamReader(s)
let data = r.ReadToEnd
r.Close()
s.Close()
data
When I call this in FSI:
> let d = getStringFromFile();;
System.ObjectDisposedException: Cannot read from a closed TextReader.
at System.IO.__Error.ReaderClosed()
at System.IO.StreamReader.ReadToEnd()
at <StartupCode$FSI_0134>.$FSI_0134.main#()
Stopped due to error
This makes me think that getStringFromFile is being evaluated lazily--so I'm totally confused. I'm not getting something about how F# evaluates functions.
For a quick explanation of what's happening, lets start here:
let getStringFromFile =
File.OpenRead("c:\\eo\\raw.txt")
|> fun s -> let r = new StreamReader(s)
let data = r.ReadToEnd
r.Close()
s.Close()
data
You can re-write the first two lines of your function as:
let s = File.OpenRead(#"c:\eo\raw.txt")
Next, you've omitted the parentheses on this method:
let data = r.ReadToEnd
r.Close()
s.Close()
data
As a result, data has the type unit -> string. When you return this value from your function, the entire result is unit -> string. But look what happens in between assigning your variable and returning it: you closed you streams.
End result, when a user calls the function, the streams are already closed, resulting in the error you're seeing above.
And don't forget to dispose your objects by declaring use whatever = ... instead of let whatever = ....
With that in mind, here's a fix:
let getStringFromFile() =
use s = File.OpenRead(#"c:\eo\raw.txt")
use r = new StreamReader(s)
r.ReadToEnd()
You don't read from your file. You bind method ReadToEnd of your instance of StreamReader to the value data and then call it when you call getStringFromFile(). The problem is that the stream is closed at this moment.
I think you have missed the parentheses and here's the correct version:
// Open a file, then read from it. Close the file. return the data.
let getStringFromFile =
File.OpenRead("c:\\eo\\raw.txt")
|> fun s -> let r = new StreamReader(s)
let data = r.ReadToEnd()
r.Close()
s.Close()
data

Resources