I have so many texts in log file but sometimes i got responses as a xml code and I have to cut this xml code and move to other files.
For example:
sThread1....dsadasdsadsadasdasdasdas.......dasdasdasdadasdasdasdadadsada
important xml code to cut and move to other file: <response><important> 1 </import...></response>
important xml code to other file: <response><important> 2 </important...></response>
sThread2....dsadasdsadsadasdasdasdas.......dasdasdasdadasdasdasdadadsada
Hindrance: xml code starting from difference numbers of sign (not always start in the same number of sign)
Please help me with finding method how to find xml code in text
Right now i tested substring() method but xml code not always start from this same sign :(
EDIT:
I found what I wanted, function which I searched was indexOf().
I needed a number of letter where String "Response is : " ending: so I used:
int positionOfXmlInLine = lineTxt.indexOf("<response")
And after this I can cut string to the end of the line :
def cuttedText = lineTxt.substring(positionOfXmlInLine);
So I have right now only a XML text/code from log file.
Next is a parsing XML value like BDKosher wrote under it.
Hoply that will help someone You guys
You might be able to leverage XmlSlurper for this, assuming your XML is valid enough. The code below will take each line of the log, wrap it in a root element, and parse it. Once parsed, it extracts and prints out the value of the <important> element's value attribute, but instead you could do whatever you need to do with the data:
def input = '''
sThread1..sdadassda..sdadasdsada....sdadasdas...
important code to cut and move to other file: **<response><important value="1"></important></response>**
important code to other file: ****<response><important value="3"></important></response>****
sThread2..dsadasd.s.da.das.d.as.das.d.as.da.sd.a.
'''
def parser = new XmlSlurper()
input.eachLine { line, lineNo ->
def output = parser.parseText("<wrapper>$line</wrapper>")
if (!output.response.isEmpty()) {
println "Line $lineNo is of importance ${output.response.important.#value.text()}"
}
}
This prints out:
Line 2 is of importance 1
Line 3 is of importance 3
Related
I am trying to bulk-download the text visible to the "end-user" from 10-K SEC Edgar reports (don't care about tables) and save it in a text file. I have found the code below on Youtube, however I am facing 2 challenges:
I am not sure if I am capturing all text, and when I print the URL from below, I receive very weird output (special characters e.g., at the very end of the print-out)
I can't seem to save the text in txt files, not sure if this is due to encoding (I am entirely new to programming).
import re
import requests
import unicodedata
from bs4 import BeautifulSoup
def restore_windows_1252_characters(restore_string):
def to_windows_1252(match):
try:
return bytes([ord(match.group(0))]).decode('windows-1252')
except UnicodeDecodeError:
# No character at the corresponding code point: remove it.
return ''
return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)
# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt"
# grab the response
response = requests.get(new_html_text)
page_soup = BeautifulSoup(response.content,'html5lib')
page_text = page_soup.html.body.get_text(' ',strip = True)
# normalize the text, remove characters. Additionally, restore missing window characters.
page_text_norm = restore_windows_1252_characters(unicodedata.normalize('NFKD', page_text))
# print: this works however gives me weird special characters in the print (e.g., at the very end)
print(page_text_norm)
# save to file: this only gives me an empty text file
with open('testfile.txt','w') as file:
file.write(page_text_norm)
Try this. If you take the data you expect as an example, it will be easier for people to understand your needs.
from simplified_scrapy import SimplifiedDoc,req,utils
url = 'https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt'
html = req.get(url)
doc = SimplifiedDoc(html)
# text = doc.body.text
text = doc.body.unescape() # Converting HTML entities
utils.saveFile("testfile.txt",text)
I have a file shape like this :
How can I parse a file like this with Talend Open Studio ?
Here's what I tried :
In the tJavaRow, the input is the whole file in a single row. I split it and parse it manually. But I can't figure out how to create an output row for each OBJ in the file.
Is this the "Right" way of doing it ? Or is there a specific component for this type of files ?
But I can't figure out how to create an output row for each OBJ in the file
You can do this by using the tJavaFlex component:
Put your raw content in the globalMap by connecting it to tFlowToIterate
Put your split-and-parse logic in the "Start Code" part of tJavaFlex, using the contents you made available in step 1
Start a loop in the "Start Code" part of tJavaFlex (e.g. for each object)
Define your output schema in tJavaFlex
In the "Main Code" part of tJavaFlex, map your parsed object to the columns of your output row
Dont forget to close your loop in the "End Code" part of tJavaFlex
I layed out a quick example, with no parsing logic. But since you already got this down, I think it should be sufficient:
Start Code
String[] lines = ((String)globalMap.get("row1.content")).split("\r\n");
for(String line : lines) { // starts the "generating" loop
Main Code
row2.key = line; // uses the "generating" loop
End Code
} // closes the "generating" loop
I've stored a string in the database. When I save and retrieve the string and the result I'm getting is as following:
This is my new object
Testing multiple lines
-- Test 1
-- Test 2
-- Test 3
That is what I get from a println command when I call the save and index methods.
But when I show it on screen. It's being shown like:
This is my object Testing multiple lines -- Test 1 -- Test 2 -- Test 3
Already tried to show it like the following:
${adviceInstance.advice?.encodeAsHTML()}
But still the same thing.
Do I need to replace \n to or something like that? Is there any easier way to show it properly?
Common problems have a variety of solutions
1> could be you that you replace \n with <br>
so either in your controller/service or if you like in gsp:
${adviceInstance.advice?.replace('\n','<br>')}
2> display the content in a read-only textarea
<g:textArea name="something" readonly="true">
${adviceInstance.advice}
</g:textArea>
3> Use the <pre> tag
<pre>
${adviceInstance.advice}
</pre>
4> Use css white-space http://www.w3schools.com/cssref/pr_text_white-space.asp:
<div class="space">
</div>
//css code:
.space {
white-space:pre
}
Also make a note if you have a strict configuration for the storage of such fields that when you submit it via a form, there are additional elements I didn't delve into what it actually was, it may have actually be the return carriages or \r, anyhow explained in comments below. About the good rule to set a setter that trims the element each time it is received. i.e.:
Class Advice {
String advice
static constraints = {
advice(nullable:false, minSize:1, maxSize:255)
}
/*
* In this scenario with a a maxSize value, ensure you
* set your own setter to trim any hidden \r
* that may be posted back as part of the form request
* by end user. Trust me I got to know the hard way.
*/
void setAdvice(String adv) {
advice=adv.trim()
}
}
${raw(adviceInstance.advice?.encodeAsHTML().replace("\n", "<br>"))}
This is how i solve the problem.
Firstly make sure the string contains \n to denote line break.
For example :
String test = "This is first line. \n This is second line";
Then in gsp page use:
${raw(test?.replace("\n", "<br>"))}
The output will be as:
This is first line.
This is second line.
I am currently trying to parse a file. it looks like this:
A|00CA|GOLDSTONE GTS|35350525|-116888367|3038
R|04|37|6000|0|0|0|35349333|-116893334|3038|300|50
R|22|217|6000|0|0|0|35360333|-116877500|3038|300|50
A|00WI|NORTHERN LITE|44304283|-89050111|860
R|09|90|1000|0|0|0|44304217|-89052022|860|300|50
R|27|270|1000|0|0|0|44304350|-89048208|860|300|50
A|01ID|LAVA HOT SPRINGS|42608250|-112032461|5268
R|14|143|2894|0|0|0|42611000|-112034867|5268|300|50
R|32|323|2894|0|0|0|42603733|-112030533|5268|300|50
A|01LS|COUNTRY BREEZE|30722639|-91077361|125
R|09|91|1800|0|0|0|30722747|-91080222|125|300|50
R|27|271|1800|0|0|0|30722531|-91074500|125|300|50
A|01MT|CRYSTAL LAKES RESORT|48789131|-114880436|3141
R|13|131|5000|0|0|0|48794975|-114885842|3141|300|50
R|31|311|5000|0|0|0|48783292|-114875003|3141|300|50
but longer, however you get the picture.
Say I want to get a whole line out of this using only the a four digit code.
So when the user types in 00CA it will pull the following whole line and break it up into the numbers or letter in between the "|":
A|00CA|GOLDSTONE GTS|35350525|-116888367|3038
I have been given code that looks like this:
file = assert(io.open("Airports.txt", "r"))
for line in file:lines() do
fields = { line:match "(%w+)|(%w+)|([%w ]+)|([%d-]+)|([%d-]+)|([%d-]+)" }
print(fields[4], fields[5]) -- the 2 numeric fields you're interested in
end
file:close()
Of this whole line:
A|00CA|GOLDSTONE GTS|35350525|-116888367|3038
I would only be interested in getting these peices of data : 35350525 : -116888367
however when I try and put this or anything like this. It just puts out a nil value.
-- ICAO == "00CA"
fields = { line:match "(%w+)|" .. ICAO .. "|([%w ]+)|([%d-]+)|([%d-]+)|([%d-]+)" }
And obviously you I need to put some custom data (The ICAO code) in there as many of the lines follow that pattern.
What am I doing wrong?
Add parentheses around the call:
line:match("(%w+)|" .. ICAO .. "|([%w ]+)|([%d-]+)|([%d-]+)|([%d-]+)")
The original code is parsed as
(line:match "(%w+)|") .. ICAO .. "|([%w ]+)|([%d-]+)|([%d-]+)|([%d-]+)"
Here is the complete code that I tested:
line="A|00CA|GOLDSTONE GTS|35350525|-116888367|3038"
ICAO = "00CA"
print(line:match("(%w+)|" .. ICAO .. "|([%w ]+)|([%d-]+)|([%d-]+)|([%d-]+)"))
The output is
A GOLDSTONE GTS 35350525 -116888367 3038
For this task, I'd use a simpler pattern: "(.-)|" .. ICAO .. "|(.-)|(.-)|(.-)|(.-)$"
I have a file containing a text representation of an object. I have written a combinator parser grammar that parses the text and returns the object. In the text, "#" is a comment delimiter: everything from that character to the end of the line is ignored. Blank lines are also ignored. I want to process text one line at a time, so that I can handle very large files.
I don't want to clutter up my parser grammar with generic comment and blank line logic. I'd like to remove these as a preprocessing step. Converting the file to an iterator over line I can do something like this:
Source.fromFile("file.txt").getLines.map(_.replaceAll("#.*", "").trim).filter(!_.isEmpty)
How can I pass the output of an expression like that into a combinator parser? I can't figure out how to create a Reader object out of a filtered expression like this. The Java FileReader interface doesn't work that way.
Is there a way to do this, or should I put my comment and blank line logic in the parser grammar? If the latter, is there some util.parsing package that already does this for me?
The simplest way to do this is to use the fromLines method on PagedSeq:
import scala.collection.immutable.PagedSeq
import scala.io.Source
import scala.util.parsing.input.PagedSeqReader
val lines = Source.fromFile("file.txt").getLines.map(
_.replaceAll("#.*", "").trim
).filterNot(_.isEmpty)
val reader = new PagedSeqReader(PagedSeq.fromLines(lines))
And now you've got a scala.util.parsing.input.Reader that you can plug into your parser. This is essentially what happens when you parse a java.io.Reader, anyway—it immediately gets wrapped in a PagedSeqReader.
Not the prettiest code you'll ever write, but you could go through a new Source as follows:
val SEP = System.getProperty("line.separator")
def lineMap(fileName : String, trans : String=>String) : Source = {
Source.fromIterable(
Source.fromFile(fileName).getLines.flatMap(
line => trans(line) + SEP
).toIterable
)
}
Explanation: flatMap will produce an iterator on characters, which you can turn into an Iterable, which you can use to build a new Source. You need the extra SEP because getLines removes it by default (using \n may not work as Source will not properly separate the lines).
If you want to apply filtering too, i.e. remove some of the lines, you could for instance try:
// whenever `trans` returns `None`, the line is dropped.
def lineMapFilter(fileName : String, trans : String=>Option[String]) : Source = {
Source.fromIterable(
Source.fromFile(fileName).getLines.flatMap(
line => trans(line).map(_ + SEP).getOrElse("")
).toIterable
)
}
As an example:
lineMapFilter("in.txt", line => if(line.isEmpty) None else Some(line.reverse))
...will remove empty lines and reverse non-empty ones.