Saving SEC 10-K annual report text to files (trouble with decoding) - parsing

I am trying to bulk-download the text visible to the "end-user" from 10-K SEC Edgar reports (don't care about tables) and save it in a text file. I have found the code below on Youtube, however I am facing 2 challenges:
I am not sure if I am capturing all text, and when I print the URL from below, I receive very weird output (special characters e.g., at the very end of the print-out)
I can't seem to save the text in txt files, not sure if this is due to encoding (I am entirely new to programming).
import re
import requests
import unicodedata
from bs4 import BeautifulSoup
def restore_windows_1252_characters(restore_string):
def to_windows_1252(match):
try:
return bytes([ord(match.group(0))]).decode('windows-1252')
except UnicodeDecodeError:
# No character at the corresponding code point: remove it.
return ''
return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)
# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt"
# grab the response
response = requests.get(new_html_text)
page_soup = BeautifulSoup(response.content,'html5lib')
page_text = page_soup.html.body.get_text(' ',strip = True)
# normalize the text, remove characters. Additionally, restore missing window characters.
page_text_norm = restore_windows_1252_characters(unicodedata.normalize('NFKD', page_text))
# print: this works however gives me weird special characters in the print (e.g., at the very end)
print(page_text_norm)
# save to file: this only gives me an empty text file
with open('testfile.txt','w') as file:
file.write(page_text_norm)

Try this. If you take the data you expect as an example, it will be easier for people to understand your needs.
from simplified_scrapy import SimplifiedDoc,req,utils
url = 'https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt'
html = req.get(url)
doc = SimplifiedDoc(html)
# text = doc.body.text
text = doc.body.unescape() # Converting HTML entities
utils.saveFile("testfile.txt",text)

Related

Wierd URL Encoding/Decoding for non English Characters

How and why a non-English word is converted to weird characters like پاکستان to پاکستان, is there any way back to get پاکستان from پاکستان. It happens in browser shown code and received requests at server
Background:
I get lot of requests at my Non-English content (urdu) website with urls like
پاکستان
I tried to know what that means but search engines don't help. I tried things like
Decode this 'mystring'
What ecoding is this 'mystring'
I thought it might be corrupted/spam url, from this link
Weird characters in URL
Problem explanation/example
But when I viewed one my js file in browser (while having look on working js file). It is showing me same wired characters in browser, even at localhost
'pakistan': {'eng': 'Pakistan', 'ur': 'پاکستان'},
//But actually source code for above line is following
'pakistan': {'eng': 'Pakistan', 'ur': 'پاکستان'},
But in browser its showing me following for same line,
My knowledge
I only know about Encoding/Decoding, which seems unrelated here with best of my knowledge as?
encodeURI and decodeURI in JS or quote and unquote in python and same for other languages. But what they do for me is only
`پاکستان` to `%D9%BE%D8%A7%DA%A9%D8%B3%D8%AA%D8%A7%D9%86` and vise versa
Why needed?
I don't want to miss the requests received with those malformed urls, there must be some things to undo as all browsers chrome/firefox/edge showing those characters same, If their translation/conversion method and result is same then there should be some technique available to reverse it as well
Thanks to Giacomo Catenazzi and then I be greatful to the following answer
How to decode cp1252 string?
A very custom and still imperfect solution to my problem.
This algo needs to be improved Only by experiment I came to know, this algo works as its not working for me when string is long or including - (hyphens)
So I made changes according to my requirement and its working fair enough, so that I could guess what the actual string was.
import re, itertools
from lxml.builder import unicode
def specific_my_required_processing(received_string):
starting_characters_in_encoded_string_in_my_case = ['Ø', 'Ã', 'Ù', 'Ù', 'Ú']
arr = received_string.split('-')
res = []
missed = []
for string_item in arr:
decoded_string = guess_decode_string_without_hyphens(string_item)
if decoded_string and decoded_string[:1] not in starting_characters_in_encoded_string_in_my_case:
res.append(decoded_string)
else:
missed.append({string_item: decoded_string})
resulting_urdu_string = '-'.join(res)
print('\n\nResult', resulting_urdu_string)
print('\nCould not be decoded', missed)
def guess_decode_string_without_hyphens(s):
encodings = ['cp1251', 'cp1252', 'utf8']
for steps in range(2, 10, 2):
for encs in itertools.product(encodings, repeat=steps):
r = s
try:
for enc in encs:
r = r.encode(enc) if isinstance(r, unicode) else r.decode(enc)
except (UnicodeEncodeError, UnicodeDecodeError) as e:
continue
if re.match(u'^[\w\sа-яА-Я]+$', r):
res = str(r)
print('Encoding => ', encs, ' Conversion = ' + s + ' => ' + res)
return res
sample_encoded_string = 'اسلام-آباد-Ûائیکورٹ-ای-ÙˆÛŒ-ایم-قانون-سازی-کالعدم-قرار-دینے-Ú©ÛŒ-درخواست-نامکمل-قرار'
specific_my_required_processing(sample_encoded_string)

How to retrieve an XMLTYPE data that contains special characters?

I want to retrieve XMLTYPE data from an Oracle table using cx_oracle.
the data looks like this:
<infos>
<Comment/>
<Observation>àéèç</Observation>
<Level>L3</Level>
<Duration/>
<Cause/>
<Depot> Haren </Depot>
<Resolution/>
</infos>
Here's my code:
#!/usr/bin/python
from __future__ import print_function
import cx_Oracle
# Connection to RTDIAG
try:
dsn_test = cx_Oracle.makedsn(host='xxxxx',port='1521',service_name='xxxxx')
con_test = cx_Oracle.connect(user='xxxx', password='xxxxx',dsn=xxxx)
except cx_Oracle.InterfaceError:
print ("Impossible to connect to the DB!")
print ("***exit script***")
quit()
ID_record = 1729
cursor = con_test.cursor()
query = """select a.content.getClobVal() from emb_log a where ID = :id and uncompleted_record=1
"""
cursor.execute(query,id=1729)
xml_retrieved = cursor.fetchone()[0].read() #string
print (xml_retrieved)
Here's what I get
<infos>
<Comment/>
<Observation>aeec</Observation>
<Level>L3</Level>
<Duration/>
<Cause/>
<Depot> Haren </Depot>
<Resolution/>
</infos>
The special characters contained within the XML child is not being retrieved proprely. They are converted in 'ascii like' characters.
Why and how to fetch the XML exactly the way it appears in the DB ?
Thank you.
Set your NLS environment. You will probably find it easiest to use
the
encoding
option when you connect.
for performance, you will want to fetch the CLOB via an OutputTypeHandler

Groovy- searching and excretion xml code from log file

I have so many texts in log file but sometimes i got responses as a xml code and I have to cut this xml code and move to other files.
For example:
sThread1....dsadasdsadsadasdasdasdas.......dasdasdasdadasdasdasdadadsada
important xml code to cut and move to other file: <response><important> 1 </import...></response>
important xml code to other file: <response><important> 2 </important...></response>
sThread2....dsadasdsadsadasdasdasdas.......dasdasdasdadasdasdasdadadsada
Hindrance: xml code starting from difference numbers of sign (not always start in the same number of sign)
Please help me with finding method how to find xml code in text
Right now i tested substring() method but xml code not always start from this same sign :(
EDIT:
I found what I wanted, function which I searched was indexOf().
I needed a number of letter where String "Response is : " ending: so I used:
int positionOfXmlInLine = lineTxt.indexOf("<response")
And after this I can cut string to the end of the line :
def cuttedText = lineTxt.substring(positionOfXmlInLine);
So I have right now only a XML text/code from log file.
Next is a parsing XML value like BDKosher wrote under it.
Hoply that will help someone You guys
You might be able to leverage XmlSlurper for this, assuming your XML is valid enough. The code below will take each line of the log, wrap it in a root element, and parse it. Once parsed, it extracts and prints out the value of the <important> element's value attribute, but instead you could do whatever you need to do with the data:
def input = '''
sThread1..sdadassda..sdadasdsada....sdadasdas...
important code to cut and move to other file: **<response><important value="1"></important></response>**
important code to other file: ****<response><important value="3"></important></response>****
sThread2..dsadasd.s.da.das.d.as.das.d.as.da.sd.a.
'''
def parser = new XmlSlurper()
input.eachLine { line, lineNo ->
def output = parser.parseText("<wrapper>$line</wrapper>")
if (!output.response.isEmpty()) {
println "Line $lineNo is of importance ${output.response.important.#value.text()}"
}
}
This prints out:
Line 2 is of importance 1
Line 3 is of importance 3

Preprocessing Scala parser Reader input

I have a file containing a text representation of an object. I have written a combinator parser grammar that parses the text and returns the object. In the text, "#" is a comment delimiter: everything from that character to the end of the line is ignored. Blank lines are also ignored. I want to process text one line at a time, so that I can handle very large files.
I don't want to clutter up my parser grammar with generic comment and blank line logic. I'd like to remove these as a preprocessing step. Converting the file to an iterator over line I can do something like this:
Source.fromFile("file.txt").getLines.map(_.replaceAll("#.*", "").trim).filter(!_.isEmpty)
How can I pass the output of an expression like that into a combinator parser? I can't figure out how to create a Reader object out of a filtered expression like this. The Java FileReader interface doesn't work that way.
Is there a way to do this, or should I put my comment and blank line logic in the parser grammar? If the latter, is there some util.parsing package that already does this for me?
The simplest way to do this is to use the fromLines method on PagedSeq:
import scala.collection.immutable.PagedSeq
import scala.io.Source
import scala.util.parsing.input.PagedSeqReader
val lines = Source.fromFile("file.txt").getLines.map(
_.replaceAll("#.*", "").trim
).filterNot(_.isEmpty)
val reader = new PagedSeqReader(PagedSeq.fromLines(lines))
And now you've got a scala.util.parsing.input.Reader that you can plug into your parser. This is essentially what happens when you parse a java.io.Reader, anyway—it immediately gets wrapped in a PagedSeqReader.
Not the prettiest code you'll ever write, but you could go through a new Source as follows:
val SEP = System.getProperty("line.separator")
def lineMap(fileName : String, trans : String=>String) : Source = {
Source.fromIterable(
Source.fromFile(fileName).getLines.flatMap(
line => trans(line) + SEP
).toIterable
)
}
Explanation: flatMap will produce an iterator on characters, which you can turn into an Iterable, which you can use to build a new Source. You need the extra SEP because getLines removes it by default (using \n may not work as Source will not properly separate the lines).
If you want to apply filtering too, i.e. remove some of the lines, you could for instance try:
// whenever `trans` returns `None`, the line is dropped.
def lineMapFilter(fileName : String, trans : String=>Option[String]) : Source = {
Source.fromIterable(
Source.fromFile(fileName).getLines.flatMap(
line => trans(line).map(_ + SEP).getOrElse("")
).toIterable
)
}
As an example:
lineMapFilter("in.txt", line => if(line.isEmpty) None else Some(line.reverse))
...will remove empty lines and reverse non-empty ones.

Grails: Replacing symbols with HTML equivalent

I'm reading a CSV file and one of the columns has text that contains symbols that is not recognized. After I read the file, symbols such as ' becomes � . I'm also saving this into a DB.
Obviously when I display this on a webpage, it shows garbage. How can I substitute HTML code (ex. &#180 ;) for this with Grails?
I am reading the CSV using the csv plugin. Code below:
def f = "clientDocs/testfile.csv"
def fReader = new File(f).toCsvMapReader([batchSize:50, charset:'UTF-8'])
fReader.each { batchList ->
batchList.each {
def description = substituteSymbols(it.Description)
def substituteSymbols(inText) {
// HOW TO SUBSTITUTE HERE
}
Thanks for any help or suggestions. I've already tried string.replaceAll(regExp).
Grails comes with a basic set of encoders/decoders for common tasks.
What you want here is it.Description.encodeAsHTML().
And then if you want the original when displaying in the view, just reverse it with .decodeHTML()
You can read more about these here: http://grails.org/doc/latest/guide/single.html#codecs
(Edited decode method name typo as per the comment)

Resources