How to retrieve an XMLTYPE data that contains special characters? - character-encoding

I want to retrieve XMLTYPE data from an Oracle table using cx_oracle.
the data looks like this:
<infos>
<Comment/>
<Observation>àéèç</Observation>
<Level>L3</Level>
<Duration/>
<Cause/>
<Depot> Haren </Depot>
<Resolution/>
</infos>
Here's my code:
#!/usr/bin/python
from __future__ import print_function
import cx_Oracle
# Connection to RTDIAG
try:
dsn_test = cx_Oracle.makedsn(host='xxxxx',port='1521',service_name='xxxxx')
con_test = cx_Oracle.connect(user='xxxx', password='xxxxx',dsn=xxxx)
except cx_Oracle.InterfaceError:
print ("Impossible to connect to the DB!")
print ("***exit script***")
quit()
ID_record = 1729
cursor = con_test.cursor()
query = """select a.content.getClobVal() from emb_log a where ID = :id and uncompleted_record=1
"""
cursor.execute(query,id=1729)
xml_retrieved = cursor.fetchone()[0].read() #string
print (xml_retrieved)
Here's what I get
<infos>
<Comment/>
<Observation>aeec</Observation>
<Level>L3</Level>
<Duration/>
<Cause/>
<Depot> Haren </Depot>
<Resolution/>
</infos>
The special characters contained within the XML child is not being retrieved proprely. They are converted in 'ascii like' characters.
Why and how to fetch the XML exactly the way it appears in the DB ?
Thank you.

Set your NLS environment. You will probably find it easiest to use
the
encoding
option when you connect.
for performance, you will want to fetch the CLOB via an OutputTypeHandler

Related

Beautiful soup findAll returns empty list on this website?

I'm trying to extract property value history from this website:https://www.properly.ca/buy/home/view/ma-tEpHcSzeES-OlhE-V6A/bc/vancouver/1268-w-broadway-%23720/
But my code returns an empty list instead of the property cost history.
I used the following code:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
url= "https://www.properly.ca/buy/home/view/ma-tEpHcSzeES-OlhE-V6A/bc/vancouver/1268-w-broadway-%23720/"
driver.maximize_window()
driver.get(url)
time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
officials = soup.findAll("table",{"id":"property-history"})
for entry in officials:
print(str(entry))
Which returns an empty list, although this URL does have a property history table. Any help would be appreciated.
Thanks!
officials = soup.findAll("table",{"id":"property-history"})
On browser, I don't see a table with id="property-history" - but there is a div with that id, so maybe you can instead get the data you want through
officials = soup.find_all("div", {"id":"property-history"})
Btw, the only table I could find while inspecting the page was inside the map, and I don't think it holds any useful information for you.

Saving SEC 10-K annual report text to files (trouble with decoding)

I am trying to bulk-download the text visible to the "end-user" from 10-K SEC Edgar reports (don't care about tables) and save it in a text file. I have found the code below on Youtube, however I am facing 2 challenges:
I am not sure if I am capturing all text, and when I print the URL from below, I receive very weird output (special characters e.g., at the very end of the print-out)
I can't seem to save the text in txt files, not sure if this is due to encoding (I am entirely new to programming).
import re
import requests
import unicodedata
from bs4 import BeautifulSoup
def restore_windows_1252_characters(restore_string):
def to_windows_1252(match):
try:
return bytes([ord(match.group(0))]).decode('windows-1252')
except UnicodeDecodeError:
# No character at the corresponding code point: remove it.
return ''
return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)
# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt"
# grab the response
response = requests.get(new_html_text)
page_soup = BeautifulSoup(response.content,'html5lib')
page_text = page_soup.html.body.get_text(' ',strip = True)
# normalize the text, remove characters. Additionally, restore missing window characters.
page_text_norm = restore_windows_1252_characters(unicodedata.normalize('NFKD', page_text))
# print: this works however gives me weird special characters in the print (e.g., at the very end)
print(page_text_norm)
# save to file: this only gives me an empty text file
with open('testfile.txt','w') as file:
file.write(page_text_norm)
Try this. If you take the data you expect as an example, it will be easier for people to understand your needs.
from simplified_scrapy import SimplifiedDoc,req,utils
url = 'https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt'
html = req.get(url)
doc = SimplifiedDoc(html)
# text = doc.body.text
text = doc.body.unescape() # Converting HTML entities
utils.saveFile("testfile.txt",text)

How to convert py2neo.database.cursor class to a dictionary or list in python?

Using graph.run() py2neo v3 to connect to neo4j DB: How can I convert an instance of py2neo.database.Cursor class to a dictionary or list in python?
Was simple in py2neo v2 using py2neo.cypher.core.RecordList class which is what graph.cypher.execute() the equivalent to graph.run would have returned if using the previous version...
Looks like you could do with the data method:
http://py2neo.org/v3/database.html#py2neo.database.Cursor.data
This is designed for use with libraries like Pandas where you need to extract the entire result.
As pointed out by Nigel Small, you can use the .data() method to convert it into a list.
Here is how it works, let's take for example this line of code calling graph.run, whose result is saved into tags variable:
tags = graph.run(query)
the result is a py2neo.database.Cursor class (containing a dictionary):
tags
-------------------------------
['py2neo', 'python', 'neo4j']
And by applying the .data() method, it becomes:
tags = graph.run(query).data()
whose result is a list (containing a dictionary):
[{'tags': ['py2neo', 'python', 'neo4j']}]
You can just cast it to a list :
result = session.run("MATCH (n) RETURN n")
records = list(result)

TypeError when attempting to parse pubmed EFetch

I'm new to this python/biopyhton stuff, so am struggling to work out why the following code, pretty much lifted straight out of the Biopython Cookbook, isn't doing what I'd expect.
I'd have thought it'd end up with the interpreter display two list containing the same number, but all i get is one list and then a message saying TypeError: 'generator' object is not subscriptable.
I'm guessing something is going wrong with the Medline.parse step and the result of the efetch isn't being processed in a way that allows subsequent interation to extract the PMID values. Or, the efetch isn't returning anything.
Any pointers at to what I'm doing wrong?
Thanks
from Bio import Medline
from Bio import Entrez
Entrez.email = 'A.N.Other#example.com'
handle = Entrez.esearch(db="pubmed", term="biopython")
record = Entrez.read(handle)
print(record['IdList'])
items = record['IdList']
handle2 = Entrez.efetch(db="pubmed", id=items, rettype="medline", retmode="text")
records = Medline.parse(handle2)
for r in records:
print(records['PMID'])
You're trying to print records['PMID'] which is a generator. I think you meant to do print(r['PMID']) which will print the 'PMID' entry in the current record dictionary object for each iteration. This is confirmed by the example given in the Bio.Medline.parse() documentation.

How can you join two or more dictionaries created by Bio.SeqIO.index?

I would like to be able to join the two "dictionaries" stored in "indata" and "pairdata", but this code,
indata = SeqIO.index(infile, infmt)
pairdata = SeqIO.index(pairfile, infmt)
indata.update(pairdata)
produces the following error:
indata.update(pairdata)
TypeError: update() takes exactly 1 argument (2 given)
I have tried using,
indata = SeqIO.to_dict(SeqIO.parse(infile, infmt))
pairdata = SeqIO.to_dict(SeqIO.parse(pairfile, infmt))
indata.update(pairdata)
which does work, but the resulting dictionaries take up too much memory to be practical for for the sizes of infile and pairfile I have.
The final option I have explored is:
indata = SeqIO.index_db(indexfile, [infile, pairfile], infmt)
which works perfectly, but is very slow. Does anyone know how/whether I can successfully join the two indexes from the first example above?
SeqIO.index returns a read-only dictionary-like object, so update will not work on it (apologies for the confusing error message; I just checked in a fix for that to the main Biopython repository).
The best approach is to either use index_db, which will be slower but
only needs to index the file once, or to define a higher level object
which acts like a dictionary over your multiple files. Here is a
simple example:
from Bio import SeqIO
class MultiIndexDict:
def __init__(self, *indexes):
self._indexes = indexes
def __getitem__(self, key):
for idx in self._indexes:
try:
return idx[key]
except KeyError:
pass
raise KeyError("{0} not found".format(key))
indata = SeqIO.index("f001", "fasta")
pairdata = SeqIO.index("f002", "fasta")
combo = MultiIndexDict(indata, pairdata)
print combo['gi|3318709|pdb|1A91|'].description
print combo['gi|1348917|gb|G26685|G26685'].description
print combo["key_failure"]
In you don't plan to use the index again and memory isn't a limitation (which both appear to be true in your case), you can tell Bio.SeqIO.index_db(...) to use an in memory SQLite3 index with the special index name ":memory:" like so:
indata = SeqIO.index_db(":memory:", [infile, pairfile], infmt)
where infile and pairfile are filenames, and infmt is their format type as defined in Bio.SeqIO (e.g. "fasta").
This is actually a general trick with Python's SQLite3 library. For a small set of files this should be much faster than building the SQLite index on disk.

Resources