AttributeError: 'BloomForCausalLM' object has no attribute 'encode' - machine-learning

I'm trying to do some basic text inference using the bloom model
from transformers import AutoModelForCausalLM, AutoModel
# checkpoint = "bigscience/bloomz-7b1-mt"
checkpoint = "bigscience/bloom-1b7"
tokenizer = AutoModelForCausalLM.from_pretrained(checkpoint)
model = AutoModel.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto")
# Set the prompt and maximum length
prompt = "This is the prompt text"
max_length = 100000
# Tokenize the prompt
inputs = tokenizer.encode("Translate to English: Je t’aime.", return_tensors="pt").to("cuda")
# Generate the text
outputs = model.generate(inputs)
result = tokenizer.result(outputs[0])
# Print the generated text
print(result)
I get the error
Traceback (most recent call last):
File "/tmp/pycharm_project_444/bloom.py", line 15, in <module>
inputs = tokenizer.encode("Translate to English: Je t’aime.", return_tensors="pt").to("cuda")
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1265, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'BloomForCausalLM' object has no attribute 'encode'
Anyone know what the issue is?
It's running on a remote server

I was trying to use the AutoModelForCausalLM tokenizer instead of the AutoTokenizer.
The AutoModelForCausalLMTokenizer does not have an encode() method

The problem was that you were using a model class to create your tokenizer.
AutoModelForCausalLM loads a model for causal language modelling (LM) but not a tokenizer, as stated in the documentation. (https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM) You got that error because the model does not have a method called "encode".
You can use AutoTokenizer to achieve what you want. Also, in huggingface every tokenizer's call is mapped to the encoding method, so you do not have to specify the method call. See below
from transformers import AutoTokenizer, AutoModel
checkpoint = "bigscience/bloom-1b7"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModel.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto")
# Set the prompt and maximum length
prompt = "This is the prompt text"
max_length = 100000
# Tokenize the prompt
inputs = tokenizer("Translate to English: Je t’aime.", return_tensors="pt").to("cuda") # Same as tokenizer.encode(...)

Related

Writing and saving GenBank files with biobython SeqIO module

I wand to safe some DNA sequences in genbank file format to include information about genes, domains, etc. I know how to create SeqRecord objects and include all information I wand to have in the file:
#my DNA sequence and encoded protein sequence of gene1
genome_seq = 'ATTTTGTGCAGCCGAGAGCGCGAGCGAAGCGCTTAAAAAATTCCCCCGCTCTGTTCTCCGGCAGGACACAAAGTCATGCCGTGGAGACCGCCGGTCCATAACGTGCCAGGTAGAGAGAATCAATGGTTTGCAGCGTTCTTTCACGGTCATGCTGCTTTCTGCGGGTGTGGTGACCCTGTTGGGCATCTTAACGGAAGC'
protein_seq = 'QQRILGVKLRLLFNQVQKIQQNQDP'
#position of gene1
start = 12
end = start + len(protein_seq)
#some information
name = 'my_contig'
bioproject = 'BodySites'
sample_type='blood'
taxonomy = ['Homo Sapiens']
reference_prot_ID = 'YP_92845z2093857'
#dictionaries with information for SeqFeature qualifiers and SeqRecord annotations
dict1 = {'gene':'ORF1', 'ref_ID': reference_prot_ID, 'translation':protein_seq}
dict2 = {'SOURCE': sample_type, 'ORGANISM': 'Human', 'Taxonomy':taxonomy}
#create SeqFeature and SeqRecord
f1 = SeqFeature(FeatureLocation(start, end, strand=1), type='domain', qualifiers=dict1)
my_features = [f1]
record = SeqRecord(Seq(genome_seq, alphabet=IUPAC.unambiguous_dna), id=name, name=name\
description=bioproject, annotations=dict2, features = my_features)
print(record)
with open('/media/sf_Desktop/test.gb', 'w') as handle:
SeqIO.write(record, handle, 'genbank')
What I get printed on the screend for the SeqRecord object looks like this, where everything seems to be included:
ID: my_contig
Name: ma_contig
Description: BodySites
Number of features: 1
/SOURCE=blood
/ORGANISM=Human
/Taxonomy=['Homo Sapiens']
Seq('ATTTTGTGCAGCCGAGAGCGCGAGCGAAGCGCTTAAAAAATTCCCCCGCTCTGT...AGC', IUPACUnambiguousDNA())
But in the resulting file the information on SOURCE, ORGANISM and Taxonomy is missing:
LOCUS my_contig 198 bp DNA UNK 01-JAN-1980
DEFINITION BodySites.
ACCESSION my_contig
VERSION my_contig
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
domain 13..37
/gene="ORF1"
/ref_ID="YP_92845z2093857"
/translation="QQRILGVKLRLLFNQVQKIQQNQDP"
ORIGIN
1 attttgtgca gccgagagcg cgagcgaagc gcttaaaaaa ttcccccgct ctgttctccg
61 gcaggacaca aagtcatgcc gtggagaccg ccggtccata acgtgccagg tagagagaat
121 caatggtttg cagcgttctt tcacggtcat gctgctttct gcgggtgtgg tgaccctgtt
181 gggcatctta acggaagc
//
Can anyone help me how to include also the annotation information in the output file?
I found that for the GenBank.Record module it is possible to include all information and it looks very nice on the screen, but there is no information on how to save a Record object to a file...
OK, I found my mistake:
all annotation titles need to be in lowercase letters. So changing 'SOURCE to 'source', 'ORGANISM' to 'organism' and so on, did the job.
Cheers!

Saving SEC 10-K annual report text to files (trouble with decoding)

I am trying to bulk-download the text visible to the "end-user" from 10-K SEC Edgar reports (don't care about tables) and save it in a text file. I have found the code below on Youtube, however I am facing 2 challenges:
I am not sure if I am capturing all text, and when I print the URL from below, I receive very weird output (special characters e.g., at the very end of the print-out)
I can't seem to save the text in txt files, not sure if this is due to encoding (I am entirely new to programming).
import re
import requests
import unicodedata
from bs4 import BeautifulSoup
def restore_windows_1252_characters(restore_string):
def to_windows_1252(match):
try:
return bytes([ord(match.group(0))]).decode('windows-1252')
except UnicodeDecodeError:
# No character at the corresponding code point: remove it.
return ''
return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)
# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt"
# grab the response
response = requests.get(new_html_text)
page_soup = BeautifulSoup(response.content,'html5lib')
page_text = page_soup.html.body.get_text(' ',strip = True)
# normalize the text, remove characters. Additionally, restore missing window characters.
page_text_norm = restore_windows_1252_characters(unicodedata.normalize('NFKD', page_text))
# print: this works however gives me weird special characters in the print (e.g., at the very end)
print(page_text_norm)
# save to file: this only gives me an empty text file
with open('testfile.txt','w') as file:
file.write(page_text_norm)
Try this. If you take the data you expect as an example, it will be easier for people to understand your needs.
from simplified_scrapy import SimplifiedDoc,req,utils
url = 'https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt'
html = req.get(url)
doc = SimplifiedDoc(html)
# text = doc.body.text
text = doc.body.unescape() # Converting HTML entities
utils.saveFile("testfile.txt",text)

django reverse foreign key on admin list_display fails

I'm tring to display a reverse foreign key on admin list_change:
# models.py
class Person(models.Model):
id = models.CharField(max_length=50, primary_key=True)
class Book(models.Model):
person = models.ForeignKey(Person, related_name='samples')
name = models.CharField(max_length=50)
#admin.py
#admin.register(Person)
class PersonAdmin(CustomAdmin):
list_display = ('id', 'links')
def links(self, obj):
links = obj.book_set().all()
return mark_safe('<br/>'.join(links))
It's like this other post
I'm using django 1.8 and it fails:
Stacktrace:
File "/venv/lib/python2.7/site-packages/django/contrib/admin/templatetags/admin_list.py", line 320, in result_list
'results': list(results(cl))}
File "/venv/lib/python2.7/site-packages/django/contrib/admin/templatetags/admin_list.py", line 296, in results
yield ResultList(None, items_for_result(cl, res, None))
File "/venv/lib/python2.7/site-packages/django/contrib/admin/templatetags/admin_list.py", line 287, in __init__
super(ResultList, self).__init__(*items)
File "/venv/lib/python2.7/site-packages/django/contrib/admin/templatetags/admin_list.py", line 199, in items_for_result
f, attr, value = lookup_field(field_name, result, cl.model_admin)
File "/venv/lib/python2.7/site-packages/django/contrib/admin/utils.py", line 278, in lookup_field
value = attr(obj)
File "/home/kparsa/boom/myapp/myapp/admin.py", line 299, in links
link = obj.book_set().all()
File "/venv/lib/python2.7/site-packages/django/db/models/fields/related.py", line 691, in __call__
manager = getattr(self.model, kwargs.pop('manager'))
KeyError: u'manager'
Does anyone know how to get this to work properly?
Note that I do NOT want to do something like
qres = Book.objects.filter(person__id=obj.id).values_list('id', 'name').order_by('name')
for x, y in qres:
links.append('{}'\
.format(x, y))
because it will run many duplicate queries (# of rows).
The error was the result of the fact that I had used a related_name in the model but instead I was trying to use the default model name.
In order to solve the problem of the duplicate queries, I queried for all the results in get_queryset(), stored the results in memcached, then in the "links" method, I just pulled it from the cache.
Works great!
The only danger here would be getting out of sync when new data is pushed. I put a 100 second timeout on the cache to avoid issues.

TypeError when attempting to parse pubmed EFetch

I'm new to this python/biopyhton stuff, so am struggling to work out why the following code, pretty much lifted straight out of the Biopython Cookbook, isn't doing what I'd expect.
I'd have thought it'd end up with the interpreter display two list containing the same number, but all i get is one list and then a message saying TypeError: 'generator' object is not subscriptable.
I'm guessing something is going wrong with the Medline.parse step and the result of the efetch isn't being processed in a way that allows subsequent interation to extract the PMID values. Or, the efetch isn't returning anything.
Any pointers at to what I'm doing wrong?
Thanks
from Bio import Medline
from Bio import Entrez
Entrez.email = 'A.N.Other#example.com'
handle = Entrez.esearch(db="pubmed", term="biopython")
record = Entrez.read(handle)
print(record['IdList'])
items = record['IdList']
handle2 = Entrez.efetch(db="pubmed", id=items, rettype="medline", retmode="text")
records = Medline.parse(handle2)
for r in records:
print(records['PMID'])
You're trying to print records['PMID'] which is a generator. I think you meant to do print(r['PMID']) which will print the 'PMID' entry in the current record dictionary object for each iteration. This is confirmed by the example given in the Bio.Medline.parse() documentation.

How can you join two or more dictionaries created by Bio.SeqIO.index?

I would like to be able to join the two "dictionaries" stored in "indata" and "pairdata", but this code,
indata = SeqIO.index(infile, infmt)
pairdata = SeqIO.index(pairfile, infmt)
indata.update(pairdata)
produces the following error:
indata.update(pairdata)
TypeError: update() takes exactly 1 argument (2 given)
I have tried using,
indata = SeqIO.to_dict(SeqIO.parse(infile, infmt))
pairdata = SeqIO.to_dict(SeqIO.parse(pairfile, infmt))
indata.update(pairdata)
which does work, but the resulting dictionaries take up too much memory to be practical for for the sizes of infile and pairfile I have.
The final option I have explored is:
indata = SeqIO.index_db(indexfile, [infile, pairfile], infmt)
which works perfectly, but is very slow. Does anyone know how/whether I can successfully join the two indexes from the first example above?
SeqIO.index returns a read-only dictionary-like object, so update will not work on it (apologies for the confusing error message; I just checked in a fix for that to the main Biopython repository).
The best approach is to either use index_db, which will be slower but
only needs to index the file once, or to define a higher level object
which acts like a dictionary over your multiple files. Here is a
simple example:
from Bio import SeqIO
class MultiIndexDict:
def __init__(self, *indexes):
self._indexes = indexes
def __getitem__(self, key):
for idx in self._indexes:
try:
return idx[key]
except KeyError:
pass
raise KeyError("{0} not found".format(key))
indata = SeqIO.index("f001", "fasta")
pairdata = SeqIO.index("f002", "fasta")
combo = MultiIndexDict(indata, pairdata)
print combo['gi|3318709|pdb|1A91|'].description
print combo['gi|1348917|gb|G26685|G26685'].description
print combo["key_failure"]
In you don't plan to use the index again and memory isn't a limitation (which both appear to be true in your case), you can tell Bio.SeqIO.index_db(...) to use an in memory SQLite3 index with the special index name ":memory:" like so:
indata = SeqIO.index_db(":memory:", [infile, pairfile], infmt)
where infile and pairfile are filenames, and infmt is their format type as defined in Bio.SeqIO (e.g. "fasta").
This is actually a general trick with Python's SQLite3 library. For a small set of files this should be much faster than building the SQLite index on disk.

Resources