I have a fasta file containning PapillomaViruses sequences (entire genomes, partial CDS, ....) and i'm using biopython to retrieve entire genomes (around 7kb) from this files, so here's my code:
rec_dict = SeqIO.index("hpv_id_name_all.fasta","fasta")
for k in rec_dict.keys():
c=c+1
if len(rec_dict[k].seq)>7000:
handle=open(rec_dict[k].description+"_"+str(len(rec_dict[k].seq))+".fasta","w")
handle.write(">"+rec_dict[k].description+"\n"+str(rec_dict[k].seq)+"\n")
handle.close()
i'm using a dictionary for avoiding loading everything in memory. The variable "c" is used to know how many iterations are made before THIS error pops up:
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
IOError: [Errno 2] No such file or directory: 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta'
when i print the value of "c", i get 9013 while the file contains 10447 sequences, meaning the for loop didn't go through all the sequences (the count is done before the "if" condition, so the i count all the iterations, not only those which match the condition). i don't understand the INPUT/OUTPUT error, it should create the 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta' file instead of verifying its existence, shouldn't it?
The file you were trying to create -- 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta' -- contains a slash ('/'), which is interpreted by Python as a directory 'EU410347.1|Human papillomavirus FA75' followed by a filename 'KI88-03_7401.fasta', so Python complains that the directory does not exist.
You may want to replace the slash with something else, such as
handle=open(rec_dict[k].description.replace('/', '_')+"_"+str(len(rec_dict[k].seq))+".fasta","w")
Related
h
words_to_guess= (which code should I enter here in order to import words from a text file )
my text file is names word.txt so i tried
words_to_guess =import words.txt
You can assign the content of a .txt file to a variable using this code:
words_to_guess = open("words.txt")
Just make sure that the file "words.txt" is in the same directory as the .py file (or if you're compiling it to a .exe, the same directory as the .exe file)
I would also like to point out that based on the screenshot you provided, it looks like you're trying to get a random word from the .txt file. Once you've done the above code, I would recommend adding this code below it as well:
words_to_guess = words_to_guess.split()
This will take the content of "words_to_guess" and split every word into a list that can be further accessed. You can then call:
word = random.choice(words_to_guess)
And it will select a random element from the array into the "word" variable, hence providing you a random word from the .txt file.
Just note that in the split() function, a word is determined by the spaces in between, so if you have a word like "Halloween Pumpkin" or "American Flag", the split() function would make each individual word an element, so it would be transferred into ["Halloween", "Pumpkin"] or ["American", "Flag"].
That's all!
I have an array of for each row in a csv file as followed:
[['thxx'], ['too', 'late', 'now', 'dumbass'], ['you', '‘', 're', 'so', 'dumb', '?', '?'], ['thxxx'], ['i', '‘', 'd', 'be', 'fucked']]
When I try to pass this on to the lemmatizer like this:
from nltk.stem import WordNetLemmatizer
lemmatized_words = [WordNetLemmatizer.lemmatize(word) for word in tokened_text]
print(lemmatized_words)
I get the following error:
TypeError: lemmatize() missing 1 required positional argument: 'word'
Why is that?
As a side question: Do I need to do this before passing this for Vectorization? I am building an machine learning model and saw the function CountVectorizer in sci kit learn but could not find any information that it does lemmatization and so on beforehand as well.
There are some things wrong in your code:
WordNetLemmatizer is a class, you need to instanciate it first
tokened_text is a nested list, hence you need a nested list-comprehension to preserve the structure. Also lemmatize is expecting a string.
Here's how you could do this:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
lemmatized_words = [[wnl.lemmatize(word) for word in l] for l in tokened_text]
Traceback (most recent call last):
File "C:/Users/tarun/PycharmProjects/Yumnam_jr_ChatBot/training.py", line 31, in <module>
words = [lemmatizer.lemmatize(word) for word in words if word not in ignore_letter]
File "C:/Users/tarun/PycharmProjects/Yumnam_jr_ChatBot/training.py", line 31, in <listcomp>
words = [lemmatizer.lemmatize(word) for word in words if word not in ignore_letter]
TypeError: lemmatize() missing 1 required positional argument: 'word'
The error is "Parameter 'self' unfilled". So, what you need to do is install the "package 'self' Version 2020.12.3" and write like the following:-
lemmatizer = WordNetLemmatizer
words = [lemmatizer.lemmatize(self, word) for word in words if word not in ignore_letter]
words = sorted(set(words))
It works fine for me.
The reason of getting the error is you are missing () after the function name while assigning it to variable
xyz = WordNetLemmatizer() --> this bracket is missing leading to the error
You can just write this to make your code work
your_variable_here = WordNetLemmatizer()
In your code, you haven't added the parentheses () after the WordNetLemmatizer.
Add that and you are good to go.
I am following instructions from this link on how to append Stata files via a foreach loop. I think that it's pretty straightforward.
However, when I try to refer to each f in datafiles in my foreach loop, I receive the error:
invalid `
I've set my working directory and the data is in a subfolder called csvfiles. I am trying to call each file f in the csvfiles subfolder using my local macro datafiles and then append each file to an aggregate Stata dataset called data.dta.
I've included the code from my do file below:
clear
local datafiles: dir "csvfiles" files "*.csv"
foreach f of local datafiles {
preserve
insheet using “csvfiles\`f'”, clear
** add syntax here to run on each file**
save temp, replace
restore
append using temp
}
rm temp
save data.dta, replace
The backslash character has meaning to Stata: it will prevent the interpretation of any following character that has a special meaning to Stata, in particular the left single quote character
`
will not be interpreted as indicating a reference to a macro.
But all is not lost: Stata will allow you to use the forward slash character in path names on any operating system, and on Windows will take care of doing what must be done to appease Windows. Replacing your insheet command with
insheet using “csvfiles/`f'”, clear
should solve your problem.
Note that the instructions you linked to do exactly that; some of the code includes backslashes in path names, but where a macro is included, forward slashes are used instead.
Im working with erlang writing an escript and Ive seen many examples with file io, not so easy to follow so i found this:
Text = file:read_file("f.txt"),
io:format("~n", Text).
works somehow, it does print the file contents followed by multiple errors
in call from erl_eval:do_apply/6 (erl_eval.erl, line 572)
in call from escript:eval_exprs/5 (escript.erl, line 850)
in call from erl_eval:local_func/5 (erl_eval.erl, line 470)
in call from escript:interpret/4 (escript.erl, line 768)
in call from escript:start/1 (escript.erl, line 277)
in call from init:start_it/1 (init.erl, line 1050)
in call from init:start_em/1 (init.erl, line 1030)
so what would be the easiest way to read the whole file and store the contents in an array or list for later use?
First, file:read_file/1 will return {ok, Binary} on success, where Binary is a binary representing the contents of the file. On error, {error, Reason} is returned. Thus your Text variable is actually a tuple. The easy fix (crashing if there is an error):
{ok, Text} = file:read_file("f.txt")
Next, the first argument to io:format/2 is a format string. ~n is a format that means "newline", but you haven't given it a format that means anything else, so it's not expecting Text as an argument. Furthermore, all arguments to the format string should be in a list passed as the second argument. ~s means string, so:
io:format("~s~n", [Text])
will print out the entire file, followed by a newline. If you want to pass multiple arguments, it would look something like:
io:format("The number ~B and the string ~s~n", [100, "hello"])
Notice how there are only two arguments to io:format/2; one just happens to be a list containing multiple entries.
Since your question asked for an easy way to read the contents of a file into a data-structure, you might enjoy file:consult/1. This solution assumes, you have control over the format of the file since consult/1 expects the file to consist of lines terminated with '.'. It returns {ok, [terms()]} | {error,Reason}.
So, if your file, t.txt, consisted of lines terminated by '.' as follows:
'this is an atom'.
{person, "john", "smith"}.
[1,2,3].
then you could utilize file:consult/1
1> file:consult("c:\t.txt").
2> {ok,['this is an atom',{person,"john","smith"},[1,2,3]]}
Creating document in couchdb is generating the following error,
12> ADoc.
[{<<"Adress">>,<<"Hjalmar Brantingsgatan 7 C">>},
{<<"District">>,<<"Brämaregården">>},
{<<"Rent">>,3964},
{<<"Rooms">>,2},
{<<"Area">>,0}]
13> IDoc.
[{<<"Adress">>,<<"Segeparksgatan 2A">>},
{<<"District">>,<<"Kirseberg">>},
{<<"Rent">>,9701},
{<<"Rooms">>,3},
{<<"Area">>,83}]
14> erlang_couchdb:create_document({"127.0.0.1", 5984}, "proto_v1", IDoc).
{json,{struct,[{<<"ok">>,true},
{<<"id">>,<<"c6d96b5f923f50bfb9263638d4167b1e">>},
{<<"rev">>,<<"1-0d17a3416d50129328f632fd5cfa1d90">>}]}}
15> erlang_couchdb:create_document({"127.0.0.1", 5984}, "proto_v1", ADoc).
** exception exit: {ucs,{bad_utf8_character_code}}
in function xmerl_ucs:from_utf8/1 (xmerl_ucs.erl, line 185)
in call from mochijson2:json_encode_string/2 (/Users/admin/AlphaGroup/src/mochijson2.erl, line 200)
in call from mochijson2:'-json_encode_proplist/2-fun-0-'/3 (/Users/admin/AlphaGroup/src/mochijson2.erl, line 181)
in call from lists:foldl/3 (lists.erl, line 1197)
in call from mochijson2:json_encode_proplist/2 (/Users/admin/AlphaGroup/src/mochijson2.erl, line 184)
in call from erlang_couchdb:create_document/3 (/Users/admin/AlphaGroup/src/erlang_couchdb.erl, line 256)
Above of two documents one can be created in couchdb with no problem (IDoc).
can any one help me to figure out the reason it is caused?
I think that is problem is in the <<"Brämaregården">>. It is necessary to convert the unicode to binary firstly. Example is in the following links.
unicode discussion. The core function is in unicode
Entering non-ASCII characters in Erlang code is fiddly, not the least because it works differently in the shell than in compiled Erlang code.
Try inputting the binary explicitly as UTF-8:
<<"Br", 16#c3, 16#a4, "mareg", 16#c3, 16#a5, "rden">>
That is, "ä" is represented by the bytes C3 A4 in UTF-8, and "å" by C3 A5. There are many ways to find those codes; a quick search turned up this table.
Normally you'd get the input from somewhere outside your code, e.g. reading from a file, typed into a web form etc, and then you wouldn't have this problem.