I'm trying to blast an 8-mer (string of length 8) against the NCBI database. However, whenever I use qblast, it comes up empty with respect to matches. This is my code:
from Bio.Blast.NCBIWWW import qblast
import Bio.Blast.NCBIXML as parser
a = qblast('blastp','nr','GGMPSGCS')
b = parser.read(a)
print b.alignments`
Whenever I do this, it just prints the empty list []. Why is this happening? Can anyone shine a light on it?
I can get a match using the NCBI online BLAST tool, and I can even get a match if I use a longer kmer like "SSRVQDGMGLYTARRVR". It just happens that all the 8-mers I search come up empty.
From the FAQ at http://biopython.org/DIST/docs/tutorial/Tutorial.html
Why doesn’t Bio.Blast.NCBIWWW.qblast() give the same results as the NCBI BLAST website?
You need to specify the same options – the NCBI often adjust the default settings on the >website, and they do not match the QBLAST defaults anymore. Check things like the gap >penalties and expectation threshold.
Check that qblast is using the same defaults, if not sure, make them explicit. I wouldn't be surprised if it's doing some sort of "read to short" filtering step.
As in this answer, you have to fine-tuning the qblast to override the defaults. The WWW frontend of NCBI-BLAST automatically adjusts your parameters to match the short (8 bp) sequences, but if you do it through Biopython API you have to do it manually.
Related
For those who are not familiar with what a homophone is, I provide the following examples:
our & are
hi & high
to & too & two
While using the Speech API included with iOS, I am encountering situations where a user may say one of these words, but it will not always return the word I want.
I looked into the [alternativeSubstrings] (link) property wondering if this would help, but in my testing of the above words, it always comes back empty.
I also looked into the Natural Language API, but could not find anything in there that looked useful.
I understand that as a user adds more words, the Speech API can begin to infer context and correct for these, but my use case will not work well with this since it will often only want one or two words at most, limiting the effectiveness of context.
An example of contextual processing:
Using the words above on their own, I get these results:
are
hi
to
However, if I put together the following sentence, you can see they are all wrong:
I am too high for our ladder
Ideally, I would either get a list back containing [are, our], [to, too, two], [hi, high] for each transcription segment, or would have a way to compare a string against a function that supports homophones.
An example of this would be:
if myDetectedWord == "to" then { ... }
Where myDetectedWord can be [to, too, two], and this function would return true for each of these.
This is a common NLP dilemma, and I'm not so sure what might be your desired output in this application. However, you may want to bypass this problem in your design/architecture process, if possible and if you could. Otherwise, this problem is to turn into a challenge.
Being said that, if you wish to really get into it, I like this idea of yours:
string against a function
This might be more efficient and performance friendly.
One way, I'd be liking to solve this problem would be though RegEx processing, instead of using endless loops and arrays. You could maybe prototype loops and arrays to begin with and see how it works, then you might want to use regular expression for gaining performance.
You could for instance define fixed arrays in regular expressions and quickly check against your string (word by word, maybe using back-referencing) and you can add many boundaries in your expressions for string processing, as you wish.
Your fixed arrays also can be designed based on probabilities of occurring certain words in certain part of a string. For instance,
^I
vs
^eye
The probability of I being the first word is much higher than that of eye.
The probability of I in any part of a string is higher than that of eye, also.
You might want to weight words based on that.
I'd say the key would be that you'd narrow down your desired outputs as focused as possible and increase accuracy, [maybe even with 100 words if possible], if you wish to have a good/working application.
Good project though, I hope you like/enjoy the challenge.
I need a way to extract the Salesforce record ID from a URL using Zapier Push. How can I find the first 3 characters in a string that match the start of the Id like 006 and then return a set number of characters after that?
The url is formatted as such:
https://useindio.lightning.force.com/lightning/r/Opportunity/006f400000AiVufAAF/view
David here, from the Zapier Platform team. Good question!
Whenever you want to extract data from a string and you know the exact format the string will be in, Regular Expressions are the answer.
Assuming you want to grab anything after 006 (and you know it'll always be there), you could use the regex 006(\w{15}) (more info), which will find the 15 characters after that. If you know the surrounding url will always be the same, you could easily grab the whole ID by anchoring via Opportunity and view: \/Opportunity\/(.*)\/view (more info).
Either way, there's info about setting up a formatter zap here, or you could do it in code (JS Example, Python Example).
Let me know if you've got any other questions!
I want to make custom slots that accepts any and all entries as long as those entries follow a certain regex pattern, eg:any number of alphabets or numbers but without a space in between. Can anyone tell me if there is a way in amazon lex to achieve it?
Also, if I want to take a certain type of data, say, email ids, but want to give the user option to give any number of email ids (more than one), what is the way to do that.
I am new to Amazon Lex and any suggestions would be appreciated.
Make a slot in Lex console in your intent but do not tick as required, and give any type as slot type.
Now in lambda code, first set the slot to null and then parse the inputText using regex and assign the correct value to the slot.
This way both of your problems will be addressed.
Hope it helps. Let us know if you run in any problems.
I am trying to validate fields in my iOS program:
I need to match a phone number, but the field is optional.
I thought using the regex to match the number to also validate if there is no phone number:
[0-9\-\+\*]{4,14}
Then I thought how to also match where there is either a valid number or no number at all?
(:?[0-9\-\+\*]{4,14})?
Meaning, either match between 4 to 14 chars within the range 0-9,+,-,* or nothing.
This website is showing infinte matches for that pattern.
ideas?
^$|^[0-9\-\+\*]{4,14}$
As to the questions this has brought here:
Regex is a great validation method. and it is cross platform.
no need for another layer of code to implement. Simple and clean.
You should just code it. I don't know your language but basicaly :
If(field.isEmpty)
should do the trick.
I've done some Google searching but couldn't find what I was looking for.
I'm developing a scrabble-type word game in rails, and was wondering if there was a simple way to validate what the player inputs in the game is actually a word. They'd be typing the word out.
Is validation against some sort of English language dictionary database loaded within the app best way to solve this problem? If so, are there any libraries that offer this kind of functionality? If not, what would you suggest?
Thanks for your help!
You need two things:
a word list
some code
The word list is the tricky part. On most Unix systems there's a word list at /usr/share/dict/words or /usr/dict/words -- see http://en.wikipedia.org/wiki/Words_(Unix) for more details. The one on my Mac has 234,936 words in it. But they're not all valid Scrabble words. So you'd have to somehow acquire a Scrabble dictionary, make sure you have the right license to use it, and process it so it's a text file.
(Update: The word list for LetterPress is now open source, and available on GitHub.)
The code is no problem in the simple case. Here's a script I whipped up just now:
words = {}
File.open("/usr/share/dict/words") do |file|
file.each do |line|
words[line.strip] = true
end
end
p words["magic"]
p words["saldkaj"]
This will output
true
nil
I leave it as an exercise for the reader to make it into a proper Words object. (Technically it's not a Dictionary since it has no definitions.) Or to use a DAWG instead of a hash, even though a hash is probably fine for your needs.
A piece of language-agnostic advice here, is that if you only care about the existence of a word (which in such a case, you do), and you are planning to load the entire database into the application (which your query suggests you're considering) then a DAWG will enable you to check the existence in O(n) time complexity where n is the size of the word (dictionary size has no effect - overall the lookup is essentially O(1)), while being a relatively minimal structure in terms of memory (indeed, some insertions will actually reduce the size of the structure, a DAWG for "top, tap, taps, tops" has fewer nodes than one for "tops, tap").