Name generation from human rules - machine-learning

I'm looking for a solution that allow me to generate name from non official/known/written rules.
Here are several example :
Mairie de Paris -> MPARIS
Mairie de Saint Etienne -> MSTETIENNE
Transport Dupont -> DUPONTTRANSPORT
Lycée Louis Barthou -> LLouisBarthou
I was initialy thinking about machine learning, but I have no clue to start with it.
Thanks you very much

Although its quite random but you can try the following function and edit it to your will.
import random
from nltk import word_tokenize
def generate_name(s):
s_tokenized = word_tokenize(s)
stop_words= ['de']
s_tokenized_list = []
for w in s_tokenized:
if w not in stop_words:
s_tokenized_list.append(w)
name=[]
length_of_list = len(s_tokenized_list)
if length_of_list>=3:
for n in s_tokenized_list[:length_of_list-1]:
name.append(n[0])
elif length_of_list==2:
for n in s_tokenized_list[:length_of_list-1]:
name.append(n[0])
name= ''.join(name)
return ''.join(name+s_tokenized_list[length_of_list-1].upper())

Related

Find sequence IDs of DNA subsequences in DNA-sequences from FASTA-file

I want to make a function that reads a FASTA-file with DNA sequences(possibly ambiguous) and inputs a subsequence that returns all sequence IDs of the sequences that contain the given subsequence.
To make the script more efficient, I tried to use nt_search to make give all possibilities of the ambiguous sequence from the FASTA. This seemed more efficient than producing all unambiguous possibilities, especially for larger sequences an FASTA-files.
Right now, I'm struggling to see how I can check whether the subsequence is part of the output given bynt_search.
I want to see if eg 'CGC' (input subsequence) is part of the possibilities given by nt_search: ['TA[GATC][AT][GT]GCGGT'] and return all sequence IDs of sequences for which this is true.
What I have so far:
def bonus_subsequence(file, unambiguous_sequence):
seq_records = SeqIO.parse(file,'fasta', alphabet =ambiguous_dna)
resultListOfSeqIds = []
print(f'Unambiguous sequence {unambiguous_sequence} could be a subsequence of:')
for record in seq_records:
d = Seq.IUPAC.IUPACData.ambiguous_dna_values
couldBeSubSequence = False;
if unambiguous_sequence in nt_search(unambiguous_sequence,record):
couldBeSubSequence = True;
if couldBeSubSequence == True:
print(f'{record.id}')
resultListOfSeqIds.append({record.id})
In a second phase, I want to be able to also use this for ambiguous subsequences, but I'd be more than happy with help on this first question, thanks in advance!
I don't know if I understood You well but you can try this:
Example fasta file:
>seq1
ATGTACGTACGTACNNNNACTG
>seq2
NNNATCGTAGTCANNA
>seq3
NNNNATGNNN
Code:
from Bio import SeqIO
from Bio import SeqUtils
from Bio.Alphabet.IUPAC import ambiguous_dna
if __name__ == '__main__':
sub_seq = input('Enter a subsequence: ')
results = []
with open('test.fasta', 'r') as fh:
for seq in SeqIO.parse(fh, 'fasta', alphabet=ambiguous_dna):
if sub_seq in seq:
results.append((seq.name))
print(results, sep='\n')
Results (console):
Enter a subsequence: ATG
Results:
seq1
seq3
Enter a subsequence: NNNA
Results:
seq1
seq2
seq3

Error "subscript not supported on set[Declaration]" for list comprehension in Rascal

I do not understand why I get the error I'm currently getting in Rascal.
|cwd:///loader.rsc|(391,1,<19,33>,<19,34>): subscript not supported on set[Declaration] at |cwd:///loader.rsc|(391,1,<19,33>,<19,34>)
Advice: |http://tutor.rascal-mpl.org/Errors/Static/UnsupportedOperation/UnsupportedOperation.html|
I get this on the following list comprehension:
{asts[astIndexes[i]] | int i <- [0 .. size(astIndexes)]}
If needed, this is the entire file:
module loader
import IO;
import Set;
import List;
import lang::java::m3::Core;
import lang::java::m3::AST;
import String;
set[Declaration] asts = {};
void getAsts(list[loc] partialScanList){
asts = {};
for (loc m <- partialScanList)
asts += createAstFromFile(m, true);
}
void scanMetric(void (set[Declaration]) metricFunction, list[int] astIndexes){
metricFunction({asts[astIndexes[i]] | int i <- [0 .. size(astIndexes)]});
println(0);
}
The answer is that the subscript operator is defined on maps and relations and not on sets. For example on a rel[int,int] x = {<1,2>} you could x[1] and get {2}, and on map[int,int] y = (1:2) you could y[1] and get 2.
A side-note, this codes looks like its computing lookup indexes for AST nodes, but Rascal already has pretty efficient hashes for all ADT constructor trees and those are used to lookup in relations and maps. Since these hash-codes are also integers and their distribution is pretty uniform, it is very hard to increase performance by introducing your own indexing scheme on top of this. Most likely it would degrade performance rather than improve it.
So you if you need a lookup per AST node, you could use a rel[Declaration, Something else]. People often also use loc as references to AST nodes, since they are supposed to be pretty unique. That helps if you can't keep all ASTs in memory all the time.

Can I Use Multiple Properties in a StructuredFormatDisplayAttribute?

I'm playing around with StructuredFormatDisplay and I assumed I could use multiple properties for the Value, but it seems that is not the case. This question (and accepted answer) talk about customizing in general, but the examples given only use a single property. MSDN is not helpful when it comes to usage of this attribute.
Here's my example:
[<StructuredFormatDisplay("My name is {First} {Last}")>]
type Person = {First:string; Last:string}
If I then try this:
let johnDoe = {First="John"; Last="Doe"}
I end up with this error:
<StructuredFormatDisplay exception: Method 'FSI_0038+Person.First}
{Last' not found.>
The error seems to hint at it only capturing the first property mentioned in my Value but it's hard for me to say that with any confidence.
I have figured out I can work around this by declaring my type like this:
[<StructuredFormatDisplay("My name is {Combined}")>]
type Person = {First:string; Last:string} with
member this.Combined = this.First + " " + this.Last
But I was wondering if anyone could explain why I can't use more than one property, or if you can, what syntax I'm missing.
I did some digging in the source and found this comment:
In this version of F# the only valid values are of the form PreText
{PropertyName} PostText
But I can't find where that limitation is actually implemented, so perhaps someone more familiar with the code base could simply point me to where this limitation is implemented and I'd admit defeat.
The relevant code from the F# repository is in the file sformat.fs, around line 868. Omitting lots of details and some error handling, it looks something like this:
let p1 = txt.IndexOf ("{", StringComparison.Ordinal)
let p2 = txt.LastIndexOf ("}", StringComparison.Ordinal)
if p1 < 0 || p2 < 0 || p1+1 >= p2 then
None
else
let preText = if p1 <= 0 then "" else txt.[0..p1-1]
let postText = if p2+1 >= txt.Length then "" else txt.[p2+1..]
let prop = txt.[p1+1..p2-1]
match catchExn (fun () -> getProperty x prop) with
| Choice2Of2 e ->
Some (wordL ("<StructuredFormatDisplay exception: " + e.Message + ">"))
| Choice1Of2 alternativeObj ->
let alternativeObjL =
match alternativeObj with
| :? string as s -> sepL s
| _ -> sameObjL (depthLim-1) Precedence.BracketIfTuple alternativeObj
countNodes 0 // 0 means we do not count the preText and postText
Some (leftL preText ^^ alternativeObjL ^^ rightL postText)
So, you can easily see that this looks for the first { and the last }, and then picks the text between them. So for foo {A} {B} bar, it extracts the text A} {B.
This does sound like a silly limitation and also one that would not be that hard to improve. So, feel free to open an issue on the F# GitHub page and consider sending a pull request!
Just to put a bow on this, I did submit a PR to add this capability and yesterday it was accepted and pulled into the 4.0 branch.
So starting with F# 4.0, you'll be able to use multiple properties in a StructuredFormatDisplay attribute, with the only downside that all curly braces you wish to use in the message will now need to be escaped by a leading \ (e.g. "I love \{ braces").
I rewrote the offending method to support recursion and switched to using a regular expression to detect property references. It seems to work pretty well, although it isn't the prettiest code I've ever written.

How to pass a digraph to a different process and node?

I have created a digraph term on process A and I want to pass this digraph to a process on another node. Whenever I use this digraph on the other process I am getting errors such as:
** {badarg,
[{ets,insert,[598105,{"EPqzYxiM9UV0pplPTRg8vX28h",[]}],[]},
{digraph,do_add_vertex,2,[{file,"digraph.erl"},{line,377}]},
Becasuse a digraph is based on ETS, it appears that this is quite more complicated, making a digraph pretty much standalone on the process it was created. I have found this entry that reveals a similar problem : ETS on a different process
I know I can create the digraph in a server an then connect to it through otp messages, but I cannot do that in my architecture. All nodes can communicate using a specific approach designed to pass the state along as Terms.
It appears to me that having digraphs sent accross different nodes that cannot directly communicate with each other is not possible. Overall, it appears that a digraph cannot be directly serialized. I am thinking that I can "unwind" the digraph as a list of vertices and edges and then transmit and recreate it on the other process (not efficient, performing or elegant). Any thoughts on a better way to serialize it ? Is there a way to serialize the digraph state out of the ETS store ?
Any thoughts ?
You can serialize/deserialize a digraph like this;
serialize({digraph, V, E, N, B}) ->
{ets:tab2list(V),
ets:tab2list(E),
ets:tab2list(N),
B}.
deserialize({VL, EL, NL, B}) ->
DG = {digraph, V, E, N, B} = case B of
true -> digraph:new();
false -> digraph:new([acyclic])
end,
ets:delete_all_objects(V)
ets:delete_all_objects(L)
ets:delete_all_objects(N)
ets:insert(V, VL)
ets:insert(E, EL)
ets:insert(N, NL)
DG.
And this is the code I used to test this;
passer() ->
G = digraph:new(),
V1 = digraph:add_vertex(G),
V2 = digraph:add_vertex(G),
V3 = digraph:add_vertex(G),
digraph:add_edge(G, V1, V2, "edge1"),
digraph:add_edge(G, V1, V3, "edge2"),
Pid = spawn(fun receiver/0),
Pid ! serialize(G).
receiver() ->
receive
SG = {_VL, _EL, _NL, _B} ->
G = deserialize(SG),
io:format("Edges: ~p~n", [digraph:edges(G)]),
io:format("Edges: ~p~n", [digraph:vertices(G)])
end.
Pretty ugly solution but works. I think this is the only way pass digraph between nodes since ets tables cannot be shared between nodes.
Edit: remove unnecessary loops
I have a solution, but it relies on the structure of the variable returned by digrapgh:new(), so I am not sure that it will be compatible with future version.
D = digraph:new(),
...
%some code modifying D
...
{digraph,Vertices,Edges,Neighbours,Cyclic} = D, % get the table Id of the 3 tables containing D values
% It should be preferable to use the record definition of the digraph module
%-record(digraph, {vtab = notable :: ets:tab(),
% etab = notable :: ets:tab(),
% ntab = notable :: ets:tab(),
% cyclic = true :: boolean()}).
LV = ets:tab2list(Vertices),
LE = ets:tab2list(Edges),
LN = ets:tab2list(Neighbours),
...
% then serialize and send all this variables to the target node, ideally in one single tuple like
% {my_digraph_data,LV,LE,LN,Cyclic} or using refs to avoid the mix of messages,
% and on reception on the remote node:
receive
{my_digraph_data,LV,LE,LN,Cyclic} ->
Dcopy = digrapgh:new(Cyclic),
{digraph,Vertices,Edges,Neighbours,_Cyclic} = Dcopy,
ets:insert(Vertices,LV),
ets:insert(Edges,LE),
ets:insert(Neighbours,LN),
Dcopy;
...
and that's it.
Note: If you are testing this in the same shell, then make sure to change the name of Vertices, Edges, and Neighbours in the
spawned process' receive expression otherwise it will fail with
badmatch (as they have already been bound when matching D).

How to create a list of 1000 random numbers in Erlang?

I'm sure that there is a function for that. I just want to make a list of 1000 numbers, each one of them which should be random.
To generate a 1000-element list with random numbers between 1 and 10:
[rand:uniform(10) || _ <- lists:seq(1, 1000)].
Change 10 and 1000 to appropriate numbers. If you omit the 10 from from the rand:uniform call, you'll get a random floating point number between 0.0 and 1.0.
On Erlang versions below 18.0: Use the random module instead. Caution! You need to run random:seed/3 before using it per process, to avoid getting the same pseudo random numbers.
Make sure to seed appropriately.
> F = fun() -> io:format("~p~n", [[random:uniform(10) || _ <- lists:seq(1, 10)]]) end.
> spawn(F).
[1,5,8,10,6,4,6,10,7,5]
> spawn(F).
[1,5,8,10,6,4,6,10,7,5]
Your intuition is that the results would be different. A random seed in Erlang is process specific. The default seed is fixed though. That's why you get the same result even though there are two processes in the example.
> G = fun() -> {A1,A2,A3} = now(),
random:seed(A1, A2, A3),
io:format("~p~n", [[random:uniform(10) || _ <- lists:seq(1, 10)]])
end.
> spawn(G).
[3,1,10,7,9,4,9,2,8,3]
> spawn(G).
[9,1,4,7,8,8,8,3,5,6]
Note that if the return value of now() is the same in two different processes you end up with the same problem as above. Which is why some people like to use a gen_server for wrapping random number generation. Alternatively you can use better seeds.
i will be more then happy to get also a site that i will be able to
read it there. thanks.
You should check out Learn You Some Erlang which will guide you through the language.
Pseudorandom number generator from crypto module works better crypto:rand_uniform(From, To).
To generate a 1000-element list with random numbers between 1 and 10:
crypto:start(),
[crypto:rand_uniform(1, 10) || _ <- lists:seq(1, 1000)].
From Erlang Central wiki:
http://erlangcentral.org/wiki/index.php?title=Random_Numbers
Where N = no of items, StartVal = minimum value and Lim = maximum value
generate_random_int_list(N,StartVal,Lim) ->
lists:map(fun (_) -> random:uniform(Lim-StartVal) + StartVal end, lists:seq(1,N)).
You need to correctly seed first of all.
_ = rand:seed(exs1024s),
[rand:uniform(100) || _ <- lists:seq(1, 1000)].

Resources