How to prepare data for weka in word sense disambiguation - machine-learning

I want to use weka for word sense diasambiguation. I prepared some files containing a Persian sentence, a tab, a Persian word, a tab and then an English word. they are in notepad++ in txt format. Now how should I use these files for weka? How should I change them?
The sample file:
https://www.dropbox.com/s/o7wtvrvkiir80la/F.txt?dl=0

I found it. The files should have the same number of columns. So I put the sentences in quotations, then a comma and the the English word in quotation. Above these, we should write proper relations and attributes.

Related

How can I determine if a word is a part of an english word or is a portmanteau (a word created by combining parts of valid English words)?

I am trying to create a validator that takes in words and tries to determine if the word is one of the following:
It is a valid English word
It is a part of an English word
It is an abbreviation
It is a portmanteau -- a word created by concatenating parts of valid English words
Are there Java or Python libraries/frameworks that can perform this task?
Samples of words: meds, ppg, reauthorization, appmetadata, reconsent, rawlog
I've tried Python NLTK (cursory investigation so far) and a Python library called enchant (this fails to identify many valid words/parts of words and portmanteaus).

How to find list of word and their stems in Hindi language

I am doing NLP project on the Hindi language, where I have to find stems/root word of given word.My approach is to build a deep neural network.But for that, I need label data of words and their root words.I am unable to find proper resources(dataset).where do I find these resources
Well you can try Hindi to English Dictionary Websites for feeding data ,which can be converted from HTML to CSV format. But lot of Data cleaning will be needed , But it may work out.

Best practices for creating a CSV file?

I am working in Swift although perhaps the language is not as relevant, and I am creating a relatively simple CSV file.
I wanted to ask for some recommendations in creating the files, in particular:
Should I wrap each column/value in single or double quotes? Or nothing? I understand if I use quotes I'll need to escape them appropriately in case the text in my file legitimately has those values. Same for \r\n
Is it ok to end each line with \r\n ? Anything specific to Mac vs. Windows I need to think about?
What encoding should I use? I'd like to make sure my csv file can be read by most readers (so on mobile devices, mac, windows, etc.)
Any other recommendations / tips to make sure the quality of my CSV is ideal for most readers?
I have a couple of apps that create CSV files.
Any column value that contains a newline or the field separator must be enclosed in quotes (double quotes is common, single quotes less so).
I end lines with just \n.
You may wish to give the user some options when creating the CSV file. Let them choose the field separator. While the comma is common, a tab is also common. You can also use a semi-colon, space, or other characters. Just be sure to properly quote values that contain the chosen field separator.
Using UTF-8 encoding is arguably the best choice for encoding the file. It lets you support all Unicode characters and just about any tool that supports CSV can handled UTF-8. It avoid any issues with platform specific encodings. But again, depending on the needs of your users, you may wish to give them the choice of encoding.

How to scan words, lines and their properties on the text?

I'need to scan a document. It's not OCR, let me show you:
--Example--
Table of Contents
Some Italic Words
Sentence 23
--End--
Suppose that as a ".doc" formatted text. I need to scan it line by line and understand the first line is bold, second is italic and third one includes space after first word and followed by a number. Reason i want to recognize them is i need to categorize them in a table view like bold lines italics, numbereds etc.
I'm okay in both swift and objective-c but totally clueless about document scanning. If you offer any reference, framework or approach i would be grateful to hear.
variant: your doc is really a docx. (docx is xml) Parse the XML. The format defines XML tags it uses to mark stuff bold or italic or whatever -- a docx is kind of like html.
variant: If your doc is really a doc! then we are not talking about xml but a binary format. It is also document and you can go parse it but I don't think will be easy
BUT
There is a library I know: doc2text that can parse a lot of stuff. (http://www.textlib.com/doc2text.html)
We used in past projects and it did an okay job and using this saves you A LOT of effort writing your own parsers

Which Alphabet type should I use with FASTA files in Biopython?

If I'm using the FASTA files from the link below, what Alphabet type should I use in Biopython? Would it be IUPAC.unambiguous_dna?
link to FASTA files: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/?C=S;O=A
Did you read 3.1 Sequences and Alphabets? It explains the different alphabets available, and what cases they cover.
There's a lot of sequences in the link you provided (too many for us to pore through). My recommendation would be to just go with UnambiguousDNA. If the four basic nucleotides aren't enough, the parser will complain, and you should pick a more extensive alphabet.

Resources