Dividing a line of text into elements using a common delimiter

Dividing a line of text into elements using a common delimiter - parsing

I have a playlist text file. I'm trying to extract a list of the artists and their songs. There are 39 line items and they appear as:
Rush - Red Sector A
Blues Traveler - Hook
This is a unicode file.
I'm trying to use the '-' as the delimiter and split the lines there:
x = open(u'list.txt')
for line in x:
line = line.strip()
elements = line.split('-')
artist = elements[0]
song = elements[1]
I get a traceback:
Traceback (most recent call last):
File "playlist.py", line 34, in <module>
song = line[1]
IndexError: list index out of range
It appears the delimiter is not being recognized. If I comment out "song = elements[1]" and print artists, I get the full line of text, delimiter and all. I've seen similar questions, but I can't get enough insight from their solutions to make this work. Any help would be appreciated.

This is due to the delimiting character '–' you think it's "-" but it's actually a different character that just looks like the hyphen. This character is not in the ASCII table, so we have to tell python we will be using utf-8, which covers almost all characters we might be using.
#-*- coding: utf-8 -*-
x = open(u'songs.txt')
delimiter = '–'
for line in x:
line = line.strip()
elements = line.split(delimiter)
artist = elements[0]
song = elements[1]
print "{artist} {song}".format(artist=artist,song=song)
My previous answers did not address the root of the problem, but this has been a great learning experience for me as well.

Related

How to find a word in a single long string?

I want to be able to copy and paste a large string of words from say a text document where there are spaces, returns and not commas between each and every word. Then i want to be able to take out each word individually and put them in a table for example...
input:
please i need help
output:
{1, "please"},
{2, "i"},
{3, "need"},
{4, "help"}
(i will have the table already made with the second column set to like " ")
havent tried anything yet as nothing has come to mind and all i could think of was using gsub to turn spaces into commas and find a solution from there but again i dont think that would work out so well.

Your delimiters are spaces ( ), commas (,) and newlines (\n, sometimes \r\n or \r, the latter very rarely). You now want to find words delimited by these delimiters. A word is a sequence of one or more non-delimiter characters. This trivially translates to a Lua pattern which can be fed into gmatch. Paired with a loop & inserting the matches in a table you get the following:
local words = {}
for word in input:gmatch"[^ ,\r\n]+" do
table.insert(words, word)
end
if you know that your words are gonna be in your locale-specific character set (usually ASCII or extended ASCII), you can use Lua's %w character class for matching sequences of alphanumeric characters:
local words = {}
for word in input:gmatch"%w+" do
table.insert(words, word)
end
Note: The resulting table will be in "list" form:
{
[1] = "first",
[2] = "second",
[3] = "third",
}
(for which {"first", "second", "third"} would be shorthand)
I don't see any good reasons for the table format you have described, but it can be trivially created by inserting tables instead of strings into the list.

extract data from string in lua - SubStrings and Numbers

I'm trying to phrase a string for a hobby project and I'm self taught from code snips from this site and having a hard time working out this problem. I hope you guys can help.
I have a large string, containing many lines, and each line has a certain format.
I can get each line in the string using this code...
for line in string.gmatch(deckData,'[^\r\n]+') do
print(line) end
Each line looks something like this...
3x Rivendell Minstrel (The Hunt for Gollum)
What I am trying to do is make a table that looks something like this for the above line.
table = {}
table['The Hunt for Gollum'].card = 'Rivendell Minstrel'
table['The Hunt for Gollum'].count = 3
So my thinking was to extract everything inside the parentheses, then extract the numeric vale. Then delete the first 4 chars in the line, as it will always be '1x ', '2x ' or '3x '
I have tried a bunch of things.. like this...
word=str:match("%((%a+)%)")
but it errors if there are spaces...
my test code looks like this at the moment...
line = '3x Rivendell Minstrel (The Hunt for Gollum)'
num = line:gsub('%D+', '')
print(num) -- Prints "3"
card2Fetch = string.sub(line, 5)
print(card2Fetch) -- Prints "Rivendell Minstrel (The Hunt for Gollum)"
key = string.gsub(card2Fetch, "%s+", "") -- Remove all Spaces
key=key:match("%((%a+)%)") -- Fetch between ()s
print(key) -- Prints "TheHuntforGollum"
Any ideas how to get the "The Hunt for Gollum" text out of there including the spaces?

Try a single pattern capturing all fields:
x,y,z=line:match("(%d+)x%s+(.-)%s+%((.*)%)")
t = {}
t[z] = {}
t[z].card = y
t[z].count = x
The pattern reads: capture a run of digits before x, skip whitespace, capture everything until whitespace followed by open parenthesis, and finally capture everything until a close parenthesis.

how to tokenize/parse/search&replace document by font AND font style in LibreOffice Writer?

I need to update a bilingual dictionary written in Writer by first parsing all entries into their parts e.g.
main word (font 1, bold)
foreign equivalent transliterated (font 1, italic)
foreign equivalent (font 2, bold)
part of speech (font 1, italic)
Each line of the document is the main word followed by the parts listed above, each separated by a space or punctuation.
I need to automate the process of walking through the whole file, line by line, and place a delimiter between each part, ignoring spaces and punctuation, so I can mass import it into a Calc file. In other words, "each part" is a sequence of character (ignoring spaces and punctuation) that have the same font AND font-style.
I have tried the standard Search&Replace feature, and AltSearch extension, but neither are able to complete the task. The main problem is I am not able to write a search query that says:
Find: consecutive characters with the same font AND font_style, ignore spaces and punctuation
Replace: term found above + "delimiter"
Any suggestions how I can write a script for this, or if an existing tool can solve the problem?
Thanks!
Pseudo code for desired effect:
var delimiter = "|"
Go to beginning of document
While not end of document do:
var $currLine = get line from doc
var $currChar = get next character which is not space or punctuation;
var $font = currChar.font
var $font_style - currChar.font_style (e.g. bold, italic, normal)
While not end of line do:
$currChar = next character which is not space or punctuation;
if (currChar.font != $font || currChar.font_style != $font_style) { // font or style has changed
print $delimiter
$font = currChar.font
$font_style - currChar.font_style (e.g. bold, italic, normal)
}
end While
end While

Here are tips for each of the things your pseudocode does.
First, the easiest way to move line by line is with the TextViewCursor, although it is slow. Notice the XLineCursor section. For the while loop, oVC.goDown() will return false when the end of the document is reached. (oVC is our variable for the TextViewCursor).
Get each character by calling oVC.goRight(0, False) to deselect followed by oVC.goRight(1, True) to select. Then the selected value is obtained by oVC.getString(). To ignore space and punctuation, perhaps use python's isalnum() or the re module.
To determine the font of the character, call oVC.getPropertyValue(attr). Values for attr could simply be CharAutoStyleName and CharStyleName to check for any changes in formatting.
Or grab a list of specific properties such as 'CharFontFamily', 'CharFontFamilyAsian', 'CharFontFamilyComplex', 'CharFontPitch', 'CharFontPitchAsian' etc. Character properties are described at https://wiki.openoffice.org/wiki/Documentation/DevGuide/Text/Formatting.
To insert the delimiter into the text: oVC.getText().insertString(oVC, "|", 0).
This python code from github shows how to do most of these things, although you'll need to read through it to find the relevant parts.
Alternatively, instead of using the LibreOffice API, unzip the .odt file and parse content.xml with a script.

Lua Pattern Matching issue

I'm trying to parse a text file using lua and store the results in two arrays. I thought my pattern would be correct, but this is the first time I've done anything of the sort.
fileio.lua:
questNames = {}
questLevels = {}
lineNumber = 1
file = io.open("results.txt", "w")
io.input(file)
for line in io.lines("questlist.txt") do
questNames[lineNumber], questLevels[lineNumber]= string.match(line, "(%a+)(%d+)")
lineNumber = lineNumber + 1
end
for i=1,lineNumber do
if (questNames[i] ~= nil and questLevels[i] ~= nil) then
file:write(questNames[i])
file:write(" ")
file:write(questLevels[i])
file:write("\n")
end
end
io.close(file)
Here's a small snippet of questlist.txt:
If the dead could talk16
Forgotten soul16
The Toothmaul Ploy9
Well-Armed Savages9
And here's a matching snippet of results.txt:
talk 16
soul 16
Ploy 9
Savages 9
What I'm after in results.txt is:
If the dead could talk 16
Forgotten soul 16
The Toothmaul Ploy 9
Well-Armed Savages 9
So my question is, which pattern do I use in order to select all text up to a number?
Thanks for your time.

%a matches letters. It does not match spaces.
If you want to match everything up to a sequence of digits you want (.-)(%d+).
If you want to match a leading sequence of non-digits then you want ([^%d]+)(%d+).
That being said if all you want to do is insert a space before a sequence of digits then you can just use line:gsub("%d+", " %0", 1) to do that (the one to only do it for the first match, leave that off to do it for every match on the line).
As an aside I don't think io.input(file) is doing anything useful for you (or what you might expect). It is replacing the default standard input file handle with the file handle file.

RAILS 3 CSV "Illegal quoting" is a lie

I've hit a problem during parsing of a CSV file where I get the following error:
CSV::MalformedCSVError: Illegal quoting on line 3.
RAILS code in question:
csv = CSV.read(args.local_file_path, col_sep: "\t", headers: true)
Line 3 in the CSV file is:
A-067067 VO VIA CE 0 8 8 SWCH Ter 4, Loc Is Here, Mne, Per Fl Auia/Sey IMAC NEK_HW 2011-03-09 09:47:44 2011-03-09 11:50:26 2011-01-13 10:49:17 2011-02-14 14:02:43 2011-02-14 14:02:44 0 0 771 771 46273 "[O/H 15/02] B270 W31 ""TEXT TEXT 2 X TEXT SWITC" SOME_TEXT SOME_TEXT N/A Name Here RESOLVED_CLOSED RESOLVED_CLOSED
UPDATE: Tabs don't appear to have come across above. See pastebin RAW TEXT: http://pastebin.com/4gj7iUpP
I've read numerous threads all over StackOverflow and Google about why this is and I understand that. But the CSV row above has perfectly legal quoting does it not?
The CSV is tab delimited and there is only a tab followed by the quote on either side of the column in question. There is 1 quote in that field and it is double quoted to escape it. So what gives? I can't work it out. :(
Assuming I've got something wrong here, I'd like the solution to include a way to work around the issue as I don't have control over how the CSV is constructed.

This part of your CSV is at fault:
46273 "[O/H 15/02] B270 W31 ""TEXT TEXT 2 X TEXT SWITC" SOME_TEXT
At least one of these parts has a stray space:
46273 "
" SOME_TEXT
I'd guess that the "3" and the double are supposed to be separated by one or more tabs but there is a space before the quote. Or, there is a space after the quote on the other end when there are only supposed to be tabs between the closing quote and the "S".
CSV escapes double quotes by double them so this:
"[O/H 15/02] B270 W31 ""TEXT TEXT 2 X TEXT SWITC"
is supposed to be a single filed that contains an embedded quote:
[O/H 15/02] B270 W31 "TEXT TEXT 2 X TEXT SWITC
If you have a space before the first quote or after the last quote then, since your fields are tab delimited, you have an unescaped double quote inside a field and that's where your "illegal quoting" error comes from.
Try sending your CSV file through cat -t (which should represent tabs as ^I) to find where the stray space is.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Dividing a line of text into elements using a common delimiter - parsing

Related

How to find a word in a single long string?

extract data from string in lua - SubStrings and Numbers

how to tokenize/parse/search&replace document by font AND font style in LibreOffice Writer?

Lua Pattern Matching issue

RAILS 3 CSV "Illegal quoting" is a lie

Categories

Resources