I have a text file which have many line, i wanted to parse all sentences, but it seems like i get all sentences but parse only the first sentence, not sure where m i making mistake.
import nltk
from nltk.parse.stanford import StanfordDependencyParser
dependency_parser = StanfordDependencyParser( model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
txtfile =open('sample.txt',encoding="latin-1")
s=txtfile.read()
print(s)
result = dependency_parser.raw_parse(s)
for i in result:
print(list(i.triples()))
but it give only the first sentence parse tripples not other sentences, any help ?
'i like this computer'
'The great Buddha, the .....'
'My Ashford experience .... great experience.'
[[(('i', 'VBZ'), 'nsubj', ("'", 'POS')), (('i', 'VBZ'), 'nmod', ('computer', 'NN')), (('computer', 'NN'), 'case', ('like', 'IN')), (('computer', 'NN'), 'det', ('this', 'DT')), (('computer', 'NN'), 'case', ("'", 'POS'))]]
You have to split the text first. You're currently parsing the literal text you posted with quotes and everything. This is evident by this part of the parsing result: ("'", 'POS')
To do that you seem to be able to use ast.literal_eval on each line. Note that an apostrophe (in a word like "don't") will ruin the formatting and you'll have to handle the apostrophes yourself with something like line = line[1:-1]:
import ast
from nltk.parse.stanford import StanfordDependencyParser
dependency_parser = StanfordDependencyParser( model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
with open('sample.txt',encoding="latin-1") as f:
lines = [ast.litral_eval(line) for line in f.readlines()]
for line in lines:
parsed_lines = dependency_parser.raw_parse(line)
# now parsed_lines should contain the parsed lines from the file
Try:
from nltk.parse.stanford import StanfordDependencyParser
dependency_parser = StanfordDependencyParser(model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
with open('sample.txt') as fin:
sents = fin.readlines()
result = dep_parser.raw_parse_sents(sents)
for parse in results:
print list(parse.triples())
Do check the docstring code or demo code in repository for examples, they're usually very helpful.
The documentation found on csv-conduit's github page is scant, my use case involve reading a string in csv form, ie:
csv :: String
csv = "\"column1 (text)\",\"column2 (text)\",\"column3 (number)\",\"column4 (number)\"\r\anId,stuff,12,30.454\r\n"
and transforming it into some intermediate data type, so suppose we declare data type Row, then I'd have
csv' :: Row
csv' = Row (Just "anId") "stuff" 12 (Just 30.454)
But I'm not sure which functions to call. Furthermore, it seems like csv-conduit export some Row type already, but I'm not sure how to use it?
Here is an example which shows how to add a processing step in a cvs conduit pipeline. Here we just add a column to each input row.
{-# LANGUAGE NoMonomorphismRestriction, OverloadedStrings #-}
module Lib
where
import Data.Conduit
import Data.Conduit.Binary
import Data.CSV.Conduit
import Data.Text
import qualified Data.Map as Map
import Control.Monad
myProcessor :: Monad m => Conduit (MapRow Text) m (MapRow Text)
myProcessor = do
x <- await
case x of
Nothing -> return ()
Just m -> do let m' = Map.insert "asd" "qwe" m
yield m'
myProcessor
test = runResourceT $
sourceFile "input.csv" $=
intoCSV defCSVSettings $=
myProcessor $=
(writeHeaders defCSVSettings >> fromCSV defCSVSettings) $$
sinkFile "output.csv"
Of course, your processing stage doesn't have to produce MapRow Text items - it can produce items of whatever type you want. Use other conduit operations to collect / filter / process that pipeline.
If you have a specific task you want to perform I can address that.
I got a question to be disscuss.I am working on INFORMIX 4GL programs. That programs produce output text files.This is an example of the output:
Lot No|Purchaser name|Billing|Payment|Deposit|Balance|
J1006|JAUHARI BIN HAMIDI|5285.05|4923.25|0.00|361.80|
J1007|LEE, CHIA-JUI AKA LEE, ANDREW J. R.|5366.15|5313.70|0.00|52.45|
J1008|NAZRIN ANEEZA BINTI NAZARUDDIN|5669.55|5365.30|0.00|304.25|
J1009|YAZID LUTFI BIN AHMAD LUTFI|3180.05|3022.30|0.00|157.75|
From that output text files(.txt) files, we can open it manually from the excel(.xls) files.From this case, is that any 4gl codes or any commands that we can use it for open the text files in microsoft excell automatically right after we run the program?If there any ideas,please share with me... Thank You
The output shown is in the normal Informix UNLOAD format, using the pipe as a delimiter between fields. The nearest approach to this for Excel is a CSV file with comma-separated values. Generating one of those from that output is a little fiddly. You need to enclose fields containing a comma inside double quotes. You need to use commas in place of pipes. And you might have to worry about backslashes too.
It is a moot point whether it is easier to do the conversion in I4GL or whether to use a program to do the conversion. I think the latter, so I wrote this script a couple of years ago:
#!/usr/bin/env perl
#
# #(#)$Id: unl2csv.pl,v 1.1 2011/05/17 10:20:09 jleffler Exp $
#
# Convert Informix UNLOAD format to CSV
use strict;
use warnings;
use Text::CSV;
use IO::Wrap;
my $csv = new Text::CSV({ binary => 1 }) or die "Failed to create CSV handle ($!)";
my $dlm = defined $ENV{DBDELIMITER} ? $ENV{DBDELIMITER} : "|";
my $out = wraphandle(\*STDOUT);
my $rgx = qr/((?:[^$dlm]|(?:\\.))*)$dlm/sm;
# $csv->eol("\r\n");
while (my $line = <>)
{
print "1: $line";
MultiLine:
while ($line eq "\\\n" || $line =~ m/[^\\](?:\\\\)*\\$/)
{
my $extra = <>;
last MultiLine unless defined $extra;
$line .= $extra;
}
my #fields = split_unload($line);
$csv->print($out, \#fields);
}
sub split_unload
{
my($line) = #_;
my #fields;
print "$line";
while ($line =~ $rgx)
{
printf "%d: %s\n", scalar(#fields), $1;
push #fields, $1;
}
return #fields;
}
__END__
=head1 NAME
unl2csv - Convert Informix UNLOAD to CSV format
=head1 SYNOPSIS
unl2csv [file ...]
=head1 DESCRIPTION
The unl2csv program converts a file from Informix UNLOAD file format to
the corresponding CSV (comma separated values) format.
The input delimiter is determined by the environment variable
DBDELIMITER, and defaults to the pipe symbol "|".
It is not assumed that each input line is terminated with a delimiter
(there are two variants of the UNLOAD format, one with and one without
the final delimiter).
=head1 EXAMPLES
Input:
10|12|excessive|cost \|of, living|
20|40|bou\\ncing tigger|grrrrrrrr|
Output:
10,12,"excessive","cost |of, living"
20,40,"bou\ncing tigger",grrrrrrrr
=head1 RESTRICTIONS
Since the csv2unl program does not know about binary blob data, it
cannot convert such data into the hex-encoded format that Informix
requires.
It can and does handle text blob data.
=head1 PRE-REQUISITES
Text::CSV_XS
=head1 AUTHOR
Jonathan Leffler <jleffler#us.ibm.com>
=cut
I generate Excel files from 4GL code by writing a XML with the Excel progid ("?mso-application progid=\"Excel.Sheet\"?) so Excel opens it as such.
Its like writing HTML from 4GL, you just wite HTML code to file. But with Excel you write XML.
How do I remove or address a specific occurrence of a character in sed?
I'm editing a CSV file and I want to remove all text between the third and the fifth occurrence of the comma (that is, dropping fields four and five) . Is there any way to achieve this using sed?
E.g:
% cat myfile
one,two,three,dropthis,dropthat,six,...
% sed -i 's/someregex//' myfile
% cat myfile
one,two,three,,six,...
If it is okay to consider cut command then:
$ cut -d, -f1-3,6- file
awk or any other tools that are able to split strings on delimiters are better for the job than sed
$ cat file
1,2,3,4,5,6,7,8,9,10
Ruby(1.9+)
$ ruby -ne 's=$_.split(","); s[2,3]=nil ;puts s.compact.join(",") ' file
1,2,6,7,8,9,10
using awk
$ awk 'BEGIN{FS=OFS=","}{$3=$4=$5="";}{gsub(/,,*/,",")}1' file
1,2,6,7,8,9,10
A real parser in action
#!/usr/bin/python
import csv
import sys
cr = csv.reader(open('my-data.csv', 'rb'))
cw = csv.writer(open('stripped-data.csv', 'wb'))
for row in cr:
cw.writerow(row[0:3] + row[5:])
But do note the preface to the csv module:
The so-called CSV (Comma Separated
Values) format is the most common
import and export format for
spreadsheets and databases. There is
no “CSV standard”, so the format is
operationally defined by the many
applications which read and write it.
The lack of a standard means that
subtle differences often exist in the
data produced and consumed by
different applications. These
differences can make it annoying to
process CSV files from multiple
sources. Still, while the delimiters
and quoting characters vary, the
overall format is similar enough that
it is possible to write a single
module which can efficiently
manipulate such data, hiding the
details of reading and writing the
data from the programmer.
$ cat my-data.csv
1
1,2
1,2,3
1,2,3,4,
1,2,3,4,5
1,2,3,4,5,6
1,2,3,4,5,6,
1,2,,4,5,6
1,2,"3,3",4,5,6
1,"2,2",3,4,5,6
,,3,4,5
,,,4,5
,,,,5
$ python csvdrop.py
$ cat stripped-data.csv
1
1,2
1,2,3
1,2,3
1,2,3
1,2,3,6
1,2,3,6,
1,2,,6
1,2,"3,3",6
1,"2,2",3,6
,,3
,,
,,
I have just over 1500 html pages (1.html to 1500.html). I have written a code using Beautiful Soup that extracts most of the data I need but "misses" out some of the data within the table.
My Input: e.g file 1500.html
My Code:
#!/usr/bin/env python
import glob
import codecs
from BeautifulSoup import BeautifulSoup
with codecs.open('dump2.csv', "w", encoding="utf-8") as csvfile:
for file in glob.glob('*html*'):
print 'Processing', file
soup = BeautifulSoup(open(file).read())
rows = soup.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
#print >> csvfile,"#".join(col.string for col in cols)
#print >> csvfile,"#".join(td.find(text=True))
for col in cols:
print >> csvfile, col.string
print >> csvfile, "==="
print >> csvfile, "***"
Output:
One CSV file, with 1500 lines of text and columns of data. For some reason my code does not pull out all the required data but "misses" some data, e.g the Address1 and Address 2 data at the start of the table do not come out. I modified the code to put in * and === separators, I then use perl to put into a clean csv file, unfortunately I'm not sure how to work my code to get all the data I'm looking for!
find files where you get missed parameters,
and after that try to analyse what happened...
I think that same files have different format, or maybe realy address filed is missed.