Read a csv file and fill the data in a BigQuery Table - google-cloud-dataflow

The following is the code that is supposed to read from a csv file and write to another csv file and BigQuery:
import argparse
import logging
import re
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.metrics import Metrics
from apache_beam.metrics.metric import MetricsFilter
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
parser = argparse.ArgumentParser()
parser.add_argument('--input',
dest='input',
default='gs://dataflow-samples/shakespeare/kinglear.txt',
help='Input file to process.')
parser.add_argument('--output',
dest='output',
required=True,
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(None)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
# Read the text file[pattern] into a PCollection.
lines = p | 'read' >> ReadFromText(known_args.input)
lines | beam.Map(lambda x: x.split(','))
lines | 'write' >> WriteToText(known_args.output)
lines | 'write2' >> beam.io.Write(beam.io.BigQuerySink('xxxx:yyyy.aaaa'))
# Actually run the pipeline (all operations above are deferred).
result = p.run()
It is able to write to the output file but it is not able to do so to the BigQuery Table (xxxx:yyyy.aaaa)
The following is the message that appears:
WARNING:root:A task failed with exception.
'unicode' object has no attribute 'iteritems'
The table contained in the csv file is not written into BigQuery even though the schema is the same and the BigQuery table is empty. I suspect the reason for this is because the data has to be converted to JSON format.
What are the corrections that have to be done to this code in order for it to work properly? Could you please give the lines of code that I have to add in order for this to work?

Looking at the following lines:
1: lines = p | 'read' >> ReadFromText(known_args.input)
2: lines | beam.Map(lambda x: x.split(','))
3: lines | 'write' >> WriteToText(known_args.output)
4: lines | 'write2' >> beam.io.Write(beam.io.BigQuerySink('xxxx:yyyy.aaaa'))
Defines lines to be a PCollection of the lines read from the text file.
Creates a new PCollection of words by splitting each line. But it doesn't actually retain that PCollection, so it effectively does nothing.
Writes the original lines to a text file (so you don't see one word per line, you see one original line on each output).
Writes the lines read from the input to the BigQuery file.
If you look at the BigQuery tornadoes example you can see that (1) you need to convert each line into a dictionary with fields for each column ad (2) you need to provide the schema matching that dictionary to the BigQuerySink. For example:
def to_table_row(x):
values = x.split(',')
return { 'field1': values[0], 'field2': values[1] }
lines = p | 'read' >> ReadFromText(known_args.input)
lines
| 'write' >> WriteToText(known_args.output)
lines
| 'ToTableRows' >> beam.Map(to_table_row)
| 'write2' >> beam.io.Write(beam.io.BigQuerySink(
'xxxx:yyyy.aaaa',
schema='field1:INTEGER, field2:INTEGER'))

Related

How to parse more than one sentence from text file using Stanford dependency parse?

I have a text file which have many line, i wanted to parse all sentences, but it seems like i get all sentences but parse only the first sentence, not sure where m i making mistake.
import nltk
from nltk.parse.stanford import StanfordDependencyParser
dependency_parser = StanfordDependencyParser( model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
txtfile =open('sample.txt',encoding="latin-1")
s=txtfile.read()
print(s)
result = dependency_parser.raw_parse(s)
for i in result:
print(list(i.triples()))
but it give only the first sentence parse tripples not other sentences, any help ?
'i like this computer'
'The great Buddha, the .....'
'My Ashford experience .... great experience.'
[[(('i', 'VBZ'), 'nsubj', ("'", 'POS')), (('i', 'VBZ'), 'nmod', ('computer', 'NN')), (('computer', 'NN'), 'case', ('like', 'IN')), (('computer', 'NN'), 'det', ('this', 'DT')), (('computer', 'NN'), 'case', ("'", 'POS'))]]
You have to split the text first. You're currently parsing the literal text you posted with quotes and everything. This is evident by this part of the parsing result: ("'", 'POS')
To do that you seem to be able to use ast.literal_eval on each line. Note that an apostrophe (in a word like "don't") will ruin the formatting and you'll have to handle the apostrophes yourself with something like line = line[1:-1]:
import ast
from nltk.parse.stanford import StanfordDependencyParser
dependency_parser = StanfordDependencyParser( model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
with open('sample.txt',encoding="latin-1") as f:
lines = [ast.litral_eval(line) for line in f.readlines()]
for line in lines:
parsed_lines = dependency_parser.raw_parse(line)
# now parsed_lines should contain the parsed lines from the file
Try:
from nltk.parse.stanford import StanfordDependencyParser
dependency_parser = StanfordDependencyParser(model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
with open('sample.txt') as fin:
sents = fin.readlines()
result = dep_parser.raw_parse_sents(sents)
for parse in results:
print list(parse.triples())
Do check the docstring code or demo code in repository for examples, they're usually very helpful.

Can csv-conduit read a string in csv form and parse it into some intermediate datatype?

The documentation found on csv-conduit's github page is scant, my use case involve reading a string in csv form, ie:
csv :: String
csv = "\"column1 (text)\",\"column2 (text)\",\"column3 (number)\",\"column4 (number)\"\r\anId,stuff,12,30.454\r\n"
and transforming it into some intermediate data type, so suppose we declare data type Row, then I'd have
csv' :: Row
csv' = Row (Just "anId") "stuff" 12 (Just 30.454)
But I'm not sure which functions to call. Furthermore, it seems like csv-conduit export some Row type already, but I'm not sure how to use it?
Here is an example which shows how to add a processing step in a cvs conduit pipeline. Here we just add a column to each input row.
{-# LANGUAGE NoMonomorphismRestriction, OverloadedStrings #-}
module Lib
where
import Data.Conduit
import Data.Conduit.Binary
import Data.CSV.Conduit
import Data.Text
import qualified Data.Map as Map
import Control.Monad
myProcessor :: Monad m => Conduit (MapRow Text) m (MapRow Text)
myProcessor = do
x <- await
case x of
Nothing -> return ()
Just m -> do let m' = Map.insert "asd" "qwe" m
yield m'
myProcessor
test = runResourceT $
sourceFile "input.csv" $=
intoCSV defCSVSettings $=
myProcessor $=
(writeHeaders defCSVSettings >> fromCSV defCSVSettings) $$
sinkFile "output.csv"
Of course, your processing stage doesn't have to produce MapRow Text items - it can produce items of whatever type you want. Use other conduit operations to collect / filter / process that pipeline.
If you have a specific task you want to perform I can address that.

How to convert .txt files to .xls files using informix 4GL codes

I got a question to be disscuss.I am working on INFORMIX 4GL programs. That programs produce output text files.This is an example of the output:
Lot No|Purchaser name|Billing|Payment|Deposit|Balance|
J1006|JAUHARI BIN HAMIDI|5285.05|4923.25|0.00|361.80|
J1007|LEE, CHIA-JUI AKA LEE, ANDREW J. R.|5366.15|5313.70|0.00|52.45|
J1008|NAZRIN ANEEZA BINTI NAZARUDDIN|5669.55|5365.30|0.00|304.25|
J1009|YAZID LUTFI BIN AHMAD LUTFI|3180.05|3022.30|0.00|157.75|
From that output text files(.txt) files, we can open it manually from the excel(.xls) files.From this case, is that any 4gl codes or any commands that we can use it for open the text files in microsoft excell automatically right after we run the program?If there any ideas,please share with me... Thank You
The output shown is in the normal Informix UNLOAD format, using the pipe as a delimiter between fields. The nearest approach to this for Excel is a CSV file with comma-separated values. Generating one of those from that output is a little fiddly. You need to enclose fields containing a comma inside double quotes. You need to use commas in place of pipes. And you might have to worry about backslashes too.
It is a moot point whether it is easier to do the conversion in I4GL or whether to use a program to do the conversion. I think the latter, so I wrote this script a couple of years ago:
#!/usr/bin/env perl
#
# #(#)$Id: unl2csv.pl,v 1.1 2011/05/17 10:20:09 jleffler Exp $
#
# Convert Informix UNLOAD format to CSV
use strict;
use warnings;
use Text::CSV;
use IO::Wrap;
my $csv = new Text::CSV({ binary => 1 }) or die "Failed to create CSV handle ($!)";
my $dlm = defined $ENV{DBDELIMITER} ? $ENV{DBDELIMITER} : "|";
my $out = wraphandle(\*STDOUT);
my $rgx = qr/((?:[^$dlm]|(?:\\.))*)$dlm/sm;
# $csv->eol("\r\n");
while (my $line = <>)
{
print "1: $line";
MultiLine:
while ($line eq "\\\n" || $line =~ m/[^\\](?:\\\\)*\\$/)
{
my $extra = <>;
last MultiLine unless defined $extra;
$line .= $extra;
}
my #fields = split_unload($line);
$csv->print($out, \#fields);
}
sub split_unload
{
my($line) = #_;
my #fields;
print "$line";
while ($line =~ $rgx)
{
printf "%d: %s\n", scalar(#fields), $1;
push #fields, $1;
}
return #fields;
}
__END__
=head1 NAME
unl2csv - Convert Informix UNLOAD to CSV format
=head1 SYNOPSIS
unl2csv [file ...]
=head1 DESCRIPTION
The unl2csv program converts a file from Informix UNLOAD file format to
the corresponding CSV (comma separated values) format.
The input delimiter is determined by the environment variable
DBDELIMITER, and defaults to the pipe symbol "|".
It is not assumed that each input line is terminated with a delimiter
(there are two variants of the UNLOAD format, one with and one without
the final delimiter).
=head1 EXAMPLES
Input:
10|12|excessive|cost \|of, living|
20|40|bou\\ncing tigger|grrrrrrrr|
Output:
10,12,"excessive","cost |of, living"
20,40,"bou\ncing tigger",grrrrrrrr
=head1 RESTRICTIONS
Since the csv2unl program does not know about binary blob data, it
cannot convert such data into the hex-encoded format that Informix
requires.
It can and does handle text blob data.
=head1 PRE-REQUISITES
Text::CSV_XS
=head1 AUTHOR
Jonathan Leffler <jleffler#us.ibm.com>
=cut
I generate Excel files from 4GL code by writing a XML with the Excel progid ("?mso-application progid=\"Excel.Sheet\"?) so Excel opens it as such.
Its like writing HTML from 4GL, you just wite HTML code to file. But with Excel you write XML.

Addressing a specific occurrence of a character in sed

How do I remove or address a specific occurrence of a character in sed?
I'm editing a CSV file and I want to remove all text between the third and the fifth occurrence of the comma (that is, dropping fields four and five) . Is there any way to achieve this using sed?
E.g:
% cat myfile
one,two,three,dropthis,dropthat,six,...
% sed -i 's/someregex//' myfile
% cat myfile
one,two,three,,six,...
If it is okay to consider cut command then:
$ cut -d, -f1-3,6- file
awk or any other tools that are able to split strings on delimiters are better for the job than sed
$ cat file
1,2,3,4,5,6,7,8,9,10
Ruby(1.9+)
$ ruby -ne 's=$_.split(","); s[2,3]=nil ;puts s.compact.join(",") ' file
1,2,6,7,8,9,10
using awk
$ awk 'BEGIN{FS=OFS=","}{$3=$4=$5="";}{gsub(/,,*/,",")}1' file
1,2,6,7,8,9,10
A real parser in action
#!/usr/bin/python
import csv
import sys
cr = csv.reader(open('my-data.csv', 'rb'))
cw = csv.writer(open('stripped-data.csv', 'wb'))
for row in cr:
cw.writerow(row[0:3] + row[5:])
But do note the preface to the csv module:
The so-called CSV (Comma Separated
Values) format is the most common
import and export format for
spreadsheets and databases. There is
no “CSV standard”, so the format is
operationally defined by the many
applications which read and write it.
The lack of a standard means that
subtle differences often exist in the
data produced and consumed by
different applications. These
differences can make it annoying to
process CSV files from multiple
sources. Still, while the delimiters
and quoting characters vary, the
overall format is similar enough that
it is possible to write a single
module which can efficiently
manipulate such data, hiding the
details of reading and writing the
data from the programmer.
$ cat my-data.csv
1
1,2
1,2,3
1,2,3,4,
1,2,3,4,5
1,2,3,4,5,6
1,2,3,4,5,6,
1,2,,4,5,6
1,2,"3,3",4,5,6
1,"2,2",3,4,5,6
,,3,4,5
,,,4,5
,,,,5
$ python csvdrop.py
$ cat stripped-data.csv
1
1,2
1,2,3
1,2,3
1,2,3
1,2,3,6
1,2,3,6,
1,2,,6
1,2,"3,3",6
1,"2,2",3,6
,,3
,,
,,

scraping text from multiple html files into a single csv file

I have just over 1500 html pages (1.html to 1500.html). I have written a code using Beautiful Soup that extracts most of the data I need but "misses" out some of the data within the table.
My Input: e.g file 1500.html
My Code:
#!/usr/bin/env python
import glob
import codecs
from BeautifulSoup import BeautifulSoup
with codecs.open('dump2.csv', "w", encoding="utf-8") as csvfile:
for file in glob.glob('*html*'):
print 'Processing', file
soup = BeautifulSoup(open(file).read())
rows = soup.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
#print >> csvfile,"#".join(col.string for col in cols)
#print >> csvfile,"#".join(td.find(text=True))
for col in cols:
print >> csvfile, col.string
print >> csvfile, "==="
print >> csvfile, "***"
Output:
One CSV file, with 1500 lines of text and columns of data. For some reason my code does not pull out all the required data but "misses" some data, e.g the Address1 and Address 2 data at the start of the table do not come out. I modified the code to put in * and === separators, I then use perl to put into a clean csv file, unfortunately I'm not sure how to work my code to get all the data I'm looking for!
find files where you get missed parameters,
and after that try to analyse what happened...
I think that same files have different format, or maybe realy address filed is missed.

Resources