currently two avro files are getting generated for 10 kb file, If I follow the same thing with my actual file (30MB+) I will n number of files.
so need a solution to generate only one or two .avro files even if the source file of large.
Also is there any way to avoid manual declaration of column names.
current approach...
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:2.0.1
import org.apache.spark.sql.types.{StructType, StructField, StringType}
// Manual schema declaration of the 'co' and 'id' column names and types
val customSchema = StructType(Array(
StructField("ind", StringType, true),
StructField("co", StringType, true)))
val df = sqlContext.read.format("com.databricks.spark.csv").option("comment", "\"").option("quote", "|").schema(customSchema).load("/tmp/file.txt")
df.write.format("com.databricks.spark.avro").save("/tmp/avroout")
// Note: /tmp/file.txt is input file/dir, and /tmp/avroout is the output dir
Try specifying number of partitions of your dataframe while writing the data as avro or any format. To fix this use repartition or coalesce df function.
df.coalesce(1).write.format("com.databricks.spark.avro").save("/tmp/avroout")
So that it writes only one file in "/tmp/avroout"
Hope this helps!
According to the spec for the structure of an iso9660 / ecma119, the path table contains records for each path, including the location of the starting sector and its name, but not its size. I can find the directory entry, but don't know how many sectors (normally 2048 bytes) it contains. Is it one? Two? Six?
If I "walk the directory tree", each directory entry includes the referenced location and size, so I can know how many bytes (essentially, how many sectors, since a directory must use entire sectors) to read. However, the path table only includes the starting location, and not the size, leaving me not knowing how many bytes to read.
In an example iso I have (ubuntu-18.04.1-live-server-amd64.iso fwiw), the root directory entry in the primary volume descriptor shows:
Root Directory:
Directory Record Length: 34
Extended Attribute Length: 0
Location of Extent: 20 $00000014 00:00:20
Data Length: 2048 $00000800
Recording Date and Time: 23:39:04 07/25/2018 GMT 0
File Flags: $02 visible regular dir non-record no-perms single-extent
File Unit Size: 0
Interleave Gap Size: 0
Volume Sequence Number: 1
File Identifier: . (current directory)
Since it says the Data Length is 2048, I know to read just one sector.
However, the root directory entry in the path table shows:
Path Record Length: 10 $0A
Extended Attribute Length: 0 $00
Location of Extent: 20 $00000014 00:00:20
Parent Directory Number: 1 $0001
File Identifier: . (current directory)
It also points to sector 20, but doesn't tell me how many sectors it uses, leaving me guessing.
Yes, unused bytes in a sector should be all 0x00, so if I read in a sector, read records, and then come to one whose first byte (length) is 0x00, then I know I have reached the end of records, but that has three issues:
If that were the canonical way, why bother including size in the directory entry?
If it includes 2 or 3 sectors, it is more efficient for me to read them all at once than one at a time.
If I have a directory whose records precisely fill a sector, without some size attribute, I don't know if the next sector is supposed to be read as an entry, or if the directory ended here.
Basically, I know how to read the ordered path table to get the directory entry, but don't know how to use that to know how many sectors to read for the directory itself. I could, in theory, read the parent to get the entry for this directory to know the size, but that adds a seek and read and pretty much defeats the purpose of the path table.
Ah, I figured it out. Because the directory entries always start with a directory entry for the directory itself, and the data length always is bytes 10-17 (10-13 for little-endian, 13-17 for big-endian), you can just read bytes 10-17 from the beginning of the sector and get the size. Still not as efficient as putting it in the path table itself - no idea why they did not - but it works.
I am trying to create a file descriptor using the command:
$ MAHOUT_HOME/core/target/mahout-core--job.jar org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
from the link:
https://mahout.apache.org/users/classification/partial-implementation.html on my data file but whatever file I take and change the number of attributes string N 3 C 2 N C 4 N C 8 N 2 C 19 N L .
I get the following exception:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong number of attributes in the string
Please help!
There are a couple of reasons for which you might get an error like that...
Wrong Descriptor: Putting this for a sake of completeness. You must have already checked this one out. You have actually given a wrong descriptor for the data. Re-check the number and type of columns and then give them correctly to the descriptor.
Bad separator: Re-check the delimiter used in the data. That also might create some trouble. May be the data you have has some wrongly placed delimiter in some records. Make sure of that.
Special Characters: In my few experiments, I have noticed mahout does not enjoy if there are certain special characters, or data consists of characters of language other than English (unless of course, you tweak around the code). So make sure you have a way of handling them, and you should be good to go.
Anyways all these fight just so you can create a descriptor of the data. ATB.
Old question, but I had a more acute answer that I discovered after landing here with the same problem.
In this particular case, the problem I found was that the format of data file (from http://nsl.cs.unb.ca/NSL-KDD/) seems to have changed from the example as listed on the Mahout Random Forest example page.
The example lists a line format with the specifier
N 3 C 2 N C 4 N C 8 N 2 C 19 N L
but there's an extra element at the end of the lines; for example:
13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,guess_passwd,2
which has one more field. Adding another number field (N) to the end of the specifier, as
N 3 C 2 N C 4 N C 8 N 2 C 19 N L N
I had luck using just the plain .txt file format instead of the .arff file format.
I'll try to redefine the XFD file structure based on below file settings option
Anaylsis Result:
Max Record length 300
Min record length 61
No of records 466
Blocking factor 1
Preallocation amount 0
Extension amount 1
Compression factor 80
Encrypted ? No
Number of keys 4
Primary key has 1 segments
key size 13 offset 0
key 02 has 3 segments
Duplicates Are allowed
Key size 4 offset 4
Key size 40 offset 21
Key size 4 offset 0
Key 03 has 3 segments
Duplicates Are allowed
Key size 4 offset 4
Key size 8 offset 13
Key size 4 offset 0
Key04 has 1 segment
Duplicates Are allowed
Key size 10 offset 21
Another given XFD file structure which is already failed to obtained data from AcuODBC:
I 'm linking against from pote.XFD to pote Acu database file through ACUODBC.
XFD,02,POTE,POTE
00026,00018,002
1,0,008,00000
02
INTIND-UNIQ
INTIND-OCC
1,1,010,00008
01
IND16
000
0004,00004
00000,00004,12,00009,+00,000,000,INTIND-UNIQ
00004,00004,12,00009,+00,000,000,INTIND-OCC
00008,00010,16,00010,+00,000,000,IND16
00018,00008,16,00008,+00,000,000,TERM20
I'm linking against from pote.XFD to pote Acu database file through ACUODBC.
My Question is here how could I change my pote.XFD structure based on give analysis as given on top to form a correct XFD structure.
I know there are four keys in this cobol table, but I still don't know how to manually configure this data structure based on given analysis information.
Below is another reference guide on how to form XFD correct structure in manual where I've already obtained, hope someone expert can help to explained the way on how to form on correct XFD structure.
# This xfd layout is a generic one suitable for accessing any
# .DAD file. However, it needs to be copied and amended for each
# DAD file that you wish to get access to.
# The simplest scenario is that you copy dad.xfd to a new file
# with the same name as the database you wish to access and extension .XFD
# Then edit this new file and replace the two instances of 'FILE' with the
# filename that you want to access. e.g. if you want to have ODBC access to
# icvc.dad then copy dad.xfd to new file icvc.xfd and change line
# XFD,02,FILE,FILE to be
# XFD,02,ICVC,ICVC
#
# If this doesn't work then the database file you are trying to access has
# probably set different values for search index sizes. The easiest way to
# check this is to run $list for the database that you want to access and
# note down all the key information that it gives. If that is different
# to the key info in this file then you need to modify the xfd file to match
# In the current xfd there are four indexes defined. In all cases the first
# index will be correct and so should the third index. However, the other
# two may need to be modified or removed if not present.
# Index 4 is optional and is not present if the database is rebuilt without
# the fast list option.
# explaining the details of 2nd index. 1 st line consists of 8 values separated
# by commas. The first value of 3 is how many segments the index consists of.
# second value 1 means duplicates allowed (0 means NO DUPS).
# The remaining six fields are three pairs of key size and byte offset, e.g.
# first index segment is 4 bytes long and starts from byte 4, second index
# segment is 20 bytes long and starts from byte 21 etc.
# The second line specifies how many field names there are to follow and lines 3
# to 5 are the three field names as defined lower in this xfd. For instance
# if you look at field D1UNIQ you will see it is defined as starting from byte 0
# and is 4 bytes long. This corresponds to the values entered in the key definition.
#
XFD,02,ICVC,ICVC
00300,00041,004
# [Key Section]
# [1st index]
01,0,013,00000
04
D1UNIQ
D1NAME
D1NAMX
D1OCCU
# [2nd index]
3,1,004,00004,020,00021,004,00000
03
D1NAME
D1TUPP
D1UNIQ
# [3rd index]
3,1,004,00004,008,00013,004,00000
03
D1NAME
D1NUMB
D1UNIQ
# [4th index]
1,1,020,00021
01
D1TUPP
# [Condition Section]
000
# [Field Section]
0015,00015,00016
00000,00013,16,00013,+00,000,999,D1KEY
00000,00004,12,00009,+00,000,000,D1UNIQ
00004,00004,16,00004,+00,000,000,D1NAME
00008,00001,16,00001,+00,000,000,D1NAMX
00009,00004,12,00009,+00,000,000,D1OCCU
00013,00008,11,00018,-06,000,000,D1NUMB
00021,00040,16,00040,+00,000,000,D1TUPP
00061,00001,01,00001,+00,000,000,D1GRAD
00062,00004,12,00008,+00,000,000,D1DLUP
00066,00004,12,00008,+00,000,000,D1TLUP
00070,00004,16,00004,+00,000,000,D1OLUP
00074,00001,16,00001,+00,000,000,D1TYPE
00075,00002,16,00002,+00,000,000,D1FORM
00077,00160,16,00160,+00,000,000,D1TEXT
00237,00001,16,00001,+00,000,000,D1PRIN
00238,00062,16,00062,+00,000,000,D1FILL
First do this:
'# This xfd layout is a generic one suitable for accessing any
'# .DAD file. However, it needs to be copied and amended for each
'# DAD file that you wish to get access to.
'# The simplest scenario is that you copy dad.xfd to a new file
'# with the same name as the database you wish to access and extension .XFD
'# Then edit this new file and replace the two instances of 'FILE' with the
'# filename that you want to access. e.g. if you want to have ODBC access to
'# icvc.dad then copy dad.xfd to new file icvc.xfd and change line
'# XFD,02,FILE,FILE to be
'# XFD,02,ICVC,ICVC
If that does not work, follow the instructions for finding out how many keys ther are and the values for the second key. If you discover there are only three keys, remove the fourth key from the template. If the values for the second key are different, change them in the [FieldSection].
Get it working before changing anything else.