I have 2 txt files
The 1) txt file is like this :
sequence_id description
Solyc01g005420.2.1 No description available
Solyc01g006950.3.1 "31.4 cell.vesicle transport Encodes a syntaxin localized at the plasma membrane (SYR1 Syntaxin Related Protein 1 also known as SYP121 PENETRATION1/PEN1). SYR1/PEN1 is a member of the SNARE superfamily proteins. SNARE proteins are involved in cell signaling vesicle traffic growth and development. SYR1/PEN1 functions in positioning anchoring of the KAT1 K+ channel protein at the plasma membrane. Transcription is upregulated by abscisic acid suggesting a role in ABA signaling. Also functions in non-host resistance against barley powdery mildew Blumeria graminis sp. hordei. SYR1/PEN1 is a nonessential component of the preinvasive resistance against Colletotrichum fungus. Required for mlo resistance. syntaxin of plants 121 (SYP121)"
Solyc01g007770.2.1 No description available
Solyc01g008560.3.1 No description available
Solyc01g068490.3.1 20.1 stress.biotic Encodes a protein containing a U-box and an ARM domain. senescence-associated E3 ubiquitin ligase 1 (SAUL1)
..
.
the 2nd txt file has the gene ids:
Solyc02g080050.2.1
Solyc09g083200.3.1
Solyc05g050380.3.1
Solyc09g011490.3.1
Solyc04g051490.3.1
Solyc08g006470.3.1
Solyc01g107810.3.1
Solyc03g095770.3.1
Solyc12g006370.2.1
Solyc03g033840.3.1
Solyc02g069250.3.1
Solyc02g077040.3.1
Solyc03g093890.3.1
..
.
.
Each txt has a lot more lines than the ones i show. I just wanted to know what grep command should i use that i only get the genes that are on the 2nd txt file, deduct from the 1st with the description next to it.
thanks
Related
Use Case
I have some terabytes of US property data to merge. It is spread across two distinct file formats and thousands of files. The source data is split geographically.
I can't find a way to branch a single pipeline into many independent processing flows.
This is especially difficult because the Dataframe API doesn't seem to support a PTransform on a collection of filenames.
Detailed Background
The distribution of files is like this:
StateData - 51 total files (US states + DC)
CountyData - ~2000 total files (county specific, grouped by state)
The ideal pipeline would split into thousands of independent processing steps and complete in minutes.
1 -> 51 (each US state + DC starts processing)
51 -> thousands (each US state then spawns a process that merges the counties, combining at the end for the whole state)
The directory structure is like this:
📂state-data/
|-📜AL.zip
|-📜AK.zip
|-📜...
|-📜WY.zip
📂county-data/
|-📂AL/
|-📜COUNTY1.csv
|-📜COUNTY2.csv
|-📜...
|-📜COUNTY68.csv
|-📂AK/
|-📜...
|-📂.../
|-📂WY/
|-📜...
Sample Data
This is extremely abbreviated, but imagine something like this:
State Level Data - 51 of these (~200 cols wide)
uid
census_plot
flood_zone
abc121
ACVB-1249575
R50
abc122
ACVB-1249575
R50
abc123
ACVB-1249575
R51
abc124
ACVB-1249599
R51
abc125
ACVB-1249599
R50
...
...
...
County Level Data - thousands of these (~300 cols wide)
uid
county
subdivision
tax_id
abc121
04021
Roland Heights
3t4g
abc122
04021
Roland Heights
3g444
abc123
04021
Roland Heights
09udd
...
...
...
...
So we join many county-level to a single state level, and thus have an aggregated, more-complete state-level data set.
Then we aggregate all the states, and we have a national level data set.
Desired Outcome
I can successfully merge one state at a time (many county to one state). I built a pipeline to do that, but the pipeline starts with a single CountyData CSV and a single StateData CSV. The issue is getting to the point where I can load the CountyData and StateData.
In other words:
#
# I need to find a way to generalize this flow to
# dynamically created COUNTY and STATE variables.
#
from apache_beam.dataframe.convert import to_pcollection
from apache_beam.dataframe.io import read_csv
COUNTY = "county-data/AL/*.csv"
STATE = "state-data/AL.zip"
def key_by_uid(elem):
return (elem.uid, elem)
with beam.Pipeline() as p:
county_df = p | read_csv(COUNTY)
county_rows_keyed = to_pcollection(county_df) | beam.Map(key_by_uid)
state_df = pd.read_csv(STATE, compression="zip")
state_rows_keys = to_pcollection(state_df, pipeline=p) | beam.Map(key_by_uid)
merged = ({ "state": state_rows_keys, "county": county_rows_keyed } ) | beam.CoGroupByKey() | beam.Map(merge_logic)
merged | WriteToParquet()
Starting with a list of states
By state, generate filepatterns to the source data
By state, load and merge the filenames
Flatten the output from each state into a US data set.
Write to Parquet file.
with beam.Pipeline(options=pipeline_options) as p:
merged_data = (
p
| beam.Create(cx.STATES)
| "PathsKeyedByState" >> tx.PathsKeyedByState()
# ('AL', {'county-data': 'gs://data/county-data/AL/COUNTY*.csv', 'state-data': 'gs://data/state-data/AL.zip'})
| "MergeSourceDataByState" >> tx.MergeSourceDataByState()
| "MergeAllStateData" >> beam.Flatten()
)
merged_data | "WriteParquet" >> tx.WriteParquet()
The issue I'm having is something like this:
I have two filepatterns in a dictionary, per state. To access those I need to use a DoFn to get at the element.
To communicate the way the data flows, I need access to Pipeline, which is a PTransform. Ex: df = p | read_csv(...)
These appear to be incompatible needs.
Here's an alternative answer.
Read the state data one at a time and flatten them, e.g.
state_dataframe = None
for state in STATES:
df = p | read_csv('/path/to/state')
df['state'] = state
if state_dataframe is None:
state_dataframe = df
else:
state_dataframe = state_dataframe.append(df)
Similarly for county data. Now join them using dataframe operations.
I'm not sure exactly what kind of merging you're doing here, but one way to structure this pipeline might be to have a DoFn that takes the county data in as a filename as an input element (i.e. you'd have a PCollection of county data filenames), opens it up using "normal" Python (e.g. pandas), and then reads the relevant state data in as a side input to do the merge.
I am writing a postscript file through coding in VB.net and pslibrary. My Main purpose for the job is tray switching from 3 different trays and having stapled the sets based on variable input. i.e I have a post script file of 100 pages first two pages will be simplex and will be printed from two different trays. On third page we will use the third tray and pages from third tray to onward 10 pages will be stapled. After page eleven to next 8 pages will be stapled separately. So it will go so on.
Note: Ricoh Aficio/ Gestatner/ Toshiba Printers is in use 2105-2090 models are being in used.
Tray switching and file is working fine except stapling
Stapling is not working through PS although working fine on machine separately.
Following code is being used to do the work
**{{{
%%Page: 3 3
%%BeginPageSetup
<< /PageSize[595 841] /Duplex false /MediaColor (Red) /Jog 3 /Staple 3 /StapleDetails << /Type 1 /StapleLocation (SinglePortrait) >>>> setpagedevice
save
%%EndPageSetup
(InvoiceNo 50011287697) 72 755.28 /ArialMT 15 SF
%EndPage: 3
restore
showpage
<</PageSize [595 842]/MediaType (Red) /MediaColor (Red) /MediaWeight 75/Duplex false>> setpagedevice
%%Page: 4 4
%%BeginPageSetup
save
%%EndPageSetup
(InvoiceNo 50011287697) 72 755.28 /ArialMT 15 SF
%EndPage: 4
restore
showpage
<< /Jog 0 >> setpagedevice
<< /Staple 0 >> setpagedevice
}}}**
But no stapling is done and printing is started to get out from first paper and that too through its finisher. Printer is just ignoring Staple commands
Things like tray selection and stapling are printer specific. You'll need to extract appropriate code fragments from the .PPD files for the printers in question.
Depending on the exact code fragments needed, it may be possible to combine the fragments into a single PostScript fragment that will work on all of these printers. But it's unlikely to make a fully general solution.
For example, the Ricoh Afficio 2105 PPD file has fragments like this:
<<
/Collate true /CollateDetails <</Type 6 /AlignSet true>>
/Staple 2 /StapleDetails << /Type 14 /Angle 0 /Position 0 >>
>> setpagedevice
The Position changes for different locations but is always a small integer for this printer.
Gestetner 2212 shows fragments that look the same to me as for the Ricoh.
The fragment for a Toshiba 2500C is completely different:
<</TSBPrivate (DSSC PRINT STAPLING=769) >> setpagedevice
Some underlines in my image are very close to the text. For that particular text tesseract is unable to produce accurate results. I have attached the image and text file. Is there any way i can increase accuracy of the text?
I have tried to remove the underlines with some of the image processing techniques, but the problem is those lines which are close to the text are not getting removed.
And are there any parameters in tesseract which i can use to improve the accuracy? Thanks in advance.
image which i am trying to run
Its Result:
ARR!
D.
1.
\OCIJHJO‘LI'IJ?‘
3..
10.
E.
F.
SITE NUMBER
ARCHEOLOGICAL DESCRIPTION
General site description SITE IS COVERED WITH LARGE PINES AND IS IN RELATIVLY
GOOD CONDITION, snowING'EITTrE‘SIGNS‘OFTRosmN—EXCEPT—AEONG-Tmm
_"—""NHERE IT DROPS or INTO FLOODPLAIN OF CREEK THERE ARE A EEN ANIMAL TRAILS THAT
HAVE APPARENTLY ERODED OUT IN THE PAST. ONE OF THESE WAS QUIET DEEP ACCORDING
“TO AUGER TEST, BUT HAS FILLED UP WITH SAND AND GROWN OVER AGAIN. FIRST AUGER TEST
“WAS INTO THIS DEE P"GULLY" AND GAVE A FALSE IMPRESSION AS TO THE TRUE DEPTH OF
SITE. THIS TEST HOLE PRODUCED LIEHLQ FLAKES ALL THE WAY DOWN TO 42 INCHES AND
_m STERILE SAND DOWN TO 60 INCHES= REST OF SITE PRODUCED SAND AND CHIPS ONLY TO
l- an I ' A: : I L I i : ‘5!) THIS 3 1.0 5.- 3.. 'Y __
FINE SITE.
Site size .AT L S - E Y CONSIDERABLY MQBE
Nature of archeological deposition EAIBIEIHNDESTURBED EXCEPT ALONG THE EDGES OF SITE
T D0.
Site depth. 20-22 INCHES
Hidden
Faunal preservation
Floral preservation
Human remains
Cultural features (type and number)
Charcoal preservation
DATA RECOVERY METHODS
Ground surface visibility: 0% x 1-251 26—50% 51-75% 76—100%
Description of ground cover iMATURE PINE FOREST
Time spent collecting Number of peeple collecting
Description of surface collecting methods
Type and extent of testing and/or excavation FIVE TEST HOLES WERE SUNK IN SITE WITH 8"
AUGERa THESE WERE TAKEN DOWN IN 6" LEVELS UNTIL STERILE CLAY WAS REACHED. DIRTTA T-
FROM EACH 6" LEVEL WAS SCREENED THROUGH_l/4" WIRE MESH AND ARTIFACTS KEPT FOR
ANALYSIS. ALL TEST HOLES QERE PLQIIED EIIE TRANSIT IN RELATION TO DATUM MARKER
WHI IS A PIPE ‘ _ -: fl' : 3:0. . .: U' J I: : : . !" uFF 3L
GROUND. P__\l: IS I : um \I' :i “I ' I ' .M' I ' D' . I’ I 2! ti 0 .1. ' -. _ .L l' .
ARCHEOLOGICAL COMPONENTS
Paleo-Indian Late Whodland 17th century
Early Archaic Mississippian 18th century
Middle Archaic Late prehistoric 19th century
Late Archaic Unknown prehistoric ___ 20th century __
Early Woodland Ceramic prehistoric ____ Unknown historic
Middle Woodland 16th century
I'm having problem to form the field section's structure into xfd files after analyse by issuing commnad "vutil32.exe -i -kx pogl.dad". I hope somebody could help me out how to form out field structure as highlighted in below. I've uploaded sample of my file known as "pglc.dad" i hope soneone could guide me how to form .xfd file from his expert skills and guide me.Thanks
Result from vutil32.exe
file size: 250880
record size (min/max): 121/1024 compressed(80%)
# of keys: 4
key size: 16:02 31:03 56:03 15
key offset: 0 0 0 1
duplicates okay: N N N N
block size: 512
blocks per granule: 1
tree height: 4/2/2.7
# of nodes: 200
# of deleted nodes: 1
total node space: 101800
node space used: 67463 (66%)
user count: 0
Key Dups Seg-1 Seg-2 Seg-3 Seg-4 Seg-5 Seg-6
(sz/of) (sz/of) (sz/of) (sz/of) (sz/of) (sz/of)
0 N 1/0 15/1
1 N 1/0 15/66 15/1
2 N 1/0 40/81 15/1
3 N 15/1
Here is my further construction of .xfd file.
XFD,02,PGLC,PGLC
00300,00041,004
1,0,013,00000
01
PGSTAT
3,0,004,00004,020,00021,004,00000
3
PGSTAT
PGDESC
PGLINE
3,0,004,00004,008,00013,004,00000
03
PGSTAT
PGDESC
PGLINE
1,0,012,00021
01
PGSTAT
000
0150,00150,00003 =================>> How can i form this field section.
00000,00013,16,00016,+00,000,000,PGSTAT
00000,00001,16,00001,+00,000,000,PGDESC
00001,00015,16,00015,+00,000,000,PGLINE
here is the link for my pglc.dad : http://files.engineering.com/getfile.aspx?folder=080fdad6-b1d5-4a37-8dd0-b89f9a985c69&file=PGLC.DAD
Thanks appopriate to someone could helps.
I know the XFD format intimately as I have written a couple of parsers of this file format in both Perl and Cobol.
Having said that, I would strongly recommend that you do not try to hand craft an XFD file from scratch.
If you have an AcuCobol (MicroFocus) compiler, and the source of the file's SELECT and FD definitions, then you can create a very small Cobol program that has just the SELECT and FD definitions and then compile the program using:
ccbl32.exe -Fx <program>
That will create an XFD file for the indexed file definition. Note, you can specify a directory for the created XFD file using the -Fo <directory> option.
If you don't have the source of the file definitions, then you are just going to be guessing what and where the fields are. The indexed file by itself will not tell you that information. I can see from extracting the data in your file (using the vutil -e option) that the file contains binary data as well as text, so without knowing exactly what PICture those fields are (COMP-?) you will be struggling to figure out the structure of those fields.
I am doing the process of cleaning up and image using leptonica and then passing it to tesseract for OCR.However it is not able to recognize the characters even though the image is of high quality.The image specifications are as follows.
1 bpp, uncompressed, 1280 * 960 , 300dpi horizontal and vertical resolution
Following are the image processing operations I carry out in sequence using leptonica
pixConvertTo8
pixBackgroundNormSimple
pixOtsuAdaptiveThreshold
pixContrastTRC {Regarding this - I am passing high values like 1.0 or even 5.0 but image doesnt really change}
pixFindSkew
pixRotate { rotate by angle found by pixFindSkew}
pixRotate90 {do this 4 times to read image in all 4 orientations}
pixClipRectangle {crop image}
Finally tesseract command
I get garbage characters in the output.A sample Input Image is as follows.
The output that i get is as follows
Final K-1
II]
s h d | K-1 ,.,
(F°o.~?n‘i&1) 5/>.©12 mm E2‘;
Deparlrnenl of tho Treasury , ,
I 1 I l I
‘mama, Ravenuo SGMW For cnlundm your 201), ‘ " °F°$ "'100fTIO
or lax yum boqmnnnq 7 _ 20\Q_
‘ 7660
and ondmg _ W vv I go
Beneï¬ciary's Share of Income, Deductions,
cl'editS, etc. F 800 buck 01 loam nnd lnstruoflons»
___lnformatI0n About mo Estate or Trust
‘ Ordmary d|v|dm
i 12113
_
‘; Quahfmd dlVIdG
\ 8132
3 1
Net shun-term
A Estate's at trust's omgiuym ldonnlmnluon numbol
56-0987654
B Estate's u trust‘: namo
ESTATE OF MARTHA SMITH
0 Fiduc§ary's name, address, clly, smlu‘ and /IP codo
N01 long~lerm c
\ 24043
u
‘ 28% vale gann
Ti
Unreptumd 5
Omar porfloho 4
nonbuslness lfll
/\..4........ L. ._.._ ,.
What Should i do to improve the accuracy.
Part 2:
I tried to follow this link.And created a eng.user-words.traineddata file and bazaar.train file and tried to run with "bazaar" as additional parameter.but i get "read_params_file: can't open bazaar error".
Any suggestions?
For part one,
I don't know if the image you posted up here is the actual one you've been trying to scan but when I tried it, I got this:-
Department oi the Treasury Internal Revenue Service
For cnlundm your V019, 1 ‘ '"l0T°5' |nC0m0
or tax yam boqlnnlnq , 2o12_ ‘ 7660 and ondlng I go 2: ‘ Ordinary
dlvndm " “T ' x 12113
1; Quali?ed dwnda ‘ 8132 Netshun-term:
M Not long ~terrn c
i 24043 Ab ‘ 2896 ralagann
Bene?ciary’s Share of Income, Deductions, Cfedits, etc. 5 800 back oi
form nnd Instruc?ons
| Partl Information About the state or Trust
A Estate's or IvLsl's omuoym Idonnlncnluon numhu
56-0987654
8 Estate‘: a trust‘: namo
ESTATE OF MARTHA SMITH
M: Unreptumd 5
017161 portioho : nonbuslness Inl
C Fiduc§ary's name, address, city, smlul an-(V1/If’ Eooo
It's not great but it seems a bit better than what you got. I'm using Tesseract v3 on Windows.
My basic command was:
- tesseract.exe nnm.tif nnm
For part two,
your bazaar file should be in the configs folder
.....\Tesseract-OCR\tessdata\configs\bazaar
and there's some requirements for it to be saved in a particular format, like UTF8 with only a LF at the end of the line not a CR + LF, it seems to be quite fussy about the file formats.
you can get a copy of it from http://code.metager.de/source/raw/google/tesseract-ocr/tessdata/configs/bazaar
I made a digits config file that I used for scanning some images where I was only interested in the numbers and that worked fine:
- tesseract.exe scanfile.jpg scanfile digits
The documentation for Tesseract is pretty poor and it doesn't work well on a PC.
For part one,
I think you should consider the preprocessing done by Capture2Text. It is using both Leptonica and Tesseract to OCR the images.
I am not sure about part 2.