Mapping Fasta to Fasta + Bed file to make new bed file - alignment

I've been thinking a lot about this problem and I don't know how to solve it (also asked here https://www.biostars.org/p/426234/ but not answer).
I have a fasta file with microRNAs precursors and a bed file with the coordinates of this precursors in a genome.
fasta file microRNAs precursors:
>LQNS02278089.1_34108
CGGTCGTGATGGGAGCAAATTTGAACAATTAAATAGCAAATTGCACTCGTCCCGGCCTGC
>LQNS02278089.1_34106
CGGTCGTGATTGGTGCAACTTGGGTCACTTAACCGCCAATTGCACTGATCCCGGCCTGC
>LQNS02278089.1_34110
CGGCCGGTATGAGGGCAAATCAATTTCTGTATAAATGACGAATTGCACTCGTCCCGGCCTTC
bed file precursors:
LQNS02278089.1 848170 848230 LQNS02278089.1_34108 2249659.3 - 848170 848230 0,0,255
LQNS02278089.1 847652 847711 LQNS02278089.1_34106 1566285.5 - 847652 847711 0,0,255
LQNS02278089.1 848490 848552 LQNS02278089.1_34110 882643.1 - 848490 848552 0,0,255
I also have a fasta file of mature microRNAs sequences that I would like to map to the precursors fasta file to obtain another bed file for the matures microRNAs.
fasta file mature microRNAs:
>LQNS02278089.1_34108
AATTGCACTCGTCCCGGCCTGC
>LQNS02278089.1_34106
AATTGCACTGATCCCGGCCTGC
>LQNS02278089.1_34110
AATTGCACTCGTCCCGGCCTTC
Any help or recommendation will be very appreciated!

Related

Error when opening a .nc file with raster package

I'm new to raster package in R, I was trying to open a .nc file with the package raster and some error popped out. In case you want to try I was using this dataset from copernicus of monthly sea salinity for the years 2018 and 2019 (the grid was Quebec St.Laurence stuary and Gaspesie coast).
I opened similar data files before but never got this error, and a search online did not clarify too much
Here is my script
library(raster)
library(ncdf4)
#Load the .nc files describing SSS.
SSS = stack('SSS.nc')
and the output error
Warning message:
In .rasterObjectFromCDF(x, type = objecttype, band = band, ...) :
"level" set to 1 (there are 17 levels)
thnx
I expected to create a a rasterstack object to work with
There is no error, there is a warning. If you want another level you can select it.
Either way, the "raster" package is obsolete, and you should try this with terra.
You can probably do
library(terra)
x <- rast('SSS.nc')
"Probably" because you do not provide a file. You should at least include a hyperlink to the file you are using. It is hard to help you without your example being reproducible.

how to grep between 2 txt files

I have 2 txt files
The 1) txt file is like this :
sequence_id description
Solyc01g005420.2.1 No description available
Solyc01g006950.3.1 "31.4 cell.vesicle transport Encodes a syntaxin localized at the plasma membrane (SYR1 Syntaxin Related Protein 1 also known as SYP121 PENETRATION1/PEN1). SYR1/PEN1 is a member of the SNARE superfamily proteins. SNARE proteins are involved in cell signaling vesicle traffic growth and development. SYR1/PEN1 functions in positioning anchoring of the KAT1 K+ channel protein at the plasma membrane. Transcription is upregulated by abscisic acid suggesting a role in ABA signaling. Also functions in non-host resistance against barley powdery mildew Blumeria graminis sp. hordei. SYR1/PEN1 is a nonessential component of the preinvasive resistance against Colletotrichum fungus. Required for mlo resistance. syntaxin of plants 121 (SYP121)"
Solyc01g007770.2.1 No description available
Solyc01g008560.3.1 No description available
Solyc01g068490.3.1 20.1 stress.biotic Encodes a protein containing a U-box and an ARM domain. senescence-associated E3 ubiquitin ligase 1 (SAUL1)
..
.
the 2nd txt file has the gene ids:
Solyc02g080050.2.1
Solyc09g083200.3.1
Solyc05g050380.3.1
Solyc09g011490.3.1
Solyc04g051490.3.1
Solyc08g006470.3.1
Solyc01g107810.3.1
Solyc03g095770.3.1
Solyc12g006370.2.1
Solyc03g033840.3.1
Solyc02g069250.3.1
Solyc02g077040.3.1
Solyc03g093890.3.1
..
.
.
Each txt has a lot more lines than the ones i show. I just wanted to know what grep command should i use that i only get the genes that are on the 2nd txt file, deduct from the 1st with the description next to it.
thanks

How can I generate a single .avro file for large flat file with 30MB+ data

currently two avro files are getting generated for 10 kb file, If I follow the same thing with my actual file (30MB+) I will n number of files.
so need a solution to generate only one or two .avro files even if the source file of large.
Also is there any way to avoid manual declaration of column names.
current approach...
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:2.0.1
import org.apache.spark.sql.types.{StructType, StructField, StringType}
// Manual schema declaration of the 'co' and 'id' column names and types
val customSchema = StructType(Array(
StructField("ind", StringType, true),
StructField("co", StringType, true)))
val df = sqlContext.read.format("com.databricks.spark.csv").option("comment", "\"").option("quote", "|").schema(customSchema).load("/tmp/file.txt")
df.write.format("com.databricks.spark.avro").save("/tmp/avroout")
// Note: /tmp/file.txt is input file/dir, and /tmp/avroout is the output dir
Try specifying number of partitions of your dataframe while writing the data as avro or any format. To fix this use repartition or coalesce df function.
df.coalesce(1).write.format("com.databricks.spark.avro").save("/tmp/avroout")
So that it writes only one file in "/tmp/avroout"
Hope this helps!

Reading and parsing a large .dat file

I am trying to parse a huge .dat file (4gb). I have tried with R but it just takes too long. Is there a way to parse a .dat file by segments, for example every 30000 lines? Any other solutions would also be welcomed.
This is what it looks like:
These are the first two lines with header:
ST|ZIPCODE|GEO_ID|GEO_TTL|FOOTID_GEO|NAICS2012|NAICS2012_TTL|FOOTID_NAICS|YEAR|EMPSZES|EMPSZES_TTL|ESTAB|ESTAB_F <br/>
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|001|All establishments|167| <br/>
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|212|Establishments with 1 to 4 employees|91|
This is an option to read data faster in R by using the fread function in the data.table package.
EDIT
I removed all <br/> new-line tags. This is the edited dataset
ST|ZIPCODE|GEO_ID|GEO_TTL|FOOTID_GEO|NAICS2012|NAICS2012_TTL|FOOTID_NAICS|YEAR|EMPSZES|EMPSZES_TTL|ESTAB|ESTAB_F
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|001|All establishments|167|
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|212|Establishments with 1 to 4 employees|91|
Then I matched variables with classes. You should use nrows ~ 100.
colclasses = sapply(read.table(edited_data, nrows=1, sep="|", header=T),class)
Then I read the edited data.
your_data <- fread(edited_data, sep="|", sep2=NULL, nrows=-1L, header=T, na.strings="NA",
stringsAsFactors=FALSE, verbose=FALSE, autostart=30L, skip=-1L, select=NULL,
colClasses=colclasses)
Everything worked like a charm. In case you have problems removing the tags, use this simple Python script (it will take some time for sure):
original_file = file_path_to_original_file # e.g. "/Users/User/file.dat"
edited_file = file_path_to_new_file # e.g. "/Users/User/file_edited.dat"
with open(original_file) as inp:
with open(edited_file, "w") as op:
for line in inp:
op.write(line.replace("<br/>", "")
P.S.
You can use read.table with similar optimizations, but it won't give you nearly as much speed.

ImageJ - Image to Stack in Batch

I have .tiff files which contain 25 sections of a stack each. Is there a way to use the "Image to Stack" command in batch? Each data set contains 60 tiffs for all three channels of color.
Thanks
Christine
The general way to discover how to do these things is to use the macro recorder, which you can find under Plugins > Macros > Record .... If you then go to File > Import > Image Sequence... and select the first file of the sequence as normal, you should see something like the following appear in the recorder:
run("Image Sequence...", "open=[/home/mark/a/1.tif] number=60 starting=1 increment=1 scale=100 file=[] or=[] sort");
To allow this to work for arbitrary numbers of slices (my example happened to have 60) just leave out the number=60 bit. So, for example, to convert this directory of files to a single file from the command-line you can do:
imagej -eval 'run("Image Sequence...", "open=[/home/mark/a/1.tif] starting=1 increment=1 scale=100 file=[] or=[] sort"); saveAs("Tiff", "/home/mark/stack.tif");' -batch

Resources