Phred qual error after fastq trimming with Cutadapt - mapping

I would like to trim the beginning of all the reads in fastq file by a given length, before mapping to the genome with bowtie2. I have used Cutadapt:
cutadapt -u 48 -o output.fastq.gz input.fastq.gz
my fastq files after trimming looks like this:
gunzip -c output.fastq.gz | head
#NB502143:99:HFF7TAFX2:1:11101:4133:1019 1:N:0:ATCACG
CATGAAAAAGAGCTCATTTTCAGATGCAGGAATTCCTATCCG
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
#NB502143:99:HFF7TAFX2:1:11101:19790:1020 1:N:0:ATCACG
CATGATCCACTTTTCCACGCGCTTTGACGACCATTTTATAA
+
EEEEE<EEEEEEEEEEEEEEEEE<EE/EEAEEEEEEEEEEE
#NB502143:99:HFF7TAFX2:1:11101:6327:1020 1:N:0:ATCACG
CATGATCTCAGTAAAGGCATTTGTGGTTGTTAAGTAGCCATT
When I try to map it with bowtie2, I get the following error message:
Saw ASCII character 10 but expected 33-based Phred qual.
I don't get this error if I map input.fastq.gz, so I suspect something wrong is happening during the trimming but I can't figure out what!
I checked both files with FastQC and they're both Sanger / Illumina 1.9 encoded.
Thanks for your help.

I have been having a similar issue. The error occurs when I use cutadapt, but does not happen when I trim with another tool, fastp.
Checking the integrity of the resulting trimmed fastq files showed that some reads had no bases. A tool like fastq_info from the fastq_utils package would work.
If there is an issue, you might need to use the -m <minimum-length> flag when running cutadapt. This will remove reads below a designated length. Alignment after that should work if that was the issue.

Related

Pipe character ignored in SPSS syntax

I am trying to use the pipe character "|" in SPSS syntax with strange results:
In the syntax it appears like this:
But when I copy this line from the syntax window to here, this is what I get:
SELECT IF(SEX = 1 SEX = 2).
The pipe just disappears!
If I run this line, this is the output:
SELECT IF(SEX = 1 SEX = 2).
Error # 4007 in column 20. Text: SEX
The expression is incomplete. Check for missing operands, invalid operators,
unmatched parentheses or excessive string length.
Execution of this command stops.
So the pipe is invisible to the program too!
When I save this syntax and reopen it, the pipe is gone...
The only way I found to get SPSS to work with the pipe is when I edited the syntax (adding the pipe) and saved it in an alternative editor (notepad++ in this case). Now, without opening the syntax, I ran it from another syntax using insert command, and it worked.
EDIT: some background info:
I have spss version 23 (+service pack 3) 64 bit.
The same things happens if I use my locale (encoding: windows-1255) or Unicode (Encoding: UTF-8). Suspecting my Hebrew keyboard I tried copying syntax from the web with same results.
Can anyone shed any light on this subject?
Turns out (according to SPSS support) that's a version specific (ver. 21) bug and was fixed in later versions.

'Missing $ inserted' error message when converting jupyter notebook to pdf with nbconvert

When attempting to convert a jupyter notebook to pdf with the following command:
jupyter nbconvert --to pdf "Search and Other Content Finding Features.ipynb"
I'm getting an error message:
! Missing $ inserted.
<inserted text>
$
l.380 ... Other Content Finding Features_10_0.png}
?
! Emergency stop.
<inserted text>
$
l.380 ... Other Content Finding Features_10_0.png}
I've found some discussion of what that is here.
However, I can't find these characters in my code. Could there be another cause?
For me it was another, although related issue: underlines. I assume that the cause is that text in cells marked as Raw Text will be passed directly to LaTeX, where it can be interpreted as LaTeX code itself. Maybe the underlines in your figure's name?
At some point, I had a raw cell with three underlines ___ which were then making the conversion break. The temporary solution was to convert the cell to markdown, instead of raw (and not run it) to appear in the pdf.
To find the error, I used the following conversion (taken from this answer):
jupyter nbconvert thenotebook.ipynb --to latex
Another error, related, was caused by a link containing underlines:
[text](https://en.wikipedia.org/wiki/Python_(programming_language))
This was also in a Raw Text cell, which I converted to markdown to generate the pdf. The format (colors, links) are different, though.
Last note: My file's name also contains empty spaces, but that wasn't an issue at all!
A very common gotcha here might be the following:
Leading or trailing spaces are not allowed in the pandoc extension tex_math_dollars, which is used by nbconvert.
This means, that this won't work:
$ \epsilon \gt 0 $
And we see the error message:
! Missing $ inserted.
<inserted text>
$
l.364 \$ \epsilon
\gt 0 \$
?
! Emergency stop.
<inserted text>
$
l.364 \$ \epsilon
\gt 0 \$
No pages of output.
Transcript written on notebook.log.
The correct formula without spaces works fine:
$\epsilon \gt 0$
This seems to be a bug in Jupyter nbconvert.
The pandoc documentation suggests that for pandoc this is by design to allow to use dollar symbols without escape sequence:
Anything between two $ characters will be treated as TeX math. The
opening $ must have a non-space character immediately to its right,
while the closing $ must have a non-space character immediately to its
left, and must not be followed immediately by a digit. Thus, $20,000
and $30,000 won’t parse as math. If for some reason you need to
enclose text in literal $ characters, backslash-escape them and they
won’t be treated as math delimiters.
The problem in this case seems to have been caused by my notebook's filename. I don't fully understand what caused the problem, but the error message above includes a reference to some text:
... Other Content Finding Features_10_0.png}.
That text includes _ which can cause this error. I think what happens is that somewhere in the conversion script, if there are spaces in the filename, a file is generated with underscores as shown, and that then triggers the error. (This seems a little bit like a bug to me, or at least a weakness).
The fix that worked for me was simply to change the jupyter notebook's filename not to include any spaces. Then the conversion ran without a hitch.
For me it's caused by significant difference between LaTeX and MathJax. For example cases environment can be rendered outside math mode with MathJax, which is the default choice of jupyter notebook. However, it causes an error stating "missing $ insert" in LaTeX. The error message disappeared after correcting syntax in Markdown cells.

Sublime Text: Not representable characters

I'm using Sublime Text for Latex, so i need to use a specific encoding. However, in some cases, when I paste text copied from a different program (word/browser in most cases), I'm getting the message:
"Not all characters are representable in XXX encoding, falling back to UTF-8"
My question is: Is there any way to see which parts of the text cannot be encoded, so I can delete them manually?
I had this problem. It is caused by corrupt characters in your document. Here is how i solved it.
1) Make a search in your document for all standard characters. Make sure you enable regular expressions in your search, then paste this :
[^a-zA-Z0-9 -\.;<>/ ={}\[\]\^\?_\\\|:\r\n#]
You can add to that the normal accented characters of your language, here are the characters for French and German. Such as éà and so on :
[^a-zA-Z0-9 -\.;<>/ ='{}\[\]\^\?_\\\|:\r\n~#éàèêîôâûçäöüÄÖÜß]
2) Search for that, and Keep pressing F3 until you see mangled characters. Usually something like "è" which is a corrupt version of "à".
3) Delete those characters or replace them with what they should be.
You will be able to convert the document to another encoding when you have cleared all corrupt characters out.
For Linux users, it's also possible to automatically remove broken characters with command iconv:
iconv -f UTF-8 -t Windows-1251 -c < ~/temp/data.csv > ~/temp/data01.csv
-c Silently discard characters that cannot be converted instead of terminating when encountering such characters.
Just adding to #Draken response: here is the RegEx with spanish characters added.
[^a-zA-Z0-9 -\.;<>/ =“”'{}\[\]\^\?_\\\|:\r\n~#àèêîôâûçäöüÄÖÜßáéíóúñÑ¿€]
In my case I hitted Ctrl+H (for replacement) and as a replacement expression used nothing. So everything got cleared super fast and I was able to save it using ISO-8859-1.

How to diagnose errors in LaTeX generated by Doxygen 1.8.x: LT#LL#FM#cr

I've been using Doxygen successfully to generate PDF documentation for a sizable Fortran 90 project since v1.6. After a recent upgrade to Doxygen 1.8, pdflatex is choking with an error I can't understand. From refman.log:
.
.
.
<use classfate__source_a022bf629bdc1d3059ebd5fb86d13b4f4_icgraph.pdf>
Package pdftex.def Info: classfate__source_a022bf629bdc1d3059ebd5fb86d13b4f4_ic
graph.pdf used on input line 607.
(pdftex.def) Requested size: 350.0pt x 65.42921pt.
)
(./classm__aerosol.tex
! Undefined control sequence.
<recently read> \LT#LL#FM#cr
l.25 ...1833ffa6f2fae54ededb}{ia\-\_\-nsize}), \\*
? ?
Type <return> to proceed, S to scroll future error messages,
R to run without stopping, Q to run quietly,
I to insert something, E to edit your file,
1 or ... or 9 to ignore the next 1 to 9 tokens of input,
H for help, X to quit.
Looking at the first 25 lines of classm__aerosol.tex, nothing obviously matches the error message:
\hypertarget{classm__aerosol}{\section{m\-\_\-aerosol Module Reference}
\label{classm__aerosol}\index{m\-\_\-aerosol#{m\-\_\-aerosol}}
}
Contains general aerosol-\/related constants and routines.
\subsection*{Public Member Functions}
\begin{DoxyCompactItemize}
\item
subroutine \hyperlink{classm__aerosol_aa06c1f39c6bd34f22be92d21535f0320}{aerdis} (I\-A\-E\-R\-O, M\-A\-E\-R\-O, V\-O\-L, A\-R\-E\-A, M\-U, T\-G\-A\-S, R\-H\-O, A\-G\-A\-M\-M\-A, X\-L\-A\-E\-R, D\-M\-E\-A\-N, N\-A\-E\-R, X\-N\-D\-A\-E\-R, L\-S\-D\-A\-E\-R)
\begin{DoxyCompactList}\small\item\em Return aerosol mass given a volume, based on aerosol size distribution function. \end{DoxyCompactList}\item
real(kind=wp) function \hyperlink{classm__aerosol_a2dff4ff413057e8788fba7270a30c093}{lamsed} (V\-O\-L, H, M\-U\-G, R\-H\-O\-A\-E\-R, A\-G\-A\-M\-M\-A, A\-C\-H\-I, A\-F\-E\-O, K\-O, M\-A\-E\-R, F\-M\-A\-E\-R, F\-A\-E\-R\-S\-S, F\-S\-E\-D\-D\-K)
\begin{DoxyCompactList}\small\item\em Calculate aerosol removal constant and interpolation factor between steady-\/state and decaying aerosol correlations. \end{DoxyCompactList}\item
pure real(kind=wp) function \hyperlink{classm__aerosol_a6d0a04004f49c404c67e0aa69dd39ee1}{fdbend} (V\-E\-L, H\-S\-E\-D, T\-G, R\-H\-O\-G, M\-U\-G, R\-H\-O\-P\-A\-R, C\-A\-E\-R\-O, X\-D\-B\-E\-N\-D, N90\-J)
\begin{DoxyCompactList}\small\item\em Find total impaction efficiency for aerosol deposition considering 90-\/degree bends in a flow path. \end{DoxyCompactList}\end{DoxyCompactItemize}
\subsection*{Public Attributes}
\begin{DoxyCompactItemize}
\item
integer, parameter \hyperlink{classm__aerosol_a8f604b7ffe3c1833ffa6f2fae54ededb}{ia\-\_\-nsize} = 30
\item
integer, parameter \hyperlink{classm__aerosol_ae71813ecf0c7768af9d6292efb14774f}{ia\-\_\-nmass} = 10
\item
real(kind=wp), dimension(\hyperlink{classm__aerosol_a8f604b7ffe3c1833ffa6f2fae54ededb}{ia\-\_\-nsize}), \\*
Nothing obviously matches the recently read chunk "\LT#LL#FM#cr" and I don't know enough low-level TeX to translate that into something that might actually be in the source text.
Suspecting this might have been fixed in a later version of Doxygen than the one shipping with Linux Mint (v1.8.1.2), I built & installed v1.8.3.1 from source, updated my doxyfile, blew away the old documentation and regenerated it. I get the same baffling error.
There's nothing obvious in refman.log that would indicate missing or broken LaTeX packages and I'm completely at a loss as to what's causing this.
As this still gets a hit on Google when you search:
doxygen missing $ inserted
I would like to add something.
Do not use a PROJECT_NAME containing underscores (_)!
After a brief look into the doxygen's current documentation (I am using 1.8.4) it does not make that explicit.
this will be difficult to solve unless you provide a bit more information - possibly using \errorcontextlines=9999 as suggested in the comments on the question.
as a first short though, the name of the control sequence that can't be found (i.e. \LT#LL#FM#cr) is one defined by the longtable package (documentation, p. 15) - thus adding:
\usepackage{longtable}
to the preamble of the document might help.
If so, according to the doxygen documentation here, adding the following to your configuration file should do the trick:
EXTRA_PACKAGES=longtable

Correct word-count of a LaTeX document

I'm currently searching for an application or a script that does a correct word count for a LaTeX document.
Up till now, I have only encountered scripts that only work on a single file but what I want is a script that can safely ignore LaTeX keywords and also traverse linked files...ie follow \include and \input links to produce a correct word-count for the whole document.
With vim, I currently use ggVGg CTRL+G but obviously that shows the count for the current file and does not ignore LaTeX keywords.
Does anyone know of any script (or application) that can do this job?
I use texcount. The webpage has a Perl script to download (and a manual).
It will include tex files that are included (\input or \include) in the document (see -inc), supports macros, and has many other nice features.
When following included files you will get detail about each separate file as well as a total. For example here is the total output for a 12 page document of mine:
TOTAL COUNT
Files: 20
Words in text: 4188
Words in headers: 26
Words in float captions: 404
Number of headers: 12
Number of floats: 7
Number of math inlines: 85
Number of math displayed: 19
If you're only interested in the total, use the -total argument.
I went with icio's comment and did a word-count on the pdf itself by piping the output of pdftotext to wc:
pdftotext file.pdf - | wc - w
latex file.tex
dvips -o - file.dvi | ps2ascii | wc -w
should give you a fairly accurate word count.
To add to #aioobe,
If you use pdflatex, just do
pdftops file.pdf
ps2ascii file.ps|wc -w
I compared this count to the count in Microsoft Word in a 1599 word document (according to Word). pdftotext produced a text with 1700+ words. texcount did not include the references and produced 1088 words. ps2ascii returned 1603 words. 4 more than in Word.
I say that's a pretty good count. I am not sure where's the 4 word difference, though. :)
In Texmaker interface you can get the word count by right clicking in the PDF preview:
Overleaf has a word count feature:
Overleaf v2:
Overleaf v1:
I use the following VIM script:
function! WC()
let filename = expand("%")
let cmd = "detex " . filename . " | wc -w | perl -pe 'chomp; s/ +//;'"
let result = system(cmd)
echo result . " words"
endfunction
… but it doesn’t follow links. This would basically entail parsing the TeX file to get all linked files, wouldn’t it?
The advantage over the other answers is that it doesn’t have to produce an output file (PDF or PS) to compute the word count so it’s potentially (depending on usage) much more efficient.
Although icio’s comment is theoretically correct, I found that the above method gives quite accurate estimates for the number of words. For most texts, it’s well within the 5% margin that is used in many assignments.
If the use of a vim plugin suits you, the vimtex plugin has integrated the texcount tool quite nicely.
Here is an excerpt from their documentation:
:VimtexCountLetters Shows the number of letters/characters or words in
:VimtexCountWords the current project or in the selected region. The
count is created with `texcount` through a call on
the main project file similar to: >
texcount -nosub -sum [-letter] -merge -q -1 FILE
<
Note: Default arguments may be controlled with
|g:vimtex_texcount_custom_arg|.
Note: One may access the information through the
function `vimtex#misc#wordcount(opts)`, where
`opts` is a dictionary with the following
keys (defaults indicated): >
'range' : [1, line('$')]
'count_letters' : 0/1
'detailed' : 0
<
If `detailed` is 0, then it only returns the
total count. This makes it possible to use for
e.g. statusline functions. If the `opts` dict
is not passed, then the defaults are assumed.
*VimtexCountLetters!*
*VimtexCountWords!*
:VimtexCountLetters! Similar to |VimtexCountLetters|/|VimtexCountWords|, but
:VimtexCountWords! show separate reports for included files. I.e.
presents the result of: >
texcount -nosub -sum [-letter] -inc FILE
<
*VimtexImapsList*
*<plug>(vimtex-imaps-list)*
The nice part about this is how extensible it is. On top of counting the number of words in your current file, you can make a visual selection (say two or three paragraphs) and then only apply the command to your selection.
For a very basic article class document I just look at the number of matches for a regex to find words. I use Sublime Text, so this method may not work for you in a different editor, but I just hit Ctrl+F (Command+F on Mac) and then, with regex enabled, search for
(^|\s+|"|((h|f|te){)|\()\w+
which should ignore text declaring a floating environment or captions on figures as well as most kinds of basic equations and \usepackage declarations, while including quotations and parentheticals. It also counts footnotes and \emphasized text and will count \hyperref links as one word. It's not perfect, but it's typically accurate to within a few dozen words or so. You could refine it to work for you, but a script is probably a better solution, since LaTeX source code isn't a regular language. Just thought I'd throw this up here.

Resources