Ruby/Rails: Traverse folders and parse metadata to seed DB - ruby-on-rails

I have a bunch of documents that I'd like to index in a Rails application. I'd like to use a rake task of sorts to comb a directory hierarchy looking for files and capturing the metadata from those files to index in Rails.
I'm not really sure how to do this in Ruby. I have found a utility called pdftk which can extract the metadata from the PDF files (much of what I'm indexing is PDFs) but I'm not sure how to capture the individual pieces of that data?
For example, to grab the ModDate or each BookmarkTitle and BookmarkPageNumber below.
Specifically I want to traverse a file hierarchy, execute the pdftk $filename dump_data command for each .pdf I find and then capture the important parts of that output into a rails model(s).
Output from pdftk:
$ pdftk BoringDocument883c2.pdf dump_data
InfoKey: Creator
InfoValue: Adobe Acrobat 9.3.4
InfoKey: Producer
InfoValue: Adobe Acrobat 9.34 Paper Capture Plug-in
InfoKey: ModDate
InfoValue: D:20110312194536-04'00'
InfoKey: CreationDate
InfoValue: D:20110214174733-05'00'
PdfID0: 2f28dcb8474c6849ae8628bc4157df43
PdfID1: 3e13c82c73a9f44bad90eeed137e7a1a
NumberOfPages: 126
BookmarkTitle: Alternative Maintenance Techniques
BookmarkLevel: 1
BookmarkPageNumber: 3
BookmarkTitle: CONTENTS
BookmarkLevel: 1
BookmarkPageNumber: 4
BookmarkTitle: EXHIBITS
BookmarkLevel: 1
BookmarkPageNumber: 6
BookmarkTitle: I - INTRODUCTION
BookmarkLevel: 1
BookmarkPageNumber: 8
BookmarkLevel: 1
BookmarkPageNumber: 13
BookmarkLevel: 1
BookmarkPageNumber: 30
BookmarkLevel: 1
BookmarkPageNumber: 55
BookmarkLevel: 1
BookmarkPageNumber: 66
BookmarkLevel: 1
BookmarkPageNumber: 77
...shortened for brevity...
PageLabelNewIndex: 1
PageLabelStart: 1
PageLabelPrefix: F-E12_0001.jpg
PageLabelNumStyle: NoNumber
PageLabelNewIndex: 2
PageLabelStart: 1
PageLabelPrefix: F-E12_0002.jpg
PageLabelNumStyle: NoNumber
PageLabelNewIndex: 3
PageLabelStart: 1
PageLabelPrefix: F-E12_0003.jpg
PageLabelNumStyle: NoNumber
Edit: I've recently found the pdf-reader gem which looks promising and may obviate the need for triggering pdftk, somehow, in the shell?!?

First off, let me say that my knowledge of Rake isn't that good, so there might be some mistakes. Let me know if something doesn't work and I would be happy to try and fix the problem.
To solve this, I am going to use 2 rake tasks. One of the rake tasks will be a recursive directory traversal task, and the other will be a task which kicks off the recursion.
desc "Populate the database with PDF metadata from the default PDF path"
task :populate_all_pdf_metadata do
pdf_path = "/path/to/pdfs"
desc "Recursively traverse a path looking for PDF metadata"
task :populate_pdf_metadata, :pdf_path do |t, args|
excluded_dir_names = [".", ".."] # Do not look in dirs with these names.
pdf_path = args[:pdf_path]
Dir.entries(pdf_path).each do |file|
if && !excluded_dir_names.include?(file)
Rake::Task[:populate_pdf_metadata].invoke(pdf_path + "/" + file)
elsif File.extname(file) == ".pdf"
reader =
# Populate the database here
I believe the code above is similar to what you want to do. In order to access the database you will need to add the :environment dependency to your tasks. You can search Google for how to access ActiveRecord models from a rake tasks. I hope this helps.


Biopython: Extract CDS from modified GenBank records?

I have some basic familiarity with python and have been extracting coding sequences from genbank records. However, I'm unsure how to handle records where the coding sequence has been modified (e.g. owing to correcting internal stop codons). An example of such a sequence is this genbank record (or accession: XM_021385495.1 if the link does not work).
In this example, I can translate the two coding sequences that I can access, but both have internal stop codons - and according to the notes also indels! This is the way I have accessed the CDS:
1 - gb_record.seq
2 - cds.location.extract(gb_record) for where feature == "CDS"
However, I need the sequence that has been corrected. As far as I can tell, I think I need to use the "transl_except" tags in the CDS feature but I am at a loss how to do this.
I wonder if anybody might be able to provide an example or some insight of how to do this?
I've got some demo code written in python3 that should help explain this GenBank record.
import re
aa_convert_codon_di = {
'J':['[ARWM][TYWK][^G]', '[CYSM][TYWK].', '[TYWK][TYWK][AGRSKWM]'],
dna_convert_aa_di = {
'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W'}
mod_dna_str = ""
mod_aa_str = aa_str[:]
start = 0
for index in range(start, len(dna_str), 3):
codon = dna_str[index:index+3]
if len(mod_aa_str) == 0:
if codon in dna_convert_aa_di and dna_convert_aa_di[codon] == mod_aa_str[0]:
mod_aa_str = mod_aa_str[1:]
codon_match = "|".join(aa_convert_codon_di[mod_aa_str[0]])
if len(re.findall(codon_match, codon)) > 0:
print(index, codon_match, codon)
mod_aa_str = mod_aa_str[1:]
Code output:
804 ... TAG
930 ... TGA
1056 ... TAG
1065 ... TAA
1209 ... TAG
1389 ... NTT
1518 ... TGA
1566 ... TAA
1800 ... NNC
2019 ... TAA
2529 ... TGA
2622 ... NAG
2985 ... NGA
3087 ... TAG
3186 ... TGA
From the note section of the CDS, we have: inserted 5 bases in 4 codons; deleted 2 bases in 2 codons; substituted 11 bases at 11 genomic stop codons".
How does this relate to our output? The reading frame never changes, suggesting that the 2 deleted bases are absent from the given nucleotide sequence. Five unknown nucleotides (N) exist in 4 codons (unknown amino acid, X). The authors of the sequence have accounted for indels. Eleven premature stop codons are present, which are simply translated as unknown amino acids. The "transl_except" tags match the locations of the premature stop codons. The nucleotides at these sites have not been altered. The authors provide XP_021241170 as a possible corrected translation product, but it's still very bad.

Read data from csv file with foreach function

I have been reading data from csv, if there is a large csv file, for avoid this time-out(rack 12 sec timeout) i have read only 25 rows from csv after 25 rows it return and again make a request so this will continue until read all the rows.
def read_csv(offset)
r_count = 1
CSV.foreach(file.tempfile, options) do |row|
if r_count > offset.to_i
r_count += 1
But here it is creating a new issue, let say first read 25 rows then when the next request comes offset is 25 that time it will read upto first 25 rows then it will start read from 26 and do process, so how can i skip this rows which already read?, i tried this if next to skip iteration but that fails, or is there any other efficient way to do this?
def read_csv(fileName)
lines = (`wc -l #{fileName}`).to_i + 1
lines_processed = 0
open(fileName) do |csv|
csv.each_line do |line|
lines_processed += 1
Pure Ruby - SLOWER
def read_csv(fileName)
lines = open("sample.csv").count
lines_processed = 0
open(fileName) do |csv|
csv.each_line do |line|
lines_processed += 1
I ran a new benchmark comparing your original method provided and my own. I also included the test file information.
"File Information"
Lines: 1172319
Size: 126M
"django's original method"
Time: 18.58 secs
Memory: 0.45 MB
"OneNeptune's method"
Time: 0.58 secs
Memory: 2.18 MB
"Pure Ruby method"
Time: 0.96
Memory: 2.06 MB
NOTE: I added a pure ruby method, since using wc is sort of cheating, and not portable. In most cases it's important to use pure language solutions.
You can use this method to process a very large CSV file.
~2MB memory I feel is pretty optimal considering the file size, it's a bit of an increase of memory usage, but the time savings seems to be a fair trade, and this will prevent timeouts.
I did modify the method to take a fileName, but this was just because I was testing many different CSV files to make sure they all worked correctly. You can remove this if you'd like, but it'll likely be helpful.
I also removed the concept of an offset, since you stated you originally included it to try to optimize the parsing yourself, but this is no longer necessary.
Also, I keep track of how many lines are in the file, and how many were processed since you needed to use that information. Note, that lines only works on unix based systems, and it's a trick to avoid loading the entire file into memory, it counts the new lines, and I add 1 to account for the last line. If you're not going to count headers as line though, you could remove the +1 and change lines to "rows" to be more accurate.
Another logistical problem you may run into is the need to figure how to handle if the CSV file has headers.
You could use lazy reading to speed this up, the whole of the file wouldn't be read, just from the beginning of the file until the chunk you use.
See and for examples.
You could also use SmarterCSV to work in chunks like this.
SmarterCSV.process(file_path, {:chunk_size => 1000}) do |chunk|
chunk.each do |row|
# Do your processing
enter code here
The way I did this was by streaming the result to the user, if you see what is happening it doesn't bother that much you have to wait. The timeout you mention won't happen here.
I'm not a Rails user so I give an example from Sinatra, this can be done with Rails also. See eg
require 'sinatra'
get '/' do
line = 0
stream :keep_open do |out|
1.upto(100) do |line| # this would be your CSV file opened
out << "processing line #{line}<br>"
# process line
sleep 1 # for simulating the delay
A still better but somewhat complicated solution would be to use websockets, the browser would receive the results from the server once the processing is finished. You will need some javascript in the client also to handle this. See

Neo4j batch importer NotFoundException

I'm consistently running into a NotFoundException when using the batch importer to read large nodes and relationship files. I've used the importer successfully before with an even larger dataset, but I've rewritten the way I generate the two files, and I'm trying to figure out why it now throws an error.
The problem
It seems to read the nodes file and then throws an error near the start of the rels file, stating that it cannot find a node. I believe this is because it hasn't really imported all the nodes. It reports importing only half of the nodes in nodes.tsv (2.1m of 4.6m total).
Things I've checked:
The node numbers in nodes.tsv are sequential and continuous (0 to ~4.5m)
The node that throws the exception appears in both files (including as both source and target in rels.tsv)
I can successfully import a smaller subset of my data (~80k nodes) using the same tsv generator script
Even though the relationships are not sorted on target (only on source), the smaller subset does not throw this exception
The insert command:
./ wiki.db nodes.tsv rels.tsv
Error message
Using Existing Configuration File
Importing 2129648 Nodes took 6400 seconds
Total import time: 6404 seconds
Exception in thread "main" org.neo4j.graphdb.NotFoundException: id=3608148
at org.neo4j.unsafe.batchinsert.BatchInserterImpl.getNodeRecord(BatchInserterImpl
at org.neo4j.unsafe.batchinsert.BatchInserterImpl.createRelationship(BatchInserte
at org.neo4j.batchimport.Importer.importRelationships(
at org.neo4j.batchimport.Importer.doImport(
at org.neo4j.batchimport.Importer.main(
The files
nodes.tsv (4578730 lines)
node name l:label degrees
0 Stroud_railway_station Page 21
1 ATP–ADP_translocase Page 38
2 Pedro_Hernández_Martínez Page 12
3 Christopher_Lowther Page 4
4 Cloncurry_River Page 10
5 Neil_Kinnock Page 147
6 Free_agent_(business) Page 10
7 Christian_Hilt Page 27
8 2009_Riviera_di_Rimini_Challenger Page 27
rels.tsv (113322480 lines)
start end type
0 3608148 LINKS_TO
0 870126 LINKS_TO
0 1516248 LINKS_TO
0 3493391 LINKS_TO
0 3034096 LINKS_TO
0 1421544 LINKS_TO
0 2808745 LINKS_TO
0 1872783 LINKS_TO
0 1673612 LINKS_TO
Hmm seems to be a problem with your CSV file, did you try to run CSVKit or similar on it?
Perhaps you can narrow down the issue by bisecting the nodes.csv and finding the offending line?
Also try to use the opencsv parser by enabling quotes in your
Or flip it to false. Perhaps you have stray single our double quotes in your text? If so then please quote it.

Simplest way to sort html table content

Given content from what is the easiest way to represent content of this table sorted by second column? What tools are best for this kind of job?
Since content looked simple, I tried to hack with tr, sed and awk (mainly to learn the tools) but it turned out to be too complex to get all rows right. Format could look like this:
47 strict
54 Win32
55 transformers-base
57 enumerator
68 system-filepath
69 xml
or any other format as long it is not making for further processing too complex.
I like perl, and just for learning I did the job using the Web::Scraper module. It uses CSS selectors to extract both columns of the table and sorts them by the second one, which indicates the number of dependencias for each package:
The file:
#!/usr/bin/env perl
use strict;
use warnings;
use Web::Scraper;
use URI;
die qq|Usage: perl $0 <url>\n| unless #ARGV == 1;
my $packages_deps = scraper {
process 'tr', 'package_deps[]' => scraper {
process 'td:first-child > a', 'package_name' => 'TEXT';
process 'td:nth-child(2)', 'tot_deps' => 'TEXT';
result 'package_deps';
my $response = $packages_deps->scrape( URI->new( shift ) );
for ( sort { $a->{tot_deps} <=> $b->{tot_deps} } #$response[1..$#$response] ) {
printf qq|%d %s\n|, $_->{tot_deps}, $_->{package_name};
Run it providing the url:
perl ""
And it yields (only show the beginning and end part of the list):
1 abstract-par-accelerate
1 accelerate-fft
1 acme-year
1 action-permutations
1 active
1 activehs-base
766 text
794 filepath
796 transformers
915 directory
1467 mtl
1741 bytestring
1857 containers
5287 base
Javascript includes a native sorting function, so Javascript is a natural choice.
There is a simple script here which you can either use or examine and learn from:

Make Sphinx quiet (non-verbose)

I'm using Sphinx through Thinking Sphinx in a Ruby on Rails project. When I create seed data and all the time, it's quite verbose, printing this:
using config file '/Users/pupeno/projectx/config/development.sphinx.conf'...
indexing index 'user_delta'...
collected 7 docs, 0.0 MB
collected 0 attr values
sorted 0.0 Mvalues, 100.0% done
sorted 0.0 Mhits, 99.6% done
total 7 docs, 159 bytes
total 0.042 sec, 3749.29 bytes/sec, 165.06 docs/sec
Sphinx (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff
for every record that is created or so. Is there a way to suppress that output?
There is actually a setting to stop this - you'll want to set it at the end of your environment.rb file:
ThinkingSphinx.suppress_delta_output = true
In Thinking Sphinx v3 and newer, this has changed, and this setting is now managed through config/thinking_sphinx.yml. Repeat for each appropriate environment:
quiet_deltas: true
Run sphinx with --quiet flag. I'm not using TS, though, so I don't know how to add make TS to use this flag. hth.
