Let say I have three columns, each column has over 1,000 entries
A B C
1 2 6
5 3 7
7 4 8
Now I reorder the elements in column A as
5
1
7
...
How can I sort columns B and C so that I have
5 3 7
1 2 6
7 4 8
...
Excel has the "custom list" sort feature that can do exactly what I want. All I need to do is enter column 1 as "5, 1, 7, ..." into the "custom list". However, it doesn't work if my column 1 has 1,000+ entries (I cannot paste the list there). I am looking for a solution with awk or grep.
If you're open to a Perl solution:
($in, $list) = #ARGV;
open IN, "< $in" or die;
while ($line = <IN>) {
#F = split /\s+/, $line;
if (defined $h{$F[0]}) {
die "ERROR: multiple input lines have first column $F[0]\n";
}
$h{$F[0]} = $line;
}
open LIST, "< $list" or die;
while ($line = <LIST>) {
#F = split /\s+/, $line;
if (defined $h{$F[0]}) {
print $h{$F[0]};
} else {
die "ERROR: no match found for $F[0]\n";
}
}
Save this file as "script"
Save your input data as "input"
Save your custom list as "list"
Run: perl script input list
How it works:
Iterate through the input file which contains columns of data, separated by whitespace. If your input file is comma-separated, change /\s+/ to /,/
Split line into fields array #F
Store line into hash h, keyed based on first field $F[0]
Iterate through the list file
Split line into fields (this handles trailing whitespace)
Print contents of hash for that key
Sanity checking is also done
Related
okay so I have a list of files and 3 lines containing a word I need to extract from each line
basically each file can be looked at like this:
random
random
random
random
LINE 1 TEXT RANDOM TEXT
random
LINE 2 TEXT RANDOM TEXT
random
random
LINE 3 TEXT RANDOM TEXT
and what I'm looking to get is a text file containing this (without the FILE * PART):
FILE1 - LINE 1 TEXT RANDOM TEXT | LINE 2 TEXT RANDOM TEXT | LINE 3 TEXT RANDOM TEXT
FILE2 - LINE 1 TEXT RANDOM TEXT | LINE 2 TEXT RANDOM TEXT | LINE 3 TEXT RANDOM TEXT
FILE3 - LINE 1 TEXT RANDOM TEXT | LINE 2 TEXT RANDOM TEXT | LINE 3 TEXT RANDOM TEXT
FILE4 - LINE 1 TEXT RANDOM TEXT | LINE 2 TEXT RANDOM TEXT | LINE 3 TEXT RANDOM TEXT
TEXT RANDOM TEXT is obviously a random text that I'm looking to find, any help would be appreciated I tried powerGREP but it doesn't have an option to retrieve only unique records from each file
(meaning, only 1 match per search term, I get
LINE 1
LINE 2
LINE 2
LINE 3
)
powerGREP, I tried getting the search terms but got instead of 3 unique lines per file I got some 3 unique lines and some 4, 5, 6 because there are sometimes multiple lines with 1 of the search terms
I have a non-delimited text file and want to parse it to add tabs at specific spots to delimit columns. The columns are sometimes empty or vary in length, which is why I need to add tabs to those specific spots. I had found the answer to this once a couple of years ago on the net using batch, but now can't find it or the code. I already have the following code to replace more than 2 spaces in the file, but this doesn't account for when the columns are empty.
gc $FileToOpen | % { $_ -replace ' +',"`t" } | set-content $FileToSave
So, I need to read each line, but be able to only read a portion (certain number of characters) of it and add the tabs after each portion to itself.
Here is a sample of the data file, the top row is the header and the data rows have no blank lines in between them:
MRUN Number Name X Exception Reason Data CDM# Quantity D.O.S
000000 00000000 Name W MODIFIER CANNOT BE FILED WITHOUT 08/13/2015 0000000 0 08/13/2015
000000 00000000 Name W MODIFIER CANNOT BE FILED WITHOUT 0000000 0 08/13/2015
The second data row is missing Data.
Using Ansgar's answer, my code that does find empty fields:
gc $FileToOpen |
? { $_ -match '^(.{8})(.{12})(.{20})(.{3})(.{34})(.{62})(.{10})(.{22})(.{10})$' } |
% { "{0}`t{1}`t{2}`t{3}`t{4}`t{5}`t{6}`t{7}`t{8}" -f $matches[1].Trim(), $matches[2].Trim(), $matches[3].Trim(), $matches[4].Trim(), $matches[5].Trim(), $matches[6].Trim(), $matches[7].Trim(), $matches[8].Trim(), $matches[9].Trim() } |
Set-Content $FileToSave
Thanks for your patience Ansgar, I know I tried it! I really do appreciate the help!
Since you seem to have an input file with fixed-width columns, you should probably use a regular expression for transforming the input into a tab-delimited format.
Assume the following input file:
A B C
foo 13 22
bar 4 17
baz 142 23
The file has 3 columns. The first column is 6 characters wide, the other two columns 4 characters each.
The transformation could be done with a regular expression like this:
Get-Content 'C:\path\to\input.txt' |
? { $_ -match '^(.{6})(.{4})(.{4})$' } |
% { "{0}`t{1}`t{2}" -f $matches[1].Trim(), $matches[2].Trim(), $matches[3].Trim() } |
Set-Content 'C:\path\to\output.txt'
The regular expression defines the columns by character count and captures them in groups (parentheses). The groups can then be accessed as the indexes 1 and above of the resulting $matches collection. Trimming removes the leading/trailing whitespace. The format operator (-f) then inserts the trimmed values into the tab-separated format string.
If the last column has a variable width (because its values are aligned to the left and don't have trailing spaces) you may need to change the regular expression to ^(.{6})(.{4})(.{,4})$ to take care of that. The quantifier {,4} (or {0,4}) means up to four times the preceding expression.
I have a string of “words”, like this: fIsh mOuntain rIver. The words are separated by a space, and I added spaces to the beginning and ending of the string to simplify the definition of a “word”.
I need to replace any words containing A, B, or C, with 1, any words containing X, Y, or Z with 2, and all remaining words with 3, e.g.:
the CAT ATE the Xylophone
First, replacing words containing A, B, or C with 1, the string becomes:
the 1 1 the Xylophone
Next, replacing words containing X, Y, or Z with 2, the string becomes:
the 1 1 the 2
Finally, it replaces all remaining words with 3, e.g.:
3 1 1 3 2
The final output is a string containing only numbers, with spaces between.
The words might contain any kind of symbols, e.g.: $5鱼fish can be a word. The only feature defining the beginning and ending of words is the spaces.
The matches are found in order, such that words which might possibly contain two matches, e.g. ZebrA, is simply replaced with 1.
The string is in UTF-8.
How can I replace all of the words containing these particular characters with numbers, and finally replace all remaining words with 3?
Try the following code:
function replace(str)
return (str:gsub("%S+", function(word)
if word:match("[ABC]") then return 1 end
if word:match("[XYZ]") then return 2 end
return 3
end))
end
print(replace("the CAT ATE the Xylophone")) --> 3 1 1 3 2
The slnunicode module provides UTF-8 string functions.
The gsub function/method in Lua is used to replace strings and to check out how times a string is found inside a string. gsub(string old, string from, string to)
local str = "Hello, world!"
newStr, recursions = str:gsub("Hello", "Bye"))
print(newStr, recursions)
Bye, world! 1
newStr being "Bye, world!" because from was change to to and recursions being 1 because "Hello" (from) was only founds once in str.
I have a text area full of lines of ingredient; typically in a [quantity] [measurement] [ingredient] [additional] format. For example, a few ingredient lines might be:
1 tablespoon garlic, minced
1 cup bell pepper, chopped
I want to be able to identify each measurement and ingredient -- how would you process this? My line of thought was...
// loop thru line by line of textarea
// explode each line by the space thus line[0] would be 1, line[1] tablespoon, line[2] garlic... etc
Now here is my problem and I'm not sure what is efficient to do. Do I run each line[X] thru a db search for that measurement, ingredient, etc? But since "bell pepper" is separated by a space, I won't get a match.
// does line[1] appear in the measurements table?
// does line[2] appear in the ingredients table?
anyone else have any creative solutions?
Separate your data not by space but another delimiter. For example you could do:
$strRecipe = "1 | tablespoon | bell pepper | minced";
And then you can use:
$recipe = explode("|",$strRecipe);
Now you can access each field by: $recipe[0], $recipe[1] ETC ETC
Try stripos() to locate substring instead of explode().
$mytext = "1 tablespoon garlic, minced 1 cup bell pepper, chopped"; # or any text
$keyword = "bell pepper"; # or any search term
if (stripos($mytext, $keyword) === false) {
# not found
...
}
else {
# found
...
}
References
stripos() - case-insensitive search
strpos() - case-sensitive search
You can use explode() (not recommended) but than also you should separate words in your search term and look for an occurrence of first keyword in array where next keyword follows in next element of array, etc. It's unnecessary complication.
I've been given a large file with a funny CSV format to parse into a database.
The separator character is a semicolon (;). If one of the fields contains a semicolon it is "escaped" by wrapping it in doublequotes, like this ";".
I have been assured that there will never be two adjacent fields with trailing/ leading doublequotes, so this format should technically be ok.
Now, for parsing it in VBScript I was thinking of
Replacing each instance of ";" with a GUID,
Splitting the line into an array by semicolon,
Running back through the array, replacing the GUIDs with ";"
It seems to be the quickest way. Is there a better way? I guess I could use substrings but this method seems to be acceptable...
Your method sounds fine with the caveat that there's absolutely no possibility that your GUID will occur in the text itself.
On approach I've used for this type of data before is to just split on the semi-colons regardless then, if two adjacent fields end and start with a quote, combine them.
For example:
Pax;is;a;good;guy";" so;says;his;wife.
becomes:
0 Pax
1 is
2 a
3 good
4 guy"
5 " so
6 says
7 his
8 wife.
Then, when you discover that fields 4 and 5 end and start (respectively) with a quote, you combine them by replacing the field 4 closing quote with a semicolon and removing the field 5 opening quote (and joining them of course).
0 Pax
1 is
2 a
3 good
4 guy; so
5 says
6 his
7 wife.
In pseudo-code, given:
input: A string, first character is input[0]; last
character is input[length]. Further, assume one dummy
character, input[length+1]. It can be anything except
; and ". This string is one line of the "CSV" file.
length: positive integer, number of characters in input
Do this:
set start = 0
if input[0] = ';':
you have a blank field in the beginning; do whatever with it
set start = 2
endif
for each c between 1 and length:
next iteration unless string[c] = ';'
if input[c-1] ≠ '"' or input[c+1] ≠ '"': // test for escape sequence ";"
found field consting of half-open range [start,c); do whatever
with it. Note that in the case of empty fields, start≥c, leaving
an empty range
set start = c+1
endif
end foreach
Untested, of course. Debugging code like this is always fun….
The special case of input[0] is to make sure we don't ever look at input[-1]. If you can make input[-1] safe, then you can get rid of that special case. You can also put a dummy character in input[0] and then start your data—and your parsing—from input[1].
One option would be to find instances of the regex:
[^"];[^"]
and then break the string apart with substring:
List<string> ret = new List<string>();
Regex r = new Regex(#"[^""];[^""]");
Match m;
while((m = r.Match(line)).Success)
{
ret.Add(line.Substring(0,m.Index + 1);
line = line.Substring(m.Index + 2);
}
(Sorry about the C#, I don't known VBScript)
Using quotes is normal for .csv files. If you have quotes in the field then you may see opening and closing and the embedded quote all strung together two or three in a row.
If you're using SQL Server you could try using T-SQL to handle everything for you.
SELECT * INTO MyTable FROM OPENDATASOURCE('Microsoft.JET.OLEDB.4.0',
'Data Source=F:\MyDirectory;Extended Properties="text;HDR=No"')...
[MyCsvFile#csv]
That will create and populate "MyTable". Read more on this subject here on SO.
I would recommend using RegEx to break up the strings.
Find every ';' that is not a part of
";" and change it to something else
that does not appear in your fields.
Then go through and replace ";" with ;
Now you have your fields with the correct data.
Most importers can swap out separator characters pretty easily.
This is basically your GUID idea. Just make sure the GUID is unique to your file before you start and you will be fine. I tend to start using 'Z'. After enough 'Z's, you will be unique (sometimes as few as 1-3 will do).
Jacob