[SOLVED] Remove trailing characters while adding leading characters

sharky · 04-03-2024, 08:17 PM

Text file contains numerous strings with trailing sub-string.

example where _xx is the trailing sub-string;

Quote:

"m1_xx" some other text "m2_xx"
"p2_xx" yet more text "p2_xx" extra text
change is good "hello_xx"

desired output:

Quote:

"yy_m1" some other text "yy_m2"
"yy_p2" yet more text "yy_p2" extra text
change is good "yy_hello"

I found ways to make the substitution. However, with my method I lose the existing spacing - all the strings in the output are separated by a single space.

syg00 · 04-03-2024, 08:56 PM

sed is your friend - use regex and capture groups. Do-able in a single invocation.

grail · 04-03-2024, 10:56 PM

Please provide what you have tried so we may assist?

sharky · 04-04-2024, 01:05 PM

Quote:

Originally Posted by syg00

sed is your friend - use regex and capture groups. Do-able in a single invocation.

What is a 'capture group'?

sharky · 04-04-2024, 01:32 PM

Quote:

Originally Posted by grail

Please provide what you have tried so we may assist?

Code:

#!/bin/env python3

def changeXXToYY():

  # read lines into list
  with open("testText") as fp:
    mapList = fp.readlines()

  # remove all line feeds
  mapList = [x.strip() for x in mapList]

  toRemove = '_XX"'
  toAdd = '"YY_'

  for elem in mapList:
    elem = elem.split()
    for item in elem:
      if toRemove in item:
        item = toAdd + item.split(toRemove)[0].split('"')[-1] + '"'
        print(item)

change2kTo3d()

This prints out the desired new string but the original line remains unchanged.

MadeInGermany · 04-04-2024, 02:37 PM

With sed:

Code:

sed 's/"\([^"]*\)_xx"/"yy_\1"/g' testText

The  group match is referred by the \1 in the substitution string.
[^"] is a character that is not a "
[^"]* is such a character any times.
The /g modifier looks for further matches/substitutions in the line; a further match is right from the current match.

sharky · 04-04-2024, 03:12 PM

Quote:

Originally Posted by MadeInGermany

With sed:

Code:

sed 's/"\([^"]*\)_xx"/"yy_\1"/g' testText

The  group match is referred by the \1 in the substitution string.
[^"] is a character that is not a "
[^"]* is such a character any times.
The /g modifier looks for further matches/substitutions in the line; a further match is right from the current match.

It works. Thanks for the explanation also.

sharky · 04-04-2024, 03:25 PM

Quote:

Originally Posted by MadeInGermany

With sed:

Code:

sed 's/"\([^"]*\)_xx"/"yy_\1"/g' testText

The  group match is referred by the \1 in the substitution string.
[^"] is a character that is not a "
[^"]* is such a character any times.
The /g modifier looks for further matches/substitutions in the line; a further match is right from the current match.

My apologies but I noticed that my input file will also have cases where the original string is not withing double quotes.

How should this sed command be modified to work in such cases? I've tried a few things but nothing changed.

MadeInGermany · 04-04-2024, 08:50 PM

If " anchors cannot be used, you can try \b anchors ("word boundaries"):

Code:

sed 's/\b\([^" ]*\)_xx\b/yy_\1/g' testText

[^" ]* is a string of characters that are not " or space.
The pre-defined "word boundary" is just a marker not a character, so it must not be re-inserted. But it is less precise e.g. also occurs at a - character.
The following uses Extended RegularExpression and three ( ) groups:

Code:

sed -E 's/(^|[" ])([^" ]*)_xx([" ]|$)/\1yy_\2\3/g' testText

The 1st group is the beginning marker or a " or space character.
The 2nd group is a string of not " or space characters.
The 3rd group is a " or space character or the end marker.
\1 \2 \3 is what the respective group has matched.

syg00 · 04-04-2024, 09:15 PM

An alternate approach is to specify what you are looking for, rather than what you are not looking for. Also protects from overlooking possible corner cases (like what if one of those blanks is a tab ?).

Code:

 sed -r 's/([[:alnum:]]+)_xx/yy_\1/g' input.file

sundialsvcs · 04-05-2024, 09:01 AM

Just to clarify: a “regular expression (regex …)” can not only match a string pattern – (“yes or no, does it match?”) – but also return to you various specified pieces of the matching string. Such as: “some other text” and “more text.” In environments like sed, these pieces are instantly available as things like [left to right …] “$1” and “$2.” Or maybe, “\1” and “\2.” Which you can immediately use to produce output.

Also: These days, “regex support” is universal, and the syntax has become standardized. Implementations now vary only in the details. Every language has it. Therefore, understanding this very important power-tool is definitely “an essential life skill.” (Like knowing how to use a life jacket …) If you need to “tear apart a text string,” (and who doesn’t?), regex has your back.

There are “esoteric fee-churs” in regexes that you can learn about if and when you actually need them, and others that you might use every day.

sharky · 04-05-2024, 05:59 PM

Quote:

Originally Posted by sundialsvcs

Just to clarify: a “regular expression (regex …)” can not only match a string pattern – (“yes or no, does it match?”) – but also return to you various specified pieces of the matching string. Such as: “some other text” and “more text.” In environments like sed, these pieces are instantly available as things like [left to right …] “$1” and “$2.” Or maybe, “\1” and “\2.” Which you can immediately use to produce output.

Also: These days, “regex support” is universal, and the syntax has become standardized. Implementations now vary only in the details. Every language has it. Therefore, understanding this very important power-tool is definitely “an essential life skill.” (Like knowing how to use a life jacket …) If you need to “tear apart a text string,” (and who doesn’t?), regex has your back.

There are “esoteric fee-churs” in regexes that you can learn about if and when you actually need them, and others that you might use every day.

I do coding in Cadence SKILL language for design automation in a Linux environment (analog IC design). However, to my complete and utter shame, I have never gotten past a few rudimentary regular expression usages. The fact is, despite working in a Linux environment, I don't often have much need for regular expressions and have never taken that deep dive. I blame it on linuxquestions - you guys spoil me with amazing solutions.

sharky · 04-05-2024, 05:59 PM

Quote:

Originally Posted by syg00

An alternate approach is to specify what you are looking for, rather than what you are not looking for. Also protects from overlooking possible corner cases (like what if one of those blanks is a tab ?).

Code:

 sed -r 's/([[:alnum:]]+)_xx/yy_\1/g' input.file

This worked perfectly.

Thanks!

syg00 · 04-06-2024, 05:40 AM

You need to take that "deep dive" - regex is a powerful and useful tool. MadeInGermany has given you good pointers to get you started.

danielbmartin · 04-06-2024, 11:53 AM

Please forgive if this is obvious to LQ regulars.

The excellent solution posted by syg00 may be generalized.
xx and yy could be variable names instead of character strings.

With this InFile ...

Code:

m1_SALT some other text m2_SALT
p2_SALT yet more text p2_SALT extra text
change is good hello_SALT
m1_HAM some other text m2_HAM
p2_HAM yet more text p2_HAM extra text
change is good hello_HAM

... this code ...

Code:

xx='SALT'
yy='SUGAR'
sed -r 's/([[:alnum:]]+)_'$xx'/'$yy'_\1/g' <$InFile >$OutFile

... produces this OutFile ...

Code:

SUGAR_m1 some other text SUGAR_m2
SUGAR_p2 yet more text SUGAR_p2 extra text
change is good SUGAR_hello
m1_HAM some other text m2_HAM
p2_HAM yet more text p2_HAM extra text
change is good hello_HAM

Which shows how we may change SALT into SUGAR
but not HAM into CHEESE.

Daniel B. Martin

.