[SOLVED] Replace text string with sequential numbers inside a textfile

K-Veikko · 04-06-2013, 11:38 AM

I get (book-size) text files from my OCR program. Each page-end is marked by a special sign: . In gedit it is shown as a square, inside numbers 000C. – Basically this can be any string of text concerning my question:

Is there a sed, awk or any one-liner command to replace a text string with sequential numbers?

Then I could insert a page-number as the page separator.

Preferably a script where I can decide the starting number of pages.

This starting number in some cases is negative to make the page numbers equal with the real page-numbers of the book because there may be cover-pages etc. that are "not counted". – However this (starting number) is a minor problem.

colucix · 04-06-2013, 04:34 PM

The character you see in gedit is the hidden control character represented by hexadecimal C in the ASCII table. Not by chance it is the NP (new page) form feed: it is inserted in plain text books to let you distinguish pages programmatically.

A simple GNU awk program can accomplish your task, but if you change it with numbers how will you distinguish them from other numbers inserted in the text? In other word, the ability to separate pages is lost, unless you choose a unique sequence of characters or another hidden control character not appearing in the text. Just out of curiousity.

Anyway, here is a rough solution in gawk:

Code:

gawk 'BEGIN{ RS = "\xC" } { printf  "%s%d", $0, ++c }' book > page_numbered_book

Basically it consider the ASCII character with hexadecimal C as Record Separator and prints out every (untouched) record, that is the content of a page (newline included), followed by a number incremented at each passage. In other words, being a separator, the hidden character is not part of the text anymore and it is a way to remove it from the output.

At this point (if you are experienced in awk, of course) you can easily add the conditional expressions to adjust numbering at your pleasure and customize the format of the number, even by adding the \xC character again! Feel free to ask if in doubt.

An aside note: most likely you cannot see the new page character in the CLI, for example using the cat command (unless you use the -v option). Actually it appears as a kind of newline, because recent terminal emulators are able to interpret it correctly. For example, suppose I create a single-line file with the following content

Code:

page one<np>page two<np>page three<np>fine

Using cat I will see it like this:

Code:

$ cat book
page one
        page two
                page three
                          fine

Instead, using the -v option the hidden character is somewhat revealed:

Code:

$ cat -v book
page one^Lpage two^Lpage three^Lfine

Hope this helps.

K-Veikko · 04-07-2013, 02:04 AM

Quote:

Originally Posted by colucix

how will you distinguish them from other numbers inserted in the text

Thank you very much. This solved my problem!

– Before numbering the pages I convert this "\xC" into something like: \n\n\n--"\xC"--\n\n\n using gedit's replace function. So I get a unique string which quite obviously does not exist elsewhere in the text.

After getting the page numbers in correct place I manually number the few first pages as: "--000--".

Only leading zeros are missing to make the page break look "standard": 001, 002, 003, etc.

colucix · 04-07-2013, 03:23 AM

It looks nice. The next time you may consider to edit the book directly in awk. Maybe you will have only to manually check at which page the numbering should start:

Code:

gawk -v s=$PAGE 'BEGIN{ RS = "\xC" } NR >= s { c++ } { printf  "%s\n\n\n--%03d--\n\n\n", $0, c }' book > page_numbered_book

where $PAGE is the aforementioned number.