LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-06-2013, 11:38 AM   #1
K-Veikko
LQ Newbie
 
Registered: Jul 2005
Posts: 11

Rep: Reputation: 0
Replace text string with sequential numbers inside a textfile


I get (book-size) text files from my OCR program. Each page-end is marked by a special sign: . In gedit it is shown as a square, inside numbers 000C. – Basically this can be any string of text concerning my question:

Is there a sed, awk or any one-liner command to replace a text string with sequential numbers?

Then I could insert a page-number as the page separator.

Preferably a script where I can decide the starting number of pages.
This starting number in some cases is negative to make the page numbers equal with the real page-numbers of the book because there may be cover-pages etc. that are "not counted". – However this (starting number) is a minor problem.
 
Old 04-06-2013, 04:34 PM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
The character you see in gedit is the hidden control character represented by hexadecimal C in the ASCII table. Not by chance it is the NP (new page) form feed: it is inserted in plain text books to let you distinguish pages programmatically.

A simple GNU awk program can accomplish your task, but if you change it with numbers how will you distinguish them from other numbers inserted in the text? In other word, the ability to separate pages is lost, unless you choose a unique sequence of characters or another hidden control character not appearing in the text. Just out of curiousity.

Anyway, here is a rough solution in gawk:
Code:
gawk 'BEGIN{ RS = "\xC" } { printf  "%s%d", $0, ++c }' book > page_numbered_book
Basically it consider the ASCII character with hexadecimal C as Record Separator and prints out every (untouched) record, that is the content of a page (newline included), followed by a number incremented at each passage. In other words, being a separator, the hidden character is not part of the text anymore and it is a way to remove it from the output.

At this point (if you are experienced in awk, of course) you can easily add the conditional expressions to adjust numbering at your pleasure and customize the format of the number, even by adding the \xC character again! Feel free to ask if in doubt.

An aside note: most likely you cannot see the new page character in the CLI, for example using the cat command (unless you use the -v option). Actually it appears as a kind of newline, because recent terminal emulators are able to interpret it correctly. For example, suppose I create a single-line file with the following content
Code:
page one<np>page two<np>page three<np>fine
Using cat I will see it like this:
Code:
$ cat book
page one
        page two
                page three
                          fine
Instead, using the -v option the hidden character is somewhat revealed:
Code:
$ cat -v book
page one^Lpage two^Lpage three^Lfine
Hope this helps.

Last edited by colucix; 04-06-2013 at 04:40 PM. Reason: typo
 
Old 04-07-2013, 02:04 AM   #3
K-Veikko
LQ Newbie
 
Registered: Jul 2005
Posts: 11

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by colucix View Post
how will you distinguish them from other numbers inserted in the text
Thank you very much. This solved my problem!

– Before numbering the pages I convert this "\xC" into something like: \n\n\n--"\xC"--\n\n\n using gedit's replace function. So I get a unique string which quite obviously does not exist elsewhere in the text.

After getting the page numbers in correct place I manually number the few first pages as: "--000--".

Only leading zeros are missing to make the page break look "standard": 001, 002, 003, etc.

Last edited by K-Veikko; 04-07-2013 at 02:08 AM.
 
Old 04-07-2013, 03:23 AM   #4
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
It looks nice. The next time you may consider to edit the book directly in awk. Maybe you will have only to manually check at which page the numbering should start:
Code:
gawk -v s=$PAGE 'BEGIN{ RS = "\xC" } NR >= s { c++ } { printf  "%s\n\n\n--%03d--\n\n\n", $0, c }' book > page_numbered_book
where $PAGE is the aforementioned number.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Replace a string with sequential value in an xml juju_qa Programming 2 02-01-2012 05:32 AM
how do i replace a text string in a file with a random string? (with sed etc) steve51184 Linux - Software 16 09-02-2010 11:05 AM
[SOLVED] Replace sequential numbers in a file with a different sequence using sed thefiend Linux - Newbie 6 04-12-2010 10:29 PM
How to replace string containing / in a text file tikit Linux - Newbie 4 09-05-2008 08:48 AM
Script to compare numbers inside two text files bugg_deccan Programming 3 10-17-2007 09:53 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:47 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration