Unexpected carriage returns

Explaining a strange output from sed

Posted by Owen Stephens on December 16, 2019

This time, a quick blog post to show a gotcha that we encountered recently, with sed seemingly not working as we expected (unsurprising spoiler: sed is just fine, the problem was our input). We started with a file that was something similar to the following (but with thousands of rows):

$ head -n3 example.csv
ref
AB0001
BC0002

We're working with a simple "CSV" file (which has no comma separators, since there was only a single column). Our aim was to generate a SQL query that was restricted to these refs, for example:

SELECT * FROM the_table WHERE ref IN
-- We wanted to generate the following line...
('AB0001','BC0002')

Our first thought was "we can use sed to easily add quotes around these refs", so we quickly ran through a simple example to test our sed command:

$ echo -e 'abc\ndef' | sed "s/.*/'&'/"
'abc'
'def'

So far, so good. We used the & replacement to reference the whole matched portion (see the manual for details), and say "match any number of any characters, and replace the match with itself, but surrounded by single quotes".

Unexpected output

However, when we ran the csv file through our previously tested sed program, the output was not as we expected:

$ head -n3 example.csv | sed "s/.*/'&'/"
'ref
'AB0001
'BC0002

It appeared that the final single quote wasn't being printed. On seeing this, we tried to check that our sed replacement was working:

$ head -n3 example.csv | sed "s/.*/'%%%&'/"
'%%%ref
'%%%AB0001
'%%%BC0002

Hmm, we were still not seeing the trailing quote, but additional leading characters were being printed. We wondered what would happen if we tried adding some additional trailing characters (this time, by chance, we added characters outside the single quotes):

$ head -n3 example.csv | sed "s/.*/'&'%%%/"
'%%%
'%%%001
'%%%002

Aha! It's as if the trailing characters are overwriting the already-printed line... That sounded to us like the behaviour of a carriage return (\r), so we checked if there were any, with file:

$ file example.csv
example.csv: ASCII text, with CRLF line terminators

Bingo! We'd received this file from a colleague who uses Windows, and we'd thus inherited their line-endings style. As a quick fix, we ran dos2unix to convert \r\n line-endings into \n, and tried our original command again:

$ head -n3 example.csv | dos2unix | sed "s/.*/'&'/"
'ref'
'AB0001'
'BC0002'

Success; now all that remained was for us to:

  1. Remove the header row with: tail -n+2
  2. Use paste to join lines with commas, as per here, using: paste -s -d,
  3. Surround in parentheses with: sed 's/.*/(&)/'

this left us with:

$ dos2unix < example.csv | sed "s/.*/'&'/" | tail -n+2 | \
  paste -s -d, | sed 's/.*/(&)/'
('AB0001','BC0002','CD0003','XY1337')

and we were done.

Why was this happening?

sed operates by reading a line at a time, removing the trailing newline character, applying the command(s) and then printing the result with the newline character added back.

This means that when matching against a file with \r\n line endings, sed operates as follows:

  1. Read in a line: AB0001\r\n
  2. Remove the newline character: AB0001\r
  3. Apply the s// command: 'AB0001\r'
  4. Write out the result, with a trailing newline character: 'AB0001\r'\n

Notice how in step 3 we added ' after the \r. This means that when the string is printed to the terminal, the second ' overwrites the first ', as the cursor is returned to the start of the line by the \r character. To demonstrate that this is what is happening, we can change the second single quote to another character, and check that we only see that new character (since the quote is being overwritten):

$ echo -e 'abc\r' | sed "s/.*/'&~/"
~abc

Notice that now the ' is overwritten by the ~.

Bonus use of \r

Due to the "overwriting" behaviour of carriage return characters, they can be used to create simple progress bars, for interactive console applications. A small example in Ruby is:

puts "Progressing..."

21.times do |i|
  bar = "#{"=" * i}#{" " * (20 - i)}"
  counter = "#{i}/20"

  print "\r[#{bar}]#{counter}"

  sleep 0.1
end

puts "\nDone!"

This prints output to the terminal as:

terminal_progress

which is neat, given a small amount of code. For something more production-ready that is based on the same approach under the hood, check out the ruby progressbar library.