[SOLVED] Converting a binary file from little endian to big endian

jaepi · 08-27-2011, 12:19 AM

I'm a little new in dealing with endianness. I created a binary file, compiled it in Windows using Visual Studio in little endian. My problem is, this file will be read in Linux in which the application is compiled using big endian. I have a function to convert unsigned long variables (the header of my bin file) which is doing its job pretty well but slows down the speed of the application. My other concern is the entire buffer or the content of the file (I can only byte swap the header but not the entire content X_X). I think the best way to deal with this is to convert the entire file first before reading it. But I don't have an idea how to start. I will appreciate all the help and suggestions I can get. Thank you

syg00 · 08-27-2011, 01:22 AM

By binary, I hope you mean "binary data" and not "binary executable".
If so, you know the layout of the data - your structs after all. You need to know the type of (big-endian) hardware to know what the data-types represent. How big (as in how many bytes are occupied by ...) an int is, or a double - that sort of thing. Each field has to be handled separately.
AFAIC, it's a no-brainer that the entire file be converted prior to feeding it to the app. Where might be a consideration, but you'd expect Intel CPU cycles to be cheaper - do it there.

Nominal Animal · 08-28-2011, 12:52 AM

If you handle very large files, or files only rarely accessed, it turns out you can do the endianness conversion while reading the data without slowing down the program. It does consume more CPU cycles than not doing a conversion, but reading a large file is I/O bound anyway; so, if you do the conversion while still reading the file, the conversion is practically free. I write my own low-level I/O routines, reading the data in 64k to 2M chunks, and apply any necessary conversions for each completed chunk. It does get a bit complicated, because the read does not necessarily end with a field boundary, but for large data files it is certainly worth the code complexity.

If you know the data is always in little-endian order, you could use

Code:

#include <stdint.h>

static inline uint16_t get_le16(const void *const from)
{
        return ((uint16_t)(((const unsigned char *)from)[0])      )
             | ((uint16_t)(((const unsigned char *)from)[1]) << 8U);
}

static inline uint32_t get_le32(const void *const from)
{
        return ((uint32_t)(((const unsigned char *)from)[0])       )
             | ((uint32_t)(((const unsigned char *)from)[1]) <<  8U)
             | ((uint32_t)(((const unsigned char *)from)[2]) << 16U)
             | ((uint32_t)(((const unsigned char *)from)[3]) << 24U);
}

static inline uint64_t get_le64(const void *const from)
{
        return ((uint64_t)(((const unsigned char *)from)[0])       )
             | ((uint64_t)(((const unsigned char *)from)[1]) <<  8U)
             | ((uint64_t)(((const unsigned char *)from)[2]) << 16U)
             | ((uint64_t)(((const unsigned char *)from)[3]) << 24U)
             | ((uint64_t)(((const unsigned char *)from)[4]) << 32U)
             | ((uint64_t)(((const unsigned char *)from)[5]) << 40U)
             | ((uint64_t)(((const unsigned char *)from)[6]) << 48U)
             | ((uint64_t)(((const unsigned char *)from)[7]) << 56U);
}

but the above functions end up being pretty slow. Certainly they are much slower than just reversing the endianness:

Code:

#include <stdint.h>

static inline uint16_t swap_endian16(uint16_t u)
{
        return ((u >> 8U) & 0xFFU)
             | ((u & 0xFFU) << 8U);
}

static inline uint32_t swap_endian32(uint32_t u)
{
        const uint32_t m8  = (uint32_t)0xFF00FFUL;
        const uint32_t m16 = (uint32_t)0xFFFFUL;

        u = ((u >>  8U) & m8)  | ((u & m8)  <<  8U);
        u = ((u >> 16U) & m16) | ((u & m16) << 16U);

        return u;
}

static inline uint64_t swap_endian64(uint64_t u)
{
        const uint64_t m8  = (uint64_t)0x00FF00FF00FF00FFULL;
        const uint64_t m16 = (uint64_t)0x0000FFFF0000FFFFULL;
        const uint64_t m32 = (uint64_t)0x00000000FFFFFFFFULL;

        u = ((u >>  8U) & m8)  | ((u & m8)  <<  8U);
        u = ((u >> 16U) & m16) | ((u & m16) << 16U);
        u = ((u >> 32U) & m32) | ((u & m32) << 32U);

        return u;
}

On 32-bit architectures, arrays of 16-bit values are fastest to convert two at a time; use a variant of swap_endian32() that only does the m8 step.

On 64-bit architectures, arrays of 16-bit values are fastest to convert four at a time; use a variant of swap_endian64() that only does the m8 step. Arrays of 32-bit values are fastest to convert two at a time; use a variant of swap_endian64() that only does the m8 and m16 steps.

There used to be certain architectures which had mixed byte orders (CDAB); the latter conversion functions only need small modifications to convert those too.

On some architectures it is possible that floats (float and double) have different byte order than integer values. I personally put prototype values in the header:

uint16_t: 43981 (0xABCD)
uint32_t: 67305985 (0x04030201)
float: 721409.0/1048576.0 (0x3d302010)
double: 66809.0/8323200.0 (0x3f80706050403020)

Note that internally, float can be treated as uint32_t, and double as uint64_t, if you remember that they may have different endianness than the integer values. The prototype values also make sure the architecture understands IEEE 754 (float AKA binary32, and double AKA binary64) floating-point values, possibly after an endianness correction.

For those who still use Fortran, it is possible to do the conversion in Fortran too, if the compiler supports sequential raw I/O (binary, no record boundaries). That is where I originally developed this for. It was an order of magnitude faster than text I/O; conversion between strings and floating-point values is surprisingly slow.

paulsm4 · 08-28-2011, 01:30 AM

jaepi -

* Please double-check and make sure that byte ordering is even a problem. Byte ordering usually comes into play when you exchange data between two different CPU architectures (e.g. SPARC or MIPS with an Intel CPU). It's seldom an issue between Windows and Linux (assuming you're using Intel CPUs on both).

* If it *is* an issue, and if you can convert the file en masse, perhaps the fastest/easiest/most efficient way is with "dd -conv=swab":

http://www.codecoffee.com/tipsforlin...icles/036.html

'Hope that helps .. PSM

syg00 · 08-28-2011, 03:17 AM

If it's big-endian vs little-endian, it's a problem. And it's extraordinarily unlikely to be a simple byte swap that dd can help with.

When I did this, I used perl as it was a one-off for a customer I didn't have access to.

ta0kira · 08-28-2011, 12:32 PM

As a general rule, when storing binary data (especially data structures) you should only expect the data to be accessible on the boot it was created on unless you've standardized the file format to be usable on all systems. If you're using data structures, you can't even guarantee the member alignment will be the same (without explicit manipulation, that is). If you don't need to mmap or have random access you might consider storing the data as text and compressing the file (e.g. bzip2) to make up for the increase in size.
Kevin Barry

jaepi · 09-05-2011, 09:51 PM

Hello, everyone. Sorry for the late reply, I've been very busy lately x_X. It turns out that I don't need to convert the content of the .bin file since it was written using a char* buffer (my bad). The header which contains all the important information for me, is converted during run time in my app so I don't have to worry about it, although the application runs a little slow becuase of the conversion, it is rather efficient because the binary file remains unchanged. Thanks for all the help guys