How can I fix Unicode error in my program?

Varister · 02-12-2022, 06:23 PM

I am trying to use download.pcap file with this program, but when I run it, I keep getting a "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 0: invalid continuation byte" error. I am not sure how to fix this.

Code:

import dpkt
import optparse
import socket
THRESH = 1000

def findDownload(pcap):
    for (ts, buf) in pcap:
        try:
            eth = dpkt.ethernet.Ethernet(buf)
            ip = eth.data
            src = socket.inet_ntoa(ip.src)
            tcp = ip.data
            http = dpkt.http.Request(tcp.data)
            if http.method == 'GET':
                uri = http.uri.lower()
                if '.zip' in uri and 'loic' in uri:
                    print ('[!] ' + src + ' Downloaded LOIC.')
        except:
            pass

def findHivemind(pcap):
    for (ts, buf) in pcap:
        try:
            eth = dpkt.ethernet.Ethernet(buf)
            ip = eth.data
            src = socket.inet_ntoa(ip.src)
            dst = socket.inet_ntoa(ip.dst)
            tcp = ip.data
            dport = tcp.dport
            sport = tcp.sport
            if dport == 6667:
                if '!lazor' in tcp.data.lower():
                    print ('[!] DDoS Hivemind issued by: '+src)
                    print ('[+] Target CMD: ' + tcp.data)
            if sport == 6667:
                if '!lazor' in tcp.data.lower():
                    print ('[!] DDoS Hivemind issued to: '+src)
                    print ('[+] Target CMD: ' + tcp.data)
        except:
            pass

def findAttack(pcap):
    pktCount = {}
    for (ts, buf) in pcap:
        try:
            eth = dpkt.ethernet.Ethernet(buf)
            ip = eth.data
            src = socket.inet_ntoa(ip.src)
            dst = socket.inet_ntoa(ip.dst)
            tcp = ip.data
            dport = tcp.dport
            if dport == 80:
                stream = src + ':' + dst
                if pktCount.has_key(stream):
                    pktCount[stream] = pktCount[stream] + 1
                else:
                    pktCount[stream] = 1
        except:
            pass

    for stream in pktCount:
        pktsSent = pktCount[stream]
        if pktsSent > THRESH:
            src = stream.split(':')[0]
            dst = stream.split(':')[1]
            print ("[+] "+src+" attacked "+dst+" with " \
                + str(pktsSent) + " pkts.")

def main():
    parser = optparse.OptionParser("usage %prog '+\
      '-p <pcap file> -t <thresh>"
                              )
    parser.add_option("-p", dest='pcapFile', type="string",\
      help='specify pcap filename')
    parser.add_option("-t", dest="thresh", type="int",\
      help="specify threshold count ")

    (options, args) = parser.parse_args()
    if options.pcapFile == None:
        print (parser.usage)
        exit(0)
    if options.thresh != None:
        THRESH = options.thresh
    pcapFile = options.pcapFile
    f = open(pcapFile)
    pcap = dpkt.pcap.Reader(f)
    with open(pcapFile, 'rb') as f:
        pcap = dpkt.pcap.Reader(f)
        findDownload(pcap)
    with open(pcapFile, 'rb') as f:
        pcap = dpkt.pcap.Reader(f)
        findHivemind(pcap)
    with open(pcapFile, 'rb') as f:
        pcap = dpkt.pcap.Reader(f)
        findAttack(pcap)

if __name__ == "__main__":
   main()

pan64 · 02-13-2022, 03:52 AM

without knowing that pcap file hard to say anything, probably the message is correct, the file is corrupted. Also would be nice to post the full error message, not only one line.
From the other hand you can use open like this (for example):

Code:

with open(filename, encoding="something") as datafile:
    # work on datafile here

NevemTeve · 02-13-2022, 08:36 AM

I suppose your file is a binary log of network traffic; it might contain any bytes, there is no point assuming any part of it is in utf8.

dugan · 02-13-2022, 11:03 AM

Generally speaking, you can sometimes deal with that by calling .decode('utf-8') or .encode('utf-8) on the strings. This works if the data is actually UTF-8.

If NevemTeve is correct (and he is), then the usual way to deal with it is to hex dump the data. Not to print it.

I don't know if you're familiar with hex dumps, but there's a Wikipedia article about them:

https://en.wikipedia.org/wiki/Hex_dump

Do be aware that the usual way to determine whether a file is binary (as opposed to text) is to check if it has null bytes.

sundialsvcs · 02-14-2022, 10:07 AM

There is, in fact, a hexdump utility which can, among other things, print the data in rows of hexadecimal bytes with the corresponding characters (if printable ASCII) beside them.

A data stream is certainly not UTF-encoded. Any algorithm which is told that it is, will try to decode it and probably fail ... although it may seem to succeed some of the time.