How to map raw data in a ".txt" in a JSON format

Hello,

I have to send logs to an external system in HTTP format (JSON). The issue is that the info is in syslog format. I need to use some of the info in the syslog format (save it in a .txt file). Does anyone have an idea of how can I do this? As I understand I need to open the file, search in each line of the file and map that info in the “value:key” of the JSON body.

Regards,

You do this by parsing the text line and constructing a Python dict with
the fields. Then use json.dumps() to transcribe that dict as JSON
(unless your HTTP library has a handy presupplied method for this, which
is possible because sending data in JSON form is very common).

It would help to see some of the sample data you have. Here’s a line

Jan  9 00:17:01 borg CRON[3550686]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)

You can see it has a timestmp, a hostname, a service[pid] field, a colon
and a space and the log information.

The timestamp is 3 words (“Jan”, “9”, “00:17:01”), the hostname is the
next word, and the service[pid] part is the next word. You can split()
the line to get these leading words and leave the log text as the final
unsplit string. Then parse the timestamp into a datetime with strptime,
or leave it alone. Then you might construct a dict with these keys:

{ "timestamp": "Jan  9 00:17:01",
  "hostname": "borg",
  "service": "CRON",
  "pid": 3550686,
  "logline": "(root) CMD (   cd / && run-parts --report /etc/cron.hourly)",
}

Write some code for that, see how you go.

Cheers,
Cameron Simpson cs@cskk.id.au

Hello, Camero. I really appreciate your response. I did what you mention, but I encounter some issues. I have developed a code somewhat similar to the way that you posted.

  1. I declare variables.
  2. I’m using a function to open the “.txt” file and search in it.
  3. My idea is to pass each variable to the JSON body.

In the second point, I have some issues with split(). As an example, in the “.txt” file I have:

srcip=172.16.2.15 srcname="PC-1"

With split():

rec.split(’=’)[11]

If I do that I get 172.16.2.15 srcname as a result. So it gives me part of the next string. How can I delimiter the end with a space?

Thanks in advance.

Regards,

Hello, Camero. I really appreciate your response. I did what you
mention, but I encounter some issues. I have developed a code somewhat
similar to the way that you posted.

  1. I declare variables.
  2. I’m using a function to open the “.txt” file and search in it.
  3. My idea is to pass each variable to the JSON body.

In the second point, I have some issues with split(). As an example, in the “.txt” file I have:

srcip=172.16.2.15 srcname=“PC-1”

Your log lines do not look like mine. Can you paste some complete
example lines?

With split():

rec.split(‘=’)[11]

Yes, split on “=” is a poor fit for the above. I was suggesting split()
with no args (split on whitespace) just to rip off a fixed number of
“words” from the log lines. It sounds like a poor fit for your data.
Please show us some example lines.

If I do that I get 172.16.2.15 srcname as a result. So it gives me part of the next string. How can I delimiter the end with a space?

By leaving off the arguments, or giving None (or " ") as the delimiter.
However, I expect there will be some strings inside quotes in your lines
which contain spaces themselves, so this won’t work well. Ideally your
log lines will be in a well defined format, even better a “standard” one
for what a parser exists already. Otherwise you’ll need to write
something special.

Start with example lines, and show us your current code (even though it
does not yet work).

Cheers,
Cameron Simpson cs@cskk.id.au

Hello,

I really don’t know if I can change the format because is take it from syslog server. I do not configure the syslog. Below is an example of the format that I receive:

Jan 13 16:54:45 192.168.0.20 date=2022-01-13 time=16:51:15 devname=“FG100E4Q16000698” devid=“FG100E4Q16000698” logid=“0001000014” type=“traffic” subtype=“local” level=“notice” vd=“root” eventtime=1642114275 srcip=172.16.2.15 srcname=“PC-1” srcport=58871 srcintf=“lan” srcintfrole=“lan” dstip=192.168.0.20 dstport=443 dstintf=“root” dstintfrole=“undefined” sessionid=56340 proto=6 action=“close” policyid=5 policytype=“local-in-policy” service=“HTTPS” dstcountry=“Reserved” srccountry=“Reserved” trandisp=“noop” app=“Web Management(HTTPS)” duration=1 sentbyte=724 rcvdbyte=313 sentpkt=5 rcvdpkt=4 appcat=“unscanned” devtype=“Windows PC” osname=“Windows” osversion=“NT 10.0” mastersrcmac=“00:00:ee:67:47:39” srcmac=“00:00:ee:67:47:39” srcserver=0

All the logs follow that structure.

I tried with spaces (" "). But I just need the value of specific keys in the JSON. If I use just space the JSON will not work well in the system in where I have to send the info. For example:

srcip=172.16.2.15

(In the above example, I just need the IP address).

Regards,

I really don’t know if I can change the format because is take it from
syslog server. I do not configure the syslog.

I wasn’t asking you to change the format. The text you cited had no
timestamp etc.

Below is an example of the format that I receive:

Jan 13 16:54:45 192.168.0.20 date=2022-01-13 time=16:51:15 devname=“FG100E4Q16000698” devid=“FG100E4Q16000698” logid=“0001000014” type=“traffic” subtype=“local” level=“notice” vd=“root” eventtime=1642114275 srcip=172.16.2.15 srcname=“PC-1” srcport=58871 srcintf=“lan” srcintfrole=“lan” dstip=192.168.0.20 dstport=443 dstintf=“root” dstintfrole=“undefined” sessionid=56340 proto=6 action=“close” policyid=5 policytype=“local-in-policy” service=“HTTPS” dstcountry=“Reserved” srccountry=“Reserved” trandisp=“noop” app=“Web Management(HTTPS)” duration=1 sentbyte=724 rcvdbyte=313 sentpkt=5 rcvdpkt=4 appcat=“unscanned” devtype=“Windows PC” osname=“Windows” osversion=“NT 10.0” mastersrcmac=“00:00:ee:67:47:39” srcmac=“00:00:ee:67:47:39” srcserver=0

All the logs follow that structure.

Ok. This is better. We can use part of my suggested approach, then write
some custom parsing.

So, suppose the above is in a variable “line”. You can write this:

mon, mday, hhmmss, hostpart, logline = line.split(None, 4)

which will plit the line on whitespace up to 4 times. That gets the
fixed month, month day, time, hostpart into distinct variables and the
rest into “logline”. Then you parse “logline” specially as a sequence of
assignment statements.

So looking at your line you seem to have either:

varname="quoted text here"

or:

varname=nonwhitespace-here

This is probably most handily parsed with regular expressions. Do you
know anything about them?

Basicly the approach is:

  • make an empty dict to hold the field values
  • while logline is not empty, match a regexp for a single assignment
  • store the varname and value in the dict, with the varname as the key
  • set logline to the text after the matched regexp
  • repeat until the logline is empty r the regexp does not match

After that your dict will have al the values from the line, ready for
use in your JSON.

Cheers,
Cameron Simpson cs@cskk.id.au

1 Like

Ok, ok. Understood. I know a little about regex. I used it in an Oracle database course. But I have never used it in Python.

I will investigate a little more about what you mention.

Cheers,

Hello, Cameron.

I used the REGEX and now it works the mapping as I want. Perhaps I have to optimize, but for now, is working.

I really appreciate your time👍.

Cheers,

Please paste code directly into your message between triple backticks
(```). That way we can see it and cut/paste it. Having gone to the forum
(I interact via email) and looked at your screenshot I have a few
remarks:

You’re splitting the logline on " ". This works poorly because some of
the fields themselve contain spaces.

You’re doing it for every variable of interest. Do it once, then pick
out what you want:

logwords = logline.split(" ")
local_ip = logwords[10]
... etc etc ...

Your approach depends on the fields always being in the exact same
order, and all of them being present. I would feel uncomfortable about
that. And because you don’t check the field part you’ll not know if this
is ever not the case unless you discover some bad data by accident.

I was thinking writing a regexp to recognise the varname=blah, and to
use the matched varname part as the dict key and to use blah as the
value. Possibly after tidying it (eg stripping quotes, or convertings a
date to a real date or something (depends on your needs, of course).
Then you would not be dependent on the ordering and position of the
fields.

Anyway, I’m glad you have it basicly working.

Cheers,
Cameron Simpson cs@cskk.id.au