Lookup and return values from several csv files

Hi

I wouldn’t say that I know how do do that but thta atelast what I tried to achive a dict function. But somewere I makin some misstakes.

You could simply use the dict.fromkeys() method that I’ve already demonstrated, to create two intermediate dict objects, then merge the two:

trips_output = dict.fromkeys(trips_filter)
stops_output = dict.fromkeys(stops_filter)
output = {**trips_output, **stops_output}

…before then updating the output dictionary with the data which (as of now) is being displayed.

I don’t know if there’s a better way to do that. I’m also not sure if the output (from which the JSON file will be constructed) could or should be done on the fly, or if it should be built in full before the JSON file is constructed and subsequently written out.

When reading data that is potentially consuming large portions of computer resources, these factors need to be considered. This is why I’ve designed the code (thus far) to read only data that needs to be processed, rather than reading all the data from both csv files in one hit: the output needs the same consideration.

I also feel that this process could be optimised by combining the JSON construction and output with the data read/processing, but that’s just a thought and a feeling based on what I’ve done thus far; this could turn out to be a blind alley.

Some content input from more experienced coders could prove valuable here.

I’ll have a think about all of the above, in the interim.

Also, make sure that what we have here is correct, in terms of the information input/output, as any change has the potential to scupper the project.

I will try to get som time for this in the upcoming weekend, hopefully I get it.

Hi all

Today I messed around with dict.fromkeys method, but I can’t figure out how to make it return all trips in todays traffic I just get the first match, I tried different indentations but none of my tries make it work for all trip id in in trips_filter but only gets one match.

from csv import reader as csv_read


def update(reader, dictionary, key):
    for item in next(reader):
        dictionary[next(key)] = item


def trip_data():
    trip_key = iter(todays_trips.keys())
    update(trips_reader, todays_trips, trip_key)


def stop_data():
    stop_key = iter(todays_stops.keys())
    update(stops_reader, todays_stops, stop_key)


trips = (trip_id for trip_id in open(
    "todays_trips.csv", mode="r", encoding="UTF-8"))

stops = (trip_id for trip_id in open(
    "stop_times.txt", mode="r", encoding="UTF-8"))

trips_reader = csv_read(trips)
stops_reader = csv_read(stops)

todays_trips = dict.fromkeys(next(trips_reader))
todays_stops = dict.fromkeys(next(stops_reader))


# driver and filters
trips_filter = ["trip_id", "operating_day_date", "dated_vehicle_journey_gid"]
stops_filter = ["arrival_time", "stop_id", "stop_headsign"]

trip_data()
stop_data()

for trips_fltr in trips_filter:
    print(f"{trips_fltr}: {todays_trips[trips_fltr]}")
    print()
while todays_trips["trip_id"] == todays_stops["trip_id"]:
    for stops_fltr in stops_filter:
        print(f"{stops_fltr}: {todays_stops[stops_fltr]}")
        print()
        stop_data()

Hi Marcus.

This is a working version:

from csv import reader as csv_read


def update(reader, dictionary, key):
    for item in next(reader):
        dictionary[next(key)] = item


def trip_data():
    trip_key = iter(todays_trips.keys())
    update(trips_reader, todays_trips, trip_key)


def stop_data():
    stop_key = iter(todays_stops.keys())
    update(stops_reader, todays_stops, stop_key)


trips = (trip_id for trip_id in open(
    "todays_trips.csv", mode="r", encoding="UTF-8"))

stops = (trip_id for trip_id in open(
    "stop_times.txt", mode="r", encoding="UTF-8"))

trips_reader = csv_read(trips)
stops_reader = csv_read(stops)

todays_trips = dict.fromkeys(next(trips_reader))
todays_stops = dict.fromkeys(next(stops_reader))


# driver and filters
trips_filter = ["trip_id", "operating_day_date", "dated_vehicle_journey_gid"]
stops_filter = ["arrival_time", "stop_id", "stop_headsign"]

trip_data()
stop_data()

header = ""
output = ""

while stops_reader:
    print("Data from trips filter...")
    for trips_fltr in trips_filter:
        print(f"{trips_fltr}: {todays_trips[trips_fltr]}")

    print("---------------------------")
    print("Data from stops filter...")
    print()
    while todays_trips["trip_id"] == todays_stops["trip_id"]:
        for stops_fltr in stops_filter:
            print(f"{stops_fltr}: {todays_stops[stops_fltr]}")
        print()
        try:
            stop_data()
        except StopIteration:
            input("End of stops data.\nPress enter to view any remaining trips data.")
            print()
            break
    try:
        trip_data()
    except StopIteration:
        print("End of trips data.")
        print("Script exit.")
        break

I’m looking into the JSON construction today, based on the output we have here.


This is a sample JSON file that I need you to parse and make sure that it is what is needed.

{
  "trip_id": {
    "55700000068603293": {
      "operating_day_date": "20230104",
      "dated_vehicle_journey_gid": {
        "9015005000200001": {
          "stop_id": {
            "9022005081011017": {
              "arrival_time": "04:43:00",
              "stop_headsign": "Kvarnberget"
            },
            "9022005000003016": {
              "arrival_time": "04:45:00",
              "stop_headsign": "Kvarnberget"
            },
            "9022005001012016": {
              "arrival_time": "04:45:00",
              "stop_headsign": "Kvarnberget"
            }
          }
        }
      }
    },
    "55700000068603314": "to-do"
  }
}

As is, the script is going to need re-working so that the output can be used to build the JSON file, because right now it’s not suited for that.

I had it in mind that this would be the case as I’ve never coded for JSON before and had little idea about the required format. Now that I’ve looked into it, I can see why this script will need re-coding; the basics seem sound enough, so it’s just the output to re-code.


@xxxxl
There’s a question in my mind: did you want the JSON file to be as I’ve posted (above) or maybe one file per trip_id? This is fundamental to the re-design.

If it’s going to be one file per trip_id, then the id number could be used for the file name, so (using the data we have here) you would have 55700000068603293.json and 55700000068603314.json and all the other nodes would be moved up by one level.

Hi Rob

Once again thanks for all help, the script didn’t return any data from stop filter. is it supposed to do that?

Data from trips filter...
trip_id: 55700000067904833
operating_day_date: 20230104
dated_vehicle_journey_gid: 9015005077500002
---------------------------
Data from stops filter...

Data from trips filter...
trip_id: 55700000067904849
operating_day_date: 20230104
dated_vehicle_journey_gid: 9015005077500004

The Json design i prefer would be if stop id is the filename or if that’s hard to achive the first level in json. second level would be trip_id containing all data as in your example. Then I offcourse need to understand all this so I can continue to work with the other files and even more exciting parse the realtime data feed to each trip/stop combination.

You’re welcome Marcus.

No. You should be getting data from the stop_times.txt file also.

This means that the code line (50, is it?) that reads…
while todays_trips["trip_id"] == todays_stops["trip_id"]:
…is not finding a match. Remember: the two files must have a common link, namely the trip_id number. In the data we have here, only the first two lines of the trips.csv file have a corresponding link to the stop_time.txt. I see match at all, in either file (posted here), for trip_id numbers 55700000067904833 and 55700000067904849, so you have a data miss-match on the files that you are using to test this, is my basic assumption.

Re your JSON proposal: it’s not impossible, but it’s not easy. The stop_id field is not the primary key here. It can be indexed, based on the stop_sequence number, but in terms of coding, this script was not designed to do that.

Good morning

I can copy a trip_id from todays_trips and serach and fin a match in stoptimes.txt. I uploaded the complete files in my google drive it should work for reading thoose files.

If the JSON is problematic then we just use your suggestion but i think we can keep it in one file.

Google drive for datasource files

Good morning and thank you.

It’s all good at my end.

A sample of the output I’m seeing…

Data from trips filter...
trip_id: 55700000068603293
operating_day_date: 20230104
dated_vehicle_journey_gid: 9015005000200001
---------------------------
Data from stops filter...

arrival_time: 04:43:00
stop_id: 9022005081011017
stop_headsign: Kvarnberget

arrival_time: 04:45:00
stop_id: 9022005000003016
stop_headsign: Kvarnberget

arrival_time: 04:46:18
stop_id: 9022005001012016
stop_headsign: Kvarnberget

...

Data from trips filter...
trip_id: 55700000068603314
operating_day_date: 20230104
dated_vehicle_journey_gid: 9015005000200003
---------------------------
Data from stops filter...

arrival_time: 04:58:00
stop_id: 9022005081011017
stop_headsign: Kvarnberget

arrival_time: 05:00:00
stop_id: 9022005000003016
stop_headsign: Kvarnberget

arrival_time: 05:01:18
stop_id: 9022005001012016
stop_headsign: Kvarnberget

...

Are you not seeing this?


So, in order that the file read terminates when the data no longer has a key match, is no more than a minor change, with maybe a simple message to indicate said; I’ll keep it in mind.

I’ll work on the JSON part and post back as and when.

Hi

No I’m just seeing this, just the header kind of trips filter data. I have looked at owner and permissions for both files and can’t se anything obviuos.

Data from trips filter...
trip_id: 55700000065686306
operating_day_date: 20230104
dated_vehicle_journey_gid: 9015005041300011
---------------------------
Data from stops filter...

Data from trips filter...
trip_id: 55700000067835164
operating_day_date: 20230104
dated_vehicle_journey_gid: 9015005041300003
---------------------------
Data from stops filter...

Data from trips filter...
trip_id: 55700000067835205
operating_day_date: 20230104
dated_vehicle_journey_gid: 9015005041300009
---------------------------
Data from stops filter...

Data from trips filter...
trip_id: 55700000067835353
operating_day_date: 20230104
dated_vehicle_journey_gid: 9015005041300013
---------------------------
Data from stops filter...

Data from trips filter...
trip_id: 55700000065686356
operating_day_date: 20230104
dated_vehicle_journey_gid: 9015005041300014
---------------------------
Data from stops filter...

Data from trips filter...
trip_id: 55700000067368675
operating_day_date: 20230104
dated_vehicle_journey_gid: 9015005041300002
---------------------------
Data from stops filter...

Data from trips filter...
trip_id: 55700000067835434
operating_day_date: 20230104
dated_vehicle_journey_gid: 9015005041300012

Again, a data mismatch: once you get past trip id 55700000068603668, it seems there are no more key matches. As I said, I’ll have the read file routine terminate when that situation occurs.

I’ve not fully tested that hypotheses, and there’s a lot of data (16.8 MB) in the .txt file.

I also think I have a bug in my scrip; not that it’s reading past the EOF, it’s just not terminating the read of one file, when there’s no matching data in the other file.


Is there no way for you to ensure that the data is fully matched, when you d/load the files? Doing that would save on having to verify that each record has a key match in both files.

I’ve run a file test. Right now there are 3486 keys in the stops file, with no corresponding key in the trips file.

Hi

One key issue is just that the provided files from goverment is as is, and the provide all trips for lets say 6 months but to make sure to which trips are valid for one specific date is just to first of all filter all trips down to one date which I did and then run the whole matching event.

However it might be that my preprocessing is making it worse. I will upload the whole source for todays_trips.csv thats original called trips_dated_vehicle_journey.txt, however even with that file do not see any data from stops filter. Must be something funny in my setup.

maybe that’s better and maybe the json files could be one per day instead.I just uploaded the fully datasource

Yes, I think that the JSON needs some kind of a boundary, least we end off having a huge file, but before I get to that, I’ll do a quick re-work of the existing script, so that it does not include data for which there are no matched keys.


So, I looked at the date stamps and all of them are the same: 5151 records in the trips.csv, dated 2023 01 04

So as is, the JSON file will have 5151 top level nodes. If this data is for human consumption, that’s an overwhelming amount of data.

Hi

Yes that would be true but files wouldn’t be for human eyes just for machine reading. It sure seems like a quite complicated task all this.

Yes, I agree.

From what you’ve said here, I feel that the method being used is incorrect for the task in hand: there are better ways to share data between computer systems. As I inclinated at before: a relational database would be (IMHO) the way to approach this task.

Okey

I understand your point in relational database, is there any good way to get that done? I will also need to make calls from the output from it from a website that should viusualize this data for a couple of stops.

Ideally and IMHO, the data (from source), would be accessed via an API, rather than a d/loaded csv file. Then, getting said data into a database would not be a huge task. The bigger task would be the data process part.