Lookup and return values from several csv files

Hi all

I’m new to python and not a great coder at all, but I have a mission to try working with publictransport gtfs data stream. I’ve gotten that far that I can download and print data from the files and stream one would like to. My proble is that I have several csv files that needs to be combined into one, but I don’t need all data and need to have some functionality close to excel xlookup.

I’ve tried alot off different methods but can not get the right output. Heres one of all codes I’ve tried and output for for items2 in dated_trips: looks good and then I need to lookup row contaning the values of any trip_id that exist in dated_trips in key dated_trips in dated_traffic

import csv
infile = 'todays_trips.csv'
stop_times = 'stop_times.txt'
strings = {"20230104","crtiteria2"}
newlines = ['trip_id,operating_day_date,dated_vehicle_journey_gid,journey_number','\n']



dated_traffic = []

with open(stop_times, mode='r') as input_file:
    rows = []
    for row in input_file:
        rows.append(row.rstrip('\n').split(","))
    keys = rows[0]
    for values in rows[1:]:
        dated_traffic.append(dict(zip(keys, values)))
    for items in dated_traffic:
        print(items['trip_id'],items['arrival_time'],items['stop_headsign'])

dated_trips = []

with open(infile, mode='r') as input_file:
    rows = []
    for row in input_file:
        rows.append(row.rstrip('\n').split(","))
    keys = rows[0]
    for values in rows[1:]:
        dated_trips.append(dict(zip(keys, values)))
    for items2 in dated_trips:
        print(items2['trip_id'],items2['operating_day_date'],items2['dated_vehicle_journey_gid'])
        #print(items.keys())
        #print(dated_trips.values())

intersection = []
for item in dated_traffic.keys():
    if dated_trips.has_key(item):
        intersection.append(item)
print("intersections: ". intersection)

error measseg after successfull print of trip_id operating_day_date dated_vehicle_journey_gid

Traceback (most recent call last):
File “\NAS\Filer\Trafiklab\filter stop times.py”, line 36, in
for item in dated_traffic.keys():
AttributeError: ‘list’ object has no attribute ‘keys’

Is there anyone able and willing to help me

/Marcus

A list does not have a keys() method. I think can just do for item in dated_ traffic:

You do not need to use has_key.

if item in dated_trip:

I think I see what you’re trying to do, but I’m unsure, as there are one or two fundamental errors in your script: as you say, you are new to Python, so errors are to be expected.

Given the above, I think you should take a step back and rethink your project, breaking it down into smaller parts and get each part working before moving on to the next.

  1. get the input
  2. process the data
  3. present the output

… should be the general flow. With a large data set, (which seems to be a .csv file, but you also have a .txt file, so I’m confused here and you are not using the csv import, which is adding to my confusion) the input can be line by line, but with smaller data sets, the entire file can get read in and the file closed.

On the plus side, you have used code formatting (+1 for that) and you have a file handler (well two in fact) coded so another +1. You also seem to know some useful object methods (such as .append(), .strip(), .split() and the like) so you’re good with the basics.

I would need a data sample, which does not need to be large; just large enough to be useful, maybe 10 lines or so, which needs to be posted in a formatted block, just as you’ve done with the Python code.

Forgot to say: the last line of code should be print("intersections: ", intersection) Notice the comma, rather than a period, but we’ll get to that as and when.

Another edit to add: I think that you’re mixing a method from Python2 with your Python3 script, namely if dated_trips.has_key(item): I’m as sure as I can be that .has_key() has been deprecated.

Thanks for your input I made some changes and got two list that prints good, but last piece here is to put the two together and I would need to do this for more sources and in different steps to achive what I wouldlike to. Maybe thats justthat I should hav an indentation here that I miss, so print(intersections) seem to keep going in a infinity loop

import csv
infile = 'todays_trips.csv'
stop_times = 'stop_times.txt'
strings = {"20230104","crtiteria2"}
newlines = ['trip_id,operating_day_date,dated_vehicle_journey_gid,journey_number','\n']



dated_traffic = []

with open(stop_times, mode='r') as input_file:
    rows = []
    for row in input_file:
        rows.append(row.rstrip('\n').split(","))
    keys = rows[0]
    for values in rows[1:]:
        dated_traffic.append(dict(zip(keys, values)))
    for items in dated_traffic:
        print(items['trip_id'],items['arrival_time'],items['stop_headsign'])

dated_trips = []

with open(infile, mode='r') as input_file:
    rows = []
    for row in input_file:
        rows.append(row.rstrip('\n').split(","))
    keys = rows[0]
    for values in rows[1:]:
        dated_trips.append(dict(zip(keys, values)))
    for items2 in dated_trips:
        print(items2['trip_id'],items2['operating_day_date'],items2['dated_vehicle_journey_gid'])
        #print(items.keys())
        #print(dated_trips.values())

intersection = []
for item in dated_traffic:
      if dated_trips:
      intersection.append(item)
      print(intersection)


Thanks for helping me out I’m willing to agree that basic understanding is better before getting into deep or to large datasets. I have google and found differnet tutorials but none really helped. So if yoy find anything strange and have the time for teaching me what I’m missing I will be glad. And I really like stremalined soulutions and my trys doesn’t seem to be anywhere close to that becaus the cpu fan is about to reach the moon.

To me it looks like the .txt file is valid .csv data(I don’t know if file suffix is crucial) my public transportation opendata feed is in GTFS format and filenames is .txt in the download process, but for todays_traffic I filter a period of traffic down to a specific date 2023-01-04 in my test envirionment. I would need to find all rows in stop_times.txt that contains trip_id key values from todays_trips.csv and choose wich data I will have on one row in one file per trip_trip id in todays_trips.csv Just to make sure that that jpourney is valid for that specific date. Later on I’m going to add on realtime updates for changes or cancellations. That process will be quite equal to this if I save data to a file or if I could find a method for working wit that data on the fly.

example data in todays_trips.csv

trip_id,operating_day_date,dated_vehicle_journey_gid,journey_number
55700000068603293,20230104,9015005000200001,1
55700000068603314,20230104,9015005000200003,3
55700000068603349,20230104,9015005000200011,11
55700000068603668,20230104,9015005000200013,13
55700000068603703,20230104,9015005000200015,15
55700000068603738,20230104,9015005000200017,17
55700000068603773,20230104,9015005000200019,19
55700000068603808,20230104,9015005000200021,21
55700000068603843,20230104,9015005000200023,23
55700000068605173,20230104,9015005000200025,25
55700000068605208,20230104,9015005000200027,27
55700000068605243,20230104,9015005000200029,29
55700000068605278,20230104,9015005000200031,31
55700000068605313,20230104,9015005000200033,33
55700000068605348,20230104,9015005000200035,35
55700000068605383,20230104,9015005000200037,37
55700000068605418,20230104,9015005000200039,39
55700000068605453,20230104,9015005000200041,41
55700000068605488,20230104,9015005000200043,43
55700000068605523,20230104,9015005000200045,45
55700000068605558,20230104,9015005000200047,47
55700000068605593,20230104,9015005000200115,115
55700000068605628,20230104,9015005000200117,117
55700000068605663,20230104,9015005000200119,119

example data in stop_time.txt

trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled,timepoint
55700000068603293,04:43:00,04:43:00,9022005081011017,1,Kvarnberget,3,1,0,1
55700000068603293,04:45:00,04:45:00,9022005000003016,2,Kvarnberget,3,3,494.27,1
55700000068603293,04:46:18,04:46:18,9022005001012016,3,Kvarnberget,3,3,931.86,0
55700000068603293,04:47:07,04:47:07,9022005001014016,4,Kvarnberget,3,3,1202.64,0
55700000068603293,04:48:55,04:48:55,9022005001129016,5,Kvarnberget,3,3,1807.93,0
55700000068603293,04:49:42,04:49:42,9022005001130016,6,Kvarnberget,3,3,2067.69,0
55700000068603293,04:50:53,04:50:53,9022005001095016,7,Kvarnberget,3,3,2465.14,0
55700000068603293,04:52:05,04:52:05,9022005000021016,8,Kvarnberget,3,3,2866.46,0
55700000068603293,04:53:28,04:53:28,9022005001096016,9,Kvarnberget,3,3,3331.45,0
55700000068603293,04:54:58,04:54:58,9022005001098016,10,Kvarnberget,3,3,3834.38,0
55700000068603293,04:55:54,04:55:54,9022005001099016,11,Kvarnberget,3,3,4145.4,0
55700000068603293,04:57:46,04:57:46,9022005000022016,12,Kvarnberget,3,3,4770.88,0
55700000068603293,05:01:00,05:01:00,9022005000023018,13,Kvarnberget,1,3,5320.68,1
55700000068603314,04:58:00,04:58:00,9022005081011017,1,Kvarnberget,3,1,0,1
55700000068603314,05:00:00,05:00:00,9022005000003016,2,Kvarnberget,3,3,494.27,1
55700000068603314,05:01:18,05:01:18,9022005001012016,3,Kvarnberget,3,3,931.86,0
55700000068603314,05:02:07,05:02:07,9022005001014016,4,Kvarnberget,3,3,1202.64,0
55700000068603314,05:03:55,05:03:55,9022005001129016,5,Kvarnberget,3,3,1807.93,0
55700000068603314,05:04:42,05:04:42,9022005001130016,6,Kvarnberget,3,3,2067.69,0
55700000068603314,05:05:53,05:05:53,9022005001095016,7,Kvarnberget,3,3,2465.14,0
55700000068603314,05:07:05,05:07:05,9022005000021016,8,Kvarnberget,3,3,2866.46,0
55700000068603314,05:08:28,05:08:28,9022005001096016,9,Kvarnberget,3,3,3331.45,0
55700000068603314,05:09:58,05:09:58,9022005001098016,10,Kvarnberget,3,3,3834.38,0
55700000068603314,05:10:54,05:10:54,9022005001099016,11,Kvarnberget,3,3,4145.4,0
55700000068603314,05:12:46,05:12:46,9022005000022016,12,Kvarnberget,3,3,4770.88,0

You’re welcome. Yes, both are indeed .csv files, the relationship being the trip_id field (so far as I quick look can show).

I need to get my head around what you’re trying to do, which may take me a little time.

I’ll spend as much time as I have free and it could be that Barry (who is a competent coder, from what I’ve seen him post) will have a better solution (or different one, least ways) than I. Or it could be that someone else will have some valuable input. One way or the other, given time, I’m sure that you’ll get some help with this.


Hey Marcus. Just a quick update:

I would use two generators to handle the files and only retain the information relating to whatever output you need. That way you’ll not use a whole load of computer resources for data that has already been processed.

This is not a full solution; rather a proof of concept that builds a dictionary object to hold one record of the data from todays_trips.csv and presents said data in the for: loop. This is where the (as yet unused) second generator (which reads the stop_times.txt file) would come into play and sync with the trip_id field (in theory).

I’m a little unsure of the exact structure of the output that you need and what you are going to do with said output (display it, write it to a file?), so maybe some insight could be provided?

I hope this goes some way to achieving your goal.

from csv import reader as csv_read

trips = (trip_id for trip_id in open(
    "todays_trips.csv", mode="r", encoding="UTF-8"))

times = (trip_id for trip_id in open(
    "stop_times.txt", mode="r", encoding="UTF-8"))

trips_reader = csv_read(trips)
stops_reader = csv_read(times)

todays_trips = dict.fromkeys(next(trips_reader))
todays_stops = dict.fromkeys(next(stops_reader))

# this is simply a throwaway loop to inspect the data
for trip in trips:
    data = trip.rstrip().split(",")
    for index, key in enumerate(todays_trips):
        todays_trips[key] = data[index]
    for item in todays_trips.items():
        print(f"{item[0]}: {item[1]}")
    print()

Note: given more time and the fact that I’ve not yet used the times generator, or maybe given some feedback from more experienced coders, the above code may or may not get dumped in favour of a rewrite. That said, I don’t think that the core concept is floored.

Edit for code update: now using the csv libary

Hi Again

That seems to work for a part of the need, I would really much need to get some more information to it

from stop_times.txt I need Stopheadsign and Arrivaltime how would that be possible?

And to clearify the long term goal is to produce a dataset in json that could be queried from a local html page and or published on my mqtt broker and have my home automation eventbus to subscribe to mqtt topics preferably via jsonpath. To produce the right dtaset that schould contain

from today_traffic
trip_id
operating_day_date
dated_vehicle_journey

from stop_times.txt
arrival_time
stop_id
stop_headsign

and then there would be atleast one more file that need to be included that contains lin number and destination and all that is for the more static data and two or more realtime data feeds to update just filterd data from the total static output.

Yes. What we have here, or rather what we need (essentially) is the essence of a one-to-many relational database with “trip_id” as the primary key, but (unless you plan on keeping the data, which seems unlikely) without data retention: correct me if I’m wrong.

So, the ‘one’ is coming from todays_trips.csv and the ‘many’ from stop_times.txt, that is to say there are many records in the .txt file that relate to one record in the .csv file. I’ll work on it, as it’s an interesting exercise.

As for the json part, we’ll put a pin in that from now and circle back to it.

I would be creating a class (TripDetails) to hold the data for one trip and create one for each trip_id. You could just use a dict to old the details if you are not used to creating classes.

That class would have all the information about a trip.

A dictionary can be made that maps the trip_id to the TripDetails object.

Then I would read the stop_times.txt file and add the information from that file into the TripDetails object. I think this is a list of (arrival_time, stop_id, stop_headsign).

I think this is the design that Rob was asking about.

As Rob says once you have the data loaded and you can check its right by printing it out we can deal with converting to JSON.

@xxxxl and @barry-scott

This code will iterate through the first dataset in the todays_trips.csv file, displaying the data from the trips_filter, while at the same time display the linked data from the stop_times.txt file, based on the stops_filter. Said filters can of course be set to whatever data is needed.

To update the data, I have two functions: trip_data() and stop_data(), which need to be called as and when. So, keep calling stop_data() until the trip_id fields no longer match, at which point trip_data() would need to be called to get the next dataset. As is, I’ve simply called on one dataset from the todays_trips.csv and as such this has not been fully tested.

I’ll leave it as is for now as I don’t know enough about JSON coding to contribute anything else, but I hope that what I have done can be used, at least in part.

from csv import reader as csv_read


def update(reader, dictionary, key):
    for item in next(reader):
        dictionary[next(key)] = item


def trip_data():
    trip_key = iter(todays_trips.keys())
    update(trips_reader, todays_trips, trip_key)


def stop_data():
    stop_key = iter(todays_stops.keys())
    update(stops_reader, todays_stops, stop_key)


trips = (trip_id for trip_id in open(
    "todays_trips.csv", mode="r", encoding="UTF-8"))

stops = (trip_id for trip_id in open(
    "stop_times.txt", mode="r", encoding="UTF-8"))

trips_reader = csv_read(trips)
stops_reader = csv_read(stops)

todays_trips = dict.fromkeys(next(trips_reader))
todays_stops = dict.fromkeys(next(stops_reader))


# driver and filters
trips_filter = ["trip_id", "operating_day_date", "dated_vehicle_journey_gid"]
stops_filter = ["arrival_time", "stop_id", "stop_headsign"]

trip_data()
stop_data()

for trips_fltr in trips_filter:
    print(f"{trips_fltr}: {todays_trips[trips_fltr]}")
print()
while todays_trips["trip_id"] == todays_stops["trip_id"]:
    for stops_fltr in stops_filter:
        print(f"{stops_fltr}: {todays_stops[stops_fltr]}")
    print()
    stop_data()

Update:

In the interests of a more complete and working model, I’ve added a couple of exception error handlers so that the script does not crash out when it hits the end of either of the data files.

I’m relatively new to generators, so it could very well be that there is a better alternative to the exception handler that I’ve coded, but it seems to work, so there’s that.

The replacement data output routine:

# driver and filters
trips_filter = ["trip_id", "operating_day_date", "dated_vehicle_journey_gid"]
stops_filter = ["arrival_time", "stop_id", "stop_headsign"]

trip_data()
stop_data()

while stops_reader:
    for trips_fltr in trips_filter:
        print(f"{trips_fltr}: {todays_trips[trips_fltr]}")
    print()
    while todays_trips["trip_id"] == todays_stops["trip_id"]:
        for stops_fltr in stops_filter:
            print(f"{stops_fltr}: {todays_stops[stops_fltr]}")
        print()
        try:
            stop_data()
        except StopIteration:
            input("End of stops data.\nPress enter to view any remaining trips data.")
            print()
            break
    try:
        trip_data()
    except StopIteration:
        print("End of trips data.")
        print("Script exit.")
        break

Thanks fpor all efforts of helping, first examplecode just tokk one trip id and listed all stoppoints not all jouneys, secodn example I didn’t know where to do replacement, I tried to replace the who #driver and filers section but did only return information from todays_trips.

Maybe a good method could be to just make one file per stop_id so that file contains all arrivals for that specifc stop.
/Marcus

In the data clips that you’ve posted, while very useful, are only samples and incomplete: only the first two records from the .csv file are used, because there is no corresponding .txt data for the others, which I’m guessing would not be the case for real world use.

If you want to see what data fields you can filter on, you can add some code to do that or you can have a look at the first lines (which are the headers) of each file.

For convenience, these are the header fields that can be in (or not in) each filter:

Trips filter.

trip_id
operating_day_date
dated_vehicle_journey_gid
journey_number


Stops filter.

trip_id
arrival_time
departure_time
stop_id
stop_sequence
stop_headsign
pickup_type
drop_off_type
shape_dist_traveled
timepoint

Let me know if there’s anything else I can help with.

Hi

Yes I thing I get the filter part I’ve used some similar in another script, however this script only runs once for the first common trip_id and doesn’t seem to loop trough all trip_id in .txt file an verify that they exist in .csv file, where I hope .csv file is the one side and .txt file is the many side containg one row for each stoppoint in that trip.

All trips for the whole period will be in the .txt file and if it exist in the .csv file which I already filtered down to just one day the trip is scheduled for that specific date. From the beginning .csv file did have all trips for the whole period as well, bu I just filtered that one down to one day and then need to find which trips are schedueld for that day.

Maybe I’dont need to filter down to one day but the almost static data may change without notice so one need to download the whole period everyday and keep working with it to be sure no changes in times or routes are made.

Maybe I’m misunderstanding the required output. I thought the goal was to link the data from each trip_id field.

Maybe you could show what kind of output you’re looking for as it’s essential to get that right before work starts on the JSON side of this.

I will try to explain. If I get help enough to solwe this then I have three more file merging kind of operations to get my head around but that will be my goal to thanks to your help be able to get to myself with similar or same method as this. So if Rob or anyone else manage to get a poc working and explain the concept a little I’ve learnd alot on this journey.

For each trip_id in stiptimes.txt that exist in todays_trips.csv or if one could work with today as variable and skip splitting trips.txt into todays_trips.csv via filtering function and

show all data in stop_times.txt (but be able to not show some of it like distance and so on and get dated_vehicle_journeygid and journey_number from todays_trips.csv

In the end I would prefer these grouped by stop_id in stop_times.txt

I will try to show an exampple below, just made with copy and paste, but I beleive it’s good if the method gets somewhat flexible so I can slice and dice data and grouping because there might be other purposes for this than just making my own html site containing arrivals and departures from my two most frequent stations.

stop_id,trip_id,arrival_time,departure_time,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled,timepoint
9022005081011017,55700000068603293,04:43:00,04:43:00,1,Kvarnberget,3,1,0,1
9022005001098016,55700000068603293,04:45:00,04:45:00,2,Central Station ,3,3,494.27,1

Thanks for trying to explain. However, the sample output that you’ve provided makes no use of the the data from .csv file and as such that output could be simply obtained from the .txt file alone, which is counter to your post on Feb 4th

Yes I see my example didn’t get good enough, now I made two rows tha is just like I try to get it where value 3 and 4 is from todays_trips.csv

stop_id,trip_id,dated_vehicle_journey_gid,journey_number,arrival_time,departure_time,stop_sequence,stop_headsign,pickup_type,drop_off_type
9022005081011017,55700000068603293,9015005000200001,1,04:43:00,04:43:00,1,Kvarnberget,3,1,0,1
9022005001098016,55700000068607063,9015005000200105,105,04:45:00,04:45:00,2,Central Station,3,1,0,1

I think that the way forward with this would be to write the results (the output) to a dictionary object, from which a JSON file could be produced. The reason I suggest this is because (from my understanding) there is a strong link between the two structures (that of a Python dictionary object and that of a JSON file) and as such, the JSON code would be relatively straightforward.

Your first post suggests to me that you could code that (at least in part), but I could be taking the wrong approach, so maybe if someone else (possibly @barry-scott) could either confirm my thoughts or suggest a better way; thanks.