Extracting Part of a link

I have an img tag that looks like this:

<img class="mlb-player-image lazyloaded" data- 
srcset="/remote.axd/www.milb.com/images/682626/generic/180x270/682626.jpg? 
preset=teamPlayerCard 1x, 
/remote.axd/www.milb.com/images/682626/generic/180x270/682626.jpg? 
preset=teamPlayerCard@2x 2x" alt="Francisco Alvarez" src="/img/silhouette.gif" 
srcset="" style="">

What would the code be to extract the number behind /images/682626?

Thank you
GMD

I’d suggest taking a look at the html.parser module for parsing the tag, and the re module for extracting text.

I probably didn’t ask the right question. The IMG tag is a bs4.object that was extracted as part of a list. There are 30 list elements. This is one of the elements:

<li class="col-xs-6"><div class="rank">1</div><div class="prospect-image"><img alt="Francisco Alvarez" class="lazyload mlb-player-image" data-aspectratio="45/68" data-srcset="/remote.axd/www.milb.com/images/682626/generic/180x270/682626.jpg?preset=teamPlayerCard 1x, /remote.axd/www.milb.com/images/682626/generic/180x270/682626.jpg?preset=teamPlayerCard@2x 2x"/></div><div class="prospect-details"><a href="#104328"> Francisco Alvarez </a> <span class="prospect-position">C</span></div></li>

I’m trying to get the player ID from that. It doesn’t really have a tag or part of the text of a tag that I can figure out how to grab.

Thanks

I have been working on the code and this is what I’ve come up with:

for first_player in my_list:
    rank = first_player.div.text
    ranks.append(rank)
    name = first_player.a.text.strip()
    names.append(name)
    position = first_player.span.text
    positions.append(position)
    player_id = first_player.find('img').attrs['data-srcset'].split('/')[4]
    player_ids.append(player_id)

This has worked on 27 of 30 webpages. I’m stumped as to why it doesn’t work on the other 3. The pages are the same but with different players on them. On the ones that it doesn’t work I get this error:

Traceback (most recent call last):
File "C:/Users/HP/PycharmProjects/CTCBL/NYN.py", line 54, in <module>
player_id = first_player.find('img').attrs['data-srcset'].split('/')[4]
KeyError: 'data-srcset'

This is the list item it is using:

<li class="col-xs-6"><div class="rank">2</div><div class="prospect-image"><img 
alt="Ronny Mauricio" class="lazyload mlb-player-image" data-aspectratio="45/68" data- 
srcset="/remote.axd/www.milb.com/images/677595/generic/180x270/677595.jpg? 
preset=teamPlayerCard 1x,/remote.axd/www.milb.com/images/677595/generic/180x270/677595.jpg? 
preset=teamPlayerCard@2x 2x"/></div><div class="prospect-details"><a 
href="#100109"> Ronny Mauricio </a> <span class="prospect-position">SS</span> 
</div></li>

What is also perplexing is that if I take the code out of the loop it runs for the first list item.

Thanks for any help you can give.

That link does not have a ‘data-srcset’ attribute, it only has these
(which you get via pprint.pprint(first_player.find('img').attrs):

    {'alt': 'Ronny Mauricio',
     'class': ['lazyload', 'mlb-player-image'],
     'data-': '',
     'data-aspectratio': '45/68',
     'srcset': '/remote.axd/www.milb.com/images/677595/generic/180x270/677595.jpg?\n'
               'preset=teamPlayerCard '
               '1x,/remote.axd/www.milb.com/images/677595/generic/180x270/677595.jpg?\n'
               'preset=teamPlayerCard@2x 2x'}

You can use a try … catch statement to filter out tags without the data-srcset attribute:

try:
    player_id = first_player.find('img').attrs['data-srcset'].split('/')[4]
except KeyError:
    pass  # put any error handling here
else:
    player_ids.append(player_id)
1 Like

Thanks for the reply. Why does the code work on other pages? There are 30 team pages and it works on 27 of them. This is an object in the list that it does work on:

<li class="col-xs-6"><div class="rank">1</div><div class="prospect-image"><img 
alt="Cristian Pache" class="lazyload mlb-player-image" data-aspectratio="45/68" data- 
srcset="/remote.axd/www.milb.com/images/665506/generic/180x270/665506.jpg? 
preset=teamPlayerCard 1x, 
/remote.axd/www.milb.com/images/665506/generic/180x270/665506.jpg? 
preset=teamPlayerCard@2x 2x"/></div><div class="prospect-details"><a href="#8776"> 
Cristian Pache </a> <span class="prospect-position">OF</span></div></li>

This one Doesn’t work. I get the error.

<li class="col-xs-6"><div class="rank">1</div><div class="prospect-image"><img 
alt="Francisco Alvarez" class="lazyload mlb-player-image" data-aspectratio="45/68" data- 
srcset="/remote.axd/www.milb.com/images/682626/generic/180x270/682626.jpg? 
preset=teamPlayerCard 1x, 
/remote.axd/www.milb.com/images/682626/generic/180x270/682626.jpg? 
preset=teamPlayerCard@2x 2x"/></div><div class="prospect-details"><a 
href="#104328"> Francisco Alvarez </a> <span class="prospect-position">C</span> 
</div></li>

if I use the code to get the first item it also works:

rank = first_player.div.text
name = first_player.a.text
position = first_player.span.text
player_id = first_player.find('img').attrs['data-srcset'].split('/')[4]

print(rank, name, position, player_id)

I get:
1 Francisco Alvarez C 682626

Thanks again for helping

It is likely that the problem lies in other items of that list that do not have the attribute.

These two you pasted seem fine (see attachment).

Check whether all elements (i.e. <li> tags) of the list have the data-srcset="..." attribute.

1 Like

Marco, THANK YOU. I have learned something here. It doesn’t necessarily do the operations up till it finds an error it just errors. There was one element that didn’t have the data-srcset attribute. Because I’m not adept yet enough to do the code to handle errors, I just put a line of code in to remove item from the list and then inserted another one with corrected data in the spot. Now I will inspect the other 2 pages and find which element is causing those to error out.

Thanks again

1 Like

Pleasure to help. Do consider the suggestion you got from another user of
wrapping your metadata extraction logic in a try/except block so that you may
process the items you can and disregard those you can’t without halting.

You could also log information about items you could not process to study them
or process them with another strategy later. Useful topics: “Python logging”,
“difference between stdout and stderr”.

1 Like

Yes, that definitely is on my agenda. This was my first attempt at getting some baseball data for my personal projects. I plan on doing some more reading and expanding my programs. It is all a learning process.

Thanks again to you and to Erlend!

1 Like

Glad to help! :slight_smile: