How to filter Json File in Python?

warstha · October 5, 2022, 6:17am

Hi,

i want to understand and learn, i’ve json files that generate all formats in videos like this

[
{
“format_id”:“sb3”,
“format_note”:“storyboard”,
“ext”:“mhtml”,
“protocol”:“mhtml”,
“acodec”:“none”,
“vcodec”:“none”,
“url”:"
“width”:48,
“height”:27,
“fps”:0.12150668286755771,
“rows”:10,
“columns”:10,
“fragments”:[
{
“url”:
}
],
“audio_ext”:“none”,
“video_ext”:“none”,
“format”:“sb3 - 48x27 (storyboard)”,
“resolution”:“48x27”,
“http_headers”:{
“User-Agent”:“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36”,
“Accept”:“text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8”,
“Accept-Language”:“en-us,en;q=0.5”,
“Sec-Fetch-Mode”:“navigate”
}
},
{
“format_id”:“sb2”,
“format_note”:“storyboard”,
“ext”:“mhtml”,
“protocol”:“mhtml”,
“acodec”:“none”,
“vcodec”:“none”,
“url”:“,
“width”:60,
“height”:45,
“fps”:0.20170109356014582,
“rows”:10,
“columns”:10,
“fragments”:[
{
“url”:”“,
“duration”:495.7831325301205
},
{
“url”:”,
“duration”:327.2168674698795
}
],
“audio_ext”:“none”,
“video_ext”:“none”,
“format”:“sb2 - 60x45 (storyboard)”,
“resolution”:“60x45”,
“http_headers”:{
“User-Agent”:“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36”,
“Accept”:“text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8”,
“Accept-Language”:“en-us,en;q=0.5”,
“Sec-Fetch-Mode”:“navigate”
}
},
{
“format_id”:“sb1”,
“format_note”:“storyboard”,
“ext”:“mhtml”,
“protocol”:“mhtml”,
“acodec”:“none”,
“vcodec”:“none”,
“url”:“”,
“width”:120,
“height”:90,
“fps”:0.20170109356014582,
“rows”:5,
“columns”:5,
“fragments”:[
{
“url”:“,
“duration”:123.94578313253012
},
{
“url”:”“,
“duration”:123.94578313253012
},
{
“url”:”“,
“duration”:123.94578313253012
},
{
“url”:”“,
“duration”:123.94578313253012
},
{
“url”:”“,
“duration”:123.94578313253012
},
{
“url”:”“,
“duration”:123.94578313253012
},
{
“url”:”“,
“duration”:79.3253012048192
}
],
“audio_ext”:“none”,
“video_ext”:“none”,
“format”:“sb1 - 120x90 (storyboard)”,
“resolution”:“120x90”,
“http_headers”:{
“User-Agent”:“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36”,
“Accept”:“text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8”,
“Accept-Language”:“en-us,en;q=0.5”,
“Sec-Fetch-Mode”:“navigate”
}
},
{
“format_id”:“sb0”,
“format_note”:“storyboard”,
“ext”:“mhtml”,
“protocol”:“mhtml”,
“acodec”:“none”,
“vcodec”:“none”,
“url”:”,
“width”:240,
“height”:180,
“fps”:0.20170109356014582,
“rows”:3,
“columns”:3,
“fragments”:[
{
“url”:“”,
“duration”:44.62048192771085
},
{
“url”:“”,
“duration”:44.62048192771085
},
{
“url”:“”,
“duration”:44.62048192771085
},
{
“url”:“,
“duration”:44.62048192771085
},
{
“url”:”“,
“duration”:44.62048192771085
},
{
“url”:”“,
“duration”:44.62048192771085
},
{
“url”:”“,
“duration”:44.62048192771085
},
{
“url”:”“,
“duration”:44.62048192771085
},
{
“url”:”“,
“duration”:44.62048192771085
},
{
“url”:”“,
“duration”:44.62048192771085
},
{
“url”:”",
“duration”:44.62048192771085
},
        "duration":19.831325301204743
     }
  ],
  "audio_ext":"none",
  "video_ext":"none",
  "format":"sb0 - 240x180 (storyboard)",
  "resolution":"240x180",
  "http_headers":{
     "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36",
     "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
     "Accept-Language":"en-us,en;q=0.5",
     "Sec-Fetch-Mode":"navigate"
  }
},

those code are generate all formats in 1 videos.
using another python code (Python.py code) like this :

def extract_video_data_from_url(url):

command = f'youtube-dl "{url}" -j --no-playlist'
output = os.popen(command).read()
video_data = json.loads(output)
title = video_data["title"]
formats = video_data["formats"]
for element in formats:
    if "251 -" in element['format']:
        element['format'] = "1"
    elif "18 -" in element['format']:
        element['format'] = "2"
    elif "22 -" in element['format']:
        element['format'] = "2"
    else:
        element['format'] = "Broken Link"
thumbnail = video_data["thumbnail"]
formats = [extract_format_data(format_data) for format_data in formats]
return {
    "title": title,
    "formats": formats,
    "thumbnail": thumbnail
}

and it show pictures like this :

how i can filter the button in Python.py code so that shows only 1 and 2 in the buttons? the “broken link” should gone or cannot in the loop.

Thanks

vbrozik · October 6, 2022, 9:39pm

Please enclose your code between lines with triple backticks otherwise some information gets mangled.

```
# Your code will be here
```

Your code does not look much related to your JSON because I do not see the strings you are matching like "251 -" in your JSON. …but anyway…

Is it the list formats whose elements get displayed as the green boxes?
Do you want to remove the “Broken Link” elements completely form the list?

If so, you can apply a condition in the list comprehension. Instead of:

formats = [
    extract_format_data(format_data)
    for format_data in formats]

do the same with filtering added:

formats = [
    extract_format_data(format_data)
    for format_data in formats
    if format_data['format'] != "Broken Link"]

cameron · October 6, 2022, 11:28pm

i want to understand and learn, i’ve json files that generate all
formats in videos like this

As Vaclac has mentioned, please paset code/data/output between triple
backticks, like this:

 ```
 your code
 or data
 or output
 goes here
 ```

This preserves indenting and formatting.

Since I’m on email, I get to see your JSON. For others, it looks like
this:

 [
    {
       "format_id":"sb3",
       "format_note":"storyboard",
       "ext":"mhtml",
       "protocol":"mhtml",
       "acodec":"none",
       "vcodec":"none",
       "url":"
       "width":48,
       "height":27,
       "fps":0.12150668286755771,
       "rows":10,
       "columns":10,
       "fragments":[
          {
             "url":
          }
       ],
       "audio_ext":"none",
       "video_ext":"none",
       "format":"sb3 - 48x27 (storyboard)",
       "resolution":"48x27",
       "http_headers":{
          "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36",
          "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
          "Accept-Language":"en-us,en;q=0.5",
          "Sec-Fetch-Mode":"navigate"
       }
    },
    {

those code are generate all formats in 1 videos.
using another python code (Python.py code) like this :

And the Python code, formatting preserved:

 def extract_video_data_from_url(url):
     command = f'youtube-dl "{url}" -j --no-playlist'
     output = os.popen(command).read()
     video_data = json.loads(output)
     title = video_data["title"]
     formats = video_data["formats"]
     for element in formats:
         if "251 -" in element['format']:
             element['format'] = "1"
         elif "18 -" in element['format']:
             element['format'] = "2"
         elif "22 -" in element['format']:
             element['format'] = "2"
         else:
             element['format'] = "Broken Link"
     thumbnail = video_data["thumbnail"]
     formats = [extract_format_data(format_data) for format_data in formats]
     return {
         "title": title,
         "formats": formats,
         "thumbnail": thumbnail
     }

So the JSON data above are a dump of the video information at YouTube as
fetched by the youtube-dl utility.

My initial thought us that some of your trouble comes from assembling
the youtube-dl command as a string. In particular, youtube URLs often
contain & separated parameters and including those to a UNIX shell
unquoted will break your command line, because they are shell
punctuation.

Try supplying the command as a list of strings:

 command = ['youtube-dl', url, '-j', '--no-playlist']

and see if your results improve; replace your call to os.popen with a
call to subprocess.run, eg:

 from subprocess import run, PIPE

 P = subprocess.run(command, stdout=PIPE)
 output = P.stdout.read()

Then you need to put in some print() statements to show each URL and
the resulting JSON data, to see which onces work for you and which are
causing trouble. That will help narrow the cause of the trouble.

Finally, and I’m not really suggesting this as an approach,
youtube-dl is itself written in Python and is available from PyPI as a
package; you can use it directly in your Python code instead of invoking
it as a separate executable. This is fiddlier than just calling the
command as you are doing, but more flexible if you want to do more work
with it later. Example code snippets:

 from youtube_dl import YoutubeDL
 from youtube_dl.utils import DownloadError

 ydl_opts = {
     'progress_hooks': [self.update_progress],
     'format': DEFAULT_OUTPUT_FORMAT,
     'logger': logging.getLogger(),
     'outtmpl': DEFAULT_OUTPUT_FILENAME_TEMPLATE,
     ##'skip_download': True,
     'writeinfojson': False,
     'updatetime': False,
     'process_info': [self.process_info]
 }
 ydl = YoutubeDL(ydl_opts)
 ie_result = ydl.extract_info(url, download=False, process=True)

This is lifted from some code using the package directly, and therefore
would want some tweaking, but it gets the video info into the variable
ie_result above.

I recommend that you stick with your current approach at present though.

Cheers,
Cameron Simpson cs@cskk.id.au

warstha · October 7, 2022, 4:08am

Hi,

Thanks for your great answer!

Yes, that JSON data are a dump file.
and i tried replace my command : command = f'yt-dlp "{url}" -j --no-playlist' to command = ['youtube-dl', url, '-j', '--no-playlist'] and add this line below.

from subprocess import run, PIPE

P = subprocess.run(command, stdout=PIPE)
output = P.stdout.read()

however, it show error AttributeError: 'bytes' object has no attribute 'read' in output = P.stdout.read()

To be honest, i really want to replace my approach using youtube-dl as binary file like i’m wrote. using youtube-dl as a package make me confuse as well.

because i don’t know how to integrated the code :

ydl_opts = {
     'progress_hooks': [self.update_progress],
     'format': DEFAULT_OUTPUT_FORMAT,
     'logger': logging.getLogger(),
     'outtmpl': DEFAULT_OUTPUT_FILENAME_TEMPLATE,
     ##'skip_download': True,
     'writeinfojson': False,
     'updatetime': False,
     'process_info': [self.process_info]
 }
 ydl = YoutubeDL(ydl_opts)
 ie_result = ydl.extract_info(url, download=False, process=True)

in my python code. if still using my code at python.py :

def extract_video_data_from_url(url):    
    
    command = f'youtube-dl"{url}" -j --no-playlist'
    #command = ['youtube-dl', url, '-j', '--no-playlist']
    #P = subprocess.run(command, stdout=PIPE)
    #output = P.stdout.read()
    output = os.popen(command).read()
    video_data = json.loads(output)
    title = video_data["title"]
    formats = video_data["formats"]
    for element in formats:
        if '251' in element['format_id']:
            element['format_id'] = "Download MP3 (64KBPS)"
        elif '18' in element['format_id']:
            element['format_id'] = "Download MP4 (360p)"
        elif '22' in element['format_id']:
            element['format_id'] = "Download MP4 (720p)"
        elif 'Playback video' in element['format_note']:
            element['format_note'] = "Download MP4 No Watermark"
        else:
            element['format_id'] = "Broken Link"
    thumbnail = video_data["thumbnail"]
    formats = [extract_format_data(format_data) for format_data in formats]
    return {
        "title": title,
        "formats": formats,
        "thumbnail": thumbnail
    }

where should i replace it?
thanks!

warstha · October 7, 2022, 4:09am

Hi,

thanks for your tip!

using

formats = [
    extract_format_data(format_data)
    for format_data in formats
    if format_data['format'] != "Broken Link"]

EDIT : i’ve syntax error “[” was not closed. is formats use “[” acceptable in python?

vbrozik · October 7, 2022, 6:42am

I do not understand you. The square brackets are balanced in the code excerpt. Probably you have an unbalanced square bracket somewhere else or you have modified this part of code?

It is much much better to show the error message complete and verbatim. Python puts a lot of useful information into the error messages so show them unabridged please

cameron · October 7, 2022, 10:21pm

Yes, that JSON data are a dump file.
and i tried replace my command : command = f'yt-dlp "{url}" -j --no-playlist' to command = ['youtube-dl', url, '-j', '--no-playlist'] and add this line below.
from subprocess import run, PIPE

P = subprocess.run(command, stdout=PIPE)
output = P.stdout.read()
however, it show error AttributeError: 'bytes' object has no attribute 'read' in output = P.stdout.read()

That’s my fault, I forgot that subprocess.run is a bit different to
subprocess.Popen. run returns the output already read in, as a
bytes instance (by default). The docs are here:
subprocess.run.

Try this:

 P = subprocess.run(command, stdout=PIPE, encoding='utf-8')
 output = P.stdout

which should get you a str (because of the encoding parameter),
ready to decode with json.loads(output).

To be honest, i really want to replace my approach using youtube-dl as
binary file like i’m wrote. using youtube-dl as a package make me
confuse as well.

because i don’t know how to integrated the code :
ydl_opts = {
    'progress_hooks': [self.update_progress],
    'format': DEFAULT_OUTPUT_FORMAT,
    'logger': logging.getLogger(),
    'outtmpl': DEFAULT_OUTPUT_FILENAME_TEMPLATE,
    ##'skip_download': True,
    'writeinfojson': False,
    'updatetime': False,
    'process_info': [self.process_info]
}
ydl = YoutubeDL(ydl_opts)
ie_result = ydl.extract_info(url, download=False, process=True)

The code above gets the info stuff from extract_info() into
ie_result. That’s decoded already. So ie_result is equivalent to
your video_data variable in your current code. You will want to print
it out to check that it has the same structure.

I would still be inclined to get your subprocess based version working
first. Just because it skips the complexity of calling the youtube-dl
package. However, if you want to dive straight into using youtube-dl
then I’d start by commenting out almost everything in the ydl_opts
above. There’s a list of the available options
here.
I would hope the defaults are reasonable and you don’t need to fiddle
with them much.

So I’d (a) copy your existing python.py file to a new filename so that
you have a copy of your old code for reference or just to go back to if
you dislike the youtube-dl package and (b) try this to get the video
information like this:

 def extract_video_data_from_url(url):
     ydl_opts = {}  # fill these in later, if you want to
     ydl = YoutubeDL(ydl_opts)
     video_data = ydl.extract_info(url, download=False, process=True)

and proceed from there. Print out the new video_data; it may not be
exactly the same as what you were using before (but I expect it to be
very similar, because you’re using youtube-dl in both cases, just in
different ways).

Use pprint to print out the data, it is easier to read:

 # up the top
 from pprint import pprint

 # in your function
 pprint(video_data)

The README from the youtube-dl library has a tiny reference to using it
directly in Python
here.

Here is the (extract_info method)[youtube-dl/youtube_dl/YoutubeDL.py at d35557a75d943865e40410d51bfcc18276e98532 · ytdl-org/youtube-dl · GitHub]
you call above.

Don’t forget that you will need to install the youtube-dl package:

 python3 -m pip install youtube-dl

and that you import youtube_dl, not youtube-dl.

Cheers,
Cameron Simpson cs@cskk.id.au

warstha · October 10, 2022, 8:06am

Hey, @cameron it’s works!

your answer is great to me!
thanks, i appreciate it.

this is my code Python.py after some of correction with your suggestions :

import youtube_dl


def extract_format_data(format_data):
    extension = format_data["ext"]
    format_name = format_data["format"]
    format_id = format_data["format_id"]
    format_note = format_data["format_note"]
    url = format_data["url"]
  
    return {
        "extension": extension,
        "format_name": format_name,
        "format_id": format_id,
        "format_note": format_note,
        "url": url        
    }
        
def extract_video_data_from_url(url):    
    
    ydl_opts = {
        'noplaylist':'True',
        'format': 'bestvideo+bestaudio/best',
        'outtmpl': 'test.%(ext)s',      
        
        }  # fill these in later, if you want to
    ydl = youtube-dl.YoutubeDL(ydl_opts)
    video_data = ydl.extract_info(url, download=False, process=True)
    #print(video_data)


    title = video_data["title"]
    formats = video_data["formats"]
    for element in formats:
        if '251' in element['format_id']:
            element['format_id'] = "Download MP3 (64KBPS)"
        elif '18' in element['format_id']:
            element['format_id'] = "Download MP4 (360p)"
        elif '22' in element['format_id']:
            element['format_id'] = "Download MP4 (720p)"
        elif 'Direct video' in element['format_note']:
            element['format_note'] = "Download MP4 No Watermark"
        else:
            element['format_id'] = "Broken Link"   
    thumbnail = video_data["thumbnail"]
    formats = [extract_format_data(format_data) for format_data in formats]
    return {
        "title": title,
        "formats": formats,
        "thumbnail": thumbnail
    }

however, i’d like to still keep IF statement to categorize different resolutions.
again, thanks!

if you don’t mind, i would like to ask 1 more question.
as you can see in

ydl_opts = {
‘noplaylist’:‘True’,
‘format’: ‘bestvideo+bestaudio/best’,
‘outtmpl’: ‘test.%(ext)s’,

my goals is to result (or print) all formats and when shows video with no audio (but best resolution) , if i print(video_data). it show all formats (including Storyboard and best video with no audio).
but i wanna merge best audio and best video.let say format code is ‘303’ (1080p).

but results it still the same.
is because i don’t categorize which audio and video to merge? or should i put in IF statement?

I still confused how to read and understand ydl_opts.

thanks again!

warstha · October 10, 2022, 8:09am

Hi @vbrozik ,

thanks. i tried to use different method from the last one. (use import youtube_dl)
and i think i would change some of code.

thanks! i appreciate it.