Before v1.15.0: c=10, a=1, r=0
Rule #3: source code has changed, increment r:
r=1
Rule #4: interfaces were removed in vpx_tpl.h, set r=0, increment c:
c=11, r=0
Rule #5: no interfaces have been added
Rule #6: interfaces were removed in vpx_tpl.h, set a=0:
a=0
After release: c=11, a=0, r=0
major = c-a = 11
minor = a = 0
patch = r = 0
Bug: webm:384672478
Change-Id: I2e70e7e35c64ece32eaf1dc5625640965483f9b9
Possible fix for issue below. It was only disabled
for screen in a previous change, but we force it off
always to check if it clears the issue.
The speed feature disabled is only used for 3 spatial
layers and at least 2 temporal. The impact on speed is
expected to be small, ~2%, so ok to disable for now and
see if it clears the issue.
Bug: 366146260
Change-Id: If7af006425e1e0ef297b9d6466507ea4c90ddb6f
(cherry picked from commit 09b3d5fc5aa48752f95f4c0c37b0bd4ff55c0ba1)
Integer overflow in encode_frame_to_data_rate()
for the update:
lc->total_target_vs_actual += bits_off_for_this_layer
Fix is to use int64_t for total_target_vs_actual.
Bug: chromium:368114043
Change-Id: I9a01e1a69e26ae748e8ae23d9e1287431510388d
Divide by 3 instead of multiple by 3, in comparison of
lrc->avg_frame_bandwidth vd lrc->last_avg_frame_bandwidth,
in two functions for reset rc.
Small loss in precision, so acceptable.
Similar change to:
https://chromium-review.googlesource.com/c/webm/libvpx/+/5698570
Bug: chromium:367892770
Change-Id: Ia9ef09a9f6beba930fedd496407cfa7057e39336
PF_ARM_SVE_INSTRUCTIONS_AVAILABLE and PF_ARM_SVE2_INSTRUCTIONS_AVAILABLE
are available in WinSDK 10.0.26100 and recent versions of mingw-w64.
Based on a patch by Martin Storsjö on ffmpeg-devel:
https://ffmpeg.org/pipermail/ffmpeg-devel/2024-September/333611.html
Change-Id: I34b2341a559f95aa400e84d709f3eb36da5dbb7b
There's no direct processor feature constant for I8MM alone, but
there is a flag for SVE-I8MM (added in WinSDK 10.0.26100 and
recent versions of mingw-w64). If SVE-I8MM is available, we can
assume that I8MM is available.
While HW supporting these features isn't yet commonly running
Windows, this at least allows detecting and running the I8MM codepaths
in Windows builds in Wine (possibly running in QEMU).
Based on patch from Martin Storsjö on ffmpeg-devel:
https://ffmpeg.org/pipermail/ffmpeg-devel/2024-September/333609.html
Change-Id: I77117bee8516924fddcdecccae8bab3cf5beed96
The program requires a minimum of 2 parameters. Previously the tool
would crash if only one input file was given.
Bug: webm:365481206
Change-Id: I875d81b2db4fcc4338061c03b23bb51b0aad58e4
Possible fix for issue below. The speed feature disabled
is only used for 3 spatial layers and at least 2 temporal.
The impact on speed is expected to be small, ~2%, so ok
to disable for now and see if it clears the issue.
Bug: 366146260
Change-Id: I94ab991d583cc2ce758db337abbbb463a65f0767
The wrapped storage must exist for the duration of the vpx_image_t
allocation.
Bug: aomedia:363806063
Change-Id: Ic6b79a56b6c07776222d1767490d873d7408ced0
The default template for https://issues.webmproject.org/ is a public bug
report. Security issues can be reported securely using the 'Security
report' template.
Change-Id: Ic7144a6c7a144772b78852d1415a51a570c79d50
and examples/resize_util.c. These functions were added in:
3cd37dfeb Adds a non-normative resize library to vp9 encoder
but never used meaningfully in the library.
This mirrors the change in libaom:
d10029bb4b Restore function prototype of av1_resize_frame420
except that vp9_resize_frame420() was never exported in the shared
library, so can be deleted along with the rest.
The reasoning for removing examples/resize_util.c is the same: it is not
useful and examples should use the public functions of the libvpx
library.
Change-Id: I386080d3f1a3ef81dfc87fcdf5bbdf459d996f03
Added key frame temporal filtering. Enabled it for VOD encoding
with encoder speed < 2.
Minor improvement in prediction.
Added the restriction of using no more than "arnr_max_frames"
frames for temporal filtering.
Key frame temporal filtering is turned off by default for now. To
enable it, set "--enable-keyframe-filtering=1"
Borg result with "--enable-keyframe-filtering=1"
avg_psnr: ovr_psnr: ssim: vmaf:
hdres2: -0.762 -0.863 -0.903 -0.680
midres2: -0.813 -0.753 -0.757 -0.743
lowres2: -0.492 -0.598 -0.737 -0.881
The impact on the encoder time is minimal.
Change-Id: If6abea3e21efcb96f1978cd9dfaa742c40dc2a56
`#if defined(__GNUC__)` is enough if a specific version isn't being
looked for.
Bug: aomedia:356832974
Change-Id: I3fcbecf9d547c6a2d89d7b5456e83ee08ddc6f5e
Applied 12-tap filter to temporal filter prediction for better
result. Improved the calculation of frames to be used in temporal
filtering.
The overall PSNR gain was -0.511% (lowres), -0.338% (midres), and
-0.288% (hdres).
Encoder time was increased by ~2%, which would be largely reduced
by the following SIMD optimization.
Change-Id: If3ece30f1614beadc99ebf6b4dc3f2d988d3bdb9
Move the saturate_cast_double_to_int() function in
vp8/encoder/firstpass.c to vpx_dsp/vpx_dsp_common.h so that it can be
used in other files.
Change-Id: I748fea969520542dca68d7a46500d3272f22e16f
to INT_MAX. This matches calc_iframe_target_size() in VP8
(http://crbug.com/1473473). If rc->avg_frame_bandwidth is large even
small kf_boost values will overflow an int.
Change-Id: Iaca5b47fe97793ae70930b3b2c2f42725d2c96fb
This fixes a build error seen in gcc 15:
3b63004 mkvparser/mkvparser.cc: add missing <cstdint> include
Bug: aomedia:357622679
Change-Id: I6c4a1795d189f9993d4f2c5c9f0375912bc58f0c
Rely on the -I or -system compiler option to find "gtest/gtest.h". This
makes it easier to build our tests against a copy of gtest outside the
libvpx source tree.
Bug: webm:42330726
Change-Id: I3b189c6345e13b36b236d1eedc6ee091bfa71f48
Fixes a 'Result of operation is garbage or undefined' static analysis
report (seen with clang-14) related to left shifting a negative value.
Bug: b:328632178
Change-Id: I18f0100eca0deac1cac9be0c7e848685d2911fb3
Motion vectors are now clamped in
vp8_find_best_sub_pixel_step_iteratively, vp8_find_best_sub_pixel_step,
vp8_find_best_half_pixel_step, vp8_full_search_sad,
vp8_refining_search_sadx4 and vp8_refining_search_sad_c (the rtcd for
other optimizations are redirects to vp8_refining_search_sadx4).
The difference of valid motion vectors may still go beyond the range of
the MVcount array, however, so additional checks are added to
rd_update_mvcount() and update_mvcount().
Note the test source and settings (speed 1 and GOOD quality mode) come
from the issue report; additional coverage is added for realtime. The
realtime path does not trigger the error without the fix, but as it's
similar to the rd path, the same clamp is done to be safe.
Fixes:
vp8/encoder/rdopt.c:1579:5: runtime error: index 17467 out of bounds for
type 'unsigned int[2047]'
Bug: oss-fuzz:69906
Change-Id: Ia8bd087cfe4475ab09ba711ed806fbcbaa72e552
cpi->output_framerate may be as large as 10M. Previously this would
cause kf_boost to be ~20M which would overflow an int when multiplied by
values in kf_boost_qadjustment[].
Fixes:
vp8/encoder/ratectrl.c:340:25: runtime error: signed integer overflow:
19999984 * 220 cannot be represented in type 'int'
Bug: oss-fuzz:69100
Change-Id: I2d77c9d2912412f6265f6a8dc0e6b361b63b8242
The assignment "cpi->output_framerate = cpi->framerate;" after the
vp8_new_framerate() call is not needed, because vp8_new_framerate() sets
cpi->framerate and cpi->output_framerate to the same value.
Change-Id: I4de97b43957142d658e0c08ecfc6628844ce453a
+ fix an additional double -> int overflow warning (chrome's fuzzers do
not have the float-cast-overflow sanitizer enabled)
Bug: chromium:352414650
Change-Id: I634bb421a74236eac434df138ed71dadf197596a
The only real change is in the initialization of frame_window. The (int)
cast is moved to the result of VPXMIN(), so that
cpi->twopass.total_stats.count - cpi->common.current_video_frame is
calculated in double.
Change-Id: Ia80f24614af7184b37cfdd99d8a8b1639460f273
rc->avg_frame_bandwidth is capped at INT_MAX. Rather than multiply the
value by 3, divide projected_frame_size by 3 to avoid the overflow.
Without rounding this differs slightly from the original, but loss of
precision is acceptable in this case.
Bug: chromium:348440590
Change-Id: Id5960825c79d7c764d257e9b4bd0a1de751878d8
Replace the VERSION_STRING_NOSP macro by the public API function
vpx_codec_version_str().
Treat vpx_version.h as an absolutely internal header of the libvpx
library.
Change-Id: I86ba8548a62adae91ae7f5caad98169707f3fc64
This change happens in define_gf_group().
Since this part is not critical for ext_ratectrl,
turn off the error reporting for now.
Change-Id: Ie74aa06a116edb8c5d9e7b29cadbd366232fbc1d
The compare_fp_stats() and compare_fp_stats_md5() functions are not used
when CONFIG_REALTIME_ONLY is equal to 1. Define these functions only if
CONFIG_REALTIME_ONLY is 0 to avoid the -Wunused-function warnings.
Change-Id: Iaae208f67708cfaeee5304b0320ebce63c863f96
Allow the TPL group to use up to 3 reference frames from the
previous GOP. This slightly changes the coding stats in the range
of <0.1%.
STATS_CHANGED
Change-Id: Ieb4e948a783bf8ef9ca78717d56ff750f3f795a4
Fix double-to-int cast overflows in vp8 code caused by setting the
target bitrate to the maximum value (2000000).
Tested: Build libvpx with UndefinedBehaviorSanitizer and then run
./vpxenc husky.yuv -o AV1_husky_2000000_10000000_10000000.webm --good \
--cpu-used=2 -v -t 0 -w 352 -h 288 --fps=10000000/10000000 \
--target-bitrate=2000000 --limit=150 --test-decode=fatal --passes=2 \
--lag-in-frames=25 --min-q=0 --max-q=63 --arnr-maxframes=7 \
--arnr-strength=5 --kf-max-dist=9999 --undershoot-pct=100 \
--overshoot-pct=100 --bias-pct=50 --codec=vp8
Note: This is essentially the VP8 version of the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/191361.
Bug: 349440066
Change-Id: Ia43e1aad8fcab60ace49da960579081c2c3a5445
Fix the following UBSan integer errors in test_decode():
vpxenc.c:1589:57: runtime error: implicit conversion from type 'int' of
value -16 (32-bit, signed) to type 'unsigned int' changed the value to
4294967280 (32-bit, unsigned)
vpxenc.c:1590:58: runtime error: implicit conversion from type 'int' of
value -16 (32-bit, signed) to type 'unsigned int' changed the value to
4294967280 (32-bit, unsigned)
Tested: Build libvpx with -fsanitize=integer and then run
./vpxenc husky.yuv -o AV1_husky_2000000_10000000_10000000.webm --good \
--cpu-used=2 -v -t 0 -w 352 -h 288 --fps=10000000/10000000 \
--target-bitrate=2000000 --limit=150 --test-decode=fatal --passes=2 \
--lag-in-frames=25 --min-q=0 --max-q=63 --arnr-maxframes=7 \
--arnr-strength=5 --kf-max-dist=9999 --undershoot-pct=100 \
--overshoot-pct=100 --bias-pct=50 --codec=vp8
Bug: 349440066
Change-Id: Ice2f0e7176ffec664856559e2c02bd51113c4d74
Tested: Build libvpx with -fsanitize=integer and then run
./vpxenc husky.yuv -o AV1_husky_2000000_10000000_10000000.webm --good \
--cpu-used=2 -v -t 0 -w 352 -h 288 --fps=10000000/10000000 \
--target-bitrate=2000000 --limit=150 --test-decode=fatal --passes=2 \
--lag-in-frames=25 --min-q=0 --max-q=63 --min-gf-interval=4 \
--max-gf-interval=22 --arnr-maxframes=7 --arnr-strength=5 \
--kf-max-dist=9999 --aq-mode=0 --undershoot-pct=100 \
--overshoot-pct=100 --bias-pct=50
This unsigned integer overflow seems to be caused by
g_timebase.num=1000000.
Note: This is a port of the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/191401.
Bug: 349440066
Change-Id: I924fa9c653400764dd7320938b88b4ea40f38172
This patch fixes some additional cases where under extreme conditions
some of the VBR adjustment variables can wrap.
As this happens on a per frame level the extra saturation checks should
not be an issue for performance.
Note: This CL is a port of the following libaom CLs:
https://aomedia-review.googlesource.com/c/aom/+/190521https://aomedia-review.googlesource.com/c/aom/+/190888
Change-Id: I87c4ecca10f39767002f7d90d0f43b19c7150832
Current code was disallowing scene detection for
speeds >= 8, to avoid any encode_time increase
(see comment in the code).
But we can expect the cost to be small even at speed 8,9,
and that concern on encode_time was from some time ago
before 8 and 9 were further optimized. And this is
needed for content with scene changes (see issue attached).
So allow scene detection now for all RTC speed settings (speed >= 5).
Bug: b/346846607
Change-Id: I678dbb88ff1399ed89b2bf9770ae9427e3044fc4
The last reference to the flag in configure was removed in:
fad70a358 Remove -fno-strict-aliasing flag
The library should be expected to function without this flag; it's built
and tested elsewhere without it.
Bug: webm:570, webm:603
Change-Id: Icf85fd9bd5c9cb0c81d6eecf10fba07807f48b4a
The GNU Assembler was removed in r24. clang's internal assembler works,
but `-c` is necessary to avoid linking.
Bug: webm:1856
Change-Id: I61f80cf78657d3b71d5e73c5b2510575533ca5ea
Move the function into define_gf_group().
define_gf_group() has a lot of settings that might cause
performance drop if skipped.
Imitate define_gf_group_structure()'s behavior which add
an extra overlay frame at the end of gf_group whenever
alt_ref is used.
After this change, we can feed the baseline decision through
webmrc and get the same result as baseline.
This CL is tested with city_cif.yuv using ffmpeg
BUG = b/345528565
Change-Id: Ib61f0a0a72251f8662fb4072e0cfd7f456a243b3
Quiets some spurious -Wmaybe-uninitialized warnings with gcc 14.1.0.
In function 'calc_plane_error16',
inlined from 'main' at ../tools/tiny_ssim.c:464:5:
../tools/tiny_ssim.c:37:12: warning: 'v[0]' may be used uninitialized
[-Wmaybe-uninitialized]
37 | if (orig == NULL || recon == NULL) {
| ^
In function 'calc_plane_error16',
inlined from 'main' at ../tools/tiny_ssim.c:462:5:
../tools/tiny_ssim.c:37:12: warning: 'u[0]' may be used uninitialized
[-Wmaybe-uninitialized]
37 | if (orig == NULL || recon == NULL) {
| ^
In function 'calc_plane_error',
inlined from 'main' at ../tools/tiny_ssim.c:461:5:
../tools/tiny_ssim.c:61:12: warning: 'y[0]' may be used uninitialized
[-Wmaybe-uninitialized]
61 | if (orig == NULL || recon == NULL) {
To reduce confusion, read_input_file() is changed to return an int as
previously it would only return (size_t)-1/0/1 (and now returns 0/1).
Change-Id: I2344048ecc2bd233891ffcef08002ee98d6d262a
The default behavior changed in:
148d1085f Refactor and extend run-time CPU feature detection on Arm
This fixes build errors with these targets as there is no runtime cpu
detection defined for them.
Change-Id: Ie6b0bae1fc3e244d7dfcc823f60c3e466ccade79
Both VP8 and VP9 internally cap the target bitrate to the smaller of the
uncompressed bitrate and 1000000 kilobits per second.
Change-Id: I4008ce09b5e709e75111800341d015e41eb1da42
These change fixes issues that can occur if the user specifies a very
high target data rate or rate per frame.
Fixes some issue with overflow of int variables used to hold bitrate
values (rate per second, rate per frame etc).
Note: This CL is a port of the following libaom CLs:
https://aomedia-review.googlesource.com/c/aom/+/190381https://aomedia-review.googlesource.com/c/aom/+/190462
All the changes were ported to VP9. For VP8, only the new type of
cpi->bytes (equivalent to ppi->total_bytes in libaom) was ported.
Change-Id: I438dd46efd5a134389b893ffae1f8a2381207906
2024-05-21 v1.14.1 "Venetian Duck"
This release includes enhancements and bug fixes.
- Upgrading:
This release is ABI compatible with the previous release.
- Enhancement:
Improved the detection of compiler support for AArch64 extensions,
particularly SVE.
Added vpx_codec_get_global_headers() support for VP9.
- Bug fixes:
Added buffer bounds checks to vpx_writer and vpx_write_bit_buffer.
Fix to GetSegmentationData() crash in aq_mode=0 for RTC rate control.
Fix to alloc for row_base_thresh_freq_fac.
Free row mt memory before freeing cpi->tile_data.
Fix to buffer alloc for vp9_bitstream_worker_data.
Fix to VP8 race issue for multi-thread with pnsr_calc.
Fix to uv width/height in vp9_scale_and_extend_frame_ssse3.
Fix to integer division by zero and overflow in calc_pframe_target_size().
Fix to integer overflow in vpx_img_alloc() & vpx_img_wrap()(CVE-2024-5197).
Fix to UBSan error in vp9_rc_update_framerate().
Fix to UBSan errors in vp8_new_framerate().
Fix to integer overflow in vp8 encodeframe.c.
Handle EINTR from sem_wait().
Change-Id: Ic5e274fdc35c9141591a65e825bf012d2cca3caa
I introduced this bug in commit 2e32276:
https://chromium-review.googlesource.com/c/webm/libvpx/+/5446333
I changed the line
stride_in_bytes = (fmt & VPX_IMG_FMT_HIGHBITDEPTH) ? s * 2 : s;
to three lines:
s = (fmt & VPX_IMG_FMT_HIGHBITDEPTH) ? s * 2 : s;
if (s > INT_MAX) goto fail;
stride_in_bytes = (int)s;
But I didn't realize that `s` is used later in the calculation of
alloc_size.
As a quick fix, undo the effect of s * 2 for high bit depths after `s`
has been assigned to stride_in_bytes.
Bug: chromium:332382766
Change-Id: I53fbf405555645ab1d7254d31aadabe4f426be8c
(cherry picked from commit 74c70af016)
A port of the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/188962.
stride_align is documented to be the "alignment, in bytes, of each row
in the image (stride)."
Change-Id: I2184b50dc3607611f47719319fa5adb3adcef2fd
(cherry picked from commit 7d37ffacc6)
A port of the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/188823.
Impose maximum values on the input parameters so that we can perform
arithmetic operations without worrying about overflows.
Also change the VpxImageTest.VpxImgAllocHugeWidth test to write to the
first and last samples in the first row of the Y plane, so that the test
will crash if there is unsigned integer overflow in the calculation of
stride_in_bytes.
Bug: chromium:332382766
Change-Id: I54cec6c9e26377abaa8a991042ba277ff70afdf3
(cherry picked from commit 06af417e79)
A port of the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/188761.
Fix unsigned integer overflows in the calculation of stride_in_bytes in
img_alloc_helper() when d_w is huge.
Change the type of stride_in_bytes from unsigned int to int because it
will be assigned to img->stride[VPX_PLANE_Y], which is of the int type.
Test:
. ../libvpx/tools/set_analyzer_env.sh integer
../libvpx/configure --enable-debug --disable-optimizations
make -j
./test_libvpx --gtest_filter=VpxImageTest.VpxImgAllocHugeWidth
Bug: chromium:332382766
Change-Id: I3b39d78f61c7255e10cbf72ba2f4975425a05a82
(cherry picked from commit 2e32276277)
Ported from test/aom_image_test.cc in libaom commit 04d6253.
Change-Id: I56478d0a5603cfb5b65e644add0918387ff69a00
(cherry picked from commit 3dbab0e664)
If fmt is VPX_IMG_FMT_NONE, currently img_alloc_helper() allocates a
single plane because VPX_IMG_FMT_NONE (0) is not a planar format (the
VPX_IMG_FMT_PLANAR bit is not set in VPX_IMG_FMT_NONE).
Although this seems correct, the problem is that most of the code in
libvpx assumes planar formats and is likely to dereference a null
pointer when it uses img->planes[1]. Also, VPX_IMG_FMT_NONE isn't really
a valid image format. So it is safer to make img_alloc_helper() fail if
fmt is VPX_IMG_FMT_NONE.
Change-Id: I05b47f4b5eceb631a02384b2cce1c2f6fdca8673
(cherry picked from commit d3a946de8c)
The integer overflow happens
in vp9_calc_iframe_target_size_one_pass_cbr(), when
calculating the target size for L1T3 encoding.
The input target bitrate(kbps) is very large, so it gets set
to INT_MAX (before being multiplied by 1000 to convert to bps),
and avg_frame_bandwidth is then set to (INT_MAX / lc->framerate),
which when multipled by (16 + kf_boost) can exceed INT_MAX.
Fix is to cast the operands to int64_t and final result to int.
Bug: chromium:340918567
Change-Id: Ic00094b22c1f12ca988c0cb1fcaed473e1f8ed2b
In multi-threaded scenario, when the bitstream
buffer allocated is insufficient, the main thread
called 'longjmp' without waiting for the completion
of workers. In this patch, 'longjmp' is called by
the main thread after joining other worker threads.
This resolves the assertion failure as reported in
Bug: webm:1847
Bug: webm:1844
Change-Id: I399c76087b65e7b8d9a9fa4f12d784408243d648
(cherry picked from commit 611d9ba0a5)
Add the `size` and `error` members to the vpx_write_bit_buffer struct.
Add the vpx_wb_init() and vpx_wb_has_error() functions.
Instances of the vpx_write_bit_buffer struct are only allocated in the
vp9_pack_bitstream() function. So vp9_pack_bitstream() is the only
function outside vpx_dsp/bitwriter_buffer.* that needs updating.
This CL completes the work of adding output buffer bounds checks to
vp9/encoder/vp9_bitstream.c.
Bug: webm:1844
Change-Id: I6b362be572852ee51d96023b35bfb334faada7e1
(cherry picked from commit d790001fd5)
In the vpx_writer struct, change the buffer_end field to the size field.
Change vpx_stop_encode() to return true on success, false on failure
(output buffer full).
In write_compressed_header(), remove the assertion
assert(header_bc.pos <= 0xffff). The caller (vp9_pack_bitstream()) will
check that condition.
In vp9_pack_bitstream(), the variable "first_part_size" is renamed
"compressed_hdr_size".
Bug: webm:1844
Change-Id: I4ed6ab905a707ad44d875e53036d5a42523a65d0
(cherry picked from commit 73703c188b)
Fixes a static analysis warning:
Value stored to 'data_size' is never read
Bug: webm:1844
Change-Id: Ia27181b1051bb2c3a6bc4a4c2549df8b0525e889
(cherry picked from commit 9f73377821)
The buffer_end field will allow bounds checking when vpx_writer writes
to the output buffer. This CL sets up the plumbing to pass the output
buffer size from vp9_pack_bitstream() to vpx_start_encode(), which
initializes the vpx_writer struct. vpx_writer doesn't use the output
buffer size in bounds checks yet, but the code in vp9_bitstream.c does.
Bug: webm:1844
Change-Id: I995e469ab453c02d740f54b46e0b08c7f2eb1a2e
(cherry picked from commit e387187438)
Set up the plumbing to pass the size of the output buffer `dest` to
vp9_pack_bitstream(). The output buffer is the cx_data buffer in the
encoder_encode() function in vp9/vp9_cx_iface.c, and its size is
cx_data_sz.
In this CL vp9_pack_bitstream() ignores the `dest_size` parameter.
Bug: webm:1844
Change-Id: I53c80280143d409cf16f87c4d6deec3d9338aea3
(cherry picked from commit d48577579b)
In multi-threaded scenario, when the bitstream
buffer allocated is insufficient, the main thread
called 'longjmp' without waiting for the completion
of workers. In this patch, 'longjmp' is called by
the main thread after joining other worker threads.
This resolves the assertion failure as reported in
Bug: webm:1847
Bug: webm:1844
Change-Id: I399c76087b65e7b8d9a9fa4f12d784408243d648
cpi_->cyclic_refresh is nullptr if aq_mode is 0, in other words, the
rate controller runs in non adaptive quantization mode. This CL fixes
the crash in GetSegmentationData() in non aq mode.
Bug: b/259487065
Test: video encoding on ChromeOS
Change-Id: I503b30d15c697c8dd1da203b3c7361b91c428e87
(cherry picked from commit 1d007eafa3)
Issue happens for real-time nonrd pickmode.
Due to speed feature: sf->adaptive_rd_thresh_row_mt,
enabled for speed >= 8, and for speed >= 7 svc only.
Issue occurs where resolution (sb_rows) changes and
row_base_thresh_freq_fact needs to be re-allocated.
Fix is to add sb_rows to TileDataEnc and check for
re-alloc of row_base_thresh_freq_fac.
Bug: b:331108922
Change-Id: I1a1ca94c14f343200c180725e4cb8d91d3c55b83
(cherry picked from commit 3f8f19372b)
In vp9_init_tile_data(), call vp9_row_mt_mem_dealloc(cpi) to free the
row mt memory in cpi->tile_data before freeing cpi->tile_data.
Bug: b:331086799, b:331108729
Change-Id: Idc79984ce7e0110e6858139b2ed286492a2e8622
(cherry picked from commit 34277e53ad)
Before proceeding with Encode(). This avoids some static analysis
warnings about uninitialized `cfg_` members.
Change-Id: Ib67b278d6706ab1034219e8c1ad9ba0c5b574ba8
(cherry picked from commit 108f5128e2)
sem_wait() may be interrupted by a signal and fail with EINTR:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/sem_wait.html
Retry the sem_wait() call if it fails with EINTR.
This finishes the fix started in
https://chromium-review.googlesource.com/c/webm/libvpx/+/5299569. As a
speculative fix, that CL fixed only the sem_wait(&cpi->h_event_end_lpf)
calls responsible for bug chromium:324459561. ClusterFuzz verified the
fix, so this CL extends it to the other sem_wait() calls.
Note that sem_wait() calls like the following do not need this fix,
because the while (1) loop retries the sem_wait() call if it fails:
while (1) {
if (vpx_atomic_load_acquire(&cpi->b_multi_threaded) == 0) break;
if (sem_wait(&cpi->h_event_start_lpf) == 0) {
...
}
}
Bug: chromium:324459561
Change-Id: I0f0612616eee37fb3da68049e49b3e86927b5e24
(cherry picked from commit d4959f9825)
Before proceeding with Encode(). This avoids some static analysis
warnings about uninitialized `cfg_` members.
Change-Id: Ib67b278d6706ab1034219e8c1ad9ba0c5b574ba8
In very rare cases (e.g. encoding with very high bit rate), the
allocated token memory isn't enough, which causes a buffer overflow
and then an encoder failure. This is fixed by using the aligned
number of blocks while allocating this buffer.
BUG=b/328803779
Change-Id: I5437cce13398206bf9982d57f35d6f9da17b187f
This is a port of the change in libaom:
https://aomedia-review.googlesource.com/c/aom/+/189761
5ccdc66ab6 cpu.cmake: Do more elaborate test of whether SVE can be compiled
For Windows targets, Clang will successfully compile simpler
SVE functions, but if the function requires backing up and restoring
SVE registers (as part of the AAPCS calling convention), Clang
will fail to generate unwind data for this function, resulting
in an error.
This issue is tracked upstream in Clang in
https://github.com/llvm/llvm-project/issues/80009.
Check whether the compiler can compile such a function, and
disable SVE if it is unable to handle that case.
Change-Id: I8550248abd6a7876bd8ecf6ba66bc70518133566
(cherry picked from commit 35f0262c5e)
This is a port of the change in libaom:
https://aomedia-review.googlesource.com/c/aom/+/189761
5ccdc66ab6 cpu.cmake: Do more elaborate test of whether SVE can be compiled
For Windows targets, Clang will successfully compile simpler
SVE functions, but if the function requires backing up and restoring
SVE registers (as part of the AAPCS calling convention), Clang
will fail to generate unwind data for this function, resulting
in an error.
This issue is tracked upstream in Clang in
https://github.com/llvm/llvm-project/issues/80009.
Check whether the compiler can compile such a function, and
disable SVE if it is unable to handle that case.
Change-Id: I8550248abd6a7876bd8ecf6ba66bc70518133566
This mode is used infrequently and is quite slow. This shifts the tests
to nightly to speed up the presubmit.
Change-Id: I3020887e0ca0150d7cbea9cc726649c11f94d56c
Use the utility functions and set gf_group_size in
ext_rc_define_gf_group_structure()
Avoid using gop_decision->update_type to keep the logic simple
for now.
Also simplify the interface.
Change-Id: I78fd5892e6f9731d50d6e5da97598b46c70a1dde
The vpx_ports/msvc.h header provides snprintf() and round() for MSVC
older than Visual Studio 2015 and Visual Studio 2013, respectively.
Since configure now requires vs14 (Visual Studio 2015) or later, it is
safe to remove vpx_ports/msvc.h.
Change-Id: I2fe4c41eaa126f4cf17639c11895f1e464294c76
Replace %ld with %zu for `size_t`. Added in:
fd28f6f3c Add rate_ctrl_log_path
Fixes:
vp9\encoder\vp9_encoder.c(5748,15): warning C4477: 'fprintf' : format
string '%ld' requires an argument of type 'long', but variadic
argument 2 has type 'size_t'
Change-Id: I36fa9c7a9e14d4a2d9ef51a7f5c55de71bb34518
If img_data is not NULL, img_alloc_helper ignores buf_align, so
vpx_img_wrap can set buf_align to any placeholder value.
A port of the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/90362.
Bug: webm:1850
Change-Id: I42bc45aecf822a9314caf23058fe123d0574dc20
Port the changes to aom/src/aom_image.c in the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/56643. The changes
related to `border` are not ported.
Bug: webm:1850
Change-Id: Ie81fffe0c84e912da880ffca245ae27cd71cf348
I introduced this bug in commit 2e32276:
https://chromium-review.googlesource.com/c/webm/libvpx/+/5446333
I changed the line
stride_in_bytes = (fmt & VPX_IMG_FMT_HIGHBITDEPTH) ? s * 2 : s;
to three lines:
s = (fmt & VPX_IMG_FMT_HIGHBITDEPTH) ? s * 2 : s;
if (s > INT_MAX) goto fail;
stride_in_bytes = (int)s;
But I didn't realize that `s` is used later in the calculation of
alloc_size.
As a quick fix, undo the effect of s * 2 for high bit depths after `s`
has been assigned to stride_in_bytes.
Bug: chromium:332382766
Change-Id: I53fbf405555645ab1d7254d31aadabe4f426be8c
A port of the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/188962.
stride_align is documented to be the "alignment, in bytes, of each row
in the image (stride)."
Change-Id: I2184b50dc3607611f47719319fa5adb3adcef2fd
A port of the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/188823.
Impose maximum values on the input parameters so that we can perform
arithmetic operations without worrying about overflows.
Also change the VpxImageTest.VpxImgAllocHugeWidth test to write to the
first and last samples in the first row of the Y plane, so that the test
will crash if there is unsigned integer overflow in the calculation of
stride_in_bytes.
Bug: chromium:332382766
Change-Id: I54cec6c9e26377abaa8a991042ba277ff70afdf3
A port of the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/188761.
Fix unsigned integer overflows in the calculation of stride_in_bytes in
img_alloc_helper() when d_w is huge.
Change the type of stride_in_bytes from unsigned int to int because it
will be assigned to img->stride[VPX_PLANE_Y], which is of the int type.
Test:
. ../libvpx/tools/set_analyzer_env.sh integer
../libvpx/configure --enable-debug --disable-optimizations
make -j
./test_libvpx --gtest_filter=VpxImageTest.VpxImgAllocHugeWidth
Bug: chromium:332382766
Change-Id: I3b39d78f61c7255e10cbf72ba2f4975425a05a82
The MAX_NUM_THREADS macro is unrelated to the VPxWorkerInterface, so it
doesn't need to be defined in vpx_util/vpx_thread.h.
The VP8 code doesn't seem to depend on MAX_NUM_THREADS, so VP8 can use
64 directly in the range check of its g_threads option. Move the
definition of the MAX_NUM_THREADS macro to vp9/encoder/vp9_ethread.h and
use it in VP9 code only.
Change-Id: Ibf788ca2496c743a2ac0498fefaab8a3c181228d
The `error: use of undeclared identifier 'EBUSY'` in
vpx_util/vpx_pthread.h was found in Mozilla's bug 1886318 [1]. This
patch addresses the issue by adding the `<errno.h>` header to introduce
the `EBUSY` identifier, resolving the problem.
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1886318#c1
Change-Id: Ic417dafebf5ab160060dd29f692fa9c40d8db05a
The Google cpp style guide dictates that you should "include what you
use" with respect to symbols. This CL adds vpx_config.h imports to unit
tests that rely on config flags but were otherwise indirectly included.
Change-Id: Ia70a512cebe6c104d2d64afbed3cde8a405c68df
This CL will help run libvpx tests under Chromium against its partition
allocator. The allocator does not support single allocations above
3.998GiB. Because of this tests related to large video sizes that
Chromium is configured for are expected to fail.
Chromium also only supports the CONFIG_REALTIME_ONLY option,
some changes are scoped behind this flag.
Change-Id: I80e8743c0619ce502688109ce0be01cb252d5f92
ctx->pending_cx_data is a pointer. It looks nicer to compare
ctx->pending_cx_data with NULL than with 0.
Change-Id: I18815907b3d75551abfc603cb3c5c0297dceed23
cpi_->cyclic_refresh is nullptr if aq_mode is 0, in other words, the
rate controller runs in non adaptive quantization mode. This CL fixes
the crash in GetSegmentationData() in non aq mode.
Bug: b/259487065
Test: video encoding on ChromeOS
Change-Id: I503b30d15c697c8dd1da203b3c7361b91c428e87
VPX_CODEC_CORRUPT_FRAME is a decoder error. It is strange for
vpx_codec_encode() to fail with this error. In set_frame_size(), change
VPX_CODEC_CORRUPT_FRAME to VPX_CODEC_ERROR.
The use of VPX_CODEC_CORRUPT_FRAME was originally added in
commit 1ed56a46b3.
Change-Id: Iee92ed4cfca5061289b278ece2ba475cf98fec06
The current SVE2 approach to 2D convolution is:
1) Filter horizontally, storing to an intermediate buffer.
2) Filter vertically and store the final output.
This patch merges the two phases for high bitdepth 2D convolution for
filter sizes smaller or equal to 4 to avoid the storing and
re-loading from the intermediate buffer.
This approach is not beneficial when applying an 8tap filter in the
convolution.
Change-Id: Ie090eb79f1cbf182300d9343ae63069396ef3956
These invalid value definitions are necessary to initialize
the gop decision in external RC so libvpx can tell which is populated
and which is not
Bug: b/329483680
Change-Id: I06bbb41fa59d0fb95296aebd0d05a703ec953b81
Coverity somehow thinks the return value of read_tx_mode() is between 0
and 7 (inclusive).
Hopefully this will fix Coverity CID 1584457: Out-of-bounds access in
read_coef_probs().
Change-Id: I49fbddf6fd6861bc9def9dfa91eaaaa4aefe5710
This array will be partially configured and used in later rate
distortion optimization search.
BUG=webm:1846
Change-Id: I83daba341c56767187031edb1c10d4528a4257a3
Add the `size` and `error` members to the vpx_write_bit_buffer struct.
Add the vpx_wb_init() and vpx_wb_has_error() functions.
Instances of the vpx_write_bit_buffer struct are only allocated in the
vp9_pack_bitstream() function. So vp9_pack_bitstream() is the only
function outside vpx_dsp/bitwriter_buffer.* that needs updating.
This CL completes the work of adding output buffer bounds checks to
vp9/encoder/vp9_bitstream.c.
Bug: webm:1844
Change-Id: I6b362be572852ee51d96023b35bfb334faada7e1
Issue happens for real-time nonrd pickmode.
Due to speed feature: sf->adaptive_rd_thresh_row_mt,
enabled for speed >= 8, and for speed >= 7 svc only.
Issue occurs where resolution (sb_rows) changes and
row_base_thresh_freq_fact needs to be re-allocated.
Fix is to add sb_rows to TileDataEnc and check for
re-alloc of row_base_thresh_freq_fac.
Bug: b:331108922
Change-Id: I1a1ca94c14f343200c180725e4cb8d91d3c55b83
In the vpx_writer struct, change the buffer_end field to the size field.
Change vpx_stop_encode() to return true on success, false on failure
(output buffer full).
In write_compressed_header(), remove the assertion
assert(header_bc.pos <= 0xffff). The caller (vp9_pack_bitstream()) will
check that condition.
In vp9_pack_bitstream(), the variable "first_part_size" is renamed
"compressed_hdr_size".
Bug: webm:1844
Change-Id: I4ed6ab905a707ad44d875e53036d5a42523a65d0
In vp9_init_tile_data(), call vp9_row_mt_mem_dealloc(cpi) to free the
row mt memory in cpi->tile_data before freeing cpi->tile_data.
Bug: b:331086799, b:331108729
Change-Id: Idc79984ce7e0110e6858139b2ed286492a2e8622
The code was using the bitstream_worker_data when it
wasn't allocated for big enough size. This is because
the existing condition was to only re-alloc the
bitstream_worker_data when current dest_size was larger
than the current frame_size. But under resolution change
where frame_size is increased, beyond the current dest_size,
we need to allow re-alloc to the new size.
The existing condition to re-alloc when dest_size is
larger than frame_size (which is not required) is kept
for now.
Also increase the dest_size to account for image format.
Added tests, for both ROW_MT=0 and 1, that reproduce
the failures in the bugs below.
Note: this issue only affects the REALTIME encoding path.
Bug: b/329088759, b/329674887, b/329179808
Change-Id: Icd65dbc5317120304d803f648d4bd9405710db6f
(cherry picked from commit c29e637283)
2D 8-tap convolution filtering is performed in two passes -
horizontal and vertical. The horizontal pass must produce enough
input data for the subsequent vertical pass - 3 rows above and 4 rows
below, in addition to the actual block height.
At present, all highbd SVE horizontal convolution algorithms process
4 rows at a time, but this means we end up doing at least 1 row too
much work in the 2D first pass case where we need h + 7, not h + 8
rows of output.
This patch adds an additional SVE2 path that processes h + 7 rows of
data exactly, saving the work of the unnecessary extra row.
Change-Id: I2f5d39ad737dbd7eccb08dd2b51586c6710119b8
If a local variable "pc" is defined as &cpi->common, replace
"cpi->common." with "pc->".
Also replace a memcpy() call with a struct assignment.
Change-Id: I6f4f12e69d9989beaa6e04c83d93230e7d726278
Declare the dest_size member of the VP9BitstreamWorkerData struct as
size_t instead of int.
Fix the following MSVC warning:
vp9\encoder\vp9_bitstream.c(1031,37): warning C4267: '=':
conversion from 'size_t' to 'int', possible loss of data
Change-Id: Idab5ad5d4bf4d1e4754f011a3073c9a89da29f55
The buffer_end field will allow bounds checking when vpx_writer writes
to the output buffer. This CL sets up the plumbing to pass the output
buffer size from vp9_pack_bitstream() to vpx_start_encode(), which
initializes the vpx_writer struct. vpx_writer doesn't use the output
buffer size in bounds checks yet, but the code in vp9_bitstream.c does.
Bug: webm:1844
Change-Id: I995e469ab453c02d740f54b46e0b08c7f2eb1a2e
This was added in libaom in:
5ddac0aac8 RTCD defs: Remove empty specialize statements once and for all.
https://aomedia-review.googlesource.com/c/aom/+/9062
Change-Id: I9c8fb0c8e4bd4dc9373d8533ab083dff816e7cbe
Set up the plumbing to pass the size of the output buffer `dest` to
vp9_pack_bitstream(). The output buffer is the cx_data buffer in the
encoder_encode() function in vp9/vp9_cx_iface.c, and its size is
cx_data_sz.
In this CL vp9_pack_bitstream() ignores the `dest_size` parameter.
Bug: webm:1844
Change-Id: I53c80280143d409cf16f87c4d6deec3d9338aea3
Avoid calling encode_tiles_buffer_alloc_size() twice by saving its
return value in a local variable.
Change-Id: I3050f9cf7c3520f7edc80abf66620ba233fadad8
The code was using the bitstream_worker_data when it
wasn't allocated for big enough size. This is because
the existing condition was to only re-alloc the
bitstream_worker_data when current dest_size was larger
than the current frame_size. But under resolution change
where frame_size is increased, beyond the current dest_size,
we need to allow re-alloc to the new size.
The existing condition to re-alloc when dest_size is
larger than frame_size (which is not required) is kept
for now.
Also increase the dest_size to account for image format.
Added tests, for both ROW_MT=0 and 1, that reproduce
the failures in the bugs below.
Note: this issue only affects the REALTIME encoding path.
Bug: b/329088759, b/329674887, b/329179808
Change-Id: Icd65dbc5317120304d803f648d4bd9405710db6f
SVE and SVE2 code paths in libvpx require intrinsics from
arm_neon_sve_bridge.h. SVE is disabled if the compiler does not
support this header. This patch conditionally disables SVE2 in the
same way.
Also gate the check for arm_neon_sve_bridge.h on whether SVE is
enabled in the first place. The check isn't necessary if the user has
explicitly disabled SVE. (Explicitly disabling SVE already disables
SVE2 since the former is a pre-requisite for the latter.)
Change-Id: Ibb21f09e8b2470d1ce5d98b71b101f5b7f7dbcdc
In encoder_encode(), remove the return statement after a
vpx_internal_error() call because setjmp() has been called at that
point.
Change-Id: Ib8ebbfbacb21097ce7f1b4e3bf53004bbe88a42b
in struct VP8RateControlRtcConfig and struct VP9RateControlRtcConfig;
structs default to public access.
Change-Id: Icdc5b44fb4c7297b0cb3c6cde8bec33ea5cee18c
vp8/vp8_ratectrl_rtc.h should come first as it's implemented in this
module. Split the rest of the groups on C/C++/vpx bounds.
Change-Id: If6bbbd8f3adf3766fa36fbc53ae06c9f6f76ebe9
Add SVE2 implementation of vpx_highbd_convolve8_avg_vert function.
Add the corresponding tests as well.
Change-Id: I20ca19e09a1686bb00c0b51bf756ddab0adbc2c0
Add SVE implementation of vpx_highbd_convolve8_avg_horiz function.
Add the corresponding tests as well.
Change-Id: If13793fa653834dfdfeddfee60b80129eea85dd7
Add SVE2 implementation of vpx_highbd_convolve8_vert function. Add
the corresponding tests as well.
Change-Id: I289ac79d4493935217feaa4fd2fa0b8ef9a62972
Add 'sve2' arch options to the configure, build and unit test files -
adding appropriate conditional options where necessary. Arm SIMD
extensions are treated as supersets in libvpx, so disable SVE2 if
SVE is unavailable.
Change-Id: Icdec2aace357e36fba77c77cd8b70da1e5427fce
This was deprecated in 1.9.5 [1]. It is now enabled by default. For
earlier versions of doxygen this will set the value to false, but I
don't believe we were relying on this functionality.
[1]: https://www.doxygen.nl/manual/changelog.html#log_1_9_5
Change-Id: I75f576d35ca86636761cf70fda0dd0ad37f71d71
The sem_* macros do not behave exactly like the POSIX sem_* functions.
Add the vp8_ prefix to the sem_* macro names to make it clear that they
are not the POSIX sem_* functions. Another reason for adding the vp8_
prefix is that we need to wrap sem_wait() (to handle EINTR) on the Unix
platforms that have real sem_wait() function.
Handle EINTR in the Unix (non-Apple) definition of vp8_sem_wait().
Change-Id: I3df02a30f851d41691a55cf7a84aa2ff054bba9c
Based on a clang-tidy warning:
`no header providing "sem_wait" is directly included`
Though this may not clear it entirely, it's the closest that can be
done given the platform-dependent includes and implementation in
vp8/common/threading.h
Change-Id: I19984f820f3f380e58deef40563a2f0c66187748
set --target to the more modern aarch64-android-gcc and remove an
incorrect comment regarding realtime-only.
Change-Id: I5f6c9de9fcd96a60817e37fc6f6505725ddea6b9
When dot-product and SVE support are disabled the hwcap variable is
currently unused. Fix this by wrapping it in an #ifdef matching the
conditions where it is needed.
Change-Id: I1c2e302d861c6c726b314e374f07d4fafe17ffc7
libvpx's check for conditionally defining __builtin_prefetch is broken,
since clang-cl defines __builtin_prefetch on Win ARM64: in addition, it
supports up to 3 arguments, with the latter 2 being optional. This
causes build breaks when paired with other libraries, like Abseil, which
do perform the conditional test correctly.
The real fix here is to define something like VPX_PREFETCH rather than
trying to #define an implementation-reserved name, which is undefined
behavior.
Bug: 328105513
Change-Id: Ibe14d9ce34306654bd20e560973f76c3b40036ee
Refactor the transpose_concat_*() helper function used in the Arm Neon
DotProd and I8MM vertical convolution implementations to not use TBL
instructions. Using vzip* to achieve the same outcome (with the same
number of instructions) avoids needing/loading the lookup indices and
also increases performance on little (in-order) Arm Cortex cores.
Change-Id: Iff62a44f8a9bf0ee239d5bb36be8424cab0dbca5
sem_wait() may be interrupted by a signal and fail with EINTR:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/sem_wait.html
Retry the sem_wait() call if it fails with EINTR.
This finishes the fix started in
https://chromium-review.googlesource.com/c/webm/libvpx/+/5299569. As a
speculative fix, that CL fixed only the sem_wait(&cpi->h_event_end_lpf)
calls responsible for bug chromium:324459561. ClusterFuzz verified the
fix, so this CL extends it to the other sem_wait() calls.
Note that sem_wait() calls like the following do not need this fix,
because the while (1) loop retries the sem_wait() call if it fails:
while (1) {
if (vpx_atomic_load_acquire(&cpi->b_multi_threaded) == 0) break;
if (sem_wait(&cpi->h_event_start_lpf) == 0) {
...
}
}
Bug: chromium:324459561
Change-Id: I0f0612616eee37fb3da68049e49b3e86927b5e24
We already have some logic in the configure.sh file to selectively
disable code dependent on particular architecture extensions, however we
do not yet have anything to check that the compiler being supplied
recognises and can compile code using these extensions.
This commit adds compiler "-march=..." flag tests to the existing
extension-disable loop so that we now correctly disable extensions that
are not supported by the compiler. For AArch64 this loop also needs to
move below the existing compiler/OS handling to ensure that prefixes
like $CROSS are handled correctly before running compiler tests.
Bug: webm:1841
Change-Id: I936b911c4b0ebf03abc34b7532b2bb4568129f57
(cherry picked from commit fa50b26848)
Disable SVE feature if arm_neon_sve_bridge header is not supported
by the compiler.
Change-Id: I3f78be2dd95b37b8d51b9f1fceca1f9701535eca
(cherry picked from commit 6ea3b51ec2)
Added unitest which triggers the data race in the
bug below, when only C code is forced.
The data race is between the loopfilter and variance
computation from generate_psnr_packet calculation.
Proposed fix is to move the wait for loopfilter thread to
finish up before entering generate_psnr_packet().
Bug: b/266833179.
Change-Id: Id2871c53274be0f404e65601c9a5c98aaead0c72
(cherry picked from commit 756b29a776)
We already have some logic in the configure.sh file to selectively
disable code dependent on particular architecture extensions, however we
do not yet have anything to check that the compiler being supplied
recognises and can compile code using these extensions.
This commit adds compiler "-march=..." flag tests to the existing
extension-disable loop so that we now correctly disable extensions that
are not supported by the compiler. For AArch64 this loop also needs to
move below the existing compiler/OS handling to ensure that prefixes
like $CROSS are handled correctly before running compiler tests.
Bug: webm:1841
Change-Id: I936b911c4b0ebf03abc34b7532b2bb4568129f57
Add SVE implementation for vpx_highbd_convolve8_horiz that specialises
for 4-tap filters. This way we avoid a lot of redundant work to
multiply and add zero, given that some of the 8-tap filters are
zero-padded, so they are effectively 4-tap filters.
Change-Id: Ib5e0377f924df1d893e9436f443fcbe7d196ea27
Rename dot_neon_sve_bridge.h to vpx_neon_sve_bridge.h in order to
reflect that other instructions can be implemented in the header
file. In a subsequent patch, the usage of vtbl with Neon-SVE bridge
intrinsics will be added.
Change-Id: I8f71aad2b7fb4932c9554badf041a80aca58c7cf
Remove the 4-tap Neon DotProd path for the horizontal pass of 2D
convolution since it has been made redundant by the horizontal-
vertical merged implementation. Also move the 8-tap path closer to
where it is used and call it explicitly rather than the filter-
agnostic wrapper.
Change-Id: I1861dc88a67a759c3e8deb0b471ec447a62063f2
The current SBD Neon DotProd approach to 2D convolution is:
1) Filter horizontally, storing to an intermediate buffer.
2) Filter vertically and store the final output.
This patch merges the two phases for 4-tap standard bitdepth 2D
convolution to avoid storing to and re-loading from the intermediate
buffer - giving a 10-25% speedup depending on block size. Merging the
passes for 8-tap filters does not have the same benefit, so keep the
existing implementation.
Change-Id: Ic6008836d1a499ee2cd957b9db194fca5671ccb4
Remove the 4-tap Neon i8mm path for the horizontal pass of 2D
convolution since it has been made redundant by the horizontal-
vertical merged implementation. Also move the 8-tap path closer to
where it is used and call it explicitly rather than the filter-
agnostic wrapper.
Change-Id: Icddecb7e133656c54aa5e79536b49759715b6fcb
The current SBD Neon i8mm approach to 2D convolution is:
1) Filter horizontally, storing to an intermediate buffer.
2) Filter vertically and store the final output.
This patch merges the two phases for 4-tap standard bitdepth 2D
convolution to avoid storing to and re-loading from the intermediate
buffer - giving a 5-40% speedup depending on block size. Merging the
passes for 8-tap filters does not have the same benefit, so keep the
existing implementation.
Change-Id: Ic8ec2822681176ef879dcaf8424d8d91c5e8d2df
With either CONFIG_VP8=0 or CONFIG_VP9=0. Fixes a warning about an extra
';' outside of a function due to VP[89]_INSTANTIATE_TEST_SUITE() being
defined to nothing.
Change-Id: I1878d7596e39c5166efbe96450a733efc08665ea
inter/intra_cost in VP9 TPL is calculated with SATD
which should be close enough to be used as inter/intra_pred_err
Bug: b/326262148
Change-Id: Ic0fd08708fcf3640398fc22a1a6bb6f449b2a9b8
Anonymous unions are not supported in C99, they were added in C11:
https://en.cppreference.com/w/c/language/union
Fixes -Wpendantic warning:
vp9/encoder/vp9_context_tree.h:93:4: warning: ISO C99 doesn’t support
unnamed structs/unions [-Wpedantic]
Change-Id: Ibd29d6deca35d81ea886e80e9f44575c73ecd96d
Fixes a -Wpedantic warning:
vp9/encoder/vp9_rdopt.c:1988:20: warning: invalid use of pointers to
arrays with different qualifiers in ISO C before C2X [-Wpedantic]
Change-Id: I581e21d7e59c0bae0e44056a3b3f049c5a4e7cf2
Add SVE implementation of vpx_highbd_convolve8_horiz function. Add
the corresponding tests as well.
Change-Id: I0b2815831daf203e167ea5289307087ce53ff9da
The new Armv8.0 Neon implementation of 4-tap vertical convolution is
faster than Armv8.4 DotProd and Armv8.6 I8MM implementations. This
patch removes the DotProd and I8MM implementations in favour of using
the Armv8.0 version everywhere.
Change-Id: I126470fd4862d8bb116153e90bb2e4f2f2dba1e4
Refactor Armv8.0 Neon 4-tap convolution functions to operate on 8-bit
types directly, rather than first widening to 16-bit.
2-tap (bilinear) filter values are always positive, but 4-tap filter
values are negative on the outer edges (taps 0 and 3), with taps 1
and 2 having much greater positive values to compensate. To use
instructions that operate on 8-bit types we also need the types to be
unsigned. In the convolution kernel, subtracting the products of taps
0 and 3 from the products of taps 1 and 2 always works since 2-tap
filters are 0-padded.
Co-authored by: Hari Limaye <hari.limaye@arm.com>
Change-Id: I87b32e2ef8cbd21eebb8cd2642e8826b704905b1
The THREADFN and THREAD_EXIT_SUCCESS macros are used to define the
thread start routines passed to our implementation of pthread_create(),
so it makes sense to define these macros in vpx_util/vpx_pthread.h. This
also allows the VP8 and VP9 code to share the macro definitions.
Replace the THREAD_FUNCTION macro by THREADFN. They have the same
definition.
Change-Id: I79a7476e43652667af6a8da7ad7ce346b1b6b024
This helps prevent name clashes if code e.g. #includes headers from both
libvpx and libaom.
Bug: none
Change-Id: Ifc9e7ac4862dc04a399e7777d2636e1453627970
Currently we use two rounds of complex right-shift operations to
narrow and pack results from the dot-product convolution kernels.
This patch refactors these sequences to use one "simple" right-shift
and one complex right-shift - reducing the latency by 4 cycles on
modern out-of-order Arm CPUs.
Change-Id: I3fd38560bb14d85826e417f40d35f11165ab80da
Currently we use two rounds of complex right-shift operations to
narrow and pack results from the dot-product convolution kernels.
This patch refactors these sequences to use one "simple" right-shift
and one complex right-shift - reducing the latency by 4 cycles on
modern out-of-order Arm CPUs.
Change-Id: I908147ed65a87157009363782399ff398406cdf9
- Initialize gop_decision
- Initialize GF group for a new one
- GF group index for key frame special treatment is not needed any more
when key frame is decided by the RC
Bug: b/323050877
Change-Id: Iaf36ea4f671b833f3ba4c524b9799a3093412dfa
The current Neon approach to 2D convolution is:
1) Filter horizontally, storing to an intermediate buffer.
2) Filter vertically, average with the dst block and store the final
output.
This patch merges the two phases for high bitdepth 2D convolution to
avoid the storing and re-loading from the intermediate buffer. This
provides a small gain (<5%) for large block sizes but the benefit
increases for small block sizes - as the proportion of compute to
memory access decreases. These effects are amplified further when
considering little (in-order) core performance.
Change-Id: I84f1cafcfbbfa48b2cfe4b20881da9c4bc3b56ac
The current Neon approach to 2D convolution is:
1) Filter horizontally, storing to an intermediate buffer.
2) Filter vertically and store the final output.
This patch merges the two phases for high bitdepth 2D convolution to
avoid the storing and re-loading from the intermediate buffer. This
provides a small gain (<5%) for large block sizes but the benefit
increases for small block sizes - as the proportion of compute to
memory access decreases. These effects are amplified further when
considering little (in-order) core performance.
Change-Id: I8ec13fb9edd642fdb927bf5394a3c2a349d22a29
Add a highbd Neon implementation of the horizontal portion of 2D
convolution specialised for executing with 4-tap filters. This new
path is also used when executing with bilinear (2-tap) filters.
Change-Id: I513e35c4f8857bc89e0def5e9402bc31ddd46440
Add a highbd Neon implementation of vertical convolution specialised
for executing with 4-tap filters. This new path is also used when
executing with bilinear (2-tap) filters.
Change-Id: I30469c7b8e6ccff31d96588a3e4c21b401f1ed09
Add a highbd Neon implementation of horizontal convolution specialised
for executing with 4-tap filters. This new path is also used when
executing with bilinear (2-tap) filters.
Change-Id: Icabeea295af3e0bbeda755168996668cb960b0de
Filter tap reporting was made more granular recently[1] to enable Arm
Neon optimizations that specialise convolution implementations
according to the filter size. This patch removes an assert that
should have been removed during that change - it no longer serves any
purpose to assert that the filter being used is a no-op filter.
This change is a pre-requisite for some highbd Neon convolution
changes that specialise implementations according to filter size.
(Without this change a convolve-copy test would fail should we
interrogate the size of the filter.)
[1] https://chromium-review.googlesource.com/c/webm/libvpx/+/5063929
Change-Id: I2a71680d27134535e6c0663b1668ba1b150b1a6f
2D 8-tap convolution filtering is performed in two passes -
horizontal and vertical. The horizontal pass must produce enough
input data for the subsequent vertical pass - 3 rows above and 4 rows
below, in addition to the actual block height.
At present, all highbd Neon horizontal convolution algorithms process
4 rows at a time, but this means we end up doing at least 1 row too
much work in the 2D first pass case where we need h + 7, not h + 8
rows of output.
This patch adds an additional Neon path that processes h + 7 rows of
data exactly, saving the work of the unnecessary extra row.
Change-Id: Id6658b4e9e774effc760ff131e188b6907a57676
Call scalar C implementation of 2D convolution immediately if scaling
is required - instead of entering the Neon functions for the
horizontal and vertical passses and then falling back to the scalar
implementation. This has the benefit of being able to allocate a
smaller intermediate buffer.
Change-Id: Icacdd5f3a1401395951b613da1cd6932955bd0f8
There's no reason for these files to be separate, and merging them
will make life easier in subsequent commits adding a horizontal pass
specialised for the first pass of 2D.
Also perform some refactoring for 2D convolution definitions:
- Add a comment deriving the intermediate buffer height.
- Align the intermediate buffers to 32 bytes.
Change-Id: Ib92524396e6f9c58295339de54d08d894ace3bd1
Mostly a cosmetic change:
1) Remove forward declarations.
2) Remove excessive prefetches - some of which were wrong, prefetching
data that had just been loaded.
Change-Id: I17d8accc2abf3a9b2050603f859fce588a1f7178
CONFIG_PROFILE is unused currently. The option can still be selected
because it is in the CMDLINE_SELECT list and interpreted by configure
directly.
Bug: webm:1835
Change-Id: Id9667289113335a10018803f578b255967bd60b1
Move narrowing shift and max value clipping into the 4-pixel-output
kernel. As well as cleaning up the code quite a bit, this also
improves performance by 5-10% as it eliminates the implied top /
bottom register shuffling of the previous approach.
Also clean up the formatting and magic numbers in the 8-pixel-output
kernel.
Change-Id: I77a5e9e317ef4097f187330d4b32973022ba573f
In https://chromium-review.googlesource.com/c/webm/libvpx/+/71356, the
statement
clamp(q, active_best_quality, active_worst_quality);
was added to rc_pick_q_and_bounds_two_pass() (recently renamed
vp9_rc_pick_q_and_bounds_two_pass()).
The result of the clamp() call is not used, so the clamp() call has no
side effect.
Fix Coverity CID 1577645 Useless call:
side_effect_free: Calling
clamp(q, active_best_quality, active_worst_quality) is only useful for
its return value, which is ignored.
Change-Id: I014c3e4caf2bc999fe480000acc4e49e7ad15aaf
Various bits of tidying up to make the code more compact:
- Use appropriate load/store helper functions from mem_neon.h.
- Remove variable forward declarations.
- Use != 0 instead of > 0 in loop termination tests.
- Remove excessive prefetches.
Change-Id: I114cf4d2a34f02acc130558d125d2c191c6c5992
Various bits of tidying up to make the code more compact:
- Use/create appropriate mem_neon.h load/store helper functions.
- Remove variable forward declarations.
- Use != 0 instead of > 0 in loop termination tests.
- Remove excessive prefetches.
Change-Id: Ida7d3c4a3fe084600417f196baa26501c6e2d45a
Initialise result vectors of mem_neon.h helpers with vdup_n_<type>(0)
instead of load-broadcast of the first loaded elements. The former is
more easily optimized by modern compilers.
Change-Id: If967e2bb55523670c3e433dd66d060665e13b4f2
Align the intermediate buffers to 32 bytes and always use a stride of
64, regardless of the actual data block width.
Change-Id: I738eaa711168bc8231d8ac54d9e5e5e87b62e703
Add rdmult to the frame decision as RC can return this information, and
we may want to use it in the future.
Bug: b/323234722
Change-Id: I8ddb7038073d89af1ef84932448b1abaf1937cee
Use uv_crop_(width|height). This fixes an issue with 1 to 2 scaling from
1x1 where the unrounded value would go to zero, resulting in a heap
overflow. This path is only executed when the library is built without
--enable-vp9-highbitdepth.
Bug: b:319964497
Change-Id: I9cb6632f864ec54c045608af86aede20657d6253
(cherry picked from commit 7ad5f4f695)
Observed when built using Visual Studio 2019.
Move 720P image allocation to the heap.
Bug: webm:1831
Change-Id: I4e343af08d2f282618ad1b328a39d7dba5e79654
(cherry picked from commit 43e1c8bf10)
This can happen in the setting of the frame
target size for delta frames, for non-CBR mode
(end_usage != USAGE_STREAM_FROM_SERVER) and with
temporal layers.
In calc_pframe_target_size(): the percent_high
(factor to adjust the target_size) may end up dividing
bits_off_target by total_byte_count. The total_byte_count
is define per layer for temporal layers, so it will be zero
for delta frames if the enhancement layer has never been
encoded before.
Since percent_high is capped to over_shoot_pct, the proposed
fix is to apply this cap if total_byte_count is zero.
Also this CL fixes a few integer overflow issues in setting
the layer target_bandwidth, the recale function, and in
setting target_bits_per_mb.
Unittest is added by Wan-Teh which triggers this issue.
Bug: chromium:1514684
Change-Id: I091158e720ece75d7ab9b7c4d18d30a5783102ab
(cherry picked from commit 43bd567950)
Equivalent to the change to av1_change_config() in the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/182413.
Because we call alloc_compressor_data() only if
cm->mi_alloc_size < new_mi_size, this change won't cause
alloc_compressor_data() to be called unnecessarily, unlike the libaom
bug https://crbug.com/aomedia/3526.
Bug: b:317105128
Change-Id: I8a772a1d5c4766846641a6d541a6d861bf76c60f
(cherry picked from commit aef73b22cb)
This change was intended to be cosmetic in that it tweaks some
comments, removes forward declarations and moves some constant
declarations into the kernels where they're used. However, it also
adds some performance for 8-tap vertical convolution paths as it
appears removing forward declarations also removes some false loop-
carried dependencies that the compiler wasn't able to figure out.
Change-Id: Ic58658b10fbe8378062920199819359d2df008de
The updated test will validate the QP / frame type / ARF settings by the
rate controller and callbacks, making sure the callbacks are working as
expected.
Removed the old tests that verify the signals from the encoder, which
are not needed any more.
Change-Id: Ida3c484e2ac520f3e81358d7cbf7918abfdaca54
Disable some tests because they rely on vpx_rc_gop_info_t
which isn't populated when the callback is used for key frame
This parameter will be deleted / cleaned up in the follow-up.
Bug: b/323050877
Change-Id: If1c0476eac8d324c8d5a460bfc9afdb6d93aacdf
Use uv_crop_(width|height). This fixes an issue with 1 to 2 scaling from
1x1 where the unrounded value would go to zero, resulting in a heap
overflow. This path is only executed when the library is built without
--enable-vp9-highbitdepth.
Bug: b:319964497
Change-Id: I9cb6632f864ec54c045608af86aede20657d6253
Simplify the computation of the Armv8.4 DotProd convolution
correction constant. Summing 128 * filter_tap[0,7] is always the same
as 128 * 128 since the filter taps always sum to 128.
Change-Id: I227ba47ae47bed8304a695a2395bcc85f33c245c
Move the convolution kernels using Armv8.4 dotprod and Armv8.6 i8mm
instructions into the respective .c files. These kernels are only used
in the respective .c files so it isn't useful for them to be declared
in a header.
This change also removes the need for feature-macro guarding - which
wasn't being done correctly for MSVC (since Microsoft's Arm
architecture feature macros are named differently to those defined by
GNU-compliant compilers.)
Bug: webm:1838
Change-Id: I495fca2a982c34978b6c9102f144bb9c45352a9a
Move the Arm Neon dotprod and i8mm 2D convolution functions into the
appropriate vpx_convolve8_neon_[dotprod|i8mm].c file. Only the
Armv7/Armv8.0 Neon files needed to be split in this way to allow
linking against a handwritten assembly implementation of the kernels
for Armv7 builds.
Change-Id: Ifc363556c3961aa78b9e53761537d4816c5b9964
This is one commit after the libwebm-1.0.0.31 tag:
affd7f4 In MakeUID(), call rand() under #ifdef _WIN32
Change-Id: I5979a8cd3b064d4f4f0dbeca9f84f6791e593b47
Call indirect RTCD high bitdepth variance functions (instead of the
Neon functions) in the high bitdepth Neon subpel variance paths so
that faster SVE variance functions can be used on CPUs where SVE is
supported.
Change-Id: I04bdef235afac06f2100df0cbaccfb8caef41ac7
Add SVE implementation of get<w>x<h>var functions for 8-, 10-, 12- bit
depth. Add the corresponding tests as well.
Change-Id: Id4feb8726a3eb0a963e3dd8932ee52374a67da48
Add standard and high bitdepth unit tests for vpx_get<w>x<h>var
functions. Enable these unit tests for the C implementation.
Change-Id: I8716fd6a9718dab3eef218a8a60a1efd4c0e316c
Fix Coverity defects CID 1568604 and CID 1568615 (Uninitialized pointer
field). Since the constructors are private and the Create() factory
methods set the cpi_ pointer field, these two Coverity defects are
harmless.
Define the constructors with "= default" instead of "{}".
Change-Id: Ie6b45fce66c23941a9a5c38ee0bccbc4b7d3a2a2
Add SVE implementation of variance functions for 8-, 10-, 12- bit
depth. Add the corresponding tests as well.
Change-Id: I785d85760ad4346cbfbf0f842784b4945870afee
Observed when built using Visual Studio 2019.
Move 720P image allocation to the heap.
Bug: webm:1831
Change-Id: I4e343af08d2f282618ad1b328a39d7dba5e79654
read_yuv_frame() supports VPX_IMG_FMT_NV12. Port its code to
vpx_img_read() and vpx_img_write().
The code in vp9/simple_encode.cc, including img_read(), doesn't support
VPX_IMG_FMT_NV12. Check before the vpx_img_alloc() calls and abort the
process if the image format is VPX_IMG_FMT_NV12.
Bug: chromium:1510090
Change-Id: Ie77e29c2c9ee7a01e6a59c8ad3cbcc769d9f2d4c
If fmt is VPX_IMG_FMT_NONE, currently img_alloc_helper() allocates a
single plane because VPX_IMG_FMT_NONE (0) is not a planar format (the
VPX_IMG_FMT_PLANAR bit is not set in VPX_IMG_FMT_NONE).
Although this seems correct, the problem is that most of the code in
libvpx assumes planar formats and is likely to dereference a null
pointer when it uses img->planes[1]. Also, VPX_IMG_FMT_NONE isn't really
a valid image format. So it is safer to make img_alloc_helper() fail if
fmt is VPX_IMG_FMT_NONE.
Change-Id: I05b47f4b5eceb631a02384b2cce1c2f6fdca8673
This often falls out of sync with the release and the version is already
contained in CHANGELOG.
Bug: webm:1833
Change-Id: Ieee6ca40249bf6e77037fbec30d87b109ca8fe21
Release v1.14.0 Venetian Duck
2024-01-18 v1.14.0 "Venetian Duck"
This release drops support for old C compilers, such as Visual Studio 2012
and older, that disallow mixing variable declarations and statements (a C99
feature). It adds support for run-time CPU feature detection for Arm
platforms, as well as support for darwin23 (macOS 14).
- Upgrading:
This release is ABI incompatible with the previous release.
Various new features for rate control library for real-time: SVC parallel
encoding, loopfilter level, support for frame dropping, and screen content.
New callback function send_tpl_gop_stats for vp9 external rate control
library, which can be used to transmit TPL stats for a group of pictures. A
public header vpx_tpl.h is added for the definition of TPL stats used in
this callback.
libwebm is upgraded to libwebm-1.0.0.29-9-g1930e3c.
- Enhancement:
Improvements on Neon optimizations: VoD: 12-35% speed up for bitdepth 8,
68%-151% speed up for high bitdepth.
Improvements on AVX2 and SSE optimizations.
Improvements on LSX optimizations for LoongArch.
42-49% speedup on speed 0 VoD encoding.
Android API level predicates.
- Bug fixes:
Fix to missing prototypes from the rtcd header.
Fix to segfault when total size is enlarged but width is smaller.
Fix to the build for arm64ec using MSVC.
Fix to copy BLOCK_8X8's mi to PICK_MODE_CONTEXT::mic.
Fix to -Wshadow warnings.
Fix to heap overflow in vpx_get4x4sse_cs_neon.
Fix to buffer overrun in highbd Neon subpel variance filters.
Added bitexact encode test script.
Fix to -Wl,-z,defs with Clang's sanitizers.
Fix to decoder stability after error & continued decoding.
Fix to mismatch of VP9 encode with NEON intrinsics with C only version.
Fix to Arm64 MSVC compile vpx_highbd_fdct4x4_neon.
Fix to fragments count before use.
Fix to a case where target bandwidth is 0 for SVC.
Fix mask in vp9_quantize_avx2,highbd_get_max_lane_eob.
Fix to int overflow in vp9_calc_pframe_target_size_one_pass_cbr.
Fix to integer overflow in vp8,ratectrl.c.
Fix to interger overflow in vp9 svc.
Fix to avg_frame_bandwidth overflow.
Fix to per frame qp for temporal layers.
Fix to unsigned integer overflow in sse computation.
Fix to uninitialized mesh feature for BEST mode.
Fix to overflow in highbd temporal_filter.
Fix to unaligned loads w/w==4 in vpx_convolve_copy_neon.
Skip arm64_neon.h workaround w/VS >= 2019.
Fix to c vs avx mismatch of diamond_search_sad().
Fix to c vs intrinsic mismatch of vpx_hadamard_32x32() function.
Fix to a bug in vpx_hadamard_32x32_neon().
Fix to Clang -Wunreachable-code-aggressive warnings.
Fix to a bug in vpx_highbd_hadamard_32x32_neon().
Fix to -Wunreachable-code in mfqe_partition.
Force mode search on 64x64 if no mode is selected.
Fix to ubsan failure caused by left shift of negative.
Fix to integer overflow in calc_pframe_target_size.
Fix to float-cast-overflow in vp8_change_config().
Fix to a null ptr before use.
Conditionally skip using inter frames in speed features.
Remove invalid reference frames.
Disable intra mode search speed features conditionally.
Set nonrd keyframe under dynamic change of deadline for rtc.
Fix to scaled reference offsets.
Set skip_recode=0 in nonrd_pick_sb_modes.
Fix to an edge case when downsizing to one.
Fix to a bug in frame scaling.
Fix to pred buffer stride.
Fix to a bug in simple motion search.
Update frame size in actual encoding.
Change-Id: I9c27fb2b917f9b80ed4bcc5cb3b4f87c56b62c2f
Add SVE implementation of MSE functions for 10-, 12- bit depth. Add
the corresponding tests as well.
An implementation was not added for 8 bit depth as the Neon DotProd
version is faster than the SVE implementation.
Change-Id: I0c5712ba2735a2879a0aa3a9a52980032fddc7a6
Enable Neon Dotprod 8-bit high bitdepth implementation for MSE
function as it is now not called with bit depth 10 or 12.
Bug: webm:1819
Change-Id: I9d1d506401aa0523fba2d8ea4978dc00fdacbb95
Instead of always calling highbd_get_block_variance_fn with bit depth
8 use the macroblock's bit depth.
Bug: webm:1819
Change-Id: Ib4b19703384e897ee9ffeef73a11a8af2d262558
For svc with no inter-layer prediction: reset
the RC and force max_qp on all spatial layers
on scene/slide changes. In the current code it was only
reset on current spatial layer because it was assumed
we can predict off lower spatial layer to avoid
prediction across scene change. But this does not apply
when inter-layer prediction is off on delta frames.
Also reset only up to current temporal layer.
Because of the hierarchical prediction structure
only the lower temporal layers need the RC to be reset.
This helps to reduce excessive frame drops for the
full_superframe_drop mode.
Change-Id: I76925681850b82aa7fff7f9b1c1a0a605cf3cf3b
for VPX_CODEC_USE_PSNR. This clears a clang-tidy warning. vpx_encoder.h
exports vpx_codec.h so it shouldn't be necessary.
Change-Id: I863b6f8689eeef59cd9eadf3cdc177247a0653f8
This can happen in the setting of the frame
target size for delta frames, for non-CBR mode
(end_usage != USAGE_STREAM_FROM_SERVER) and with
temporal layers.
In calc_pframe_target_size(): the percent_high
(factor to adjust the target_size) may end up dividing
bits_off_target by total_byte_count. The total_byte_count
is define per layer for temporal layers, so it will be zero
for delta frames if the enhancement layer has never been
encoded before.
Since percent_high is capped to over_shoot_pct, the proposed
fix is to apply this cap if total_byte_count is zero.
Also this CL fixes a few integer overflow issues in setting
the layer target_bandwidth, the recale function, and in
setting target_bits_per_mb.
Unittest is added by Wan-Teh which triggers this issue.
Bug: chromium:1514684
Change-Id: I091158e720ece75d7ab9b7c4d18d30a5783102ab
Add header file containing helper functions to make use of SVE
dot-product intrinsics via the Neon-SVE bridge.
Change-Id: I6cd198f8202559672817cbc19f890db35c03d3ff
GCC already does not allow implicit vector type conversions by default,
add -flax-vector-conversions=none to Clang builds to have the same
behavior.
Change-Id: I9d1adb836377077cf48818c80fe71025e2d2bdc7
Added unitest which triggers the data race in the
bug below, when only C code is forced.
The data race is between the loopfilter and variance
computation from generate_psnr_packet calculation.
Proposed fix is to move the wait for loopfilter thread to
finish up before entering generate_psnr_packet().
Bug: b/266833179.
Change-Id: Id2871c53274be0f404e65601c9a5c98aaead0c72
Equivalent to the change to av1_change_config() in the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/182413.
Because we call alloc_compressor_data() only if
cm->mi_alloc_size < new_mi_size, this change won't cause
alloc_compressor_data() to be called unnecessarily, unlike the libaom
bug https://crbug.com/aomedia/3526.
Bug: b:317105128
Change-Id: I8a772a1d5c4766846641a6d541a6d861bf76c60f
The VpxTpl* structs defined in vpx_tpl.h are only used by the external
rate control library. Add a VPX_TPL_ABI_VERSION component to
VPX_EXT_RATECTRL_ABI_VERSION and remove the VPX_TPL_ABI_VERSION
component from VPX_ENCODER_ABI_VERSION.
The current value of VPX_TPL_ABI_VERSION is 2. It is subtracted from
VPX_EXT_RATECTRL_ABI_VERSION and added to VPX_ENCODER_ABI_VERSION so
that the values of those two macros stay the same.
Add a note to explain why VPX_ENCODER_ABI_VERSION has a
VPX_EXT_RATECTRL_ABI_VERSION component.
Change-Id: I680b8522dc04328cd51df6de590fdec75ca88ae8
Commit db83435 introduced support for configuring for *-darwin23-gcc.
However configuring for *-darwin23-gcc does not currently add the
`-arch` flag to CFLAGS/LDFLAGS, so correct this here.
Change-Id: Ieeda1a5039ad40590dfcdcc6ba615a1d1697d54d
Before release:
c-a=8, a=0, r=1 -> c=8, a=0, r=1
After release:
- If the library source code has changed at all since the last
update, then increment revision:
c=8, a=0, r=r+1=2
- If any interfaces have been added, removed, or changed since
the last update, increment current, and set revision to 0:
c=c+1=9, a=0, r=0
- If any interfaces have been added since the last public release,
then increment age:
c=9, a=a+1=1, r=0
- If any interfaces have been removed or changed since the last
public release, then set age to 0:
c=9, a=0, r=0 (VpxTpl* structure changes)
MAJOR=c-a=9
MINOR=a=0
PATCH=r=0
Bug: webm:1833
Change-Id: Id24c9a0ff415a6f625d17b6098cdd0baf27432e3
Change if to assertion in vp9_extrc_get_encodeframe_decision
Clarify comment for VP9E_ENABLE_EXTERNAL_RC_TPL that
rc_type | VPX_RC_QP must be non zero for this control to work.
Change-Id: I2c54cf7eda1f0f12f4ff7ac929e8e6a1fdd2215d
Performance optimization. get_msb utilizes
the compiler/platform specific last significant bit
operator.
Note: 32 bit unsigned assumed, like all get_msb implementations do.
Change-Id: Ib013ad24aa0ea845efeb52aacd448b067edf91da
This is never used.
A callback in external rc func was added and used instead.
Change-Id: Iade6f361072f0c28af98904baf457d2f0e9ca904
(cherry picked from commit 41ced868a6)
Commit db83435 introduced support for configuring for *-darwin23-gcc.
However configuring for *-darwin23-gcc does not currently add the
`-arch` flag to CFLAGS/LDFLAGS, so correct this here.
Change-Id: Ieeda1a5039ad40590dfcdcc6ba615a1d1697d54d
Explain why the encoder init functions cannot call update_error_state().
In vp8/vp8_cx_iface.c, this comment should have been added in
https://chromium-review.googlesource.com/c/webm/libvpx/+/4506609.
Rewrite update_error_state() in vp8/vp8_cx_iface.c to look like the
versions in vp9/vp9_cx_iface.c and av1/av1_cx_iface.c (in libaom).
Change-Id: I3f153d67b8c549ca5ac8ea0cfbcaad4ae705c8e6
After a longjmp() call in vp8e_encode(), call update_error_state() so
that we return the error code and error detail set by the
vpx_internal_error() call.
Change-Id: I1f2428eb1b1f61e46c02604e16a5d44dcf162479
The function convolve8_4_usdot contains a comment relating to the
SDOT implementation of convolve8, which requires addition of a
correction constant to account for range clamp of the input values.
This is not performed in the i8mm USDOT implementation - so remove the
comment.
Also add some const qualifiers to function arguments.
Change-Id: I10aff560d20403897f708ee293bf873be9c35761
Fix the following clang-tidy misc-include-cleaner warnings:
vp9/encoder/vp9_encoder.c:
no header providing "vp9_is_valid_scale" is directly included
no header providing "VPX_CODEC_CORRUPT_FRAME" is directly included
vp9/vp9_cx_iface.c:
no header providing "valid_ref_frame_size" is directly included
Change-Id: I20e846f5b14c42c72aaefec0718b4ae9c7eea44a
Issue explanation:
The unit test calls set_config function twice after encoding the
first frame.
The first call of set_config reduces frame width, but is still within
half of the first frame.
The second call reduces frame width even more, making is less than
half of the first frame, which according to the encoder logic,
there is no valid ref frames, and this frame should be set as a
forced keyframe. This leads to null pointer access in scale_factors
later.
Solution:
To make sure the correct detection of a forced key frame,
we need to update the frame width and height only when the actual
encoding is performed.
Bug: b/311985118
Change-Id: Ie2cd3b760d4a4b399845693d7421c4eb11a12775
(cherry picked from commit 1ed56a46b3)
This change fixed a bug revealed by b/311294795.
In simple motion search, the reference buffer pointer needs to be
restored after the search. Otherwise, it causes problems while the
reference frame scaling happens. This CL fixes the bug.
Bug: b/311294795
Change-Id: I093722d5888de3cc6a6542de82a6ec9d601f897d
(cherry picked from commit 50ed636e49)
Use vpx_sse and vpx_highbd_sse instead of vpx_mse16x16 and
vpx_highbd_8_mse16x16 respectively to compute SSE for PSNR
calculations. This solves an issue whereby vpx_highbd_8_mse16x16
was being used to calculate SSE for 10- and 12-bit input.
This is a port of the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/175063
by Jonathan Wright <jonathan.wright@arm.com>.
Bug: webm:1819
Change-Id: I37e3ac72835e67ccb44ac89a4ed16df62c2169a7
(cherry picked from commit 7dfe343199)
Issue explanation:
The unit test calls set_config function twice after encoding the
first frame.
The first call of set_config reduces frame width, but is still within
half of the first frame.
The second call reduces frame width even more, making is less than
half of the first frame, which according to the encoder logic,
there is no valid ref frames, and this frame should be set as a
forced keyframe. This leads to null pointer access in scale_factors
later.
Solution:
To make sure the correct detection of a forced key frame,
we need to update the frame width and height only when the actual
encoding is performed.
Bug: b/311985118
Change-Id: Ie2cd3b760d4a4b399845693d7421c4eb11a12775
This change fixed a bug revealed by b/311294795.
In simple motion search, the reference buffer pointer needs to be
restored after the search. Otherwise, it causes problems while the
reference frame scaling happens. This CL fixes the bug.
Bug: b/311294795
Change-Id: I093722d5888de3cc6a6542de82a6ec9d601f897d
fseeko and ftello are available on Android only from API level 24. Add
the needed guards for these functions.
Suggested by Yifan Yang.
Change-Id: I3a6721d31e1d961ab10b434ea6e92959bd5a70ab
(cherry picked from commit bf07554183)
This change fixed a corner case bug reealed by b/311394513.
During the frame scaling, vpx_highbd_convolve8() and vpx_scaled_2d()
requires both x_step_q4 and y_step_q4 are less than or equal to a
defined value. Otherwise, it needs to call vp9_scale_and_extend_
frame_nonnormative() that supports arbitrary scaling.
The fix was done in LBD and HBD funnctions.
Bug: b/311394513
Change-Id: Id0d34e7910ec98859030ef968ac19331488046d4
(cherry picked from commit 8bf3649d41)
Need to set skip_recode properly so that
vp9_encode_block_intra() can work properly when it is
called by block_rd_txfm(). We can not skip "recode" because
it is still at the rd search stage.
Bug: b/310340241
Change-Id: I7d7600ef72addd341636549c2dad1868ad90e1cb
(cherry picked from commit f10481dc0a)
no header providing "CONFIG_VP9_HIGHBITDEPTH" is directly included
no header providing "VPX_BITS_8" is directly included
Change-Id: Ie6d78c79ab462501417f2b451bbe808a1fdce931
Since the reference frame is already scaled, do not scale the offsets.
BUG: b/311489136, b/312656387
Change-Id: Ib346242e7ec8c4d3ed26668fa4094271218278ed
(cherry picked from commit 845a817c05)
Use vpx_sse and vpx_highbd_sse instead of vpx_mse16x16 and
vpx_highbd_8_mse16x16 respectively to compute SSE for PSNR
calculations. This solves an issue whereby vpx_highbd_8_mse16x16
was being used to calculate SSE for 10- and 12-bit input.
This is a port of the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/175063
by Jonathan Wright <jonathan.wright@arm.com>.
Bug: webm:1819
Change-Id: I37e3ac72835e67ccb44ac89a4ed16df62c2169a7
vpx/vpx_integer.h is clearly intended as the facade header for the
Standard C Library headers <stddef.h>, <inttypes.h>, and <stdint.h>.
It is reasonable to expect that vpx/vpx_decoder.h and vpx/vpx_encoder.h
should provide the symbols from vpx/vpx_codec.h.
Change-Id: I220797e63b2efc3dd9e2ac197fe2f918bf80d247
This change fixed a corner case bug reealed by b/311394513.
During the frame scaling, vpx_highbd_convolve8() and vpx_scaled_2d()
requires both x_step_q4 and y_step_q4 are less than or equal to a
defined value. Otherwise, it needs to call vp9_scale_and_extend_
frame_nonnormative() that supports arbitrary scaling.
The fix was done in LBD and HBD funnctions.
Bug: b/311394513
Change-Id: Id0d34e7910ec98859030ef968ac19331488046d4
Need to set skip_recode properly so that
vp9_encode_block_intra() can work properly when it is
called by block_rd_txfm(). We can not skip "recode" because
it is still at the rd search stage.
Bug: b/310340241
Change-Id: I7d7600ef72addd341636549c2dad1868ad90e1cb
Define the VPX_DL_REALTIME, VPX_DL_GOOD_QUALITY, and VPX_DL_BEST_QUALITY
macros as unsigned long, because the deadline parameter of
vpx_codec_encode() is of the unsigned long type. This enables C++
templates to deduce the unsigned long type from these macros.
Change-Id: I2173e3bbf5e15c84c11843790df93a497a35ed7d
fseeko and ftello are available on Android only from API level 24. Add
the needed guards for these functions.
Suggested by Yifan Yang.
Change-Id: I3a6721d31e1d961ab10b434ea6e92959bd5a70ab
The changes in this CL show that both the VP8 and VP9 implementations of
the decode function eventually discard the deadline parameter. Change
the code to ignore the deadline parameter in vpx_codec_decode() without
passing it to the decode function, and document that the deadline
parameter is ignored and 0 should be passed.
Change-Id: Ia977e16cdbdf97901207aa2d749887980137c4c0
Since the reference frame is already scaled, do not scale the offsets.
BUG: b/311489136, b/312656387
Change-Id: Ib346242e7ec8c4d3ed26668fa4094271218278ed
Add an Armv8.0 MLA Neon implementation of horizontal convolution
specialised for executing with 4-tap filters (the most common filter
size for settings --good --cpu-used=1.) This new path is also used
when executing with bilinear (2-tap) filters.
Change-Id: Ic2c3cb307b95964cd0ba86f1c42eece3a8ab7cf4
Add an Armv8.0 MLA Neon implementation of vertical convolution
specialised for executing with 4-tap filters (the most common filter
size for settings --good --cpu-used=1.) This new path is also used
when executing with bilinear (2-tap) filters.
Change-Id: I027eaf2d1bb9711c2217cc8aa6b1e379d3e66b26
The deadline parameter of vpx_codec_encode() is of the unsigned long
type. The cpplint runtime/int check and the clang-tidy
google-runtime-int warn about the use of the unsigned long type. Adding
a type alias works around this issue.
Note: vpx_codec_decode() also has a deadline parameter, but it is of the
long type. So unfortuntely this type alias cannot be simply named
vpx_codec_deadline_t and the name must suggest it is encoder-specific.
Change-Id: I27b6b25730b620b328422ec3f91e63fdc55b377a
For realtime mode: if the deadline mode (good/best/realtime)
is changed on the fly (via codec_encode() call), force a
key frame and set the speed feature nonrd_keyframe = 1 to
avoid entering the rd pickmode.
nonrd_pickmode=0/off is the only feature in realtime mode that
involves rd pickmode, so by forcing it on/1 we can cleanly
separate nonrd (realtime) from rd (good/best), so we can
avoid possible issues on this dynamic mode switch, such as in
bug listed below.
Dynamic change of deadline, in particular for realtime mode,
involves a lot of coding/speed feature changes, so best to
also force reset with keyframe.
Added unitest that triggers the issue in the bug.
Bug: b/310663186
Change-Id: Idf8fd7c9ee54b301968184be5481ee9faa06468d
Add an Armv8.6 USDOT Neon path for the horizontal portion of 2D
convolution, specialised for executing with 4-tap filters (the most
common filter size for settings --good --cpu-used=1.) This new path
is also used when executing with bilinear (2-tap) filters.
Change-Id: I455e5a94bdcea1358025bd8e4d4c8c62e373aa5d
Add an Armv8.6 USDOT Neon implementation of horizontal convolution
specialised for executing with 4-tap filters (the most common filter
size for settings --good --cpu-used=1.) This new path is also used
when executing with bilinear (2-tap) filters.
Change-Id: I8f7633d9852ebfe8feb9b4a055715f849cccf297
Add an Armv8.4 SDOT Neon path for the horizontal portion of 2D
convolution, specialised for executing with 4-tap filters (the most
common filter size for settings --good --cpu-used=1.) This new path
is also used when executing with bilinear (2-tap) filters.
Change-Id: I5116d10ddb371ac2cf302ef905d06f2140dc7600
Add an Armv8.4 SDOT Neon implementation of horizontal convolution
specialised for executing with 4-tap filters (the most common filter
size for settings --good --cpu-used=1.) This new path is also used
when executing with bilinear (2-tap) filters.
Change-Id: Ib396681b3f7b8b0eeba94381fbe33a06cf7b4a13
Add an Armv8.6 USDOT Neon implementation of vertical convolution
specialised for executing with 4-tap filters (the most common filter
size for settings --good --cpu-used=1.) This new path is also used
when executing with bilinear (2-tap) filters.
Change-Id: Ic893b25541e3317c5d5c270c338f868f080aed7c
Add an Armv8.4 SDOT Neon implementation of vertical convolution
specialised for executing with 4-tap filters (the most common filter
size for settings --good --cpu-used=1.) This new path is also used
when executing with bilinear (2-tap) filters.
Change-Id: I3eb00b5a34f5676b68bda60a2a29be56e3d7d0cd
vpx_get_filter_taps() currently reports either 8-tap or 2-tap.
However, many 8-tap filters are actually 0-padded, resulting in a
lot of redundant work (multiplying by, and adding, 0) when processing
using an 8-tap convolution function. In preparation for adding 2- and
4-tap SIMD implementations for the convolution paths, make the filter
size reporting more granular, stripping any 0 padding. Filter sizes
can now be reported as 2-, 4-, 6- or 8-tap.
Change-Id: I100133aac7173134af34b918c9ad3007d98d6060
Delete redundant transpose/permute code in the Neon dot-product
vertical convolution paths. Variable values were assigned but never
used before subsequent assignment.
Change-Id: I15b29d0c993f56599e0d18ac1d5787e6385d2a3a
The test shows that the comment for kf_max_dist in vpx/vpx_encoder.h
differs from its behavior by one. We should modify the comment to match
the encoding behavior.
Bug: webm:1829
Change-Id: Icdc58b8f6b25353f10ce8ecc481c862bd3fe86df
When all the inter reference frames are invalid, disable the speed
features that bypass intra mode search.
BUG=b/312517065
Change-Id: I246c953fad3be61b9d307da11c752a21a36b90ff
vpx_codec_iface_t is defined as follows:
typedef const struct vpx_codec_iface vpx_codec_iface_t;
Since vpx_codec_iface_t is already a const struct, it is redundant to
add "const" to vpx_codec_iface_t.
Note: I think vpx_codec_iface_t should not have been defined as a const
struct, but it is too late to change that now.
Change-Id: Ifbd3f8a63c1d48e9169ff77fa0b505ea1e65519d
When the reference frame's scaling factor is not in the supported
range, skip using it for motion compensation prediction in the
partition speed features.
BUG=b/312517065
Change-Id: Ie3687186521ad2616be258e80d3e5b16e5f2d5e9
The code is ported from libaom's aom_sse and aom_highbd_sse at
commit 1e20d2da96515524864b21010dbe23809cff2e9b.
The vpx_sse and vpx_highbd_sse functions will be used by vpx_dsp/psnr.c.
Bug: webm:1819
Change-Id: I4fbffa9000ab92755de5387b1ddd4370cb7020f7
is -> if
returns -> computes
in the documentation for ComputeQP().
This is the same as:
9142314c2 ratectrl_rtc.h: fix a few typos
+ remove a duplicate, commented out, version of GetLoopfilterLevel()
Change-Id: I8832e628b63b0b7dac6236631072f36ad55d90e8
Move some internal drop_frame code to separate
function so the external RC can use.
And add new flag setting under VP8E_SET_RTC_EXTERNAL_RATECTRL
to disable vp8_drop_encodedframe_overshoot() for
testing the external RC.
Unittest added for single layer and 3 temporal layers.
Bug: b/280363228
Change-Id: Ibea2f627cc54e7156ff35259a64dd111d42d146c
Older versions of MSVC do not allow declarations after statements in C
files. We don't need to support those versions of MSVC now.
Use -std=gnu99 instead of -std=gnu89.
Change-Id: I76ba962f5a2bca30d6a5b2b05c5786507398ad32
Most are related to include-what-you-use. One is to avoid using the
unsigned long type explicitly (by passing VPX_DL_REALTIME directly to
vpx_codec_encode).
Change-Id: Ieaf3418382ad8516cb4b172f7678893286fcb8cf
Declare the oxcf parameters of vp8_create_compressor() and init_config()
as const. This helps code analysis.
Change-Id: I344ef3e6afc3adced2b2865b7e0057c6d4b1d3c0
Fixes the creation of DT_TEXTREL entries in binaries built with PIE
enabled:
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
This matches the changes made in libaom:
1df26009da aom_configure: only override CONFIG_PIC if not set on cmd line
7235e65746 aom_configure.cmake: detect PIE and set CONFIG_PIC
Change-Id: I0a43e964af2d8eb8c5e7811ce14ad39285eec3a8
- Enable C vs SIMD test for x86 32-bit platform
- Correct a print message in run_tests()
BUG=webm:1800
Change-Id: Ib1ccd3a87a64b5ec6cde524a14d5d1b7e200abfb
Supports single layer and svc. For svc only the
framedrop_mode = FULL_SUPERFRAME_DROP is allowed
for now.
Dropping frames due to overshoot is enabled by the
oxcf->drop_frames_water_mark, which is zero as default.
Note that this CL also allows for drop/skip encoding of
enhancement layers if that layer bitrate is zero.
max_consec_drop is also added, set to INT_MAX as default.
Note that max_consec_drop is only used for svc mode.
It has not been added yet for single layer in libvpx encoder.
Tests added for single layer and svc case.
Change-Id: Ic12f6a0eb3fbf07d8eb8456c46cec27b2e1930d3
Guard hwcap2 feature interrogation on HAVE_NEON_I8MM so that it gets
disabled if neon_i8mm is disabled when configuring the build.
Bug: webm:1825
Change-Id: Ic6ff71f17387b96219591928a583d43560bb7c7a
The intermediate value in the target bandwidth
calculation may exceed integer bounds.
Bug: 308007926
Change-Id: I8288c5820db06a550d88bf91fccc86106996deaa
Signed-off-by: Xiahong Bao <xiahong.bao@nxp.com>
Add 'sve' arch options to the configure, build and unit test files -
adding appropriate conditional options where necessary. Arm SIMD
extensions are treated as supersets in libvpx, so disable SVE if
either Neon DotProd or I8MM are unavailable.
Change-Id: I39dd24f2b209251084d1e28d7ac68099460309bb
- Use smaller frame size that still triggers the overflow
- Do not run encoder as the encoder init also triggers the overflow
Bug: chromium:1492864
Change-Id: I392549abf69f1cfb3754cc847a214513ec9bedc5
Frame size caps the target bitrate internally, so the frame size needs
to be large enough to reproduce the target bitrate overflow in the
fuzzing test.
However the frame size needed exceeds the max buffer allowed on 32bit
system defined by VPX_MAX_ALLOCABLE_MEMORY
Bug: chromium:1492864
Change-Id: Ia3a9a78cd35516373897039a7769b492e29e8450
avg_frame_bandwidth = target_bandwidth / framerate
If target_bandwidth is too big and/or framerate is too small (< 1),
avg_frame_bandwidth could be overflow
Bug: chromium:1492864
Change-Id: I32314da1414b472ae4bf2acdcd81b8a948286146
A speed feature disable_split_mask (set to 63) could cause no mode and
partition to be selected in rd_pick_partition because:
-> thresh_mult_sub8x8 all INT_MAX
-> All modes skipped for sub8x8 blocks
-> found_best_rd is 0 -> break from the loop of 4 sub blocks
-> sum_rdc is INT_MAX -> No rd update -> should_encode_sb is 0
-> Propagating to top of the tree
-> No partition / mode is selected
Bug: b/290499385
Change-Id: Ia655e262f3b32445347ae0aaf1a2d868cea997f3
Port the following libaom CLs to libvpx:
https://aomedia-review.googlesource.com/c/aom/+/178361https://aomedia-review.googlesource.com/c/aom/+/180701https://aomedia-review.googlesource.com/c/aom/+/181821
The tests themselves are not feature-gated in the same way that they are
used in the rest of the codebase since they are not controlled by
rtcd.pl. This means that tests that assume the existence of features not
present on the target can cause SIGILL to be thrown.
This commit extends init_vpx_test.cc to match the behaviour for other
targets and automatically disable testing for features that are not
available on the machine running the tests.
Call arm_cpu_caps() and x86_simd_caps() inside #if !CONFIG_SHARED.
All the SIMD-specialized functions (arm or x86) are internal functions,
so they are not exported from the libvpx shared library. If
CONFIG_SHARED is 1, it is not necessary to call arm_cpu_caps(),
x86_simd_caps(), and append_negative_gtest_filter() either.
Change-Id: I330631816bdb52842020c5aa2a1ad802865cc285
Fix the TODO(https://crbug.com/1486441) comment in vp8/vp8_cx_iface.c.
Make vp8cx_create_encoder_threads() work after it has been called
before. If there are already the exact number of threads it needs to
create, return immediately. Otherwise, shut down the existing threads
(by calling vp8cx_remove_encoder_threads()) and create the required
number of threads.
Call vp8cx_create_encoder_threads() in vp8e_set_config() to respond to
changes in g_threads or g_w (which also affects the number of threads
through cm->mb_cols and cpi->mt_sync_range).
Change-Id: I552eeca5b1f1f5313f59559eb1da396f270a2429
Add the mt_current_mb_col_size field to VP8_COMP to record the size of
the mt_current_mb_col array.
Move the allocation of the mt_current_mb_col array from
vp8_alloc_compressor_data() to vp8_encode_frame(), where the use of
mt_current_mb_col starts. Allocate mt_current_mb_col right before use
if mt_current_mb_col hasn't been allocated or if the current size is
incorrect.
Move the deallocation of the mt_current_mb_col array from
dealloc_compressor_data() to vp8cx_remove_encoder_threads().
Move the TODO(https://crbug.com/1486441) comment from
vp8/encoder/onyx_if.c to vp8/vp8_cx_iface.c.
Change-Id: Ic5a0793278c2cc94876669aaa0dd732412876673
This CL adds an `emms` instruction at the end of the MMX assembly
for the vpx_subtract_block function, to properly clear the register
state. This resolves a mismatch between x86 build and C only build.
BUG=webm:1816
Change-Id: I79d2947da7f587f3558a2ae17df214d2faf59e74
Make vp8cx_create_encoder_threads() undo everything cleanly before
returning an error.
Make vp8cx_remove_encoder_threads() reset pointer fields to NULL after
freeing them, reset encoding_thread_count to 0, and reset b_lpf_running
to 0 (false). This makes it safe to call vp8cx_create_encoder_threads()
after calling vp8cx_remove_encoder_threads().
Change-Id: I586f06ce3d5b1c88ca46884bb4d6667ffc97e440
Fix the following compiler warning when libvpx is configured with
the --disable-multithread option:
vp9/common/vp9_thread_common.c:391:7: warning:
variable 'cur_row' set but not used [-Wunused-but-set-variable]
int cur_row;
^
Change-Id: I53aa279152715083df40990eb7fdcaeb77a66777
vp8cx_create_encoder_threads() caps the thread count at
(cm->mb_cols / cpi->mt_sync_range) - 1. If cfg.g_w is 16, cm->mb_cols is
only 1 (see vp8_alloc_frame_buffers: mb_cols = width >> 4), so we won't
be using multiple threads. To reproduce bug chromium:1486441, the test
just needs to increase cfg.g_h sufficiently.
Bug: chromium:1486441
Change-Id: Ie6b2da2e31cfa1717a481f55eebc8c875db94d87
Use $PWD to get the current directory.
Quote directory pathnames.
Suggested by James Zern.
Bug: webm:1800
Change-Id: I51e922b24da0e89d936370f858eab55d193ebdcb
These functions assume the uint16_t samples are <= 255 (bit depth 8),
but vpx_highbd_8_mse16x16() is called for any bit depth, not just 8.
A better fix is to port the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/175063 to libvpx, but
that requires porting aom_sse() and aom_highbd_sse() to libvpx, which is
quite involved. So disable vpx_highbd_8_mse16x16_neon_dotprod, etc.
first.
Bug: webm:1819
Change-Id: If495a5dedc58d9981317b9993c9fbb81ff3ab50c
libvpx 1.13.1
2023-09-29 v1.13.1 "Ugly Duckling"
This release contains two security related fixes. One each for VP8 and VP9.
- Upgrading:
This release is ABI compatible with the previous release.
- Bug fixes:
https://crbug.com/1486441 (CVE-2023-5217)
Fix to a crash related to VP9 encoding (#1642)
* tag 'v1.13.1':
update CHANGELOG
update version to 1.13.1
Fix bug with smaller width bigger size
vp9_alloccommon: clear allocation sizes on free
VP8: disallow thread count changes
encode_api_test: add ConfigResizeChangeThreadCount
README: update release version to 1.13.0
Bug: webm:1818
Change-Id: I732e2423f635d4115890f00fd63f9886e31f39a6
use sizeof(var) instead of sizeof(type) and sizeof(*var) instead of
sizeof(var[0]) for consistency in some places.
Change-Id: Ibd9a783cfef5ce1d06131df3831a4093821a502f
When the next frame is null and the current frame is an overlay
frame, which is equivalent to there is an active alt ref frame,
we call this an end of sequence.
Change-Id: I49c2cf7a001df98aff8b62ba034317e408274bd4
Currently allocations are done at encoder creation time. Going from
threaded to non-threaded would cause a crash.
Bug: chromium:1486441
Change-Id: Ie301c2a70847dff2f0daae408fbef1e4d42e73d4
Update thread counts and resolution to ensure allocations are updated
correctly. VP8 is disabled to avoid a crash.
Bug: chromium:1486441
Change-Id: Ie89776d9818d27dc351eff298a44c699e850761b
define_gf_group is called at the last frame of each GOP to get GOP size
for next one, which means it'll also be called at the last GOP of the
sequence, when calling WebM RC will be returned with error since WebM RC
does not have any more GOP to return.
When gop_coding_frames from the encoder is 1, it means it's running out
of firstpass stats, which means end of sequence.
Bug: b/299610956
Change-Id: I30e077a28fe41593ebabbc1dc0c2915a4bcbece3
This cpu detection implementation doesn't do anything MSVC specific,
it just calls the IsProcessorFeaturePresent function. This can be
compiled with mingw compilers just as well.
Change-Id: I55e607a47c8f5b70d9f707ef96b2fa7553f2f79f
The original ref frame index was the index in the GF group; RC expects
the index to be the one for ref frame buffer.
Change-Id: I9a2b0e72b6332023fb2e8da131b557f82db02e39
Arm Neon DotProd implementations of vpx_highbd_8_mse<w>x<h> currently
need to be enabled at compile time since they're guarded by #ifdef
feature macros. Now that run-time feature detection has been enabled
for Arm platforms, expose these implementations with distinct
*neon_dotprod names in a separate file and wire them up to the build
system and rtcd.pl. Also add new test cases for the new functions.
Change-Id: I26be6fb587258c8fa9fbf03509b7602358a001a8
Enable Arm Neon DotProd implementations of vpx_get_var_sse_sum*
specialty variance functions via run-time feature detection, wiring
up the new *neon_dotprod names to rtcd.pl. Also add new test cases.
Change-Id: I04ac3db87d32ee7f94702b6c0360254e5688f713
Arm Neon DotProd implementations of vpx_variance<w>x<h> currently
need to be enabled at compile time since they're guarded by #ifdef
feature macros. Now that run-time feature detection has been enabled
for Arm platforms, expose these implementations with distinct
*neon_dotprod names in a separate file and wire them up to the build
system and rtcd.pl. Also add new test cases for the new functions.
Remove the _neon suffix in functions making reference to
vpx_variance<w>x<h>_neon() (e.g. sub-pixel variance) - enabling use
of the appropriate *neon or *neon_dotprod version at run time.
Similar changes for the specialty variance and MSE functions will be
made in a subsequent commit.
Change-Id: I69a0ef0d622ecb2d15bd90b4ace53273a32ed22d
Arm Neon DotProd implementations of vpx_sad*4d currently need to be
enabled at compile time since they're guarded by ifdef feature
macros. Now that run-time feature detection has been enabled for Arm
platforms, expose these implementations with distinct *neon_dotprod
names in separate files and wire them up to the build system and
rtcd.pl. Also add new test cases for the new DotProd functions.
Change-Id: Ie99ee0b03ec488626f52c3f13e4111fe26cc5619
Arm Neon DotProd implementations of vpx_sad* currently need to be
enabled at compile time since they're guarded by ifdef feature
macros. Now that run-time feature detection has been enabled for Arm
platforms, expose these implementations with distinct *neon_dotprod
names in separate files and wire them up to the build system and
rtcd.pl. Also add new test cases for the new DotProd functions.
Change-Id: Ic6906c28240276ba89787eadbc9393a232374f95
Arm Neon DotProd and I8MM implementations of vpx_convolve8* currently
need to be enabled at compile time since they're guarded by ifdef
feature macros. Now that run-time feature detection has been enabled
for Arm platforms, expose these implementations with distinct
*neon_dotprod/*neon_i8mm names in separate files and wire them up to
the build system and rtcd.pl. Also add new test cases for the new
DotProd and I8MM functions.
Change-Id: I3db3cd62e8596099d9fec7805ca3ee86b2a01c74
1) Overhaul the Arm CPU feature detection code, taking inspiration
from similar recent changes in libaom.
2) Add neon_dotprod and neon_i8mm arch options in the configure,
build and unit test files, adding appropriate conditional options
where necessary.
3) Soft-enable run-time CPU feature detection by default for both 32-
bit and 64-bit Arm platforms.
Change-Id: I3f13317d88324acc5753394351188baa8d18a261
Simplify the parameters and return values of the Neon MSE helper
functions for both standard and high bitdepth - avoiding unused
return values.
Change-Id: I6f9208f9ce890fbe58346d9c7d9d701f28f2f90f
Overflow was happening in two places:
one in set_encoder_config(), where the input
layer_target_bitrates are converted from kbps to bps,
the other in vp9_calc_pframe_target_size_one_pass_vbr(),
where target is scaled by kf_ratio.
vp9_ratectrl.c:2039: runtime error: signed integer overflow:
-137438983 * 25 cannot be represented in type 'int'
Bug: chromium:1475943
Change-Id: I1ab0980862548c8827fae461df9a7a74425209ff
vp9/encoder/vp9_ratectrl.c:2171:23: runtime error: signed integer
overflow: 103079280 * -22 cannot be represented in type 'int'
Bug: chromium:1473268
Change-Id: Ic1de7d48e74d94c2a992e53ec4382b5b44dba7af
in calc_iframe_target_size():
vp8/encoder/ratectrl.c:349:31: runtime error: signed integer overflow:
38 * 343597280 cannot be represented in type 'int'
Bug: chromium:1473473
Change-Id: Ie8f7b147efb27c92314df09837b66f7d97046883
Remove '= {}' (C23 [1]) and use memset to clear a vpx_rc_config_t
instance.
after:
6e2c3b9b3 Add RC mode to vpx external RC interface
Fixes compile with -pedantic and Microsoft's cl compiler.
[1] https://en.cppreference.com/w/c/language/initialization
Change-Id: I2019cdf0c42103cfc80b1e58c68b7596e497007f
Use an array for constant initialization rather than array syntax which
assumes the underlying type is a vector. Fixes compile error with
cl targeting Windows Arm64:
vpx_dsp\arm\fdct4x4_neon.c(55,52): error C2078: too many initializers
No change in assembly with gcc 12.2.0 & clang 14.
Bug: b/277255390
Bug: webm:1810
Fixed: webm:1810
Change-Id: Ia30edcdbb45067dfe865b9958a5eecf1fd9ddfc8
after:
22818907d normalize *const in rtcd
fixes warnings of the form:
vpx_dsp\x86\quantize_avx.c(145): warning C4028: formal parameter 2
different from declaration
Change-Id: I4dc423f11ec4a9171e18bdb6be2fa8dfb65ee61a
Fix a bug in vpx_int_pro_row_neon (increment pointer after peeled
first loop iteration) and re-enable both vpx_int_pro_row/col_neon
paths.
Also fix IntProRowTest to use width_ (instead of 0) as the src_stride
for the input data block. The test's use of 0 for src_stride is the
reason the tests passed with the buggy Neon implementation noted in
the listed bugs. (The old buggy Neon implementation fails the
adjusted unit tests.)
BUG=webm:1800
BUG=webm:1809
Change-Id: I1f4572ee155653a7596fe2c10b5938ea7a3f63ae
Arm SIMD testing was enabled in c vs SIMD bit-exactness test after
arm SIMD mismatch was resolved.
BUG=webm:1800
Change-Id: Id60127313a0955f4a5c8468281fd5a441668fddb
The vpx_int_pro_row/col neon SIMD version caused a mismatch between
neon encoding vs c encoding. Disabled them for now to ensure the
correctness of VP9 encoding on the arm platform. Since these 2
functions were not used much, so this wouldn't affect the overall
encoder speed much.
BUG=webm:1800
BUG=webm:1809
Change-Id: Id1a7d542fc03d4cf9fa1039a49832abf35fb722f
- Include vpx/vpx_ext_ratectrl.h in vp9_ext_ratectrl.c
- Include vpx/internal/vpx_codec_internal.h
- Include <stddef.h> for NULL
Bug: b/294049605
Change-Id: Iedd8b3864da27fde1678bfa6606e6fc5630a7a09
- Use zero initializer instead of memset to avoid including <cstring>
- Include vpx_codec.h for vpx_codec_err_t and error codes
- Include vpx_tpl.h for VpxTplGopStats
Change-Id: Iac5837ce2173bd945bfe8eeb401ff4dfd04fd2e1
This CL adds a shell script to test bit exactness of C and SIMD
VP9 encoder for x86 platform.
As C Vs NEON encoding outputs are not bit-exact (BUG=webm:1809),
ARM tests are currently disabled.
BUG=webm:1800
Change-Id: Iffcc70863e8cf83ccb5bc5be73e8866165697358
apply similar steps as to the other quantize functions to switch to
macroblock_plane and ScanOrder
Change-Id: I486d653326aaf52ffd3beafd2e891ba6a5d57ef3
Pass macroblock_plane and ScanOrder instead of looking up the values
beforehand. Avoids pushing arguments to the stack.
Change-Id: I22df6f645eb1a1d89ba5a4d9bc58acb77af51aa9
Update functions in WRITE_COMPRESSED_STREAM blocks, which are disabled
by default. This caused them to be missed in:
84e6b7ab0 test/*.cc: prefer 'override' to 'virtual'
Change-Id: I0e462263f19c15eb0a30d0c0f4e145062f789489
In file included from ../test/bench.cc:14:
../test/bench.h:17:7: warning: 'AbstractBench' has virtual functions but
non-virtual destructor [-Wnon-virtual-dtor]
class AbstractBench {
Change-Id: Ibbfb949b63c8dff936c7ed4f2d056dea0343377b
With gcc 13.1.1
In function ‘handle_inter_mode’,
inlined from ‘vp9_rd_pick_inter_mode_sb’ at
../vp9/encoder/vp9_rdopt.c:3872:17:
../vp9/encoder/vp9_rdopt.c:3142:8: warning: ‘tmp_rd’ may be used
uninitialized [-Wmaybe-uninitialized]
3142 | rd = tmp_rd + RDCOST(x->rdmult, x->rddiv, rs, 0);
../vp9/encoder/vp9_rdopt.c: In function ‘vp9_rd_pick_inter_mode_sb’:
../vp9/encoder/vp9_rdopt.c:2846:15: note: ‘tmp_rd’ was declared here
2846 | int64_t rd, tmp_rd, best_rd = INT64_MAX;
Change-Id: I8608957cc8bbeb1ae525f3c3dad6fe9785b2a9b4
These were removed in If7a49e920e12f7fca0541190b87e6dae510df05c but
the leftovers can cause a build to fail if the code isn't optimized out.
I just found this out in the Meson port of libvpx for GStreamer.
BUG=webm:1584
Change-Id: I1c953720a2cbec3796200d4ec4020dca0b672bfb
vp9/common/vp9_mfqe.c|240 col 16| warning: code will never be executed
[-Wunreachable-code]
BLOCK_SIZE mfqe_bs, bs_tmp;
^~~~~~~
Change-Id: I566b20d8c294e19bc4b90b57b730f933048e71a5
Based on the change in libaom:
fe36011455 Fix Clang -Wunreachable-code-aggressive warnings
Clang's -Wunreachable-code-aggressive flag enables several warning flags
such as -Wunreachable-code-break and -Wunreachable-code-return. Chrome's
build system enables -Wunreachable-code-aggressive (in
build/config/compiler/BUILD.gn), so it would be good if libvpx could be
compiled without -Wunreachable-code-aggressive warnings.
This requires the VPX_NO_RETURN macro be defined correctly for all the
compilers we support, otherwise some compilers may warn about missing
return statements after a die() or fatal() call (which does not return).
Change-Id: I0c069133af45a7a61759538b6d74c681ea087dcd
This fixes a crash if the application continues to call
vpx_codec_decode(). Previously a non-keyframe could cause a crash if the
decoder failed before fully initializing due to an allocation failure.
The stream info and frame resolution would be 0, skipping an allocation.
Found with vpx_dec_fuzzer_vp8 & Nallocfuzz
(https://github.com/catenacyber/nallocfuzz).
Bug: webm:1807
Change-Id: I1c17302f4d3a488ba3b4eefe0bf53853dc558bc1
This fixes a crash if the application continues to call
vpx_codec_decode(). Previously the decoder instance would be freed,
causing a crash when attempting to access it with restart_threads=1.
Found with vpx_dec_fuzzer_vp8 & Nallocfuzz
(https://github.com/catenacyber/nallocfuzz).
Bug: webm:1807
Change-Id: Ic084894b776729bb1572f747082cef002f0832a8
This fixes a crash if the application continues to call
vpx_codec_decode().
Found with vpx_dec_fuzzer_vp8 & Nallocfuzz
(https://github.com/catenacyber/nallocfuzz).
Bug: webm:1807
Change-Id: I9867f5fc3d1163026f521a9609d3cbbc00568d1d
This avoids a crash if any of the thread allocations fail and the
application continues to call vpx_codec_decode(). Previously
num_tile_workers would be non-zero, but not equal to num_threads, which
would cause a crash during later thread management.
Found with vpx_dec_fuzzer_vp9 & Nallocfuzz
(https://github.com/catenacyber/nallocfuzz).
Bug: webm:1807
Change-Id: Ie3faf7b36764aebedac0924acb6e4cb7545aec7d
This fixes reallocations (and avoids potential crashes) if any
allocations fails and the application continues to call
vpx_codec_decode().
Found with vpx_dec_fuzzer_vp9 & Nallocfuzz
(https://github.com/catenacyber/nallocfuzz).
Bug: webm:1807
Change-Id: If5dc96b73c02efc94ec84c25eb50d10ad6b645a6
If any allocations fail in init_decoder() and the application continues
to call vpx_codec_decode() some of the allocations would be orphaned or
the decoder would be left in a partially initialized state.
Found with vpx_dec_fuzzer_vp9 & Nallocfuzz
(https://github.com/catenacyber/nallocfuzz).
Bug: webm:1807
Change-Id: I44f662526d715ecaeac6180070af40672cd42611
A right shift by 2 is equivalent to two halving operations if there is
no no addition or subtraction between the two halving operations.
Note: Since vhaddq_s16() and vhsubq_s16() have 17-bit intermediate
precision, the Neon code doesn't need to go to int32_t as was done in
https://chromium-review.googlesource.com/c/webm/libvpx/+/4604169.
Change-Id: Ibe0691cde0fd3b94ee7c497845ba459d30d503b0
The corresponding case block is not only for ARM.
Original comment text makes reader confused.
Test: N/A, just comment text changes.
Change-Id: I3154d18d3b3d237c1eecfe07dc7ec237c98194cf
Signed-off-by: Chen Wang <wangchen20@iscas.ac.cn>
This CL resolves the mismatch between C and intrinsic implementation
of vpx_hadamard_32x32 function. The mismatch was due to integer
overflow during the addition operation in the intrinsic functions.
Specifically, the addition in the intrinsic function was performed
at the 16-bit level, while the calculation of a0 + a1 resulted in
a 17-bit value.
This code change addresses the problem by performing
the addition at the 32-bit level (with sign extension) in both SSE2
and AVX2, and then converting the results back to the 16-bit level
after a right shift.
STATS_CHANGED
Change-Id: I576ca64e3b9ebb31d143fcd2da64322790bc5853
impace -> impact
taget -> target
prediciton -> prediction
addtion -> addition
the the -> the
Bug: webm:1803
Change-Id: I759c9d930a037ca69662164fcd6be160ed707d77
Dont -> Don't
setings -> settings
thresold -> thresh
thresold -> threshold
becasue -> because
itterations -> iterations
its a -> it's a
an constant -> a constant
Bug: webm:1803
Change-Id: I1e019393939ed25c59c898c88d4941ec360b026d
In the function vp9_diamond_search_sad_avx(), arranged
the cost vector in a specific order. This ensures that
the motion vector with the least index is selected,
when there exists more than one candidate motion
vector with the minimum cost, thus resolving the
c vs avx mismatch.
STATS_CHANGED
Change-Id: I4f8864f464f9ea2aae6250db3d8ad91cb08b26e2
Double the number of accumulator registers to remove the bottleneck.
Also peel the first loop iteration.
Change-Id: I6a90680369f9c33cdfe14ea547ac1569ec3f50de
* changes:
vpx_dsp_common.h,clip_pixel: work around VS2022 Arm64 issue
fdct_partial_neon.c: work around VS2022 Arm64 issue
fdct8x8_test.cc: work around VS2022 Arm64 issue
New file (vpx_tpl.c) in the following CLs will add new APIs dealing with
TPL stats from VP9 encoder.
Change-Id: I5102ef64214cba1ca6ecea9582a19049666c6ca4
This CL refactors the code related to convolve function.
Furthermore, improved the AVX2 intrinsic to compute
convolve vertical for w = 4 case, and convolve horiz for
w = 16 case.
Please note the module level scaling w.r.t C function
(timer based) for existing (AVX2) and new AVX2 intrinsics:
Block Scaling
Size AVX2 AVX2
(existing) (New)
4x4 5.34x 5.91x
4x8 7.10x 7.79x
16x8 23.52x 25.63x
16x16 29.47x 30.22x
16x32 33.42x 33.44x
This is a bit exact change.
Change-Id: If130183bc12faab9ca2bcec0ceeaa8d0af05e413
2D 8-tap convolution filtering is performed in two passes -
horizontal and vertical. The horizontal pass must produce enough
input data for the subsequent vertical pass - 3 rows above and 4 rows
below, in addition to the actual block height.
At present, all Neon horizontal convolution algorithms process 4 rows
at a time, but this means we end up doing at least 1 row too much
work in the 2D first pass case where we need h + 7, not h + 8 rows of
output.
This patch adds additional dot-product (SDOT and USDOT) Neon paths
that process h + 7 rows of data exactly, saving the work of the
unnecessary extra row. It is impractical to take a similar approach
for the Armv8.0 MLA paths since we have to transpose the data block
both before and after calling the convolution helper functions.
vpx_convolve_neon performance impact: we observe a speedup of ~9% for
smaller (and wider) blocks, and a speedup of 0-3% for larger blocks.
This is to be expected since the proportion of redundant work
decreases as the block height increases.
Change-Id: Ie77ad1848707d2d48bb8851345a469aae9d097e1
This avoids link errors related to the sanitizers:
https://clang.llvm.org/docs/AddressSanitizer.html#usage
"When linking shared libraries, the AddressSanitizer run-time is not
linked, so -Wl,-z,defs may cause link errors ..."
See also:
https://crbug.com/aomedia/3438
Bug: webm:1801
Fixed: webm:1801
Change-Id: Ie212318005a5f7222e5486775175534025306367
1) Use #define constant instead of magic numbers for right shifts.
2) Move saturating narrow into helper functions that return 4-element
result vectors.
3) Use mem_neon.h helpers for load/store sequences in Armv8.0 paths.
4) Tidy up: assert conditions and some longer variable names.
5) Prefer != 0 to > 0 where possible for loop termination conditions.
Change-Id: Idfcac43ca38faf729dca07b8cc8f7f45ad264d24
* changes:
vp8_[cd]x_iface: clear setjmp flag on function exit
vp9_decodeframe,tile_worker_hook: relocate setjmp=1
vp9,encoder_set_config: set setjmp flag after setjmp()
rather than define new targets, add a platform to the arm64 list as they
share the same configuration.
Bug: webm:1788
Change-Id: Iac020280b1103fb12b559f21439aeff26568fba4
x86 and armv7 are skipped for now as the intrinsics will need different
flags than cl.exe (/arch:... -> -m...).
Bug: webm:1788
Change-Id: I8ca8660a8644cdd84c51cb1f75005e371ba8207d
Contains the size of GOP - also the size of the list of TPL stats for
each frame in this GOP.
VpxTplGopStats will be the unit for VP9E_GET_TPL_STATS control to return
TPL stats from the encoder.
Bug: b/273736974
Change-Id: I1682242fc6db4aafcd6314af023aa0d704976585
There were multiple implementations of CHECK_MEM_ERROR across the
library that take different arguments and used in different places.
This CL will unify them and have only one implementation that takes
vpx_internal_error_info.
Change-Id: I2c568639473815bc00b1fc2b72be56e5ccba1a35
* changes:
Overwrite cm->error->detail before freeing
Have vpx_codec_error take const vpx_codec_ctx_t *
Add comments about vpx_codec_enc_init_ver failure
Added AVX2 intrinsic optimization for the following functions
1. vpx_idct16x16_256_add
2. vpx_idct32x32_1024_add
3. vpx_idct32x32_135_add
The module level scaling w.r.t C function (timer based) for
existing (SSE2) and new AVX2 intrinsics:
Scaling
Function Name SSE2 AVX2
vpx_idct32x32_1024_add 3.62x 7.49x
vpx_idct32x32_135_add 4.85x 9.41x
vpx_idct16x16_256_add 4.82x 7.70x
This is a bit-exact change.
Change-Id: Id9dda933aa1f5093bb6b35ac3b8a41846afca9d2
Help detect use after free of the return value of
vpx_codec_error_detail(). If vpx_codec_error_detail() is called after
vpx_codec_encode() fails, the return value may be equal to
cm->error->detail, which is freed when vpx_codec_destroy() is called.
Document the lifetime of the string returned by
vpx_codec_error_detail().
Change-Id: I8089e90a4499b4f3cc5b9cfdbb25d72368faa319
Also have vpx_codec_error_detail take vpx_codec_ctx_t *. Both functions
are getter functions that don't modify the codec context.
Change-Id: I4689022425efbf7b1da5034255ac052fce5e5b4f
Address the questions:
1. If vpx_codec_enc_init_ver() fails, should I still call
vpx_codec_destroy() on the encoder context?
2. Is it safe to call vpx_codec_error_detail() when
vpx_codec_enc_init_ver() failed?
Change-Id: I1b0e090d11dd9f853fe203f4cbb6080c3c7b0506
I realized the calculation of the size of the list of VpxTplBlockStats
is non-trivial. So it's better to add the field for the size.
Bug: b/273736974
Change-Id: Ic1b50597c1f89a8f866b5669ca676407be6dc9d8
This allows AArch64 to be correctly detected when building with Visual
Studio (cl.exe) and fixes a crash in vp9_diamond_search_sad_neon.c.
There are still test failures, however.
Microsoft's compiler doesn't define __ARM_FEATURE_*. To use those paths
we may need to rely on _M_ARM64_EXTENSION.
Bug: webm:1788
Bug: b/277255076
Change-Id: I4d26f5f84dbd0cbcd1cdf0d7d932ebcf109febe5
This will allow identifying Windows Visual Studio targets as aarch64;
the Microsoft compiler does not define __aarch64__.
An alternative would be to define this in the code, checking for
_M_ARM64 or _M_ARM64EC. For now we'll use the existing VPX_ARCH_*
system. For compatibility VPX_ARCH_ARM will continue to be defined to 1
in this case.
Bug: webm:1788
Bug: b/277255076
Change-Id: I12e25710891e86f0c7339ba96884c18ed90ba16f
Get ready for changes to follow:
- Custom reader/writer IO functions
- Codec control to get TPL stats from the encoder
Move the definition of TplFrameStats to public header so applications
can use them directly.
Bug: b/273736974
Change-Id: Ieb0db4560ddd966df1bc01f6a7e179cc97f9bac1
Joint motion search during compound mode eval is optimized by
reducing the number of mv search iterations based on bsize.
The sf 'comp_inter_joint_search_thresh' is renamed as
'comp_inter_joint_search_iter_level' and used to add the logic.
cpu Testset Instr. Cnt BD Rate loss (%)
Red (%) avg. psnr ovr.psnr ssim
0 LOWRES2 5.373 0.0917 0.1088 0.0294
0 MIDRES2 3.395 0.0239 0.0520 0.0783
0 HDRES2 2.291 0.0223 0.0301 0.0053
0 Average 3.686 0.0460 0.0636 0.0377
STATS_CHANGED
Change-Id: I7ee8873ebc8af967382324ae8f5c70c26665d5e6
This is a reland of commit 3c59378e4e
Addressed issues from the previous CL:
- Both recon_error and rate_cost are scaled up
- recon_error and rate_cost are not accumulated across ref frames,
instead they are calculated with the best ref frame picked.
- get_quantize_error() is put where it was, so there is no behavior
change for vp9.
Bug: b/273736974
Original change's description:
> Calculate recrf_dist and recrf_rate
>
> Change-Id: I74e74807436b92d729e2ccaab96149780f1f52d9
Change-Id: I20e1f5543e83b576a074bd4e6b44d99da65f4b56
This reverts commit 3c59378e4e.
Reason for revert:
recon_error and recon_rate is summed by mistake across reference frames, as pointed out by Angie.
It could also cause vp9 behavior changes.
Original change's description:
> Calculate recrf_dist and recrf_rate
>
> Change-Id: I74e74807436b92d729e2ccaab96149780f1f52d9
Change-Id: I6106ce77cb0fe8c12b2bcf070d01513ffa8dc613
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
This allows the testdata target to work environments like cygwin/msys
when a windows style path is used. It may also fix using paths with
spaces, though that's not generally recommended.
Change-Id: Id444c14468b05d589bce49c1f612aa712a3f0c8c
in get_rdmult_delta() and compute_frame_aq_offset().
quiets -Wunused-but-set-variable with clang-17
Change-Id: I726852f3bc42afa80a18475de910040a9436b0bb
Add Neon implementations of high bitdepth downsampling SAD4D
functions for all block sizes.
Also add corresponding unit tests.
Change-Id: Ib0c2f852e269cbd6cbb8f4dfb54349654abb0adb
Add Neon implementations of standard bitdepth downsampling SAD4D
functions for all block sizes.
Also add corresponding unit tests.
Change-Id: Ieb77661ea2bbe357529862a5fb54956e34e8d758
Add Neon implementations of high bitdepth downsampling SAD functions
for all block sizes.
Also add corresponding unit tests.
Change-Id: I56ea656e9bb5f8b2aedfdc4637c9ab4e1951b31b
Add Neon implementations of standard bitdepth downsampling SAD
functions for all block sizes.
Also add corresponding unit tests.
Change-Id: Ibda734c270278d947673ffcc29ef17a2f4970b01
Introduced AVX2 intrinsic to compute FDCT for block size
16x16 case. This is a bit-exact change.
Please check the module level scaling w.r.t C function (timer based)
for existing (SSE2) and new AVX2 intrinsics:
Scaling
SSE2 AVX2
3.88x 5.95x
Change-Id: I02299c3746fcb52d808e2a75d30aa62652c816dc
I believe the following comments are wrongly scoped, possibly left over
from previous changesets. This made me very confused when reading the
test suite Makefile, in order to port it to Meson.
Change-Id: Ice3c7ba50c6909a9c7dfd4001afa1e1ddfa4b5ce
New linear model to calculate loopfilter level from frame qp.
Linear regression was done on qvga, vga, and hd clips.
Bug: b/275304642
Change-Id: I552b312212bb4de21b53b762d139aa9588c64ae2
Added an assert for prune_single_mode_based_on_mv_diff_mode_rate
speed feature. This ensures NEARMV or ZEROMV modes are pruned
only when NEARESTMV and NEWMV modes are not early terminated.
Change-Id: Id8b03eef6d1ef3f16714a9cbfde0c171c0c6fe0b
Pack nz_mask with zero. After the result is permuted this has the effect
of ignoring the upper half of the iscan register which is only loaded
with 128-bits. Depending on the optimization level and the load used the
upper half of the ymm register may contain undefined values which can
produce an incorrect eob. If this is large enough it can cause a crash.
Bug: chromium:1431729
Change-Id: I4ebae9fa39f228bdd29dcc19935f3f07759d75f5
Add a widening 4D reduction function operating on uint16x8_t vectors
and use it to optimize the final reduction in Armv8.0 Neon standard
bitdepth 16xh, 32xh and 64h SAD4D computations.
Also simplify the Armv8.0 Neon version of the sad64xhx4d_neon helper
function since VP9 block sizes are not large enough to require
widening to 32-bit accumulators before the final reduction.
Change-Id: I32b0a283d7688d8cdf21791add9476ed24c66a28
Add a 4D reduction function operating on uint16x8_t vectors and use
it to optimize the final reduction in standard bitdepth 4xh and 8xh
SAD4D computations. Similar 4D reduction optimizations have already
been implemented for all other standard bitdepth block sizes, and all
high bitdepth block sizes.[1]
[1] https://chromium-review.googlesource.com/c/webm/libvpx/+/4224681
Change-Id: I0aa0b6e0f70449776f316879cafc4b830e86ea51
Added AVX2 intrinsic optimization for the following functions
1. vpx_variance8x4
2. vpx_variance8x8
3. vpx_variance8x16
This is a bit-exact change.
Instruction Count
cpu Resolution Reduction(%)
0 LOWRES2 0.698
0 MIDRES2 0.577
0 HDRES2 0.469
0 Average 0.582
Change-Id: Iae8fdf9344fd012cda4955ed140633141d60ba86
The shift instructions have marginally worse performance on some
micro-architectures, and the vget_{low,high} instructions are
unnecessary.
This commit improves performance of the d135 predictors by 1.5% geomean
averaged across a range of compilers and micro-architectures.
Change-Id: Ied4c3eecc12fc973841696459d868ce403ed4e6c
Use sum_neon.h helpers for horizontal reductions in Neon DC predictors,
enabling use of dedicated Neon reduction instructions on AArch64. Some
of the surrounding code is also optimized to remove redundant broadcast
instructions in the dc_store helpers.
Performance is largely unchanged on both the standard as well as the
high bit-depth predictors. The main improvement appears to be the 16x16
standard-bitdepth dc predictor, which improves by 10-15% when
benchmarked on Neoverse N1.
Change-Id: Ibfcc6ecf4b1b2f87ce1e1f63c314d0cc35a0c76f
* changes:
Avoid LD2/ST2 instructions in highbd v predictors in Neon
Avoid interleaving loads/stores in Neon for highbd dc predictor
Avoid LD2/ST2 instructions in vpx_dc_predictor_32x32_neon
For these block sizes there is no need to widen to 32-bits until the
final reduction, so use a single vabaq instead of vabd + vpadalq.
Change-Id: I9c19d620f7bb8b3a6b0bedd37789c03bb628b563
The interleaving load/store instructions (LD2/LD3/LD4 and ST2/ST3/ST4)
are useful if we are dealing with interleaved data (e.g. real/imag
components of complex numbers), but for simply loading or storing larger
quantities of data it is preferable to simply use the normal load/store
instructions.
This patch replaces such occurrences in the two larger block sizes:
vpx_highbd_v_predictor_16x16_neon and vpx_highbd_v_predictor_32x32_neon.
Change-Id: Ie4ffa298a2466ceaf893566fd0aefe3f66f439e4
The interleaving load/store instructions (LD2/LD3/LD4 and ST2/ST3/ST4)
are useful if we are dealing with interleaved data (e.g. real/imag
components of complex numbers), but for simply loading or storing larger
quantities of data it is preferable to simply use two or more of the
normal load/store instructions.
This patch replaces such occurrences in the two larger block sizes:
vpx_highbd_dc_predictor_16x16_neon, vpx_highbd_dc_predictor_32x32_neon,
and related helper functions.
Speedups over the original Neon code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 16x16 | 1.25
Neoverse N1 | LLVM 15 | 32x32 | 1.13
Neoverse N1 | GCC 12 | 16x16 | 1.56
Neoverse N1 | GCC 12 | 32x32 | 1.52
Neoverse V1 | LLVM 15 | 16x16 | 1.63
Neoverse V1 | LLVM 15 | 32x32 | 1.08
Neoverse V1 | GCC 12 | 16x16 | 1.59
Neoverse V1 | GCC 12 | 32x32 | 1.37
Change-Id: If5ec220aba9dd19785454eabb0f3d6affec0cc8b
The LD2 and ST2 instructions are useful if we are dealing with
interleaved data (e.g. real/imag components of complex numbers), but for
simply loading or storing larger quantities of data it is preferable to
simply use two of the normal load/store instructions.
This patch replaces such occurrences in vpx_dc_predictor_32x32_neon and
related functions.
With Clang-15 this speeds up this function by 10-30% depending on the
micro-architecture being benchmarked on. With GCC-12 this speeds up the
function by 40-60% depending on the micro-architecture being benchmarked
on.
Change-Id: I670dc37908aa238f360104efd74d6c2108ecf945
The existing tests duplicate `above_row_[block_size - 1]` after the
first `block_size` elements, which can lead to tests incorrectly passing
due to differing behaviour when calculating the average for the last
elements of the output.
This change adjusts the above array setup to be fully random instead,
allowing us to catch such issues here rather than in other larger tests
like the external MD5 tests.
It doesn't appear that other architectures are fully clean with this
change so restrict it to just Neon for now until they are fixed.
Bug: webm:1797
Change-Id: If83ff1adbf1e8d30f2a92474d7186c65840a5d0b
The existing standard bitdepth implementation doesn't appear to manifest
as a failure in any of the predictor or MD5 tests, but it does rely on
the predictor tests filling the second `bs` elements of the `above`
input array with copies of `above[bs - 1]` in order to match the C
implementation.
This patch adjusts the Neon implementation to correctly match the C
implementation in the case where the elements of the `above` array all
differ.
The geomean of performance for the predictor is approximately a 2%
slowdown compared to the previous vectorized implementation. This is
still considerably faster than the unspecialized naive C implementation.
Bug: webm:1797
Change-Id: I8fb00a154288d54b24a72a7ff63c816bdcf3aca3
The existing implementation doesn't appear to manifest as a failure in
any of the predictor or MD5 tests, but it does rely on the predictor
tests filling the second `bs` elements of the `above` input array with
copies of `above[bs - 1]` in order to match the C implementation.
This patch adjusts the Neon implementation to correctly match the C
implementation in the case where the elements of the `above` array all
differ.
Performance of the predictor is mostly unchanged, except for the 32x32
block size where it appears to have gotten about 40% faster when
compiled with clang-15.
Bug: webm:1797
Change-Id: Iaad58e77c5467307a3c80d6989b7cf2988e09311
The existing implementation doesn't appear to manifest as a failure in
any of the predictor or MD5 tests, but it does rely on the predictor
tests filling the second `bs` elements of the `above` input array with
copies of `above[bs - 1]` in order to match the C implementation.
This patch adjusts the Neon implementation to correctly match the C
implementation in the case where the elements of the `above` array all
differ.
Performance of the predictor is mostly unchanged, except for the 16x16
block size where it appears to have gotten marginally faster across most
compiler/micro-architecture combinations.
Bug: webm:1797
Change-Id: Iac166d6047316c0382e0f2790ce780fc99674b43
Introduced AVX2 intrinsic to compute convolve vertical for
w = 4 case. This is a bit-exact change.
Instruction Count
cpu Resolution Reduction(%)
0 LOWRES2 0.364
0 MIDRES2 0.236
0 HDRES2 0.162
0 Average 0.254
Change-Id: I413f58aa6333a6f2421d4c10d49dec01e55b2098
This matches the style guide and fixes some -Wshadow warnings related to
variables with the same name. Something similar was done in libaom in:
03f6fdcfca Fix warnings reported by -Wshadow: Part1b: scan_order struct
and variable
Bug: webm:1793
Change-Id: Ide5127886b7fd7778e6d8a983bfba6edda21ff28
Fix comment typos for vpx_codec_destroy() and vpx_codec_enc_init_ver().
Based on the change made in libaom:
https://aomedia.googlesource.com/aom/+/365a968684
365a968684 Fix comment typos (likely copy-and-paste errors)
Change-Id: I39edae835ed0752b569e8e7328d0709c59724ac2
This reverts commit 9c15fb62b3.
Reason for revert:
vpxenc should only use public interface
Original change's description:
> Add codec control to get tpl stats
>
> Add command line flag to vpxenc to export tpl stats
>
> Bug: b/273736974
> Change-Id: I6980096531b0c12fbf7a307fdef4c562d0c29e32
Bug: b/273736974
Change-Id: Ifa8951bb34e5936bbfc33086b22e9fc36d379bc9
Add Neon implementation of vpx_highbd_avg_4x4_c and vpx_highbd_avg_8x8_c
as well as the corresponding tests.
Change-Id: Ib1b06af5206774347690c9c56e194b76aa409c91
Shift the final read from the source by 3 to avoid breaking the
assumption that the 6-tap filter needs only 5 pixels outside of the
macroblock; this matches the sse2 and ssse3 implementations.
It's possible this restriction could be removed if the source buffers
are assumed to be padded.
Bug: webm:1795
Change-Id: I4c791e3a214898a503c78f4cedca154c75cdbaef
Fixed: webm:1795
The code to enable trellis coefficient optimization
is refactored using the sf 'trellis_opt_tx_rd'. This
change facilitates adaptive skipping of trellis
optimization based on block properties.
Change-Id: Ia1ff7cbbe5acf86414410f62655d46c099387847
This is a reland of commit 14fc40040f
Parent change fixed in crrev.com/c/webm/libvpx/+/4305500
Original change's description:
> quantize: use scan_order instead of passing scan/iscan
>
> further reduces the arguments for the 32x32. This will be applied to the base
> version as well.
>
> Change-Id: I25a162b5248b14af53d9e20c6a7fa2a77028a6d1
Change-Id: I2a7654558eaddd68bd09336bf317b297f18559d2
This is a reland of commit 573f5e662b
Alignment issue with tests fixed in crrev.com/c/webm/libvpx/+/4305500
Original change's description:
> quantize: simplify highbd 32x32_b args
>
> Change-Id: I431a41279c4c4193bc70cfe819da6ea7e1d2fba1
Change-Id: Ic868b6f987c99d88672858fedd092fa49c125e19
Change the VP9RateControlRtcConfig constructor to initialize
ss_number_layers (to 1).
Change UpdateRateControl() to return bool so that it can report failure
(due to invalid configuration).
Also change InitRateControl() to return bool to propagate the return
value of UpdateRateControl().
Note: This is a port of the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/172042.
Change-Id: I90b60353b5f15692dba5d89e7b1a9c81bb2fdd89
The code that sets oxcf->ts_rate_decimator[tl] does not need to be
inside a loop that iterates over sl. Move the code out of the sl loop so
that oxcf->ts_rate_decimator[tl] is set only once.
Change-Id: I22f6c117d200ec38a757b749a8700660d15436c1
Remove the `ts_number_layers` field from VP9RateControlRtcConfig because
the base class VpxRateControlRtcConfig already has that field.
Note: In commit 65a1751e5b,
`ts_number_layers` was moved to the newly created base class
VpxRateControlRtcConfig but was inadvertently left in
VP9RateControlRtcConfig:
https://chromium-review.googlesource.com/c/webm/libvpx/+/3140048,
Change-Id: I98d48e152683ec2e5e62efffb56b7f010c5d0695
Introduced AVX2 intrinsic to compute convolve horizontal for
w = 4 case. This is a bit-exact change.
Instruction Count
cpu Resolution Reduction(%)
0 LOWRES2 0.763
0 MIDRES2 0.466
0 HDRES2 0.317
0 Average 0.516
Change-Id: I124f3f8e994c24461812f4963b113819466db44f
Optimize vpx_minmax_8x8_neon on AArch64 targets by using the UMAXV and
UMINV instructions - computing the maximum and minimum elements in a
Neon vector.
Change-Id: I54c3a3a087d266f6774e6113e5947253df288a64
Optimize Neon implementation of vpx_satd by using ABD and UADALP instead
of ABAL and ABAL2, splitting the accumulator and using a dedicated
helper function to perform the final reduction.
Change-Id: Idcfa49e001b68b1dcd87c13fd9acc317a208cd2a
Both are around 3x faster than original C version. 8-bit gives a
small 0.5% speed increase, whereas highbd gives ~2.5%.
Change-Id: I71d75ddd2757b19aa201e879fd9fa8f3a25431ad
Introduced AVX2 intrinsic to compute convolve vertical for
w = 8 case. This is a bit-exact change.
Instruction Count
cpu Resolution Reduction(%)
0 LOWRES2 1.347
0 MIDRES2 1.046
0 HDRES2 0.805
0 Average 1.066
Change-Id: Idf77fff054beaf2c985b9bf2335591bda47e811f
Function renamed as 'build_inter_pred_model_rd_earlyterm' and
added a comment to explain its behavior.
Change-Id: I804e6273558ba36241232f62cf18ea754b85e369
The high bitdepth Neon code applying the first pass of the bilinear
filter for subpixel variance on blocks of width 4 processed two rows
at a time. This resulted in a source buffer overread, attempting to
produce two rows of padding for the second (vertical) pass of the
bilinear filter.
This patch modifies highbd_var_filter_block2d_bil_w4 and
highbd_avg_pred_var_filter_block2d_bil_w4 such that they only process
a single row per iteration, and only require a single row of padding
for the second pass. This prevents the buffer overread.
Since all block sizes are now processed one row at a time, there is
no need for a "padding" macro parameter - the value is always 1, with
no special case for 4xh blocks. As well as re-enabling the Neon paths
and their associated tests, we remove the now-redundant 'padding'
macro parameter.
Bug: webm:1796
Change-Id: Icd6076b38eb4476139795bb1734ca800c9edf079
vpx_highbd_8_sub_pixel_avg_variance4x4_neon
vpx_highbd_8_sub_pixel_avg_variance4x8_neon
vpx_highbd_10_sub_pixel_avg_variance4x4_neon
vpx_highbd_10_sub_pixel_avg_variance4x8_neon
vpx_highbd_12_sub_pixel_avg_variance4x4_neon
vpx_highbd_12_sub_pixel_avg_variance4x8_neon
all cause heap overflows of the form:
i[ RUN ] NEON/VpxHBDSubpelAvgVarianceTest.Ref/33
=================================================================
==535205==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff95bb0b89 at pc 0x00000116dabc bp 0xffffd09f6430 sp 0xffffd09f6428
READ of size 8 at 0xffff95bb0b89 thread T0
#0 0x116dab8 in load_unaligned_u16q vpx_dsp/arm/mem_neon.h:176:3
#1 0x116dab8 in highbd_var_filter_block2d_bil_w4
vpx_dsp/arm/highbd_subpel_variance_neon.c:49:21
#2 0x116dab8 in vpx_highbd_8_sub_pixel_avg_variance4x4_neon
vpx_dsp/arm/highbd_subpel_variance_neon.c:543:1
...
0xffff95bb0b89 is located 0 bytes to the right of 73-byte region
[0xffff95bb0b40,0xffff95bb0b89)
allocated by thread T0 here:
#0 0x5f18b0 in malloc (test_libvpx+0x5f18b0)
#1 0xce4a40 in vpx_memalign vpx_mem/vpx_mem.c:62:10
#2 0xce4a40 in vpx_malloc vpx_mem/vpx_mem.c:70:40
#3 0xa52238 in (anonymous namespace)::SubpelVarianceTest<unsigned
int (*)(unsigned char const*, int, int, int, unsigned char
const*, int, unsigned int*, unsigned char
const*)>::SetUp()
test/variance_test.cc:586:14
...
This is the same issue as:
e33d4c276 disable vpx_highbd_*_sub_pixel_variance4x{4,8}_neon
They have highbd_var_filter_block2d_bil_w4 in common.
Bug: webm:1796
Change-Id: I3ed70d0ba22e127720542612ea9f6665948eedfc
vpx_highbd_8_sub_pixel_variance4x4_neon
vpx_highbd_8_sub_pixel_variance4x8_neon
vpx_highbd_10_sub_pixel_variance4x4_neon
vpx_highbd_10_sub_pixel_variance4x8_neon
vpx_highbd_12_sub_pixel_variance4x4_neon
vpx_highbd_12_sub_pixel_variance4x8_neon
all cause heap overflows of the form:
[ RUN ] NEON/VpxHBDSubpelVarianceTest.Ref/24
=================================================================
==450528==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff8311a571 at pc 0x0000010ca52c bp 0xffffc63e96b0 sp 0xffffc63e96a8
READ of size 8 at 0xffff8311a571 thread T0
#0 0x10ca528 in load_unaligned_u16q vpx_dsp/arm/mem_neon.h:176:3
#1 0x10ca528 in highbd_var_filter_block2d_bil_w4
vpx_dsp/arm/highbd_subpel_variance_neon.c:49:21
#2 0x10ca528 in vpx_highbd_10_sub_pixel_variance4x8_neon
vpx_dsp/arm/highbd_subpel_variance_neon.c:257:1
...
0xffff8311a571 is located 0 bytes to the right of 113-byte region
[0xffff8311a500,0xffff8311a571)
allocated by thread T0 here:
#0 0x5f18b0 in malloc (test_libvpx+0x5f18b0)
#1 0xce4f90 in vpx_memalign vpx_mem/vpx_mem.c:62:10
#2 0xce4f90 in vpx_malloc vpx_mem/vpx_mem.c:70:40
#3 0xa4ad44 in (anonymous namespace)::SubpelVarianceTest<unsigned
int (*)(unsigned char const*, int, int, int, unsigned char
const*, int, unsigned int*)>::SetUp() test/variance_test.cc:586:14
Bug: webm:1796
Change-Id: I39f7f936bae2bcbbe1f803fb10375ec02d1c1277
* changes:
Implement highbd_d207_predictor using Neon
Implement highbd_d153_predictor using Neon
Implement d207_predictor using Neon
Implement d153_predictor using Neon
Implement highbd_d63_predictor using Neon
Introduced AVX2 intrinsic to compute convolve horizontal for
w = 8 case. This is a bit-exact change.
Instruction Count
cpu Resolution Reduction(%)
0 LOWRES2 1.509
0 MIDRES2 1.165
0 HDRES2 0.898
0 Average 1.191
Change-Id: I699c94aa3d7ea74c58f901df906eed0b81b4ee79
horizontal_add_int64x2 was incorrectly returning a uint64_t instead of
an int64_t. This patch fixes that.
Change-Id: Ic6016cf87aebfc6a14f540b784d6648757e12b49
Currently vp9_block_error_fp_neon is only used when
CONFIG_VP9_HIGHBITDEPTH is set to false. This patch optimizes the
implementation and uses tran_low_t instead of int16_t so that the
function can also be used in builds where vp9_highbitdepth is enabled.
Change-Id: Ibab7ec5f74b7652fa2ae5edf328f9ec587088fd3
Use a mem_neon.h helper to do strided 4-byte loads instead of Neon
8-byte loads - where the last 4 bytes are out of bounds.
Re-enable the Neon code path and the tests.
Bug: webm:1794
Change-Id: I69ccff730f4a5cbf585dd6a9aa0f3eb13e150074
Add an additional 32-bit vector accumulator to allow parallel
processing on CPUs that have more than one Neon multiply-accumulate
pipeline. Also use sum_neon.h horizontal-add helpers for reduction.
Change-Id: Ibcb48a738f5dee1430c3ebcd305b5ea8ea344c40
The load of `left[bs]` in the standard bitdepth d117 Neon implementation
triggered an address-sanitizer failure.
The highbd equivalent does not appear to trigger any asan failures when
running the VP9/ExternalFrameBufferMD5Test or
VP9/TestVectorTest.MD5Match tests, but for consistency with the standard
bitdepth implementation we adjust it to avoid the over-read.
Performance is roughly identical, with a 0.8% performance improvement on
average over the previous optimised code.
Change-Id: I05dc4d43f244f4915c0ccc52cc0af999bbacb018
Add Neon implementations of the d117 predictor for 4x4, 8x8, 16x16 and
32x32 block sizes. Also update tests to add new corresponding cases.
This re-lands commit 360e9069b6,
previously reverted in commit 394de691a0.
The implementation is mostly identical to the original but with an
adjustment to how data is loaded from the `left` array. In particular
the left array cannot be guaranteed to be larger than the block size, so
the read of e.g. `left[32]` in the `bs=32` case is not valid. This turns
out to be not a problem since the last lane loaded in this case is
unused. I have added comments in the code to explain why this is the
case.
Since we cannot load the last element directly, we instead construct it
from the previous aligned read. This seems to have an inconsistent
affect on performance, improving by up to 10% in some cases and
regressing by up to 10% on others. Either way it is still significantly
faster than the original C code.
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 1.88
Neoverse N1 | LLVM 15 | 8x8 | 5.19
Neoverse N1 | LLVM 15 | 16x16 | 9.63
Neoverse N1 | LLVM 15 | 32x32 | 13.85
Neoverse N1 | GCC 12 | 4x4 | 2.04
Neoverse N1 | GCC 12 | 8x8 | 4.62
Neoverse N1 | GCC 12 | 16x16 | 9.79
Neoverse N1 | GCC 12 | 32x32 | 4.69
Neoverse V1 | LLVM 15 | 4x4 | 1.75
Neoverse V1 | LLVM 15 | 8x8 | 6.71
Neoverse V1 | LLVM 15 | 16x16 | 9.62
Neoverse V1 | LLVM 15 | 32x32 | 13.81
Neoverse V1 | GCC 12 | 4x4 | 1.75
Neoverse V1 | GCC 12 | 8x8 | 6.01
Neoverse V1 | GCC 12 | 16x16 | 6.91
Neoverse V1 | GCC 12 | 32x32 | 4.39
Change-Id: Ia0977ff0b0eba2c41c7884b64e7c22ff9bc9549d
Add Neon implementations of the highbd d63 predictor for 4x4, 8x8, 16x16
and 32x32 block sizes. Also update tests to add new corresponding cases.
This re-lands commit 7cdf139e3d,
previously reverted in 7478b7e4e4.
Compared to the previous implementation attempt we now correctly match
the behaviour of the C code when handling the final element loaded from
the 'above' input array. In particular:
- The C code for a 4x4 block performs a full average of the last element
rather than duplicating the final element from the input 'above'
array.
- The C code for other block sizes performs a full average for the
stride=0 and stride=1, and otherwise shifts in duplicates of the final
element from the input 'above' array. Notably this shifting for later
strides _replaces_ the final element which we previously performed an
average on (see {d0,d1}_ext in the code).
It is worth noting that this difference is not caught by the existing
VP9HighbdIntraPredTest test cases since the test vector initialisation
contains this loop:
for (int x = block_size; x < 2 * block_size; x++) {
above_row_[x] = above_row_[block_size - 1];
}
Since AVG2(a, a) and AVG3(a, a, a) are simply 'a', such differences in
behaviour for the final element are not observed.
Tested on AArch64 with:
- ./test_libvpx --gtest_filter="*VP9HighbdIntraPredTest*"
- ./test_libvpx --gtest_filter="*VP9/TestVectorTest.MD5Match*"
- ./test_libvpx --gtest_filter="*VP9/ExternalFrameBufferMD5Test*"
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 2.43
Neoverse N1 | LLVM 15 | 8x8 | 3.92
Neoverse N1 | LLVM 15 | 16x16 | 3.19
Neoverse N1 | LLVM 15 | 32x32 | 4.13
Neoverse N1 | GCC 12 | 4x4 | 2.92
Neoverse N1 | GCC 12 | 8x8 | 6.51
Neoverse N1 | GCC 12 | 16x16 | 4.55
Neoverse N1 | GCC 12 | 32x32 | 3.18
Neoverse V1 | LLVM 15 | 4x4 | 1.99
Neoverse V1 | LLVM 15 | 8x8 | 3.65
Neoverse V1 | LLVM 15 | 16x16 | 3.72
Neoverse V1 | LLVM 15 | 32x32 | 3.26
Neoverse V1 | GCC 12 | 4x4 | 2.39
Neoverse V1 | GCC 12 | 8x8 | 4.76
Neoverse V1 | GCC 12 | 16x16 | 3.24
Neoverse V1 | GCC 12 | 32x32 | 2.44
Change-Id: Iefaa774d6a20388b523eaa7f5df6bc5f5cf249e4
Allocate mb_plane_ on the heap to ensure src is aligned.
Now that all the implementations of the 32x32 quantize are in
intrinsics we can reference struct members directly. Saves
pushing them to the stack.
n_coeffs is not used at all for this function.
Change-Id: Ib551f7f583977602504d962b72063bc6eda9dda9
This causes various buffer overflows in the tests:
[ RUN ] NEON/SixtapPredictTest.TestWithPresetData/0
=================================================================
==22346==ERROR: AddressSanitizer: global-buffer-overflow on address
0x0000012b4a5b at pc 0x000000df0f60 bp 0xffffcf6e64b0 sp 0xffffcf6e64a8
READ of size 8 at 0x0000012b4a5b thread T0
#0 0xdf0f5c in vp8_sixtap_predict16x16_neon
vp8/common/arm/neon/sixtappredict_neon.c:1507:13
#1 0x8819e4 in (anonymous
namespace)::SixtapPredictTest_TestWithPresetData_Test::TestBody()
test/predict_test.cc:293:3
...
0x0000012b4a5b is located 2 bytes to the right of global variable
'kTestData' defined in '../test/predict_test.cc:237:24' (0x12b48a0) of
size 441
[ RUN ] NEON/SixtapPredictTest.TestWithRandomData/0
=================================================================
==22338==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff8b5321fb at pc 0x000000df0f60 bp 0xfffff7e0cf30 sp 0xfffff7e0cf28
READ of size 8 at 0xffff8b5321fb thread T0
#0 0xdf0f5c in vp8_sixtap_predict16x16_neon
vp8/common/arm/neon/sixtappredict_neon.c:1507:13
#1 0x87d4c0 in (anonymous
namespace)::PredictTestBase::TestWithRandomData(void (*)(unsigned
char*, int, int, int, unsigned char*, int))
test/predict_test.cc:170:9
...
0xffff8b5321fb is located 2 bytes to the right of 441-byte region
[0xffff8b532040,0xffff8b5321f9)
allocated by thread T0 here:
#0 0x5fd4f0 in operator new[](unsigned long) (test_libvpx+0x5fd4f0)
#1 0x87c2e0 in (anonymous namespace)::PredictTestBase::SetUp()
test/predict_test.cc:47:12
#2 0x87d074 in non-virtual thunk to (anonymous
namespace)::PredictTestBase::SetUp() test/predict_test.cc
...
Bug: webm:1795
Change-Id: I32213a381eef91547d00f88acf90f1cf2ec2ea75
This function causes a heap overflow in the tests:
[ RUN ] NEON/VpxSseTest.RefSse/0
=================================================================
==876922==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff8949d903 at pc 0x000000dd95d4 bp 0xfffffdd7f260 sp 0xfffffdd7f258
READ of size 8 at 0xffff8949d903 thread T0
#0 0xdd95d0 in vpx_get4x4sse_cs_neon
vpx_dsp/arm/variance_neon.c:556:10
#1 0x9d4894 in (anonymous namespace)::MainTestClass<unsigned int
(*)(unsigned char const*, int, unsigned char const*,
int)>::RefTestSse() test/variance_test.cc:531:5
#2 0x9d4894 in (anonymous
namespace)::VpxSseTest_RefSse_Test::TestBody()
test/variance_test.cc:772:30
...
0xffff8949d903 is located 3 bytes to the right of 16-byte region
[0xffff8949d8f0,0xffff8949d900)
allocated by thread T0 here:
#0 0x5fd050 in operator new[](unsigned long) (test_libvpx+0x5fd050)
#1 0x9d3e04 in (anonymous namespace)::MainTestClass<unsigned int
(*)(unsigned char const*, int, unsigned char const*,
int)>::SetUp() test/variance_test.cc:299:12
Bug: webm:1794
Change-Id: I4bc681eb9a436743ef8bfe2a2abae59ce754309c
This reverts commit 360e9069b6.
This causes ASan errors:
[ RUN ] VP9/TestVectorTest.MD5Match/1
=================================================================
==837858==ERROR: AddressSanitizer: stack-buffer-overflow on address
0xffff82ecad40 at pc 0x000000c494d4 bp 0xffffe1695800 sp 0xffffe16957f8
READ of size 16 at 0xffff82ecad40 thread T0
#0 0xc494d0 in vpx_d117_predictor_32x32_neon (test_libvpx+0xc494d0)
#1 0x1040b34 in vp9_predict_intra_block (test_libvpx+0x1040b34)
#2 0xf8feec in decode_block (test_libvpx+0xf8feec)
#3 0xf8f588 in decode_partition (test_libvpx+0xf8f588)
#4 0xf7be5c in vp9_decode_frame (test_libvpx+0xf7be5c)
...
Address 0xffff82ecad40 is located in stack of thread T0 at offset 64 in
frame
#0 0x103fd3c in vp9_predict_intra_block (test_libvpx+0x103fd3c)
This frame has 2 object(s):
[32, 64) 'left_col.i' <== Memory access at offset 64 overflows this
variable
[96, 176) 'above_data.i'
Change-Id: I058213364617dfe1036126c33a3307f8288d9ae0
This reverts commit 5359ae810c.
Reason for revert: Blocks quantize cleanups
Original change's description:
> Allow macroblock_plane to have its own rounding buffer
>
> Add 8 bytes buffer to macroblock_plane to support rounding factor.
>
> Change-Id: I3751689e4449c0caea28d3acf6cd17d7f39508ed
Change-Id: Ia2424d2114207370f0b45350313a5ff8521d25a8
While porting this function to NEON, using SSE4_1 implementation
as base I noticed that both were producing files with different
checksums to the C reference implementation. After investigating
further I found that this saturating pack was the culprit. Doing
the multiplication on the 32-bit values, leads to producing the
correct results with the C implementation.
Change-Id: I40c2a36551b2db363a58ea9aa19ef327f2676de3
This reverts commit 848f6e7337.
This has alignment issues, causing crashes in the tests:
SSSE3/VP9QuantizeTest.EOBCheck/*
Change-Id: Ic12014ab0a78ed3cde02d642509061552cdc8fc9
This reverts commit 573f5e662b.
This has alignment issues, causing crashes in the tests:
SSSE3/VP9QuantizeTest.EOBCheck/*
Change-Id: Ibf05e6b116c46f6e2c11187b3e3578bbd2d2c227
This reverts commit 14fc40040f.
This has alignment issues, causing crashes in the tests:
SSSE3/VP9QuantizeTest.EOBCheck/*
Change-Id: I934f9a4c3ce3db33058a65180fa645c8649c3670
This reverts commit 7cdf139e3d.
This causes failures in the VP9/ExternalFrameBufferMD5Test and
VP9/TestVectorTest.MD5Match tests in both armv7 and aarch64 builds.
Change-Id: I7ac4ba0ddc70e7e7860df9f962e6658defe1cdd5
Currently MSE functions just call the variance helpers but don't
actually use the computed sum. This patch adds dedicated helpers to
perform the computation of sse.
Add the corresponding tests as well.
Change-Id: I96a8590e3410e84d77f7187344688e02efe03902
* changes:
Implement highbd_d117_predictor using Neon
Implement highbd_d63_predictor using Neon
Implement d117_predictor using Neon
Implement d63_predictor using Neon
Now that all the implementations of the 32x32 quantize are in
intrinsics we can reference struct members directly. Saves
pushing them to the stack.
n_coeffs is not used at all for this function.
Change-Id: I2104fea3fa20c455087e21b347d6abd7ea1f3e1e
Currently only vpx_mse16x16 has a Neon implementation. This patch adds
optimized Armv8.0 and Armv8.4 dot-product paths for all block sizes:
8x8, 8x16, 16x8 and 16x16.
Add the corresponding tests as well.
Change-Id: Ib0357fdcdeb05860385fec89633386e34395e260
1) Use vtrn[12]q_[su]64 in vpx_vtrnq_[su]64* helpers on AArch64
targets. This produces half as many TRN1/2 instructions compared to
the number of MOVs that result from vcombine.
2) Use vpx_vtrnq_[su]64* helpers wherever applicable.
3) Refactor transpose_4x8_s16 to operate on 128-bit vectors.
Change-Id: I9a8b1c1fe2a98a429e0c5f39def5eb2f65759127
Use (void) to indicate an empty parameter list and match the declaration
of vpx_codec_vp[89]_[cd]x. This fixes a cfi sanitizer error.
Change-Id: I190f432eea4d1765afffd84c7458ec44d863f90c
* changes:
Add Neon implementation of high bitdepth 32x32 hadamard transform
Add Neon implementation of high bitdepth 16x16 hadamard transform
Add Neon implementation of high bitdepth 8x8 hadamard transform
This matches the style guide and fixes some -Wshadow warnings related to
variables with the same name. Something similar was done in libaom in:
863b04994b Fix warnings reported by -Wshadow: Part2: av1 directory
Bug: webm:1793
Change-Id: I4df1bbc8d079a3174d75f0d35d54c200ffdbb677
Specialize implementation of high bitdepth variance functions such that
we only widen data processing element types when absolutely necessary.
Change-Id: If4cc3fea7b5ab0821e3129ebd79ff63706a512bf
In joint_motion_search, there are four iterations.
Even iterations search in the first reference frame
and odd iterations search in the second. The last two
iterations use the search result of the first two
iterations as the start point. If the search result does
not change,last two iterations are not necessary and can
be skipped.
Instruction Count
cpu-used Reduction(%)
0 1.411
Change-Id: Ie583c9f75dd0a22bbdfb432ccdd62eea6ec4fce8
Added unit test.
Keep track of spatial layer id and frame type in case where spatial
layers are encoded parallel by the hardware encoder.
ComputeQP() / PostEncodeUpdate() doesn't need to be called sequentially
when there is no inter layer prediction.
Bug: b/257368998
Change-Id: I50beaefcfc205d3f9a9d3dbe11fead5bfdc71489
* changes:
Optimize vpx_highbd_comp_avg_pred_neon
Add Neon AvgPredTestHBD test suite
Specialize Neon high bitdepth avg subpel variance by filter value
Specialize Neon high bitdepth subpel variance by filter value
Refactor Neon high bitdepth avg subpel variance functions
Optimize Neon high bitdepth subpel variance functions
Optimize the implementation of vpx_highbd_comp_avg_pred_neon by making
use of the URHADD instruction to compute the average.
Change-Id: Id74a6d9c33e89bc548c3c7ecace59af69051b4a7
Use the same specialization as for standard bitdepth. The rationale for
the specialization is as follows:
The optimal implementation of the bilinear interpolation depends on the
filter values being used. For both horizontal and vertical interpolation
this can simplify to just taking the source values, or averaging the
source and reference values - which can be computed more easily than a
bilinear interpolation with arbitrary filter values.
This patch introduces tests to find the most optimal bilinear
interpolation implementation based on the filter values being used.
This new specialization is only used for larger block sizes.
Change-Id: Id5a2b2d9fac6f878795a6ed9de2bc27d9e62d661
Use the same specialization as for standard bitdepth. The rationale for
the specialization is as follows:
The optimal implementation of the bilinear interpolation depends on the
filter values being used. For both horizontal and vertical interpolation
this can simplify to just taking the source values, or averaging the
source and reference values - which can be computed more easily than a
bilinear interpolation with arbitrary filter values.
This patch introduces tests to find the most optimal bilinear
interpolation implementation based on the filter values being used.
This new specialization is only used for larger block sizes.
Change-Id: I73182c979255f0332a274f2e5907df7f38c9eeb3
Use the same general code style as in the standard bitdepth Neon
implementation - merging the computation of vpx_highbd_comp_avg_pred
with the second pass of the bilinear filter to avoid storing and loading
the block again.
Also move vpx_highbd_comp_avg_pred_neon to its own file (like the
standard bitdepth implementation) since we're no longer using it for
averaging sub-pixel variance.
Change-Id: I2f5916d5b397db44b3247b478ef57046797dae6c
Use the same general code style as in the standard bitdepth Neon
implementation. Additionally, do not unnecessarily widen to 32-bit data
types when doing bilinear filtering - allowing us to process twice as
many elements per instruction.
Change-Id: I1e178991d2aa71f5f77a376e145d19257481e90f
Release v1.13.0 Ugly Duckling
2023-01-31 v1.13.0 "Ugly Duckling"
This release includes more Neon and AVX2 optimizations, adds a new codec
control to set per frame QP, upgrades GoogleTest to v1.12.1, and includes
numerous bug fixes.
- Upgrading:
This release is ABI incompatible with the previous release.
New codec control VP9E_SET_QUANTIZER_ONE_PASS to set per frame QP.
GoogleTest is upgraded to v1.12.1.
.clang-format is upgraded to clang-format-11.
VPX_EXT_RATECTRL_ABI_VERSION was bumped due to incompatible changes to the
feature of using external rate control models for vp9.
- Enhancement:
Numerous improvements on Neon optimizations.
Numerous improvements on AVX2 optimizations.
Additional ARM targets added for Visual Studio.
- Bug fixes:
Fix to calculating internal stats when frame dropped.
Fix to segfault for external resize test in vp9.
Fix to build system with replacing egrep with grep -E.
Fix to a few bugs with external RTC rate control library.
Fix to make SVC work with VBR.
Fix to key frame setting in VP9 external RC.
Fix to -Wimplicit-int (Clang 16).
Fix to VP8 external RC for buffer levels.
Fix to VP8 external RC for dynamic update of layers.
Fix to VP9 auto level.
Fix to off-by-one error of max w/h in validate_config.
Fix to make SVC work for Profile 1.
Bug: webm:1780
Change-Id: I371fc1444ead56f8d7fc510e05582b6415c3ddb1
Use standard loads and stores instead of the significantly slower
interleaving/de-interleaving variants. Also move all loads in loop
bodies above all stores as a mitigation against the compiler thinking
that the src and dst pointers alias (since we can't use restrict in
C89.)
Change-Id: Idd59dca51387f553f8db27144a2b8f2377c937d3
Add missing 4x4 and 4x8 tests for both high bitdepth sub-pixel variance
and high bitdepth averaging sub-pixel variance.
Change-Id: I042752c5b7ccc14f58075694d0bb1d36f144ad06
Move the 4D reduction helper function to sum_neon.h and use this for
both standard and high bitdepth SAD4D paths. This also removes the
AArch64 requirement for using the UDOT Neon SAD4D paths.
Change-Id: I207f76b3d42aa541809b0672c3b3d86e54d133ff
* changes:
Optimize Neon implementation of high bitdepth SAD4D functions
Optimize Neon implementation of high bitdepth avg SAD functions
Optimize Neon implementation of high bitdepth SAD functions
Optimizations take a similar form to those implemented for Armv8.0
standard bitdepth SAD4D:
- Use ABD, UADALP instead of ABAL, ABAL2 (double the throughput on
modern out-of-order Arm-designed cores.)
- Use more accumulator registers to make better use of Neon pipeline
resources on Arm CPUs that have four Neon pipes.
- Compute the four SAD sums in parallel so that we only load the source
block once - instead of four times.
Change-Id: Ica45c44fd167e5fcc83871d8c138fc72ed3a9723
Optimizations take a similar form to those implemented for standard
bitdepth averaging SAD:
- Use ABD, UADALP instead of ABAL, ABAL2 (double the throughput on
modern out-of-order Arm-designed cores.)
- Use more accumulator registers to make better use of Neon pipeline
resources on Arm CPUs that have four Neon pipes.
Change-Id: I75c5f09948f6bf17200f82e00e7a827a80451108
Optimizations take a similar form to those implemented for standard
bitdepth SAD:
- Use ABD, UADALP instead of ABAL, ABAL2 (double the throughput on
modern out-of-order Arm-designed cores.)
- Use more accumulator registers to make better use of Neon pipeline
resources on Arm CPUs that have four Neon pipes.
Change-Id: I9e626d7fa0e271908dc43448405a7985b80e6230
At BEST encoding mode, the mesh search range wasn't initialized for
non FC_GRAPHICS_ANIMATION content type, which actually/mistakenly
used speed 0's setting. Fixed it by adding the initialization.
There were 2 ways to fix this. Patchset 1 set to use speed 0's setting
for non FC_GRAPHICS_ANIMATION type. This didn't change BEST mode's
encoding results much, and only a couple of clips' results were changed.
Borg result for BEST mode:
avg_psnr: ovr_psnr: ssim: encoding_spdup:
lowres2: -0.004 -0.003 -0.000 0.030
midres2: -0.006 -0.009 -0.012 0.033
hdres2: 0.002 0.002 0.004 0.015
Patchset 2 set to use BEST's setting for non FC_GRAPHICS_ANIMATION type.
However, the majority of test clips' BDrate got changed up to
~0.5% (gain or loss), and overall it didn't give better performance
than patchset 1. So, we chose to use patchset 1.
Change-Id: Ibbf578dad04420e6ba22cb9a3ddec137a7e4deef
rather than the gcc specific __attribute__((aligned())); fixes build
targeting ARM64 windows.
Bug: webm:1788
Change-Id: I2210fc215f44d90c1ce9dee9b54888eb1b78c99e
Use the load_unaligned helper functions in mem_neon.h to load strided
sequences of 4 bytes where alignment is not guaranteed in the Neon
SAD and SAD4D paths.
Change-Id: I941d226ef94fd7a633b09fc92165a00ba68a1501
Refactor the Neon implementation of transpose_s16_8x8(q) and
transpose_u16_8x8 so that the final step compiles to 8 ZIP1/ZIP2
instructions as opposed to 8 EXT, MOV pairs. This change removes 8
instructions per call to transpose_s16_8x8(q), transpose_u16_8x8
where the result stays in registers for further processing - rather
than being stored to memory - like in vpx_hadamard_8x8_neon, for
example.
This is a backport of this libaom patch[1].
[1] https://aomedia-review.googlesource.com/c/aom/+/169426
Change-Id: Icef3e51d40efeca7008e1c4fc701bf39bd319c88
In total this gives about 9% extra performance for both rt/best
profiles.
Furthermore, add transpose_s32 16x16 function
Change-Id: Ib6f368bbb9af7f03c9ce0deba1664cef77632fe2
2023-01-24 20:56:02 +00:00
443 changed files with 36998 additions and 16560 deletions
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.