Why is computing ridge regression with a Cholesky decomposition much quicker than using SVD?Why is it much quicker to compute ridge regression than regular linear regression?Computing ridge regression with prior different from 0SVD in linear regressionGaussian Process: Using partitions of a Cholesky decomposition solution for conditional deductionConfirming an understanding of SVDIll-conditioned covariance matrix in GP regression for Bayesian optimizationThe proof of shrinking coefficients using ridge regression through “spectral decomposition”Simulate Moderated Regression with Cholesky DecompositionCoefficients of Principal Components Regression in terms of original regressorsFor symmetric matrices, is the Cholesky decomposition better than the SVD?Why is it much quicker to compute ridge regression than regular linear regression?

Do I really need to have a scientific explanation for my premise?

At what distance can a bugbear, holding a reach weapon, with Polearm Mastery, get their Opportunity Attack?

Latex does not go to next line

Is it necessary to separate DC power cables and data cables?

Examples of a statistic that is not independent of sample's distribution?

NASA's RS-25 Engines shut down time

When traveling to Europe from North America, do I need to purchase a different power strip?

Are there historical instances of the capital of a colonising country being temporarily or permanently shifted to one of its colonies?

What are some noteworthy "mic-drop" moments in math?

Should I take out a loan for a friend to invest on my behalf?

Reversed Sudoku

Declaring and defining template, and specialising them

Why does Captain Marvel assume the people on this planet know this?

ST_Centroid in view produces geometry without further definition from geometry(polygon, ssid)

Signed and unsigned numbers

IBM PAT Number Sequence

When a wind turbine does not produce enough electricity how does the power company compensate for the loss?

How strictly should I take "Candidates must be local"?

Motivation for Zeta Function of an Algebraic Variety

In the late 1940’s to early 1950’s what technology was available that could melt a LOT of ice?

Should this registration form include field for password now or after user passes review?

Virginia employer terminated employee and wants signing bonus returned

They call me Inspector Morse

Error during using callback start_page_number in lualatex



Why is computing ridge regression with a Cholesky decomposition much quicker than using SVD?


Why is it much quicker to compute ridge regression than regular linear regression?Computing ridge regression with prior different from 0SVD in linear regressionGaussian Process: Using partitions of a Cholesky decomposition solution for conditional deductionConfirming an understanding of SVDIll-conditioned covariance matrix in GP regression for Bayesian optimizationThe proof of shrinking coefficients using ridge regression through “spectral decomposition”Simulate Moderated Regression with Cholesky DecompositionCoefficients of Principal Components Regression in terms of original regressorsFor symmetric matrices, is the Cholesky decomposition better than the SVD?Why is it much quicker to compute ridge regression than regular linear regression?













3












$begingroup$


By my understanding, for a matrix with n samples and p features:



  • Ridge regression using Cholesky decomposition takes O(p^3) time

  • Ridge regression using SVD takes O(p^3) time

  • Computing SVD when only the diagonal matrix is needed (and not u and v) takes O(np^2) time

I tested this out in scipy on both random and real-world data with p > n (p = 43624, n = 1750) and found ridge regression with a Cholesky decomposition to be much quicker than computing it using SVD and much quicker than just computing the diagonal matrix of the SVD. Why is this?



import numpy as np
from sklearn import linear_model
import scipy
import time

x = np.random.rand(1750, 43264)
y = np.random.rand(1750)

old_time = time.time()
clf = linear_model.Ridge(alpha=1.0, solver='cholesky')
clf.fit(x, y)
print("Time taken to solve Ridge with cholesky: ", time.time() - old_time)

old_time = time.time()
clf = linear_model.Ridge(alpha=1.0, solver='svd')
clf.fit(x, y)
print("Time taken to solve Ridge with SVD: ", time.time() - old_time)

old_time = time.time()
scipy.linalg.svd(x, full_matrices=False)
print("Time taken for SVD", time.time() - old_time)

old_time = time.time()
scipy.linalg.svd(x, full_matrices=False, compute_uv=False)
print("Time taken for SVD, just s", time.time() - old_time)


Output:



Time taken to solve Ridge with cholesky: 3.2336008548736572
Time taken to solve Ridge with SVD: 118.47378492355347
Time taken for SVD 92.01217150688171
Time taken for SVD, just s 44.7129647731781









share|cite|improve this question







New contributor




pighead10 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$











  • $begingroup$
    Cholesky decomposes a square symmetric matrix (such as correlations) and is much faster itself than SVD which, in general, decomposes a rectangular matrix (such as dataset) into 3 matrices.
    $endgroup$
    – ttnphns
    1 hour ago















3












$begingroup$


By my understanding, for a matrix with n samples and p features:



  • Ridge regression using Cholesky decomposition takes O(p^3) time

  • Ridge regression using SVD takes O(p^3) time

  • Computing SVD when only the diagonal matrix is needed (and not u and v) takes O(np^2) time

I tested this out in scipy on both random and real-world data with p > n (p = 43624, n = 1750) and found ridge regression with a Cholesky decomposition to be much quicker than computing it using SVD and much quicker than just computing the diagonal matrix of the SVD. Why is this?



import numpy as np
from sklearn import linear_model
import scipy
import time

x = np.random.rand(1750, 43264)
y = np.random.rand(1750)

old_time = time.time()
clf = linear_model.Ridge(alpha=1.0, solver='cholesky')
clf.fit(x, y)
print("Time taken to solve Ridge with cholesky: ", time.time() - old_time)

old_time = time.time()
clf = linear_model.Ridge(alpha=1.0, solver='svd')
clf.fit(x, y)
print("Time taken to solve Ridge with SVD: ", time.time() - old_time)

old_time = time.time()
scipy.linalg.svd(x, full_matrices=False)
print("Time taken for SVD", time.time() - old_time)

old_time = time.time()
scipy.linalg.svd(x, full_matrices=False, compute_uv=False)
print("Time taken for SVD, just s", time.time() - old_time)


Output:



Time taken to solve Ridge with cholesky: 3.2336008548736572
Time taken to solve Ridge with SVD: 118.47378492355347
Time taken for SVD 92.01217150688171
Time taken for SVD, just s 44.7129647731781









share|cite|improve this question







New contributor




pighead10 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$











  • $begingroup$
    Cholesky decomposes a square symmetric matrix (such as correlations) and is much faster itself than SVD which, in general, decomposes a rectangular matrix (such as dataset) into 3 matrices.
    $endgroup$
    – ttnphns
    1 hour ago













3












3








3


1



$begingroup$


By my understanding, for a matrix with n samples and p features:



  • Ridge regression using Cholesky decomposition takes O(p^3) time

  • Ridge regression using SVD takes O(p^3) time

  • Computing SVD when only the diagonal matrix is needed (and not u and v) takes O(np^2) time

I tested this out in scipy on both random and real-world data with p > n (p = 43624, n = 1750) and found ridge regression with a Cholesky decomposition to be much quicker than computing it using SVD and much quicker than just computing the diagonal matrix of the SVD. Why is this?



import numpy as np
from sklearn import linear_model
import scipy
import time

x = np.random.rand(1750, 43264)
y = np.random.rand(1750)

old_time = time.time()
clf = linear_model.Ridge(alpha=1.0, solver='cholesky')
clf.fit(x, y)
print("Time taken to solve Ridge with cholesky: ", time.time() - old_time)

old_time = time.time()
clf = linear_model.Ridge(alpha=1.0, solver='svd')
clf.fit(x, y)
print("Time taken to solve Ridge with SVD: ", time.time() - old_time)

old_time = time.time()
scipy.linalg.svd(x, full_matrices=False)
print("Time taken for SVD", time.time() - old_time)

old_time = time.time()
scipy.linalg.svd(x, full_matrices=False, compute_uv=False)
print("Time taken for SVD, just s", time.time() - old_time)


Output:



Time taken to solve Ridge with cholesky: 3.2336008548736572
Time taken to solve Ridge with SVD: 118.47378492355347
Time taken for SVD 92.01217150688171
Time taken for SVD, just s 44.7129647731781









share|cite|improve this question







New contributor




pighead10 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




By my understanding, for a matrix with n samples and p features:



  • Ridge regression using Cholesky decomposition takes O(p^3) time

  • Ridge regression using SVD takes O(p^3) time

  • Computing SVD when only the diagonal matrix is needed (and not u and v) takes O(np^2) time

I tested this out in scipy on both random and real-world data with p > n (p = 43624, n = 1750) and found ridge regression with a Cholesky decomposition to be much quicker than computing it using SVD and much quicker than just computing the diagonal matrix of the SVD. Why is this?



import numpy as np
from sklearn import linear_model
import scipy
import time

x = np.random.rand(1750, 43264)
y = np.random.rand(1750)

old_time = time.time()
clf = linear_model.Ridge(alpha=1.0, solver='cholesky')
clf.fit(x, y)
print("Time taken to solve Ridge with cholesky: ", time.time() - old_time)

old_time = time.time()
clf = linear_model.Ridge(alpha=1.0, solver='svd')
clf.fit(x, y)
print("Time taken to solve Ridge with SVD: ", time.time() - old_time)

old_time = time.time()
scipy.linalg.svd(x, full_matrices=False)
print("Time taken for SVD", time.time() - old_time)

old_time = time.time()
scipy.linalg.svd(x, full_matrices=False, compute_uv=False)
print("Time taken for SVD, just s", time.time() - old_time)


Output:



Time taken to solve Ridge with cholesky: 3.2336008548736572
Time taken to solve Ridge with SVD: 118.47378492355347
Time taken for SVD 92.01217150688171
Time taken for SVD, just s 44.7129647731781






regression ridge-regression svd scipy cholesky






share|cite|improve this question







New contributor




pighead10 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|cite|improve this question







New contributor




pighead10 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|cite|improve this question




share|cite|improve this question






New contributor




pighead10 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 3 hours ago









pighead10pighead10

1264




1264




New contributor




pighead10 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





pighead10 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






pighead10 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











  • $begingroup$
    Cholesky decomposes a square symmetric matrix (such as correlations) and is much faster itself than SVD which, in general, decomposes a rectangular matrix (such as dataset) into 3 matrices.
    $endgroup$
    – ttnphns
    1 hour ago
















  • $begingroup$
    Cholesky decomposes a square symmetric matrix (such as correlations) and is much faster itself than SVD which, in general, decomposes a rectangular matrix (such as dataset) into 3 matrices.
    $endgroup$
    – ttnphns
    1 hour ago















$begingroup$
Cholesky decomposes a square symmetric matrix (such as correlations) and is much faster itself than SVD which, in general, decomposes a rectangular matrix (such as dataset) into 3 matrices.
$endgroup$
– ttnphns
1 hour ago




$begingroup$
Cholesky decomposes a square symmetric matrix (such as correlations) and is much faster itself than SVD which, in general, decomposes a rectangular matrix (such as dataset) into 3 matrices.
$endgroup$
– ttnphns
1 hour ago










1 Answer
1






active

oldest

votes


















4












$begingroup$

First, we note that ridge regression can be transformed into an OLS problem via the data augmentation trick.. This adds $p$ rows to the design matrix $mathbfX$, so the new matrix $mathbfX_lambda$ has dimensions $ (n + p) times p $.



We can then look up the run time for various OLS algorithms, for example in the book Least Squares Data Fitting with Applications or in Golub's own book on Matrix Computations. (Golub was one of inventors of currently state-of-the-art Golub-Reinsch SVD algorithm.) Since linking to Google Books is not very reliable, these costs can also be found, for example, here.



We see that relevant costs for the non-ridge case, assuming $mathbfX$ is $n times p$, are:



$$ C_LU = 2 n p^2 - frac23 p^3 tag1 $$



$$ C_SVD = 4 n p^2 + 8 p^3 tag2 $$



To obtain equations for ridge regression, we need to substitute $n rightarrow n + p $ to correctly account for data augmentation:



$$ C_LU = 2 n p^2 + frac43 p^3 tag3 $$



$$ C_SVD = 4 n p^2 + 12 p^3 tag4 $$



Note that unlike your example, $n > p$ for most typical regression problems. If we say that $n approx p$, we can obtained simplified costs in only one variable:



$$ C_LU = frac103 p^3 tag5 $$
$$ C_SVD = 16 p^3 tag6 $$



This is probably where those rough estimates you reference came from, but as you yourself discovered, this simplification hides important detail and makes inappropriate assumptions that don't fit your case. Note, for example, that the constant for SVD is much worse than for Cholesky, an important fact hidden by the big $mathordO$ notation.



What you are doing is exactly the opposite - you have $n ll p$. That means the second term is much more important! By comparing the coefficients of the $p^3$ terms of equations (3) and (4) we can argue the SVD approach will require roughly 10 times as many floating point operations as the Cholesky approach when $n ll p$.



Beyond that, it's hard to know exactly what scipy is doing under the hood - is it calculating both $mathbfU$ and $mathbfV$ when doing SVD? Is it using the randomized algorithm? Is it using data augmentation for ridge regression, or some other approach? And even once those details are known, it's hard to translate these theoretical operation counts to real-world run time because libraries like LAPACK (scipy is almost surely using LAPACK or ATLAS under the hood) are so highly vectorized (taking advantage of CPU SIMD instructions) and often multithreaded that's it's hard to estimate accurately from purely theoretical arguments. But it's safe to say that it's reasonable to expect Cholesky factorization to be much faster than SVD for a given problem, especially when $n ll p$.






share|cite|improve this answer









$endgroup$












  • $begingroup$
    Explicit data augmentation would be a very slow way to compute ridge solution when $nll p$... Everything that scales as $p^3$ can instead be rewritten to scale as $n^3$.
    $endgroup$
    – amoeba
    1 hour ago











  • $begingroup$
    Great answer. This problem arose because a paper I was following suggested, since n ≪ p, computing the SVD in order to grid-search for the best alpha, since the resulting computation for each alpha only takes O(pn^2). However, my results/your answer shows that this would only take less time if there are greater than ~10 different values of alpha to try, due to the initial overheard of the SVD being huge.
    $endgroup$
    – pighead10
    1 hour ago










Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);






pighead10 is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f396914%2fwhy-is-computing-ridge-regression-with-a-cholesky-decomposition-much-quicker-tha%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









4












$begingroup$

First, we note that ridge regression can be transformed into an OLS problem via the data augmentation trick.. This adds $p$ rows to the design matrix $mathbfX$, so the new matrix $mathbfX_lambda$ has dimensions $ (n + p) times p $.



We can then look up the run time for various OLS algorithms, for example in the book Least Squares Data Fitting with Applications or in Golub's own book on Matrix Computations. (Golub was one of inventors of currently state-of-the-art Golub-Reinsch SVD algorithm.) Since linking to Google Books is not very reliable, these costs can also be found, for example, here.



We see that relevant costs for the non-ridge case, assuming $mathbfX$ is $n times p$, are:



$$ C_LU = 2 n p^2 - frac23 p^3 tag1 $$



$$ C_SVD = 4 n p^2 + 8 p^3 tag2 $$



To obtain equations for ridge regression, we need to substitute $n rightarrow n + p $ to correctly account for data augmentation:



$$ C_LU = 2 n p^2 + frac43 p^3 tag3 $$



$$ C_SVD = 4 n p^2 + 12 p^3 tag4 $$



Note that unlike your example, $n > p$ for most typical regression problems. If we say that $n approx p$, we can obtained simplified costs in only one variable:



$$ C_LU = frac103 p^3 tag5 $$
$$ C_SVD = 16 p^3 tag6 $$



This is probably where those rough estimates you reference came from, but as you yourself discovered, this simplification hides important detail and makes inappropriate assumptions that don't fit your case. Note, for example, that the constant for SVD is much worse than for Cholesky, an important fact hidden by the big $mathordO$ notation.



What you are doing is exactly the opposite - you have $n ll p$. That means the second term is much more important! By comparing the coefficients of the $p^3$ terms of equations (3) and (4) we can argue the SVD approach will require roughly 10 times as many floating point operations as the Cholesky approach when $n ll p$.



Beyond that, it's hard to know exactly what scipy is doing under the hood - is it calculating both $mathbfU$ and $mathbfV$ when doing SVD? Is it using the randomized algorithm? Is it using data augmentation for ridge regression, or some other approach? And even once those details are known, it's hard to translate these theoretical operation counts to real-world run time because libraries like LAPACK (scipy is almost surely using LAPACK or ATLAS under the hood) are so highly vectorized (taking advantage of CPU SIMD instructions) and often multithreaded that's it's hard to estimate accurately from purely theoretical arguments. But it's safe to say that it's reasonable to expect Cholesky factorization to be much faster than SVD for a given problem, especially when $n ll p$.






share|cite|improve this answer









$endgroup$












  • $begingroup$
    Explicit data augmentation would be a very slow way to compute ridge solution when $nll p$... Everything that scales as $p^3$ can instead be rewritten to scale as $n^3$.
    $endgroup$
    – amoeba
    1 hour ago











  • $begingroup$
    Great answer. This problem arose because a paper I was following suggested, since n ≪ p, computing the SVD in order to grid-search for the best alpha, since the resulting computation for each alpha only takes O(pn^2). However, my results/your answer shows that this would only take less time if there are greater than ~10 different values of alpha to try, due to the initial overheard of the SVD being huge.
    $endgroup$
    – pighead10
    1 hour ago















4












$begingroup$

First, we note that ridge regression can be transformed into an OLS problem via the data augmentation trick.. This adds $p$ rows to the design matrix $mathbfX$, so the new matrix $mathbfX_lambda$ has dimensions $ (n + p) times p $.



We can then look up the run time for various OLS algorithms, for example in the book Least Squares Data Fitting with Applications or in Golub's own book on Matrix Computations. (Golub was one of inventors of currently state-of-the-art Golub-Reinsch SVD algorithm.) Since linking to Google Books is not very reliable, these costs can also be found, for example, here.



We see that relevant costs for the non-ridge case, assuming $mathbfX$ is $n times p$, are:



$$ C_LU = 2 n p^2 - frac23 p^3 tag1 $$



$$ C_SVD = 4 n p^2 + 8 p^3 tag2 $$



To obtain equations for ridge regression, we need to substitute $n rightarrow n + p $ to correctly account for data augmentation:



$$ C_LU = 2 n p^2 + frac43 p^3 tag3 $$



$$ C_SVD = 4 n p^2 + 12 p^3 tag4 $$



Note that unlike your example, $n > p$ for most typical regression problems. If we say that $n approx p$, we can obtained simplified costs in only one variable:



$$ C_LU = frac103 p^3 tag5 $$
$$ C_SVD = 16 p^3 tag6 $$



This is probably where those rough estimates you reference came from, but as you yourself discovered, this simplification hides important detail and makes inappropriate assumptions that don't fit your case. Note, for example, that the constant for SVD is much worse than for Cholesky, an important fact hidden by the big $mathordO$ notation.



What you are doing is exactly the opposite - you have $n ll p$. That means the second term is much more important! By comparing the coefficients of the $p^3$ terms of equations (3) and (4) we can argue the SVD approach will require roughly 10 times as many floating point operations as the Cholesky approach when $n ll p$.



Beyond that, it's hard to know exactly what scipy is doing under the hood - is it calculating both $mathbfU$ and $mathbfV$ when doing SVD? Is it using the randomized algorithm? Is it using data augmentation for ridge regression, or some other approach? And even once those details are known, it's hard to translate these theoretical operation counts to real-world run time because libraries like LAPACK (scipy is almost surely using LAPACK or ATLAS under the hood) are so highly vectorized (taking advantage of CPU SIMD instructions) and often multithreaded that's it's hard to estimate accurately from purely theoretical arguments. But it's safe to say that it's reasonable to expect Cholesky factorization to be much faster than SVD for a given problem, especially when $n ll p$.






share|cite|improve this answer









$endgroup$












  • $begingroup$
    Explicit data augmentation would be a very slow way to compute ridge solution when $nll p$... Everything that scales as $p^3$ can instead be rewritten to scale as $n^3$.
    $endgroup$
    – amoeba
    1 hour ago











  • $begingroup$
    Great answer. This problem arose because a paper I was following suggested, since n ≪ p, computing the SVD in order to grid-search for the best alpha, since the resulting computation for each alpha only takes O(pn^2). However, my results/your answer shows that this would only take less time if there are greater than ~10 different values of alpha to try, due to the initial overheard of the SVD being huge.
    $endgroup$
    – pighead10
    1 hour ago













4












4








4





$begingroup$

First, we note that ridge regression can be transformed into an OLS problem via the data augmentation trick.. This adds $p$ rows to the design matrix $mathbfX$, so the new matrix $mathbfX_lambda$ has dimensions $ (n + p) times p $.



We can then look up the run time for various OLS algorithms, for example in the book Least Squares Data Fitting with Applications or in Golub's own book on Matrix Computations. (Golub was one of inventors of currently state-of-the-art Golub-Reinsch SVD algorithm.) Since linking to Google Books is not very reliable, these costs can also be found, for example, here.



We see that relevant costs for the non-ridge case, assuming $mathbfX$ is $n times p$, are:



$$ C_LU = 2 n p^2 - frac23 p^3 tag1 $$



$$ C_SVD = 4 n p^2 + 8 p^3 tag2 $$



To obtain equations for ridge regression, we need to substitute $n rightarrow n + p $ to correctly account for data augmentation:



$$ C_LU = 2 n p^2 + frac43 p^3 tag3 $$



$$ C_SVD = 4 n p^2 + 12 p^3 tag4 $$



Note that unlike your example, $n > p$ for most typical regression problems. If we say that $n approx p$, we can obtained simplified costs in only one variable:



$$ C_LU = frac103 p^3 tag5 $$
$$ C_SVD = 16 p^3 tag6 $$



This is probably where those rough estimates you reference came from, but as you yourself discovered, this simplification hides important detail and makes inappropriate assumptions that don't fit your case. Note, for example, that the constant for SVD is much worse than for Cholesky, an important fact hidden by the big $mathordO$ notation.



What you are doing is exactly the opposite - you have $n ll p$. That means the second term is much more important! By comparing the coefficients of the $p^3$ terms of equations (3) and (4) we can argue the SVD approach will require roughly 10 times as many floating point operations as the Cholesky approach when $n ll p$.



Beyond that, it's hard to know exactly what scipy is doing under the hood - is it calculating both $mathbfU$ and $mathbfV$ when doing SVD? Is it using the randomized algorithm? Is it using data augmentation for ridge regression, or some other approach? And even once those details are known, it's hard to translate these theoretical operation counts to real-world run time because libraries like LAPACK (scipy is almost surely using LAPACK or ATLAS under the hood) are so highly vectorized (taking advantage of CPU SIMD instructions) and often multithreaded that's it's hard to estimate accurately from purely theoretical arguments. But it's safe to say that it's reasonable to expect Cholesky factorization to be much faster than SVD for a given problem, especially when $n ll p$.






share|cite|improve this answer









$endgroup$



First, we note that ridge regression can be transformed into an OLS problem via the data augmentation trick.. This adds $p$ rows to the design matrix $mathbfX$, so the new matrix $mathbfX_lambda$ has dimensions $ (n + p) times p $.



We can then look up the run time for various OLS algorithms, for example in the book Least Squares Data Fitting with Applications or in Golub's own book on Matrix Computations. (Golub was one of inventors of currently state-of-the-art Golub-Reinsch SVD algorithm.) Since linking to Google Books is not very reliable, these costs can also be found, for example, here.



We see that relevant costs for the non-ridge case, assuming $mathbfX$ is $n times p$, are:



$$ C_LU = 2 n p^2 - frac23 p^3 tag1 $$



$$ C_SVD = 4 n p^2 + 8 p^3 tag2 $$



To obtain equations for ridge regression, we need to substitute $n rightarrow n + p $ to correctly account for data augmentation:



$$ C_LU = 2 n p^2 + frac43 p^3 tag3 $$



$$ C_SVD = 4 n p^2 + 12 p^3 tag4 $$



Note that unlike your example, $n > p$ for most typical regression problems. If we say that $n approx p$, we can obtained simplified costs in only one variable:



$$ C_LU = frac103 p^3 tag5 $$
$$ C_SVD = 16 p^3 tag6 $$



This is probably where those rough estimates you reference came from, but as you yourself discovered, this simplification hides important detail and makes inappropriate assumptions that don't fit your case. Note, for example, that the constant for SVD is much worse than for Cholesky, an important fact hidden by the big $mathordO$ notation.



What you are doing is exactly the opposite - you have $n ll p$. That means the second term is much more important! By comparing the coefficients of the $p^3$ terms of equations (3) and (4) we can argue the SVD approach will require roughly 10 times as many floating point operations as the Cholesky approach when $n ll p$.



Beyond that, it's hard to know exactly what scipy is doing under the hood - is it calculating both $mathbfU$ and $mathbfV$ when doing SVD? Is it using the randomized algorithm? Is it using data augmentation for ridge regression, or some other approach? And even once those details are known, it's hard to translate these theoretical operation counts to real-world run time because libraries like LAPACK (scipy is almost surely using LAPACK or ATLAS under the hood) are so highly vectorized (taking advantage of CPU SIMD instructions) and often multithreaded that's it's hard to estimate accurately from purely theoretical arguments. But it's safe to say that it's reasonable to expect Cholesky factorization to be much faster than SVD for a given problem, especially when $n ll p$.







share|cite|improve this answer












share|cite|improve this answer



share|cite|improve this answer










answered 2 hours ago









olooneyolooney

1,316614




1,316614











  • $begingroup$
    Explicit data augmentation would be a very slow way to compute ridge solution when $nll p$... Everything that scales as $p^3$ can instead be rewritten to scale as $n^3$.
    $endgroup$
    – amoeba
    1 hour ago











  • $begingroup$
    Great answer. This problem arose because a paper I was following suggested, since n ≪ p, computing the SVD in order to grid-search for the best alpha, since the resulting computation for each alpha only takes O(pn^2). However, my results/your answer shows that this would only take less time if there are greater than ~10 different values of alpha to try, due to the initial overheard of the SVD being huge.
    $endgroup$
    – pighead10
    1 hour ago
















  • $begingroup$
    Explicit data augmentation would be a very slow way to compute ridge solution when $nll p$... Everything that scales as $p^3$ can instead be rewritten to scale as $n^3$.
    $endgroup$
    – amoeba
    1 hour ago











  • $begingroup$
    Great answer. This problem arose because a paper I was following suggested, since n ≪ p, computing the SVD in order to grid-search for the best alpha, since the resulting computation for each alpha only takes O(pn^2). However, my results/your answer shows that this would only take less time if there are greater than ~10 different values of alpha to try, due to the initial overheard of the SVD being huge.
    $endgroup$
    – pighead10
    1 hour ago















$begingroup$
Explicit data augmentation would be a very slow way to compute ridge solution when $nll p$... Everything that scales as $p^3$ can instead be rewritten to scale as $n^3$.
$endgroup$
– amoeba
1 hour ago





$begingroup$
Explicit data augmentation would be a very slow way to compute ridge solution when $nll p$... Everything that scales as $p^3$ can instead be rewritten to scale as $n^3$.
$endgroup$
– amoeba
1 hour ago













$begingroup$
Great answer. This problem arose because a paper I was following suggested, since n ≪ p, computing the SVD in order to grid-search for the best alpha, since the resulting computation for each alpha only takes O(pn^2). However, my results/your answer shows that this would only take less time if there are greater than ~10 different values of alpha to try, due to the initial overheard of the SVD being huge.
$endgroup$
– pighead10
1 hour ago




$begingroup$
Great answer. This problem arose because a paper I was following suggested, since n ≪ p, computing the SVD in order to grid-search for the best alpha, since the resulting computation for each alpha only takes O(pn^2). However, my results/your answer shows that this would only take less time if there are greater than ~10 different values of alpha to try, due to the initial overheard of the SVD being huge.
$endgroup$
– pighead10
1 hour ago










pighead10 is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















pighead10 is a new contributor. Be nice, and check out our Code of Conduct.












pighead10 is a new contributor. Be nice, and check out our Code of Conduct.











pighead10 is a new contributor. Be nice, and check out our Code of Conduct.














Thanks for contributing an answer to Cross Validated!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f396914%2fwhy-is-computing-ridge-regression-with-a-cholesky-decomposition-much-quicker-tha%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Isabella Eugénie Boyer Biographie | Références | Menu de navigationmodifiermodifier le codeComparator to Compute the Relative Value of a U.S. Dollar Amount – 1774 to Present.

Join wedge with single bond in chemfigHow to make only one part of double bond bold with chemfig?Crossing bonds in chemfigjoining atoms in chemfig. Two adjacent molculesHow do I selectively change bond length in chemfig?Ugly bond joints in chemfigchemfig: reaction above arrowUsing the mhchem and chemfig packages in conjunctionBonding to specific element letter using chemfigResonance hybrids in chemfigScale chemfig molecule in beamer with tikzWhy does this chemfig bond with a hook start in the middle of the atom?

Are small insurances worth itIs insurance worth it if you can afford to replace the item? If not, when is it?Is accident insurance worth it for my kids who play sportsIs insuring property for more than it is worth allowed?At what point does it become worth it to file an insurance claim?Are wage loss insurance programs worth the cost compared to having an emergency fund?When is an event worth insuring against?Is insurance worth it if you can afford to replace the item? If not, when is it?FHA loan just commenced : Any way to get any of the up-front mortgage insurance back?Which types of insurances do I need to buy?Should I carry less renter's insurance if I can self-insure?Mortgage Adviser Signed Me Up For Multiple Home and Life Insurances (UK)Why many travel insurances don't cover country of nationality?