OnlinePCA.jl

Online Principal Component Analysis

📚 Documentation

Description

OnlinePCA.jl binarizes CSV file, summarizes the information of data matrix and, performs some online-PCA functions for extreamly large scale matrix.

Algorithms

Gradient-based
- GD-PCA
- SGD-PCA
- Oja's method : Erkki Oja et. al., 1985, Erkki Oja, 1992
- CCIPCA : Juyang Weng et. al., 2003
- RSGD-PCA : Silvere Bonnabel, 2013
- SVRG-PCA : Ohad Shamir, 2015
- RSVRG-PCA : Hongyi Zhang, et. al., 2016, Hiroyuki Sato, et. al., 2017
Krylov subspace-based
- Orthgonal Iteration (A power method to calculate multiple eigenvectors at once) : Zhaofun Bai, 1987
- Arnoldi method : Zhaofun Bai, 1987
- Lanczos method : Zhaofun Bai, 1987
Random projection-based
- Halko's method : Halko, N., et. al., 2011, Halko, N. et. al., 2011
- Algorithm971 : George C. Linderman, et. al., 2017, Huamin, Li, et. al., 2017, Vladimir Rokhlin, et. al., 2009
- Randomized Block Krylov Iteration : W, Yu, et. al., 2017
- Single-pass PCA : C Musco, et. al., 2015

Learning Parameter Scheduling

Robbins-Monro : Herbert Robbins, et. al., 1951
Momentum : Ning Qian, 1999
Nesterov's Accelerated Gradient Descent（NAG） : Nesterov, 1983
ADAGRAD : John Duchi, et. al., 2011

Installation

julia> Pkg.add(url="https://github.com/rikenbit/OnlinePCA.jl.git")
julia> Pkg.add("PlotlyJS")

# push the key "]" and type the following command.
(v1.7) pkg> add https://github.com/rikenbit/OnlinePCA.jl
# After that, push Ctrl + C to leave from Pkg REPL mode

Basic API usage

Preprocess of CSV

using OnlinePCA
using OnlinePCA: write_csv
using Distributions
using DelimitedFiles
using SparseArrays
using MatrixMarket

# CSV
tmp = mktempdir()
data = Int64.(ceil.(rand(NegativeBinomial(1, 0.5), 300, 99)))
data[1:50, 1:33] .= 100*data[1:50, 1:33]
data[51:100, 34:66] .= 100*data[51:100, 34:66]
data[101:150, 67:99] .= 100*data[101:150, 67:99]
write_csv(joinpath(tmp, "Data.csv"), data)

# Binarization (Zstandard)
csv2bin(csvfile=joinpath(tmp, "Data.csv"), binfile=joinpath(tmp, "Data.zst"))

# Matrix Market (MM)
mmwrite(joinpath(tmp, "Data.mtx"), sparse(data))

# Binarization (Zstandard)
csv2bin(csvfile=joinpath(tmp, "Data.csv"), binfile=joinpath(tmp, "Data.zst"))

# Summary of data for CSV/Dense Matrix
dense_path = mktempdir()
sumr(binfile=joinpath(tmp, "Data.zst"), outdir=dense_path)

Setting for plot

using DataFrames
using PlotlyJS

function subplots(respca, group)
	# data frame
	data_left = DataFrame(pc1=respca[:,1], pc2=respca[:,2], group=group)
	data_right = DataFrame(pc2=respca[:,2], pc3=respca[:,3], group=group)
	# plot
	p_left = Plot(data_left, x=:pc1, y=:pc2, mode="markers", marker_size=10, group=:group)
	p_right = Plot(data_right, x=:pc2, y=:pc3, mode="markers", marker_size=10,
	group=:group, showlegend=false)
	p_left.data[1]["marker_color"] = "red"
	p_left.data[2]["marker_color"] = "blue"
	p_left.data[3]["marker_color"] = "green"
	p_right.data[1]["marker_color"] = "red"
	p_right.data[2]["marker_color"] = "blue"
	p_right.data[3]["marker_color"] = "green"
	p_left.data[1]["name"] = "group1"
	p_left.data[2]["name"] = "group2"
	p_left.data[3]["name"] = "group3"
	p_left.layout["title"] = "PC1 vs PC2"
	p_right.layout["title"] = "PC2 vs PC3"
	p_left.layout["xaxis_title"] = "pc1"
	p_left.layout["yaxis_title"] = "pc2"
	p_right.layout["xaxis_title"] = "pc2"
	p_right.layout["yaxis_title"] = "pc3"
	plot([p_left p_right])
end

group=vcat(repeat(["group1"],inner=33), repeat(["group2"],inner=33), repeat(["group3"],inner=33))

GD-PCA

out_gd1 = gd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="robbins-monro", stepsize=1E-3,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_gd2 = gd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="momentum", stepsize=1E-3,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_gd3 = gd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="nag", stepsize=1E-3,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_gd4 = gd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="adagrad", stepsize=1E-0,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_gd1[1], group) # Top, Left
subplots(out_gd2[1], group) # Top, Right
subplots(out_gd3[1], group) # Bottom, Left
subplots(out_gd4[1], group) # Bottom, Right

SGD-PCA

out_sgd1 = sgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="robbins-monro", stepsize=1E-3,
    numbatch=100, numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_sgd2 = sgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="momentum", stepsize=1E-3,
    numbatch=100, numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_sgd3 = sgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="nag", stepsize=1E-3,
    numbatch=100, numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_sgd4 = sgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="adagrad", stepsize=1E-0,
    numbatch=100, numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_sgd1[1], group) # Top, Left
subplots(out_sgd2[1], group) # Top, Right
subplots(out_sgd3[1], group) # Bottom, Left
subplots(out_sgd4[1], group) # Bottom, Right

Oja's method

out_oja1 = oja(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="robbins-monro", stepsize=1E+0,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_oja2 = oja(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="momentum", stepsize=1E-3,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_oja3 = oja(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="nag", stepsize=1E-3,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_oja4 = oja(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="adagrad", stepsize=1E-1,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_oja1[1], group) # Top, Left
subplots(out_oja2[1], group) # Top, Right
subplots(out_oja3[1], group) # Bottom, Left
subplots(out_oja4[1], group) # Bottom, Right

CCIPCA

out_ccipca1 = ccipca(input=joinpath(tmp, "Data.zst"), dim=3, stepsize=1E-0,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_ccipca1[1], group)

RSGD-PCA

out_rsgd1 = rsgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="robbins-monro", stepsize=1E+2,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_rsgd2 = rsgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="momentum", stepsize=1E-3,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_rsgd3 = rsgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="nag", stepsize=1E-3,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_rsgd4 = rsgd(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="adagrad", stepsize=1E-1,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_rsgd1[1], group) # Top, Left
subplots(out_rsgd2[1], group) # Top, Right
subplots(out_rsgd3[1], group) # Bottom, Left
subplots(out_rsgd4[1], group) # Bottom, Right

SVRG-PCA

out_svrg1 = svrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="robbins-monro", stepsize=1E-5,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_svrg2 = svrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="momentum", stepsize=1E-5,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_svrg3 = svrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="nag", stepsize=1E-5,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_svrg4 = svrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="adagrad", stepsize=1E-2,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_svrg1[1], group) # Top, Left
subplots(out_svrg2[1], group) # Top, Right
subplots(out_svrg3[1], group) # Bottom, Left
subplots(out_svrg4[1], group) # Bottom, Right

RSVRG-PCA

out_rsvrg1 = rsvrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="robbins-monro", stepsize=1E-6,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_rsvrg2 = rsvrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="momentum", stepsize=1E-6,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_rsvrg3 = rsvrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="nag", stepsize=1E-6,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))
out_rsvrg4 = rsvrg(input=joinpath(tmp, "Data.zst"), dim=3, scheduling="adagrad", stepsize=1E-2,
    numepoch=10, rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_rsvrg1[1], group) # Top, Left
subplots(out_rsvrg2[1], group) # Top, Right
subplots(out_rsvrg3[1], group) # Bottom, Left
subplots(out_rsvrg4[1], group) # Bottom, Right

Orthogonal Iteration (Power method)

out_orthiter = orthiter(input=joinpath(tmp, "Data.zst"), dim=3,
    rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_orthiter[1], group)

Arnoldi method

out_arnoldi = arnoldi(input=joinpath(tmp, "Data.zst"), dim=3,
    rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_arnoldi[1], group)

Lanczos method

out_lanczos = lanczos(input=joinpath(tmp, "Data.zst"), dim=3,
    rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_lanczos[1], group)

Halko's method

out_halko = halko(input=joinpath(tmp, "Data.zst"), dim=3,
    rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_halko[1], group)

Algorithm 971

out_algorithm971 = algorithm971(input=joinpath(tmp, "Data.zst"), dim=3,
    rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_algorithm971[1], group)

Randomized Block Krylov Iteration

out_rbkiter = rbkiter(input=joinpath(tmp, "Data.zst"), dim=3,
    rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_rbkiter[1], group)

Single-pass PCA type I

out_singlepass = singlepass(input=joinpath(tmp, "Data.zst"), dim=3,
    rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_singlepass[1], group)

Single-pass PCA type II

out_singlepass2 = singlepass2(input=joinpath(tmp, "Data.zst"), dim=3,
    rowmeanlist=joinpath(dense_path, "Feature_LogMeans.csv"))

subplots(out_singlepass2[1], group)

Summarization for 10X-HDF5

tenxsumr(tenxfile="Data.h5", group="mm10", chunksize=100)

Algorithm 971 for 10X-HDF5

out_tenxpca = tenxpca(tenxfile="Data.h5", scale="sqrt",
    rowmeanlist="Feature_SqrtMeans.csv", dim=3, chunksize=100, group="mm10")

Summary of data for MM/Sparse Matrix

# Sparsification + Binarization (Zstandard + MM format)
mm2bin(mmfile=joinpath(tmp, "Data.mtx"), binfile=joinpath(tmp, "Data.mtx.zst"))

sparse_path = mktempdir()
sumr(binfile=joinpath(tmp, "Data.mtx.zst"), outdir=sparse_path, mode="sparse_mm")

Sparse Randomized SVD for MM format

out_sparse_rsvd = sparse_rsvd(
	input=joinpath(tmp, "Data.mtx.zst"),
	scale="ftt",
    rowmeanlist=joinpath(sparse_path, "Feature_FTTMeans.csv"),
	dim=3, chunksize=100)

subplots(out_sparse_rsvd[1], group)

Exact Out-of-Core PCA

Unlike other PCAs, this function assumes matrix data with data x dimensions. It is also computationally efficient when the data is vertical with number of data >> number of dimensions. In the following, data assuming this assumption are first prepared. The function can also be used without performing a sumr to extract row and column statistics in advance.

# CSV
tmp2 = mktempdir()
# data2 = Int64.(ceil.(rand(Binomial(1, 0.2), 300, 99)))
# data2[1:100, 1:33] .= 1
# data2[101:200, 34:66] .= 1
# data2[201:300, 67:99] .= 1
data2 = Int64.(ceil.(rand(NegativeBinomial(1, 0.5), 99, 30)))
data2[1:33, 1:10] .= 100*data2[1:33, 1:10]
data2[34:66, 11:20] .= 100*data2[34:66, 11:20]
data2[67:99, 21:30] .= 100*data2[67:99, 21:30]
write_csv(joinpath(tmp2, "Data2.csv"), data2)

# Binarization (Zstandard)
csv2bin(csvfile=joinpath(tmp2, "Data2.csv"), binfile=joinpath(tmp2, "Data2.zst"))

# Matrix Market (MM)
mmwrite(joinpath(tmp2, "Data2.mtx"), sparse(data2))

# Binary COO (BinCOO)
bincoofile = joinpath(tmp2, "Data2.bincoo")
open(bincoofile, "w") do io
    for i in 1:size(data2, 1)
        for j in 1:size(data2, 2)
            if data2[i, j] != 0
                println(io, "$i $j")
            end
        end
    end
end

# Binarization (CSV + Zstandard)
csv2bin(csvfile=joinpath(tmp2, "Data2.csv"), binfile=joinpath(tmp2, "Data2.zst"))

# Binarization (MM + Zstandard)
mm2bin(mmfile=joinpath(tmp2, "Data2.mtx"), binfile=joinpath(tmp2, "Data2.mtx.zst"))

# Binarziation (BinCOO + Zstandard)
bincoo2bin(bincoofile=bincoofile, binfile=joinpath(tmp2, "Data2.bincoo.zst"))

# Dense-mode
out_exact_ooc_pca_dense = exact_ooc_pca(
	input=joinpath(tmp2, "Data2.zst"),
	scale="raw", dim=3, chunksize=10)

subplots(out_exact_ooc_pca_dense[3], group)

# Sparse-mode (MM)
out_exact_ooc_pca_sparse_mm = exact_ooc_pca(
	input=joinpath(tmp2, "Data2.mtx.zst"),
	scale="raw", dim=3, chunksize=10, mode="sparse_mm")

subplots(out_exact_ooc_pca_sparse_mm[3], group)

# Sparse-mode (BinCOO)
out_exact_ooc_pca_sparse_bincoo = exact_ooc_pca(
	input=joinpath(tmp2, "Data2.bincoo.zst"),
	scale="raw", dim=3, chunksize=10, mode="sparse_bincoo")

subplots(out_exact_ooc_pca_sparse_bincoo[3], group)

Command line usage

All the CSV preprocess functions and PCA functions also can be performed as command line tools with same parameter names like below.

# CSV → Julia Binary (e.g, csv2bin, mm2bin)
julia YOUR_HOME_DIR/.julia/v0.x/OnlinePCA/bin/csv2bin \
    --csvfile Data.csv --binfile Data.zst

# Summary statistics extracted from Julia Binary (e.g., sumr, tenxsumr)
julia YOUR_HOME_DIR/.julia/v0.x/OnlinePCA/bin/sumr \
    --binfile Data.zst

# Perform PCA
julia YOUR_HOME_DIR/.julia/v0.x/OnlinePCA/bin/gd \
    --input Data.zst --dim 3 --scheduling robbins-monro --stepsize 10 \
    --numepoch 10 --rowmeanlist Feature_LogMeans.csv

Distributed Computing with Multiple Stepsize Setting

The online PCA algorithms are performed until the reconstruction error is converged. In the default stopping criteria, the calculation is stopped when the relative change is bellow 1E-3 or above 0.03. These values can be changed by lower and upper options, respectively.

The convergence is depend on the step size parameter and default value is set as 1000. This value is tuned for single-cell RNA-Seq dataset, but the appropriate level may change according to the size and dynamic range of data matrix.

Combined with Grid Engine, this step is easily paralled, because each calculation of different step size are independently performed. For example, we firstly make the following template file (e.g., oja_template) containing the online PCA script,

#!/bin/bash

julia YOUR_HOME_DIR/.julia/v0.x/OnlinePCA/bin/oja \
--scale log \
--input Data.zst \
--outdir XXXXX \
--rowmeanlist Feature_LogMeans.csv \
--dim 10 \
--stepsize YYYYY \
--logdir XXXXX/log

and then rewrite the template to set different step size by sed command and submit each job by qsub command.

#!/bin/bash

Steps=(1 10 100 1000 10000 100000 1000000)
for i in ${Step[@]}; do
	OUT="Step"$i
	mkdir -p $OUT
	sed -e "s|XXXXX|$OUT|g" oja_template > TMP_oja_scData.sh
	sed -e "s|YYYYY|$i|g" TMP_oja_scData.sh > oja_scData.sh
	chmod +x oja_scData.sh
	qsub oja_scData.sh
done

Even if there are no distributed computational environment, background process is applicable (just adding & in the end of command).

#!/bin/bash

Steps=(1 10 100 1000 10000 100000 1000000)
for i in ${Steps[@]}; do
	mkdir -p "Step"$i
	julia YOUR_HOME_DIR/.julia/v0.x/OnlinePCA/bin/oja \
	--scale log \
	--input Data.zst \
	--outdir "Step"$i \
	--rowmeanlist Feature_LogMeans.csv \
	--dim 10 \
	--stepsize $i \
	--logdir "Step"$i/log &
done

ps | grep julia

Contributing

If you have suggestions for how OnlineNMF.jl could be improved, or want to report a bug, open an issue! We'd love all and any contributions.

For more, check out the Contributing Guide.

Author

Koki Tsuyuzaki

Name		Name	Last commit message	Last commit date
Latest commit History 315 Commits
.github		.github
bin		bin
docs		docs
src		src
test		test
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
Manifest.toml		Manifest.toml
Project.toml		Project.toml
README.md		README.md

License

rikenbit/OnlinePCA.jl

Folders and files

Latest commit

History

Repository files navigation

OnlinePCA.jl

📚 Documentation

Description

Algorithms

Learning Parameter Scheduling

Installation

Basic API usage

Preprocess of CSV

Setting for plot

GD-PCA

SGD-PCA

Oja's method

CCIPCA

RSGD-PCA

SVRG-PCA

RSVRG-PCA

Orthogonal Iteration (Power method)

Arnoldi method

Lanczos method

Halko's method

Algorithm 971

Randomized Block Krylov Iteration

Single-pass PCA type I

Single-pass PCA type II

Summarization for 10X-HDF5

Algorithm 971 for 10X-HDF5

Summary of data for MM/Sparse Matrix

Sparse Randomized SVD for MM format

Exact Out-of-Core PCA

Command line usage

Distributed Computing with Multiple Stepsize Setting

Contributing

Author

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Uh oh!

Contributors 4

Uh oh!

Languages

Packages