Actinobacteria

id : 20231217095710

types : undefined

keywords :

#Archaea

Activation Function

id : 20231217095711

types : undefined

keywords :

활성화 함수

ReLU
Sigmoid
Swish

참고

Active Learning

id : 20231217095712

types : undefined

keywords :

Active Learning은 machine learning(ML) 모델을 훈련 시키기 위해 인간이 직접 데이터를 모두 라벨링 시키기 어렵다는 점을 해결하기 위해 만들어지 개념이다.
인간이 일부 데이터에 대하여 라벨링을 하고 제시하면, 모델이 나머지 데이터들을 평가하고, 이 중 라벨링이 필요하다고 판단되는 중요한 데이터는 다시 인간에게 건내준다.

Three scienarios of Active Learning

1. Membership Query Synthesis

학습모델이 주어진 데이터를 활용해서 약간 왜곡된 데이터 인스턴스를 생성한 후, 이에 대한 라벨링을 인간에게 요구한다.

2. Stream-Based Selective Sampling

학습모델이 주어진 데이터의 정보량을 평가하고, 라벨링이 필요하다고 판단되는 경우에만 인간에게 전달한다.
정보량을 평가할 때는 [[today-i-learned/Query Strategy]]를 사용한다. 라벨링이 필요하다고 판단되는 경우는 query하고 아닌 것은 버린다.

3. Pool-Based Sampling

가장 널리 사용된다.
라벨링 되지 않은 데이터가 매우 많을 때 사용한다.
학습모델이 주어진 데이터를 평가하고, 정보량이 가장 많은 데이터만 선택해서 인간에게 전달한다.
정보량을 평가할 때 버리는 instance가 존재하지 않는다.

참고: https://littlefoxdiary.tistory.com/52

Adaptive Immunity

id : 20231217095713

types : undefined

keywords :

2차 면역반응

2차면역반응은 특정 항원을 인식해서 발생한다.
특정 항원을 인식하는 antigen receptor는 유전자들의 조합을 통하여 만들어진다.
유전자들의 조합결과는 V(D)J Recombination, irreversible somatic DNA recombination을 통해 결정된다.

Alignment

id : 20231217095714

types : undefined

keywords :

Alpha Diversity

id : 20231217095715

types : undefined

keywords :

Alphaproteobacteria

id : 20231217095716

types : undefined

keywords :

#Archaea

Alternative Splicing

id : 20231217095717

types : undefined

keywords :

- pre-mRNA (exon, intron) -> mature mRNA (exon) - pre-mRNA에서 intron 영역을 제거 # pre-mRNA - pre-mRNA = precursor mRNA - pre-mRNA는 poly(A) tail을 가진다 - pre-mRNA는 pre-tRNA와 마찬가지로 primary transcript에 포함된다. # RNA splicing 과정

1. snRNPs 5개 혹은 ribonucleoproteins 접근 2. intron에 붙음 3. intron의 양 끝(5', 4')이 모이면서 고리를 생성 4. intron이 제거됨 - intron은 이후 다른 과정에 참여함 6. snRNPs는 다시 재활용됨

Alternative Splicing 결과 다양한 종류

## Exon Skipping (Cassette Exon) - 특정 exon을 건너 뛰어 버린다. ## Intron Retention (Retained Intron) - intron이 중간에 그대로 끼어있음 ## Mutually Exclusive Exon - 특정한 exon에서 여러 돌연변이가 발생해 다양한 종류의 돌연변이를 야기한다. ## Alternative 5' Donor Site - upstream exon의 중간부분이 연결된다. ## Alternative 3' Acceptor Sites - downsteram exon의 중간부분이 연결된다.

참고

Amplicon Target Sequencing

id : 20231217095718

types : undefined

keywords :

Error

Allele Drop-Out

#Allele_Drop-Out

aRNA

id : 20231217095719

types : undefined

keywords :

antisense RNA

AUROC

id : 20231217095720

types : undefined

keywords :

Area Under the Receiver Operator Curve

ROC 곡선 아래의 영역이라는 뜻
의료검사 혹은 기준치에 대한 진단의 정확성 평가하는 통계기법

참고

https://blog.naver.com/i-doctor/222700461406

Batch Effect

id : 20231217095721

types : undefined

keywords :

Batch effect is the perturbation in measured gene expressions, often introduced by factors such as library preparation, sequencing technologies, and sample origins (donors).
데이터 생성 단계에서의 차이로 인해 데이터 셋 간 또는 데이터 셋 내에는 관심의 대상인 생물학적인 변동 외에도 원치 않는 기술적 변동이 불가피하게 수반한다.
배치 효과는 생물학적 요인에 의한 변동과 혼합되어 둘 사이의 구분을 어렵게 함으로써 세포 특성을 규명하는 것을 방해할 수 있다.

참고

Bayesian statistics

id : 20231217095722

types : undefined

keywords :

BCR

id : 20231217095723

types : undefined

keywords :

B cell receptor

Beta Diversity

id : 20231217095724

types : undefined

keywords :

Beta-Selection

id : 20231217095725

types : undefined

keywords :

β-selection

T cell의 전구세포는 조혈모세포-척수-흉선을 따라 이동하게 된다.
대부분은 αβ T세포가 되지만 5% 정도는 γδ TCR을 가진 세포들이 된다.
흉선의 피질과 수질로 구분되어 있으며, T cell의 전구세포(흉선세포) 성장은 흉선의 기질세포들과 상호작용하며 일어난다.
초기의 흉선세포들은 CD4, CD8 어느 단백질도 띄지 않고 있기 때문에 double negative(DN)세포라고 부른다.
흉선세포의 DN기는 크게 4 단계로 나뉘어지며, 이때는 CD44, CD25 같은 단백질들을 띄게 된다.
DN3기에서 DN4기로 넘어갈 때, V(D)J 유전자 재조합의 결과 β-chain을 형성할 수 있는 T cell 만이 선택되는 β-selection이 일어난다.
선택받은 T cell 만이 다음 단계를 넘어서 성장하게 된다.

참고

https://www.immunology.org/public-information/bitesized-immunology/immune-development/t-cell-development-thymus

BFS

id : 20231217095726

types : undefined

keywords :

CCA

id : 20231217095727

types : undefined

keywords :

canonical correlation analysis

Cell Atlas

id : 20231217095728

types : undefined

keywords :

세포지도

[[today-i-learned/Knowledge/Bioinformatics/scRNA-seq|scRNA-seq]] 기술이 발전함에 따라서 단일 세포를 단순히 분석하는 것을 넘어, 이를 통합하여 인간의 세포지도를 작성하는 것이 가능하게 되었다.
마치 인간의 유전체를 전부 sequencing하여 '게놈지도'를 만들기로 했던 것과 유사하다.

Cell State

id : 20231217095729

types : undefined

keywords :

가장 쉬운 예로 B cell을 생각해보자. B cell이 항원에 노출되기 전에는 'naive' 상태에 있다가, 항원을 만나면 활성화 된다. 이렇게 세포들도 주변 환경과 조건들에 따라 '상태'가 변하게 되며, 같은 종류의 세포라고 해도 다양한 '상태'를 갖게 된다.

Single cell 연구의 장점은 각 세포들의 '상태'를 하나하나 알 수 있다는 것이며, 이러한 '상태'들을 cell state라고 부른다.

Cell trajectory

id : 20231217095730

types : undefined

keywords :

Time series experiments of differentiation have observed cells transitioning between a starting state and one or more end states, with many cells distributed along a “trajectory” between them.

세포의 발달과정, 상태변화과정을 말한다.

참고

https://cole-trapnell-lab.github.io/projects/sc-trajectories/

Cell type

id : 20231217095731

types : undefined

keywords :

Central Dogma

id : 20231217095732

types : undefined

keywords :

DNA
RNA
Protein

Chromatin Accessibility

id : 20231217095733

types : undefined

keywords :

염색질 접근성

Chromatin accessibility represents the degree to which nuclear macromolecules physically contact chromatinized DNA and are topologically organized by nucleosomes and other chromatin-binding factors.

참고

Chromatin

id : 20231217095734

types : undefined

keywords :

Class Imbalance

id : 20231217095735

types : undefined

keywords :

여러 클래스들 중에서 소수의 클래스들 데이터 수가 그 외의 클래스들의 데이터 수보다 과도하게 많은 경우

Class Imbalance는 왜 문제가 되는가?

Machine Learing model이 훈련될 때 class 사이의 데이터량이 비슷하다고 가정하고 훈련되기 때
Class Imbalance를 해결하지 않고 훈련시킬 경우, model이 한쪽으로 편향될 수 있다.

Class Imbalance 해결법

Resampling
Weighting

참고

[Medium] 클래스 불균형 다루기

Clipping

id : 20231217095736

types : undefined

keywords :

Clipping

Clipping이란 sequencing을 통해 얻은 read의 앞 부분과 뒷 부분의 정확도가 떨어지기 때문에, 이를 Alignment 단계에서 생략하고 정확도가 높은 중간 부분만 alingment에 활용하는 기술이다.
예를 들어 1000 base 길이의 read를 얻었다고 했을 때, aligment 과정에 clipping 옵션이 활용되면 1000 base 중 중간의 700 base만 alignment에 활용될 것이다.
장점은 정확도가 높은 부분만 alignment 과정에 활용한다는 점이다.
- 이를 통해 low-quality base call이 많은 데이터도 alignment 정확도를 높일 수 있다.
Clipping의 종류는 Soft Clipping과 Hard Clipping 두 종류가 있다.

Soft Clipping

Hard Clipping

#Soft_Clipping #Hard_Clipping

참조

https://medium.com/@lwy730050619/soft-clipping-vs-hard-clipping-in-read-alignment-bd0c96f47426

CNN

id : 20231217095737

types : undefined

keywords :

Convolutional Neural Network

Convolution

- 연산으로는 다음과 같이 나타낼 수 있다. $$\huge f(x) * g(x) = h(x)$$ *f = filter, g= feature map, h= output* # Deconvolution - Deconvolution은 Convolution의 반대과정이다.

참고

https://dambaekday.tistory.com/3

#Convolution #Deconvolution

CNV

id : 20231217095738

types : undefined

keywords :

Copy Number Variation

Conformational Structure

id : 20231217095739

types : undefined

keywords :

Converged Database

id : 20231217095740

types : undefined

keywords :

Curse of The Dimensionality

id : 20231217095741

types : undefined

keywords :

How to solve?

Feature Selection
Feature Extraction

Data Integration

id : 20231217095742

types : undefined

keywords :

Data ordering

id : 20231217095743

types : undefined

keywords :

Time Series
Random Shuffling

Deep Learning

id : 20231217095744

types : undefined

keywords :

CNN
Transformer

DeepVariant

id : 20231217095745

types : undefined

keywords :

https://github.com/google/deepvariant
"DeepVariant is a deep learning-based variant caller that takes aligned reads (in BAM or CRAM format), produces pileup image tensors from them, classifies each tensor using a convolutional neural network, and finally reports the results in a standard VCF or gVCF file."

DFS

id : 20231217095746

types : undefined

keywords :

Depth First Search

분기가 발생했을 때, 한 분기를 끝까지 탐색한 다음 돌아와서 다른 분기를 탐색한다.

DFS와 BFS 중에 어느 걸 사용해야할까?

모든 node를 방문해야한다면
- DFS, BFS 둘 다 좋다.
Node 방문 history를 기록해야한다면
- DFS. 왜냐하면 BFS는 특정 node로 향하는 경로의 특징을 저장하지 못하기 때문.
최단거리를 구해야 한다면
- BFS

알고리즘 구현

#Algorithm

참조

Diversity

id : 20231217095747

types : undefined

keywords :

Alpha Diversity

단일 군집 내 종의 수

Beta Diversity

환경에 따른 종 조성의 변화

Gamma Diversity

서로 다른 위치의 동일한 환경에서 종의 수
$\huge \gamma =\alpha +\beta$

참고

DNA

id : 20231217095748

types : undefined

keywords :

ORF

Ensemble

id : 20231217095749

types : undefined

keywords :

Exon

id : 20231217095750

types : undefined

keywords :

Feature Extraction

id : 20231217095751

types : undefined

keywords :

Feature Selection

id : 20231217095752

types : undefined

keywords :

Gamma Diversity

id : 20231217095753

types : undefined

keywords :

GATK HaploTypeCaller

id : 20231217095754

types : undefined

keywords :

Harmony

id : 20231217095755

types : undefined

keywords :

#tool

Histone Modification

id : 20231217095756

types : undefined

keywords :

HMM

id : 20231217095757

types : undefined

keywords :

HVG

id : 20231217095758

types : undefined

keywords :

highly variable genes

IDP

id : 20231217095759

types : undefined

keywords :

Intrinsically Disordered Protein

Immune Repertoire

id : 20231217095760

types : undefined

keywords :

면역 레퍼토리

"Immune repertoire refers to all of the unique T-cell receptor (TCR) and B-cell receptor (BCR) genetic rearrangements within the adaptive immune system."

T cell receptor(TCR)와 B cell receptor(BCR)이 어떻게 수많은 항원에 일일이 특이적으로 대응할 수 있는 것일까? 그 대답이 바로 유전자들의 조합에 따른 특이성이다.
몇가지 유전자만을 가지고 수많은 antigen receptor 조합물을 만들어내며, 이 모든 조합들에 따른 범용적인 면역기능을 면역 레퍼토리(immune repertoire)라는 단어로 총칭한다.

참조

https://www.thermofisher.com/ca/en/home/life-science/sequencing/sequencing-learning-center/next-generation-sequencing-information/immuno-oncology-research/why-immune-repertoire-matters.html

Intron

id : 20231217095761

types : undefined

keywords :

LAE

id : 20231217095762

types : undefined

keywords :

Latent Representation

id : 20231217095763

types : undefined

keywords :

Learn-NextFlow-in-2023

id : 20231217095764

types : undefined

keywords :

Learn Nextflow in 2023

Before you start

Be familar with
- Linux command
- Python or Perl
- Some biology

Meet the Tutorials!

1. Basic Nextflow Community Training

Basic Training
YouTube Playlist(7hrs): Community Nextflow & nf-core Foundational Training

Environment setup

You can start Nextflow by 1) local installation or 2) Gitpod
For this tutorial, local installation require something like
- Bash
- Java 11
- Git
- Docker
- Singularity 2.5.x (or later)
- Conda 4.5 (or later)
- Graphviz
- AWS CLI
- A configured AWS Batch computing environment
Meanwhile, Gitpod requires only
- Github ID
- Internet
I chose Gitpod as I am not familiar with AWS
Gitpod is online based IDE service
Not 100% free, but offer around 50 hours as credits monthly
Visit https://gitpod.io/#https://github.com/nextflow-io/training, connect Git ID to Gitpod
Get your own workplace to practice Nextflow
Test the environment

nextflow info

I didn't roll back the version to the previous version as like the tutorial recommended. I guess there are some bugs with JAVA when you follow the tutorial with the old version.

Introduction

Nextflow is a workflow orchestration engine and domain-specific language (DSL) that makes it easy to write data-intensive computational workflows.

Processes and Channels

process takes input data and give output data
- process includes command or script which has to be executed
channel is where the data come in

Execution abstraction

Executor is where the Nextflow run
There would be local executor like your computer
Or, High-Performace Computing(HPC), Cloud platform would be
Good thing is that you don't need to modify the workflow only for local computer or cloud. They run same in every platform. Just need to define the target platform.

Scripting Language

DSL
Nextflow scripting is an extension of the Groovy programming language which, in turn, is a super-set of the Java programming language.

Your first script

hello.nf

#!/usr/bin/env nextflow

params.greeting = 'Hello world!' 
greeting_ch = Channel.of(params.greeting) 

process SPLITLETTERS { 
    input: 
    val x 

    output: 
    path 'chunk_*' 

    script: 
    """
    printf '$x' | split -b 6 - chunk_
    """
} 

process CONVERTTOUPPER { 
    input: 
    path y 

    output: 
    stdout 

    script: 
    """
    cat $y | tr '[a-z]' '[A-Z]'
    """
} 

workflow { 
    letters_ch = SPLITLETTERS(greeting_ch) 
    results_ch = CONVERTTOUPPER(letters_ch.flatten()) 
    results_ch.view { it } 
}

run the code

nextflow run hello.nf 
#N E X T F L O W  ~  version 23.10.0
#Launching `hello.nf` [confident_mclean] DSL2 - revision: 197a0e289a
#executor >  local (3)
#[bd/de1214] process > SPLITLETTERS (1)   [100%] 1 of 1 ✔
#[cc/562f4f] process > CONVERTTOUPPER (1) [100%] 2 of 2 ✔
#WORLD!
#HELLO

Logistic Regression

id : 20231217095765

types : undefined

keywords :

Ridge Regularization

Long-read Sequencing

id : 20231217095766

types : undefined

keywords :

Also known as third-generation sequencing

종류

Nanopore
SMRT

LSTM

id : 20231217095767

types : undefined

keywords :

Long Short-Term Memory

LSTM은 RNN이 가지고 있는 장기 의존성 문제(long term dependency)를 해결하기 위해 제안된 모델이다.

참고

https://ok-lab.tistory.com/209

Machine Learning

id : 20231217095768

types : undefined

keywords :

Example

Deep Learning
Active Learning

Mendelian disease

id : 20231217095769

types : undefined

keywords :

What is the difference between Mendelian and non-Mendelian inheritance?

The Mendelian traits are determined by dominant and recessive alleles of one gene.
On the contrary, non-Mendelian traits are not determined by dominant and recessive alleles and can be governed by more than one gene.

참고

https://byjus.com/biology/non-mendelian-inheritance/

Methylation

id : 20231217095770

types : undefined

keywords :

miRNA

id : 20231217095771

types : undefined

keywords :

micro RNA

MNNs

id : 20231217095772

types : undefined

keywords :

Mutual Nearest Neighbors

Motif

id : 20231217095773

types : undefined

keywords :

Motif란 생물학적인 기능과 연관된 것으로 추정되는 유전자 혹은 단백질 'pattern'를 의미한다.
Motif는 유전자서열 pattern인 sequence motif, 단백질 1차구조 pattern인 short linear motif, 단백질 2차구조 pattern인 structrual motif로 나뉜다.

1. Sequence Motif

Sequence motif는 생물학적 기능과 연관된 것으로 '추정'되는 nucleotide 혹은 amino-acid sequence pattern이다.
만약 sequence motif가 [[cosma/Exon|Exon]]에 위치하고 있다면, 해당 sequence motif는 structural motif로 전사 및 번역될 수도 있다.
Sequence motif는 [[cosma/Intron|Intron]] 부분에 존재할 수도 있다.
Sequence motif의 길이는 길는 15~20개의 nucleotide로 구성될 수도 있다.

2. Short Linear Motif(minimotifs, SLiMs)

SLiMs는 단백질과 단백질 사이의 상호작용에 영향을 미칠 수 있다.
SliMs는 주로 비정형단백질(Intrinsically Disordered Protein, IDP)에서 발견된다.

3. Structural Motif

같은 structural motif 라고 하더라도 단백질마다 생물학적인 기능이 다를 수 있다.

Motif vs Domain

Motif는 통계적으로 발견된 패턴이며, 생물학적 기능을 가질 것이라고 추정할 뿐이다.
Domain은 단백질 구조, 단백질 family와 밀접한 연관을 가진다
Domain은 생물학적인 기능을 가지며, 여러 motif로 구성되어있을 수 있다.

참고

mRNA

id : 20231217095774

types : undefined

keywords :

mRNA를 만드는 과정: Capping → Tailing → Splicing → RNAediting

5' Capping
3' Tailing
Splicing
RNAediting

Multi-omics

id : 20231217095775

types : undefined

keywords :

Mutation

id : 20231217095776

types : undefined

keywords :

Loss of Heterozygosity(LOH)

이형상실/이형접합성 소실
example of LOH allele proportion
Reference allele는 G, alternative allele는 A가 확인된 샘플.

Substitution

Point Mutation

Missense Mutation

Frameshift Mutation

https://www.genome.gov/genetics-glossary/Frameshift-Mutation

Deletion

Insertion

Nonsense Mutation

Nanopore

id : 20231217095777

types : undefined

keywords :

2D sequencing

가장 처음으로 nanopore에서 활용된 기술
헤어핀구조가 이중나선 구조의 DNA에 붙음
두 DNA 가닥을 모두 sequencing 함

1D sequencing

One direction library kit를 활용함
하나의 DNA strand만 sequencing 함

1D$^2$ sequencing

헤어핀구조를 사용하지 않고 이중나선 구조를 sequencing
1D 기술보다 더 높은 정확도를 가짐

참고: https://www.mdpi.com/2079-6374/11/7/214

Neural Network

id : 20231217095778

types : undefined

keywords :

NGS

id : 20231217095779

types : undefined

keywords :

Next Generation Sequencing

NoSQL

id : 20231217095780

types : undefined

keywords :

Non SQL / Not Only SQL

Non-relational DataBase(NRDB)를 다루기 위한 프로그래밍 언어

특징 (SQL과의 차이점)

Scheme가 없다

SQL은 row, column과 같이 schema를 정하고, 이를 따르는 데이터만 추가할 수 있다.
NoSQL은 schema를 따르지 않는 데이터도 추가할 수 있다.

참조

https://post.naver.com/viewer/postView.naver?volumeNo=34289847&memberNo=6457418&vType=VERTICAL

NRDB

id : 20231217095781

types : undefined

keywords :

Non-Relational DataBase

특징

Schema를 미리 정의하지 않아도 된다

NRDB 사용 사례

NRDB를 사용하면 좋은 경우

대량의 데이터를 사용하는 경우
낮은 대기시간, 응답시간을 요구하는 경우
- 예를 들어 온라인 게임, 쇼핑

NRDB를 사용하면 안 좋은 경우

데이터의 중복을 허용하면 안되는 경우
- 예를 들어 금융, 회계

참조

Nucleosome

id : 20231217095782

types : undefined

keywords :

ORF

id : 20231217095783

types : undefined

keywords :

Open Reading Frame

DNA 염기서열 중 start codon과 stop codon까지 이루어진 서열들
일반적으로 실제 번역이 일어나는 아미노산 서열 부분
- Intron과 exon을 모두 포함
- 따라서 protein coding region 부분도 포함
ORF만 알고 있는 상황에서는 어떤 부분이 intron인지 exon인지 판단할 수 없다.
- 따라서 어떤 codon이 만들어질지는 DNA 한 가닥에서 3개 frame의 가능성을 두고 생각해야 한다.

Six-frame translation

DNA 한 가닥에서 3개의 frame에서 유래하는 translation을 고려해야한다.
Double helix DNA에서는 6개의 frame을 고려해야한다.

참고

OTUs

id : 20231217095784

types : undefined

keywords :

Operational Taxonomical Units

pcaReduce

id : 20231217095785

types : undefined

keywords :

Pearson χ2 test

id : 20231217095786

types : undefined

keywords :

Pearson's chi-squared test

참고

https://en.wikipedia.org/wiki/Pearson's_chi-squared_test

piRNA

id : 20231217095787

types : undefined

keywords :

piwi interacting RNA

Protein

id : 20231217095788

types : undefined

keywords :

Conformational Structure

Query Strategy

id : 20231217095789

types : undefined

keywords :

Active Learning에서

참고: https://littlefoxdiary.tistory.com/52

Random Forest

id : 20231217095790

types : undefined

keywords :

Random Grid Search

id : 20231217095791

types : undefined

keywords :

Random Shuffling

id : 20231217095792

types : undefined

keywords :

RBPs

id : 20231217095793

types : undefined

keywords :

RNA binding proteins

RBPs는 이중나선 혹은 단일가닥의 [[today-i-learned/Knowledge/Biology/Genomics/RNA|RNA]]와 결합하는 단백질을 말한다.
RBPs와 RNA는 세포 내에서 결합하여 [[ribonucleoprotein]] 복합체를 생성할 수 있다.
RBPs를 구성하는 것은 structural [[motif]]이다.
RBPs는 전사가 끝난 이후의 RNA에게 여러가지 영향을 미친다: splicing, polyadenylation, ...

RBPs의 기능

1. Alternative Splicing

2. RNA editing

3. Polyadenylation

4. Export

5. mRNA localization

6. Translation

unconventional RBPs(ucRBPs)

RBPs 중에서 생물학적 기능이 파악되지 않은 RBPs

참조

RDB

id : 20231217095794

types : undefined

keywords :

Relational DataBase, 관계형 데이터베이스

RDB는 table과 같은 형태로 데이터를 저장하는 database를 말한다.
행렬, 엑셀과 같은 형태를 상상하면 쉽다.

Basic Term

Row(or record)

각 행은 하나의 데이터 record를 의미한다.

Column(or field)

각 열은 데이터의 속성을 의미한다.
데이터의 속성 뿐 아니라 데이터의 조건사항(schema)를 적어줄 수도 있다.
- ex) NOT NULL

Primary Key

데이터들을 행 별로 구별할 때 사용할 열.
- 우리가 일반적으로 엑셀에서 table을 만들 때 가장 첫번째 열의 값들이 행을 구별할 때 사용된다는 것을 생각하자.
Primary Key는 단 하나만 지정되어야 한다.

Foreign Key

Primary Key를 참조하는 그 외의 열들.
Primary Key가 데이터의 이름이라면 Foreign Key는 그 속성일 것이다.

RDB의 장점

논리적인 데이터 조직화 및 관리

CREATE TABLE customers(
id INT NOT NULL ATUO_INCREMENT,
name VARCHAR(255) NOT NULL,
email VARCHAR(255) NOT NULL,
phone_number VARCHAR(255) NOT NULL,
PRIMARY KEY (id)
);

위와 같은 SQL 명령어로 생성한 테이블의 경우, 고객의 ID, 이름, 이메일, 전화번호를 저장한다.
모든 속성들을 하나의 record로 효율적으로 그룹화시킬 수 있다.

서로 다른 데이터를 효과적으로 연결

위에서 만들었던 'customer' table과 연결하기 위한 또 다른 table 'orders'를 만들어보자.

CREATE TABLE orders(
id INT NOT NULL AUTO_INCREMENT,
customer_id INT NOT NULL,
product_id INT NOT NULL,
quantuty INT NOT NULL,
price INT NOT NULL,
PRIMARY KEY (id),
FOREIGN KEY (customer_id) REFERNCES customers (id),
FOREIGN KEY (product_id) REFERENCES products (id)
);

'orders' table은 customer_id를 'customers' table의 id에서 reference한다는 것을 알 수 있다.
이렇게 간단한 명령어 만으로 다른 table의 데이터를 끌어다 참조하여 두 table을 연결시킬 수 있다.

효율적인 query

특정 고객이 주문한 상품 목록을 보고싶다면 다음과 같이 SQL을 사용하면 된다.

orders.id,
orders.customer_id,
orders.product_id,
orders.quantity,
orders.price,
products.name
From oders
INNER JOIN products
ON orders.product_id=products.id
WHERE orders.customer_id=1;

Type of RDB

MySQL
Amazon Aurora
PostgreSQL
Microsoft SQL Server
Oracale Database
MariaDB
SQL Server

참조

⬤ SQL
- Structured Query Language
- Database에서 데이터를 수정하기 위해 사용하는 프로그래밍 언어
```
SELECT "Hello, World!";
```
SQL을 사용하는 DB

RDB

REF

id : 20231217095795

types : undefined

keywords :

Row Echelon Form matrix, 사다리꼴 행렬

연립방정식을 풀기 위한 방법.
계수행렬

ReLU

id : 20231217095796

types : undefined

keywords :

Rectified Line

Resampling

id : 20231217095797

types : undefined

keywords :

UnderSampling

OverSampling

Ribonucleoprotein

id : 20231217095798

types : undefined

keywords :

Ridge Regularization

id : 20231217095799

types : undefined

keywords :

RNA-seq

id : 20231217095800

types : undefined

keywords :

RNA

id : 20231217095801

types : undefined

keywords :

mRNA
rRNA
tRNA
snRNA
snoRNA
aRNA
miRNA
siRNA
piRNA

참고

https://bioinformaticsandme.tistory.com/249

RNN

id : 20231217095802

types : undefined

keywords :

이후 LSTM이 등장

rRNA

id : 20231217095803

types : undefined

keywords :

ribosomal RNA

Sanger Sequencing

id : 20231217095804

types : undefined

keywords :

SC3

id : 20231217095805

types : undefined

keywords :

scRNA-seq

id : 20231217095806

types : undefined

keywords :

Single Cell RNA Sequencing

Spatial Transcriptomics

Sequencing Types

id : 20231217095807

types : undefined

keywords :

Based on Target

Whole-Genome Sequencing([[Knowledge/Bioinformatics/WGS|WGS]])
Whole-Exome Sequencing([[today-i-learned/Knowledge/Bioinformatics/WES|WES]])
- Amplicon Target Sequencing
Single-Cell RNA Sequencing([[today-i-learned/Knowledge/Bioinformatics/scRNA-seq|scRNA-seq]])

Based on Sequencng Technology Generation

Sanger Sequencing
Next Generation Sequencing(NGS)
Long-read Sequencing

Seurat

id : 20231217095808

types : undefined

keywords :

SHAP

id : 20231217095809

types : undefined

keywords :

Shapley Additive Explanations

Sigmoid

id : 20231217095810

types : undefined

keywords :

siRNA

id : 20231217095811

types : undefined

keywords :

small interfering RNA

smFISH

id : 20231217095812

types : undefined

keywords :

Single-molecule Fluorescene in situ Hybridization.

유전자의 발현 정도는 mRNA level을 기준으로 측정되었으며, smFISH 기술은 그 한 종류이다.

등장배경

smFISH 이전에는 reverse transcription-PCR(RT-PCR), Northern blot, RNA-seq 등의 방법을 통하여 mRNA level을 측정했다.
그러나 위의 방법들은 다수의 세포에서 구별없이 RNA를 추출하는 방식이었기 때문에, 각 세포마다의 유전자 발현량은 알 수 없었다.
따라서 RNA가 정확하게 어느 세포에서 유래했는지, 즉 RNA와 세포의 위상(the spatial distribution)을 알기위한 연구가 이어졌으며, 그 결과 smFISH가 개발되었다.

특징

smFISH는 DNA oligonucleotide가 달려있는 probe와 세포들을 결합시키는 방법이다.
각 DNA oligonucleotide 말단에는 형광표지가 되어있어 추적이 가능하다.
먼저 RNA 분자 하나에 여러 개의 probe가 붙은 형태를 만든다.
그리고 probe에 의해서 signal-to-noise 비율이 올라가면, 현미경을 통해 관찰이 가능하게 된다.
3D 가우시안 피팁 알고리즘(Gaussian fitting algorithm)을 통하여 이미지 상의 형광빛들을 분석할 수 있다.

장점

smFISH는 RNA 분자 각각의 수준까지 구별할 수 있는 해상도와, 세포 내의 RNA 분자의 위상(spatial information)을 알 수 있는 장점을 가진다.

단점

smFISH의 단점으로는 세포가 고정되어있어야 하므로 살아있는 세포 내에서 RNA의 변화를 관찰할 수 없으며, 형광빛을 1~4개만 사용할 수 있으므로 한번의 실험에서 1~4개의 유전자만 분석할 수 있다는 단점이 있다.

참고

https://bio-protocol.org/e3070

SMRT

id : 20231217095813

types : undefined

keywords :

Single molecule real time sequencing from PacBio

snoRNA

id : 20231217095814

types : undefined

keywords :

small nucleaolar RNA

snRNA

id : 20231217095815

types : undefined

keywords :

small nuclear RNA

snRNP

id : 20231217095816

types : undefined

keywords :

small nuclear ribonucleoproteins

RNA-단백질 복합
pre-mRNA의 splicing 과정에서 발생한다.
intron 영역을 제거하는 과정에서 발생한다.

Sparsity

id : 20231217095817

types : undefined

keywords :

Spatial Transcriptomics

id : 20231217095818

types : undefined

keywords :

SQL

id : 20231217095819

types : undefined

keywords :

Structured Query Language
Database에서 데이터를 수정하기 위해 사용하는 프로그래밍 언어

SELECT "Hello, World!";

SQL을 사용하는 DB

RDB

⬤ RDB
- Structured Query Language
- Database에서 데이터를 수정하기 위해 사용하는 프로그래밍 언어
```
SELECT "Hello, World!";
```
SQL을 사용하는 DB

RDB

⬤ NoSQL
Non SQL / Not Only SQL
- Non-relational DataBase(NRDB)를 다루기 위한 프로그래밍 언어
특징 (SQL과의 차이점)

Scheme가 없다
- SQL은 row, column과 같이 schema를 정하고, 이를 따르는 데이터만 추가할 수 있다.
- NoSQL은 schema를 따르지 않는 데이터도 추가할 수 있다.
참조
- https://post.naver.com/viewer/postView.naver?volumeNo=34289847&memberNo=6457418&vType=VERTICAL

Stochastic Gradient Descent

id : 20231217095820

types : undefined

keywords :

Structural Motif

id : 20231217095821

types : undefined

keywords :

Structiral motif의 예시

RNA Recognition Motif(RMM)
dsRNA binding domain
Zinc Finger(ZnF)

Swish

id : 20231217095822

types : undefined

keywords :

T cell

id : 20231217095823

types : undefined

keywords :

분류

conventional T cell (주류)

α T세포
β T세포

unconventional T cell (비주류)

γ T세포
δ T세포
NK T세포
Tregs
...

참고

https://link.springer.com/chapter/10.1007%2F978-1-84800-165-7_6

TCR

id : 20231217095824

types : undefined

keywords :

T cell receptor

Thymus

id : 20231217095825

types : undefined

keywords :

흉선.

흉선 혹은 가슴샘이라고 부르며, 위치는 갑상선과 심장의 사이다.
흉선은 골수에서 생성된 T cell이 이동하여 성숙하는 장소다.
훙선은 청소년기까지 성장하다가 성인이 되면 퇴화하게 된다.

Time Series

id : 20231217095826

types : undefined

keywords :

Transformer

id : 20231217095827

types : undefined

keywords :

Traversal

id : 20231217095828

types : undefined

keywords :

Queue, Stack으로 대표되는 linear data structure와 Tree data structure는 모든 node들을 '방문'하는 방법에서 큰 차이를 보여준다.
Linear data structure의 경우 모든 node를 방문하는 방법은 한가지 방법 뿐이다. 데이터 구조가 일자로 뻗은 선형이기 때문에, 처음 node부터 마지막 node까지 순서대로 거쳐가며 방문하면 된다.
반면에 Tree data structure의 경우, 각 node에 분기가 있기 때문에 모든 node를 방문하는 방법이 다양해질 수 밖에 없다. 더 이상 분기가 없는 가장 바닥의 node까지 방문한 다음 뒤로 돌아오는 방법도 있을 것이고, 분기에서 만난 node들만 방문하고서 한 분기를 선택해 계속 방문해가는 방법도 있을 것이다.
이렇게 tree data structre에서 모든 node들을 방문하는 것을 traversal(순회)이라고 부른다.
Tree data structure를 배울 때는 일반적으로 Binary Search Trees(BST)를 상상하고 공부한다.
- BST의 경우 각 node는 최대 두 개의 분기를 가진다.
Tree data structure는 크게 4가지의 방식으로 traverse가 가능하다.
1. Depth First Search(DFS)
  - Inorder Traversal
  - Preorder Traversal
  - Postorder Traversal
2. Level Order Traversal or Breadth First Search or BFS
3. Boundary Traversal
4. Diagonal Traversal

1. DFS

DFS는 깊이우선탐색으로, tree structure의 가장 바닥이 어디인지 빠르게 파악하는데 도움을 준다.
DFS를 활용한 traversal의 경우, 어디를 방문 시작점으로 잡느냐에 따라 결과물이 달라진다.
- 가장 왼쪽 끝의 node부터 시작해서 순서대로 올라가는 경우(inorder)
- 맨 위의 node에서부터 시작하는 경우(preorder)
- 가장 왼쪽 끝의 node부터 시작하되, 바닥의 node들을 다 탐색하며 올라오는 경우(postorder)
다시금 말하지만 DFS에서 중요한 것은 '가장 바닥이 어디인지', '이 tree구조의 깊이가 얼마나 되는지' 빨리 파악하는 것에 초점을 둔다.
위의 그림에서 가장 왼쪽 끝 바닥의 node인 4의 위치가 항상 앞쪽에 나열된다는 점에 주목하자.

DFS를 이용한 Inorder Traversal, Preorder Traversal, Postorder Traversal의 구현

# Binary Search Tree(BST) Structure

# 0. Build BST node
Class Node:
	def __init__(self, key):
		self.left = None # child 1
		self.right = None # child 2
		self.val = key

# 1. Inorder Traversal
def printInorder(root):
	if root:
		printInorder(root.left) # Recursive Function. Keep go to left!
		print(root.val, end=" ") # Come back to root
		printInorder(root.right) # Go to right
		
# 2. Preorder Traversal
def printPreorder(root):
	if root:
		print(root.val, end=" ") # Root First
		printPreorder(root.left) # Recursive Function. Keep go to left!
		printPreorder(root.right) # Now go to right

# 3. Postorder Traversal
def printPostorder(root):
	if root:
		printPostorder(root.left) # Recursive Function. Keep go to left!
		printPostorder(root.right) # Go to right
		print(root.val, end=" ") # Check the root as last
		
# Driver code
if __name__ == "__main__":
    root = Node(1)
    root.left = Node(2)
    root.right = Node(3)
    root.left.left = Node(4)
    root.left.right = Node(5)
 
    # Function call
    print("Inorder traversal of binary tree is")
    printInorder(root) # 42513
    print("Preorder traversal of binary tree is")
    printPreorder(root) # 12453
    print("Postorder traversal of binary tree is")
    printPostorder(root) # 45231

2. BFS

3. Boundary Traversal

4. Diagonal Traversal

참조

https://www.geeksforgeeks.org/tree-traversals-inorder-preorder-and-postorder/

Tree Structure

id : 20231217095829

types : undefined

keywords :

Traversal

tRNA

id : 20231217095830

types : undefined

keywords :

transfer RNA

V(D)J Recombination

id : 20231217095831

types : undefined

keywords :

V(D)J는 항체의 상단의 variable region을 말한다.

Light Chain: V-J
Heavy Chain: V-D-J
이렇게 V-(D)-J 영역을 구성하는 단백질을 유전자 재조합 과정을 통해 바꿔가기 때문에, 항체들은 Immune Repertoire를 넓게 구성할 수 있는 것이다.
- - 항체는 immunoglobulin을 말하며, 이는 BCR, 혈청, TCR에서 모두 발견된다.
그리고 이때 관여하는 수많은 단백질들 중 RAG1, RAG2, Ku, Artemis, DNA Ligase 같은 단백질들이 있다.

-> 따라서 V(D)J 유전자 재조합은 BCR, 혈청 내 항체, TCR 3개 모두에 영향을 끼친다.

참조

https://namu.wiki/w/V(D)J 재조합?from=VDJ#s-4

VarDict

id : 20231217095832

types : undefined

keywords :

Weighting

id : 20231217095833

types : undefined

keywords :

Case Weight

Sample Weight

WES

id : 20231217095834

types : undefined

keywords :

Whole-Exome Sequencing

예시

Amplicon Target Sequencing

WGS

id : 20231217095835

types : undefined

keywords :

Workflow

id : 20231217095836

types : undefined

keywords :

1. Data Generation

2. Data Processing

FASTQC

Trimming

[[today-i-learned/Knowledge/Bioinformatics/Alignment|Alignment]]

3. Variant Calling

Variant Caller

GATK HaploTypeCaller
VarDict

4. Analysis

ZnF

id : 20231217095837

types : undefined

keywords :

Zinc Finger

ZnF는 아연 이온($Zn^{2+}$)을 하나 이상 가지고 있는 [[structural motif]]이다.
ZnF는 다양한 3차원 구조를 가진다.

ZnF의 특징

ZnFsms

Classes of Zinc Finger

1. $Cys_2His_2$

2. Gag-knuckle

3. Treble-clef

4. Zinc Ribbon

5. $Zn_2/Cys_6$

6. Miscellaneous

참조

https://en.wikipedia.org/wiki/Zinc_finger

_bcl

id : 20231217095838

types : undefined

keywords :

#file_type

_FASTA&FASTQ

id : 20231217095839

types : undefined

keywords :

#file_type

_sam&bam

id : 20231217095840

types : undefined

keywords :

#file_type

_vcf

id : 20231217095841

types : undefined

keywords :

#file_type
variant calling file

Help

Click here to access Cosma's documentation

Shortcuts

Space	Re-run the force-layout algorithm
S	Move the cursor to Search
Alt + click	(on a record type) Deselect other types
R	Reset zoom
Alt + R	Reset the display
C	Zoom in on the selected node
F	Switch to Focus mode
Escape	Close the active record

Version 2.0.4 • License GPL-3.0-or-later

Arthur Perret
Guillaume Brioudes
Clément Borel
Olivier Le Deuff
ANR research programme HyperOtlet

D3 v4.13.0: Mike Bostock (BSD 3-Clause)
Nunjucks v3.2.3: James Long (BSD 2-Clause)
Js-yaml v4.1.0: Vitaly Puzrin (MIT License)
Markdown-it v12.3.0: Vitaly Puzrin, Alex Kocharin (MIT License)
Citeproc v2.4.62: Frank Bennett (CPAL, AGPL)
Fuse-js v6.4.6: Kiro Risk (Apache License 2.0)