linux学习笔记03
@[TOC] # 概括 今天的任务还是把昨天的知识进行实操练习 简要概括如下: 1. 运行表达定量比对的R程序 2. 继续进行批量操作 3. 将表达定量的结果合并成矩阵 4. 了解最后的定量矩阵、标准化矩阵、TMM算法调整的矩阵
主要程序过程如下:
1 |
|
遇到的问题
R包版本问题
1 |
|
看到上述问题,可以判断为版本不匹配所致,
于是我尝试的解决方案是安装最新的R和最新的edgeR
主要原因是我的旧版本是3.6.3
,更新的版本的是4.0.3
。
强烈建议 先把旧版本的R包给删了
可以参考这里的教程 >Linux系统更新最新版R语言方法
我直接就更新了,导致安装了新版本的r之后,许许多多的r程序都需要更新。呜呜呜~
结果展示
R程序——Subread比对
把上述冲突处理完之后, 运行R程序:
1 |
|
得到了以下结果(以KID_S4_LD3
为例): 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
Rsubread 2.4.3
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 1 BAM file ||
|| ||
|| KID_S4_LD3.bam ||
|| ||
|| Paired-end : no ||
|| Count read pairs : no ||
|| Annotation : genes.gtf (GTF) ||
|| Dir for temp files : . ||
|| Threads : 1 ||
|| Level : meta-feature level ||
|| Multimapping reads : counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
\\============================================================================//
//================================= Running ==================================\\
|| ||
|| Load annotation file genes.gtf ... ||
|| Features : 246042 ||
|| Meta-features : 44677 ||
|| Chromosomes/contigs : 18 ||
|| ||
|| Process BAM file KID_S4_LD3.bam... ||
|| Single-end reads are included. ||
|| Total alignments : 556850 ||
|| Successfully assigned alignments : 334274 (60.0%) ||
|| Running time : 0.02 minutes ||
|| ||
|| Write the final count table. ||
|| Write the read assignment summary. ||
|| ||
\\============================================================================//
经过subread
处理后会得到俩个文件,一个是count
文件,记录基因对应的read关系,一个是log
文件,尝试打开一个(KID_S1_LD1.count
):
里面记录了每个基因的表达量,使用的是不同的标准化的模式(FPKM和TPM)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34gene_id counts fpkm tpm
HF32875 0 0 0
HF32876 15 25.4415259802859 23.9388625965012
HF32877 3 2.03342463708189 1.91332363574273
HF32878 25 29.7651040853782 28.0070754176678
HF32879 18 42.955539269997 40.4184384670904
HF32880 0 0 0
HF32881 57 153.870852921363 144.782715020522
HF32882 6 28.953807443051 27.2436967248677
HF32883 11 28.7686512004096 27.0694764420548
HF32884 17 59.0448663461795 55.5574749559139
HF32885 5 39.2299789096495 36.9129224209838
HF32886 4 26.4386282106244 24.8770725649297
HF32887 18 24.5690631369069 23.1179303873095
HF32888 1 3.79997705117859 3.57553743311272
HF32889 0 0 0
HF32890 0 0 0
HF32891 0 0 0
HF32892 5 6.82473976025192 6.4216473298082
HF32893 0 0 0
HF32894 3 3.67203169592005 3.45514896735134
HF32895 6 31.308901828371 29.4596911953114
HF32896 32 127.835125414008 120.284746467792
HF32897 6 18.0761339285346 17.0084957453594
HF32898 6 9.26849926647243 8.72107005640241
HF32899 3 7.59113745026628 7.14277895570777
HF32900 34 27.6717731831349 26.0373837853089
HF32901 83 32.8682837095589 30.9269706587429
HF32902 0 0 0
HF32903 8 10.6790052747932 10.0482667642923
HF32904 18 50.9887829776328 47.9772113752216
HF32905 30 87.4807551086836 82.3138430457231
HF32906 0 0 0
再打开这个转录组log文件: 比对上的一共305644
未比对成功的,其中unmapped
未匹配上注释文件的一共有119800个,Unassigned_NoFeatures(未映射到注释文件中注释的片段)(130553),说明注释文件没有给出较为全面的注释,581个都是比对时模糊不清的。
1
2
3
4
5
6
7
8
9
10
11
12
13
14Assigned 305644
Unassigned_Unmapped 119800
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 0
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 130553
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 581
参考网上的说法 Assigned: The read (or fragment) was assigned to a gene feature in the annotation file provided with option -a Ambiguity: Section 5.3 of the paper. The fragment might originate from gene A or gene B, and it is not clear which gene it originated from. MultiMapping: The fragment maps to multiple different positions. Will a read with multiple alignments be assigned or unassigned if I use the --primary option? NoFeatures: The fragment mapped to a region that is not annotated in the annotation file. Unmapped: The fragment is not mapped to the reference at all. MappingQuality: The fragment’s mapping quality is below the threshold I set with option -Q. FragementLength: The insert size between the two read mates is larger or smaller than the options set with -d and -D. Chimera: ??? I’m guessing that the fragment’s mates are mapped to different chromosomes. Secondary: ??? Nonjunction: ??? Duplicate: The fragment is duplicated in the data, so it was not assigned.
我们打开count矩阵,可以看到不同基因下匹配到的reads
reads count矩阵
1 |
|
对上述矩阵进行标准化后:(方法是TPM) TPM矩阵 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21 BLO_S1_LD1 BLO_S1_LD2 BLO_S1_LD3 BLO_S2_LD1 BLO_S2_LD2 BLO_S2_LD3 BLO_S3_LD1 BLO_S3_LD2 BLO_S3_LD3 BLO_S4_LD1 BLO_S4_
HF18496 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HF16431 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HF07143 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HF39048 3.64430415937764 0 6.43321840995913 0 0 0 0 16.4326267389083 5.64536361832566 5.9040116944613 0 0
HF32055 0.728860831875528 5.97388663516309 0 0 0 0 0 0 0 0 0 0 0 6.79366266459393 0
HF35981 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HF31129 39.2324922495974 39.1744833621847 14.0622023236281 52.9458320708972 40.927037081633 44.2142685920757 49.040039090221 26.939733798730
HF14726 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HF30136 34.1377121086869 39.4679251851224 24.2872065814187 25.9091803167698 32.1300831021721 28.636368181224 26.468239396608 23.264168192162
HF01285 15.7755890482016 8.97915475837397 21.273035110233 13.5919566175187 14.3271260131465 12.1611843060004 16.8606473865918 18.1128
HF15351 12.8641539257505 10.1104101111559 17.1093983041183 9.37001896517854 4.93841491101642 9.78095250926674 8.13637161714809
HF28792 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HF42099 14.9759943290718 12.6253227129725 16.6174347746069 6.06706233315074 6.39522101362723 18.9994221266268 22.3901855790316
HF12561 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HF21569 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HF05175 2.61418438934525 1.91877931981567 2.75508552443989 1.38309578315818 1.4579054459325 2.16562968886897 0.600499386164851 0.58645
HF24422 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HF02960 19.3843415319528 11.5547544127497 15.5539984582893 12.493358620238 49.3841491101642 19.5619050185335 37.9697342133578 45.0275
HF03959 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
group lib.size norm.factors eff.lib.size
BLO_S1_LD1 999623 1.19835285824697 1197901.07921941
BLO_S1_LD2 1000024 1.14444978101036 1144477.24780511
BLO_S1_LD3 1000092 1.17712340823365 1177231.70358721
BLO_S2_LD1 1000051 0.790137765984441 790178.063010506
BLO_S2_LD2 1000037 0.920995506851478 921029.583685232
BLO_S2_LD3 999995 1.0840994226589 1084094.00216179
BLO_S3_LD1 999966 0.971096339652144 971063.322376596
BLO_S3_LD2 1000002 0.897106990690095 897108.784904077
BLO_S3_LD3 999976 0.887155199786789 887133.908061994
BLO_S4_LD1 999972 0.874964482098764 874939.983093265
BLO_S4_LD2 1000096 0.929513259538841 929602.492811757
BLO_S4_LD3 1000037 0.875277014429481 875309.399679015
KID_S1_LD1 999994 1.266534203831 1266526.60462578
KID_S1_LD2 999939 1.21755253986366 1217478.26915873
KID_S1_LD3 1000050 1.24861324728186 1248675.67794423
KID_S2_LD1 999975 0.968264483401714 968240.276789629
KID_S2_LD2 999949 1.08012898371 1080073.89713183
KID_S2_LD3 1000066 1.11022114688009 1110294.42147579
KID_S3_LD1 999960 1.00488458514679 1004844.38976338
KID_S3_LD2 999942 0.898797372534519 898745.242286912
KID_S3_LD3 999931 0.974514145521912 974446.904045871
KID_S4_LD1 1000018 0.87262090923823 872636.616414597
KID_S4_LD2 999966 0.892593891696263 892563.543503945
KID_S4_LD3 999965 0.931133742026098 931101.152345127
在TPM算法下,使用TMM修正后的矩阵:(一般我们使用的就是这个矩阵)
TMM校正依赖这个perl程序
script/support_scripts/run_TMM_scale_matrix.pl
TPM+TMM矩阵 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34 BLO_S1_LD1 BLO_S1_LD2 BLO_S1_LD3 BLO_S2_LD1 BLO_S2_LD2 BLO_S2_LD3 BLO_S3_LD1 BLO_S3_LD2 BLO_S3_LD3 BLO_S4_LD1 BLO_S4_
HF18496 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF16431 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF07143 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF39048 3.041 0.000 5.465 0.000 0.000 0.000 0.000 18.317 6.363 6.748 0.000 0.000 0.000 5.580 0.000 7.018 0.000 0.000 0.000 6.191 0.000
HF32055 0.608 5.220 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 5.580 0.000 0.000 0.000 0.000 0.000 0.000 5.755
HF35981 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF31129 32.739 34.230 11.946 67.008 44.438 40.784 50.500 30.030 13.910 33.187 44.993 38.408 24.096 36.590 70.466 49.855 50.071 41.555 50.901 43.980 44.029
HF14726 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF30136 28.487 34.486 20.633 32.791 34.886 26.415 27.256 25.932 42.042 42.988 38.854 43.118 27.311 30.281 36.764 31.462 28.272 24.467 41.026 36.519 27.158
HF01285 13.164 7.846 18.072 17.202 15.556 11.218 17.362 20.190 19.129 12.171 5.712 8.451 6.628 8.387 9.691 12.658 21.188 12.469 22.401 14.888 27.680
HF15351 10.735 8.834 14.535 11.859 5.362 9.022 8.379 13.286 16.924 8.157 3.063 15.294 11.993 8.094 12.990 6.787 8.520 3.343 9.008 17.961 8.348
HF28792 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF42099 12.497 11.032 14.117 7.678 6.944 17.526 23.057 15.772 22.414 118.840 117.491 110.580 9.060 18.344 18.925 6.592 16.551 9.740 21.873 30.529 40.542
HF12561 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF21569 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF05175 2.181 1.677 2.341 1.750 1.583 1.998 0.618 0.654 2.044 2.890 1.356 3.762 1.770 0.597 0.000 1.503 2.264 2.220 3.324 2.651 0.000
HF24422 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF02960 16.176 10.096 13.214 15.812 53.620 18.044 39.100 50.192 46.156 45.680 42.875 50.979 15.991 13.491 15.588 27.148 20.449 30.085 36.033 20.955 27.828
HF03959 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF11853 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF20463 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF30006 0.351 3.015 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.611 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF25733 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF19008 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF42715 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF19277 26.619 28.132 35.548 21.269 19.234 18.493 21.467 21.559 23.652 16.302 22.363 28.736 30.729 27.998 27.958 16.954 13.098 23.125 23.080 17.257 14.973
HF34254 2.451 5.802 4.556 4.543 4.109 3.457 3.210 1.697 0.000 0.000 3.520 0.000 1.532 3.101 1.493 1.950 5.876 0.000 5.177 3.441 3.198
HF27633 6.529 7.067 11.628 1.581 4.290 4.812 3.351 5.905 6.154 2.610 3.675 5.438 8.529 6.475 6.235 4.072 2.727 4.011 3.603 3.592 2.226
HF30992 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF35339 26.367 39.055 21.648 17.989 19.522 19.161 20.336 32.247 19.605 17.819 33.449 27.840 31.536 36.837 28.376 24.709 37.224 18.255 16.398 27.247 22.796
HF06789 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
HF33530 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
目前记录表达量有俩种手段,一种是基于比对,一种基于kmer,使用比对方法好处就是能了解到reads的情况(有没有污染啥的) edgeR可以没有生物学重复