Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.14
Description
Some reads are identical in all respects and occur twice (both in fastq and bed of course). Here's an example of the header from a simulated fastq:
@Chr1:8017397-8021712W:AT1G22660.1:541:2005:723:872:S/2 @Chr1:8017397-8021712W:AT1G22660.1:541:2005:723:872:A/1 @Chr1:8017397-8021712W:AT1G22660.1:541:2005:723:872:S/2 @Chr1:8017397-8021712W:AT1G22660.1:541:2005:723:872:A/1
Fastq files are supposed to have unique ids. I am wondering how softwares usually handle this (do they just skip the 2nd occurrence?) and hence, the impact on the reads after mapping.
I did a test on a fastq generated from flux-simulator directly, that contained a total of about 14 million reads and found 2 * 489353 = app. 1 million reads duplicated. I have removed them for now. Now, the reads may occur duplicated (due to amplification or other factors or due to chance which is highly unlikely here though), however, the read IDs in those cases must be somehow changed. Right?